TimeSeriesImputer 类

用于插补数据帧列的缺失值的插补转换器。

示例:构造示例数据帧:df1 请注意,df1 不是常规时序,因为对于存储“a”,缺少日期“2017-01-03”的行。


>>> data1 = pd.DataFrame(
...  {'store': ['a', 'a', 'b', 'b', 'c', 'c', 'c', 'd',
...            'd', 'd', 'd', 'd', 'd', 'd', 'd'],
...   'date': pd.to_datetime(
...      ['2017-01-02', '2017-01-04', '2017-01-01', '2017-01-02',
...       '2017-01-01', '2017-01-02', '2017-01-03', '2017-01-01',
...       '2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05',
...       '2017-01-06', '2017-01-07', '2017-01-08']),
...   'sales': [1, np.nan, 2, np.nan, 6, 7, np.nan, 10, 11, 15, 13, 14,
...             np.nan, np.nan, 15],
...   'price': [np.nan, 3, np.nan, 4, 3, 6, np.nan, 2, 6, 3, 5, 5,
...             np.nan, np.nan, 6]})
>>> df1 = TimeSeriesDataSet(data1, time_series_id_column_names=['store'],
...                           time_column_name='date', target_column_name='sales')
>>> df1.data
>>>                price  sales
>>> date       store
2017-01-02 a        nan   1.00
2017-01-04 a       3.00    nan
2017-01-01 b        nan   2.00
2017-01-02 b       4.00    nan
2017-01-01 c       3.00   6.00
2017-01-02 c       6.00   7.00
2017-01-03 c        nan    nan
2017-01-01 d       2.00  10.00
2017-01-02 d       6.00  11.00
2017-01-03 d       3.00  15.00
2017-01-04 d       5.00  13.00
2017-01-05 d       5.00  14.00
2017-01-06 d        nan    nan
2017-01-07 d        nan    nan
2017-01-08 d       6.00  15.00

如果运行 infer_freq,则“regular_ts”属性为 False,并推断出频率为“D”。


>>>  df1.infer_freq() 
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
{'regular_ts': False, 'freq': 'D'}

>>> sorted(df1.infer_freq().items())
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
[('freq', 'D'), ('regular_ts', False)]

使用选项“默认值”为单列“销售”插补 df1 请注意,对于存储“a”,还添加并插补了日期“2017-01-03”的缺失行。 此外,默认情况下,用于填充缺失日期的频率是 df1.infer_freq() 的推断频率,在这种情况下为“D”


>>> imputer1 = TimeSeriesImputer(input_column='sales', option='default')
>>> imputer1.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price      sales store
0  2017-01-02    NaN   1.000000     a
1  2017-01-03    NaN   1.000000     a
2  2017-01-04    3.0   1.000000     a
3  2017-01-01    NaN   2.000000     b
4  2017-01-02    4.0   2.000000     b
5  2017-01-01    3.0   6.000000     c
6  2017-01-02    6.0   7.000000     c
7  2017-01-03    NaN   7.000000     c
8  2017-01-01    2.0  10.000000     d
9  2017-01-02    6.0  11.000000     d
10 2017-01-03    3.0  15.000000     d
11 2017-01-04    5.0  13.000000     d
12 2017-01-05    5.0  14.000000     d
13 2017-01-06    NaN  14.333333     d
14 2017-01-07    NaN  14.666667     d
15 2017-01-08    6.0  15.000000     d

如果要显式指定频率,还可以使用“频率”键参数来传递频率,因为在某些情况下,推断的频率可能并不精确,例如,不从任何时序推断频率,或者推断出多个频率,并且所选频率不是所需的频率。


>>> imputer2 = TimeSeriesImputer(input_column='sales', option='default',
...                              freq='D')
>>> imputer2.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price      sales store
0  2017-01-02    NaN   1.000000     a
1  2017-01-03    NaN   1.000000     a
2  2017-01-04    3.0   1.000000     a
3  2017-01-01    NaN   2.000000     b
4  2017-01-02    4.0   2.000000     b
5  2017-01-01    3.0   6.000000     c
6  2017-01-02    6.0   7.000000     c
7  2017-01-03    NaN   7.000000     c
8  2017-01-01    2.0  10.000000     d
9  2017-01-02    6.0  11.000000     d
10 2017-01-03    3.0  15.000000     d
11 2017-01-04    5.0  13.000000     d
12 2017-01-05    5.0  14.000000     d
13 2017-01-06    NaN  14.333333     d
14 2017-01-07    NaN  14.666667     d
15 2017-01-08    6.0  15.000000     d

默认选项与 set option='interpolate'、method='linear' 和 limit_direction='both' 相同


>>> imputer3 = TimeSeriesImputer(input_column='sales',
...                              option='interpolate', method='linear',
...                              limit_direction='both')
>>> imputer3.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price  sales store
0  2017-01-02    NaN   1.000000     a
1  2017-01-03    NaN   1.000000     a
2  2017-01-04    3.0   1.000000     a
3  2017-01-01    NaN   2.000000     b
4  2017-01-02    4.0   2.000000     b
5  2017-01-01    3.0   6.000000     c
6  2017-01-02    6.0   7.000000     c
7  2017-01-03    NaN   7.000000     c
8  2017-01-01    2.0  10.000000     d
9  2017-01-02    6.0  11.000000     d
10 2017-01-03    3.0  15.000000     d
11 2017-01-04    5.0  13.000000     d
12 2017-01-05    5.0  14.000000     d
13 2017-01-06    NaN  14.333333     d
14 2017-01-07    NaN  14.666667     d
15 2017-01-08    6.0  15.000000     d

还可以对列列表进行插补。 此处,为“销售”和“价格”列插补 df1。


>>> imputer4 = TimeSeriesImputer(input_column=['sales', 'price'],
...                              option='default')
>>> imputer4.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price  sales store
0  2017-01-02  3.000000   1.000000     a
1  2017-01-03  3.000000   1.000000     a
2  2017-01-04  3.000000   1.000000     a
3  2017-01-01  4.000000   2.000000     b
4  2017-01-02  4.000000   2.000000     b
5  2017-01-01  3.000000   6.000000     c
6  2017-01-02  6.000000   7.000000     c
7  2017-01-03  6.000000   7.000000     c
8  2017-01-01  2.000000  10.000000     d
9  2017-01-02  6.000000  11.000000     d
10 2017-01-03  3.000000  15.000000     d
11 2017-01-04  5.000000  13.000000     d
12 2017-01-05  5.000000  14.000000     d
13 2017-01-06  5.333333  14.333333     d
14 2017-01-07  5.666667  14.666667     d
15 2017-01-08  6.000000  15.000000     d

还可以将选项设置为“内插”,并使用 pandas.Series.interpolate 中的“method”、“limit”、“limit_direction”和“order”等键参数 请注意,如果使用的特定方法不适用于某些粒度,则默认线性内插用于这些粒度。


>>> imputer5 = TimeSeriesImputer(input_column=['sales'],
...                              option='interpolate', method='barycentric')
>>> imputer5.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price  sales store
0  2017-01-02    NaN   1.000000     a
1  2017-01-03    NaN   1.000000     a
2  2017-01-04    3.0   1.000000     a
3  2017-01-01    NaN   2.000000     b
4  2017-01-02    4.0   2.000000     b
5  2017-01-01    3.0   6.000000     c
6  2017-01-02    6.0   7.000000     c
7  2017-01-03    NaN   8.000000     c
8  2017-01-01    2.0  10.000000     d
9  2017-01-02    6.0  11.000000     d
10 2017-01-03    3.0  15.000000     d
11 2017-01-04    5.0  13.000000     d
12 2017-01-05    5.0  14.000000     d
13 2017-01-06    NaN  26.904762     d
14 2017-01-07    NaN  42.428571     d
15 2017-01-08    6.0  15.000000     d

还可以将选项设置为“fillna”,并使用 pandas.Series.fillna 中的“method”、“value”和“limit”方法等键参数


>>> imputer6 = TimeSeriesImputer(input_column=['sales'], option='fillna', method='ffill')
>>> imputer6.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

预期数据中所有 ['store'] 有 1 个不同的日期时间频率,推断出 2 个不同的日期时间频率 (['2D' 'D'])。


>>>      date  price  sales store
0  2017-01-02    NaN    1.0     a
1  2017-01-03    NaN    1.0     a
2  2017-01-04    3.0    1.0     a
3  2017-01-01    NaN    2.0     b
4  2017-01-02    4.0    2.0     b
5  2017-01-01    3.0    6.0     c
6  2017-01-02    6.0    7.0     c
7  2017-01-03    NaN    7.0     c
8  2017-01-01    2.0   10.0     d
9  2017-01-02    6.0   11.0     d
10 2017-01-03    3.0   15.0     d
11 2017-01-04    5.0   13.0     d
12 2017-01-05    5.0   14.0     d
13 2017-01-06    NaN   14.0     d
14 2017-01-07    NaN   14.0     d
15 2017-01-08    6.0   15.0     d

>>> imputer7 = TimeSeriesImputer(input_column=['sales'], option='fillna',
...                              value=0)
>>> imputer7.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price  sales store
0  2017-01-02    NaN    1.0     a
1  2017-01-03    NaN    0.0     a
2  2017-01-04    3.0    0.0     a
3  2017-01-01    NaN    2.0     b
4  2017-01-02    4.0    0.0     b
5  2017-01-01    3.0    6.0     c
6  2017-01-02    6.0    7.0     c
7  2017-01-03    NaN    0.0     c
8  2017-01-01    2.0   10.0     d
9  2017-01-02    6.0   11.0     d
10 2017-01-03    3.0   15.0     d
11 2017-01-04    5.0   13.0     d
12 2017-01-05    5.0   14.0     d
13 2017-01-06    NaN    0.0     d
14 2017-01-07    NaN    0.0     d
15 2017-01-08    6.0   15.0     d

有时,你可能想要填充上至较早日期、下至较晚日期的值,可以使用 origin 和 end 属性来达到此目的。


>>> imputer8 = TimeSeriesImputer(input_column=['sales'], option='fillna',
...                              value=0, origin='2016-12-28')
>>> imputer8.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price  sales store
0  2016-12-28    NaN    0.0     a
1  2016-12-29    NaN    0.0     a
2  2016-12-30    NaN    0.0     a
3  2016-12-31    NaN    0.0     a
4  2017-01-01    NaN    0.0     a
5  2017-01-02    NaN    1.0     a
6  2017-01-03    NaN    0.0     a
7  2017-01-04    3.0    0.0     a
8  2016-12-28    NaN    0.0     b
9  2016-12-29    NaN    0.0     b
10 2016-12-30    NaN    0.0     b
11 2016-12-31    NaN    0.0     b
12 2017-01-01    NaN    2.0     b
13 2017-01-02    4.0    0.0     b
14 2016-12-28    NaN    0.0     c
15 2016-12-29    NaN    0.0     c
16 2016-12-30    NaN    0.0     c
17 2016-12-31    NaN    0.0     c
18 2017-01-01    3.0    6.0     c
19 2017-01-02    6.0    7.0     c
20 2017-01-03    NaN    0.0     c
21 2016-12-28    NaN    0.0     d
22 2016-12-29    NaN    0.0     d
23 2016-12-30    NaN    0.0     d
24 2016-12-31    NaN    0.0     d
25 2017-01-01    2.0   10.0     d
26 2017-01-02    6.0   11.0     d
27 2017-01-03    3.0   15.0     d
28 2017-01-04    5.0   13.0     d
29 2017-01-05    5.0   14.0     d
30 2017-01-06    NaN    0.0     d
31 2017-01-07    NaN    0.0     d
32 2017-01-08    6.0   15.0     d

>>> imputer9 = TimeSeriesImputer(input_column=['sales'], option='fillna',
...                              value=0, end='2017-01-10')
>>> imputer9.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price  sales store
0  2017-01-02    NaN    1.0     a
1  2017-01-03    NaN    0.0     a
2  2017-01-04    3.0    0.0     a
3  2017-01-05    NaN    0.0     a
4  2017-01-06    NaN    0.0     a
5  2017-01-07    NaN    0.0     a
6  2017-01-08    NaN    0.0     a
7  2017-01-09    NaN    0.0     a
8  2017-01-10    NaN    0.0     a
9  2017-01-01    NaN    2.0     b
10 2017-01-02    4.0    0.0     b
11 2017-01-03    NaN    0.0     b
12 2017-01-04    NaN    0.0     b
13 2017-01-05    NaN    0.0     b
14 2017-01-06    NaN    0.0     b
15 2017-01-07    NaN    0.0     b
16 2017-01-08    NaN    0.0     b
17 2017-01-09    NaN    0.0     b
18 2017-01-10    NaN    0.0     b
19 2017-01-01    3.0    6.0     c
20 2017-01-02    6.0    7.0     c
21 2017-01-03    NaN    0.0     c
22 2017-01-04    NaN    0.0     c
23 2017-01-05    NaN    0.0     c
24 2017-01-06    NaN    0.0     c
25 2017-01-07    NaN    0.0     c
26 2017-01-08    NaN    0.0     c
27 2017-01-09    NaN    0.0     c
28 2017-01-10    NaN    0.0     c
29 2017-01-01    2.0   10.0     d
30 2017-01-02    6.0   11.0     d
31 2017-01-03    3.0   15.0     d
32 2017-01-04    5.0   13.0     d
33 2017-01-05    5.0   14.0     d
34 2017-01-06    NaN    0.0     d
35 2017-01-07    NaN    0.0     d
36 2017-01-08    6.0   15.0     d
37 2017-01-09    NaN    0.0     d
38 2017-01-10    NaN    0.0     d
继承
TimeSeriesImputer
azureml.automl.runtime.featurizer.transformer.timeseries.forecasting_base_estimator._GrainBasedStatefulTransformer
TimeSeriesImputer

构造函数

TimeSeriesImputer(input_column: Union[str, List[str]], option: str = 'fillna', method: Optional[Union[Dict[str, List[Any]], str]] = None, value: Optional[Any] = None, limit: Optional[int] = None, limit_direction: Optional[str] = 'forward', order: Optional[str] = None, freq: Optional[Union[str, pandas._libs.tslibs.offsets.DateOffset]] = None, origin: Optional[str] = None, end: Optional[str] = None, impute_by_horizon: bool = False)

参数

input_column
strlist[str]
必需

需要插补的列的名称。

option
str
默认值: fillna

{'interpolate', 'fillna'} 之一。 “interpolate”和“fillna”选项具有指定其操作的其他参数。

method
str, dict
默认值: None

可用于选项“interpolate”或“fillna”,有关详细信息,请分别参阅 pandas.Series.interpolate 或 pandas.Series.fillna。 此外,选项“fillna”接受 {pandas.Series.fillna method: [applicable cols]} 形式的字典输入。

limit
str
默认值: None

可用于选项“interpolate”或“fillna”,有关详细信息,请分别参阅 pandas.Series.interpolate 或 pandas.Series.fillna。

value
默认值: None

可用于选项“fillna”,请参阅 pandas.Series.fillna 了解详细信息。

limit_direction
str
默认值: forward

可用于选项“interpolate”,请参阅 pandas.Series.interpolate 了解详细信息。

order
str
默认值: None

可用于选项“interpolate”,请参阅 pandas.Series.interpolate 了解详细信息。

freq
str 或 <xref:pandas.tseries.offsets.DateOffset>
默认值: None

时序频率。 如果 freq 是字符串,则此字符串必须是 pandas 偏移别名。 有关详细信息,请参阅 pandas.tseries.offsets.DateOffset。

origin
str
默认值: None

如果提供,日期/时间将往后填充为所有粒度的起点。

end
string
默认值: None

如果提供,日期/时间将往前填充为所有粒度的终点。

impute_by_horizon
bool
默认值: False

有关详细信息,请参阅 TimeSeriesImputer.transform() 文档。

方法

fit

拟合将更新转换器的 _known_df。

此方法只是一个传递。

get_params

获取此估算器的参数。

transform

对请求的数据帧列执行插补。

以下是 TimeSeriesImputer 工作原理的简要摘要:

  1. 当输入 TimeSeriesDataSet 没有属性 origin_time_column_name 时:

    时序值将在每个时序中从单个 time_series_id_column_names 插补。

  2. 当输入 TimeSeriesDataSet 具有属性 origin_time_column_name 时:

    a) 如果同一 time_series_id_column_names 中的时序具有相同的值,只要 time_column_name 的值相同,时序就会压缩为不带有 origin_time_column_name,并插补在压缩数据帧中。 插补值将通过 time_series_id_column_names 和 time_column_name 加入回原始数据。 b) 如果同一 time_series_id_column_names 中的时序具有相同的值,只要 origin_time_column_name 的值相同,时序就会压缩为不带有 time_column_name,并插补在压缩数据帧中。 插补值将通过 time_series_id_column_names 和 origin_time_column_name 加入回原始数据。 c) 对于时序不属于 a) 或 b),如果 impute_by_horizon 为 True,则从 time_series_id_column_names 和 origin_time_column_name 的单一组合在子时序内进行插补。 否则,它将在 sub-time-series 内由 time_series_id_column_names 和 horizon 的单一组合进行插补。

fit

拟合将更新转换器的 _known_df。

此方法只是一个传递。

fit(X: azureml.automl.runtime._time_series_data_set.TimeSeriesDataSet, y: Optional[Union[numpy.ndarray, pandas.core.series.Series, pandas.core.arrays.categorical.Categorical, azureml.dataprep.api.dataflow.Dataflow]] = None) -> azureml.automl.runtime.featurizer.transformer.timeseries.time_series_imputer.TimeSeriesImputer

参数

X
必需

输入时序数据集。

y
必需

已忽略。

返回

拟合转换。

返回类型

get_params

获取此估算器的参数。

get_params(deep=True)

参数

deep
bool, <xref:default=True>
默认值: True

如果为 True,则返回此估算器的参数和包含的子对象(即估算器)。

返回

params - 映射到其值的参数名称。

返回类型

<xref:<xref:mapping of string to any>>

transform

对请求的数据帧列执行插补。

以下是 TimeSeriesImputer 工作原理的简要摘要:

  1. 当输入 TimeSeriesDataSet 没有属性 origin_time_column_name 时:

    时序值将在每个时序中从单个 time_series_id_column_names 插补。

  2. 当输入 TimeSeriesDataSet 具有属性 origin_time_column_name 时:

    a) 如果同一 time_series_id_column_names 中的时序具有相同的值,只要 time_column_name 的值相同,时序就会压缩为不带有 origin_time_column_name,并插补在压缩数据帧中。 插补值将通过 time_series_id_column_names 和 time_column_name 加入回原始数据。 b) 如果同一 time_series_id_column_names 中的时序具有相同的值,只要 origin_time_column_name 的值相同,时序就会压缩为不带有 time_column_name,并插补在压缩数据帧中。 插补值将通过 time_series_id_column_names 和 origin_time_column_name 加入回原始数据。 c) 对于时序不属于 a) 或 b),如果 impute_by_horizon 为 True,则从 time_series_id_column_names 和 origin_time_column_name 的单一组合在子时序内进行插补。 否则,它将在 sub-time-series 内由 time_series_id_column_names 和 horizon 的单一组合进行插补。

transform(X: azureml.automl.runtime._time_series_data_set.TimeSeriesDataSet) -> azureml.automl.runtime._time_series_data_set.TimeSeriesDataSet

参数

X
<xref:azureml.automl.runtime._time_series_data_set.TimeSeriesDataSet>
必需

要转换的数据帧

返回

一个数据帧,包含插补的列

返回类型

<xref:azureml.automl.runtime._time_series_data_set.TimeSeriesDataSet>

属性

freq

返回数据集的频率。

FFILL_METHOD_STR

FFILL_METHOD_STR = 'ffill'