time_series_imputer 模块

以时序特定方式插补缺失数据,例如内插。

TimeSeriesImputer

用于插补数据帧列的缺失值的插补转换器。

示例:构造示例数据帧:df1 请注意,df1 不是常规时序,因为对于存储“a”,缺少日期“2017-01-03”的行。


>>> data1 = pd.DataFrame(
...  {'store': ['a', 'a', 'b', 'b', 'c', 'c', 'c', 'd',
...            'd', 'd', 'd', 'd', 'd', 'd', 'd'],
...   'date': pd.to_datetime(
...      ['2017-01-02', '2017-01-04', '2017-01-01', '2017-01-02',
...       '2017-01-01', '2017-01-02', '2017-01-03', '2017-01-01',
...       '2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05',
...       '2017-01-06', '2017-01-07', '2017-01-08']),
...   'sales': [1, np.nan, 2, np.nan, 6, 7, np.nan, 10, 11, 15, 13, 14,
...             np.nan, np.nan, 15],
...   'price': [np.nan, 3, np.nan, 4, 3, 6, np.nan, 2, 6, 3, 5, 5,
...             np.nan, np.nan, 6]})
>>> df1 = TimeSeriesDataSet(data1, time_series_id_column_names=['store'],
...                           time_column_name='date', target_column_name='sales')
>>> df1.data
>>>                price  sales
>>> date       store
2017-01-02 a        nan   1.00
2017-01-04 a       3.00    nan
2017-01-01 b        nan   2.00
2017-01-02 b       4.00    nan
2017-01-01 c       3.00   6.00
2017-01-02 c       6.00   7.00
2017-01-03 c        nan    nan
2017-01-01 d       2.00  10.00
2017-01-02 d       6.00  11.00
2017-01-03 d       3.00  15.00
2017-01-04 d       5.00  13.00
2017-01-05 d       5.00  14.00
2017-01-06 d        nan    nan
2017-01-07 d        nan    nan
2017-01-08 d       6.00  15.00

如果运行 infer_freq,则“regular_ts”属性为 False,并推断出频率为“D”。


>>>  df1.infer_freq() 
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
{'regular_ts': False, 'freq': 'D'}

>>> sorted(df1.infer_freq().items())
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
[('freq', 'D'), ('regular_ts', False)]

使用选项“默认值”为单列“销售”插补 df1 请注意,对于存储“a”,还添加并插补了日期“2017-01-03”的缺失行。 此外,默认情况下,用于填充缺失日期的频率是 df1.infer_freq() 的推断频率,在这种情况下为“D”


>>> imputer1 = TimeSeriesImputer(input_column='sales', option='default')
>>> imputer1.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price      sales store
0  2017-01-02    NaN   1.000000     a
1  2017-01-03    NaN   1.000000     a
2  2017-01-04    3.0   1.000000     a
3  2017-01-01    NaN   2.000000     b
4  2017-01-02    4.0   2.000000     b
5  2017-01-01    3.0   6.000000     c
6  2017-01-02    6.0   7.000000     c
7  2017-01-03    NaN   7.000000     c
8  2017-01-01    2.0  10.000000     d
9  2017-01-02    6.0  11.000000     d
10 2017-01-03    3.0  15.000000     d
11 2017-01-04    5.0  13.000000     d
12 2017-01-05    5.0  14.000000     d
13 2017-01-06    NaN  14.333333     d
14 2017-01-07    NaN  14.666667     d
15 2017-01-08    6.0  15.000000     d

如果要显式指定频率,还可以使用“频率”键参数来传递频率,因为在某些情况下,推断的频率可能并不精确,例如,不从任何时序推断频率,或者推断出多个频率,并且所选频率不是所需的频率。


>>> imputer2 = TimeSeriesImputer(input_column='sales', option='default',
...                              freq='D')
>>> imputer2.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price      sales store
0  2017-01-02    NaN   1.000000     a
1  2017-01-03    NaN   1.000000     a
2  2017-01-04    3.0   1.000000     a
3  2017-01-01    NaN   2.000000     b
4  2017-01-02    4.0   2.000000     b
5  2017-01-01    3.0   6.000000     c
6  2017-01-02    6.0   7.000000     c
7  2017-01-03    NaN   7.000000     c
8  2017-01-01    2.0  10.000000     d
9  2017-01-02    6.0  11.000000     d
10 2017-01-03    3.0  15.000000     d
11 2017-01-04    5.0  13.000000     d
12 2017-01-05    5.0  14.000000     d
13 2017-01-06    NaN  14.333333     d
14 2017-01-07    NaN  14.666667     d
15 2017-01-08    6.0  15.000000     d

默认选项与 set option='interpolate'、method='linear' 和 limit_direction='both' 相同


>>> imputer3 = TimeSeriesImputer(input_column='sales',
...                              option='interpolate', method='linear',
...                              limit_direction='both')
>>> imputer3.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price  sales store
0  2017-01-02    NaN   1.000000     a
1  2017-01-03    NaN   1.000000     a
2  2017-01-04    3.0   1.000000     a
3  2017-01-01    NaN   2.000000     b
4  2017-01-02    4.0   2.000000     b
5  2017-01-01    3.0   6.000000     c
6  2017-01-02    6.0   7.000000     c
7  2017-01-03    NaN   7.000000     c
8  2017-01-01    2.0  10.000000     d
9  2017-01-02    6.0  11.000000     d
10 2017-01-03    3.0  15.000000     d
11 2017-01-04    5.0  13.000000     d
12 2017-01-05    5.0  14.000000     d
13 2017-01-06    NaN  14.333333     d
14 2017-01-07    NaN  14.666667     d
15 2017-01-08    6.0  15.000000     d

还可以对列列表进行插补。 此处,为“销售”和“价格”列插补 df1。


>>> imputer4 = TimeSeriesImputer(input_column=['sales', 'price'],
...                              option='default')
>>> imputer4.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price  sales store
0  2017-01-02  3.000000   1.000000     a
1  2017-01-03  3.000000   1.000000     a
2  2017-01-04  3.000000   1.000000     a
3  2017-01-01  4.000000   2.000000     b
4  2017-01-02  4.000000   2.000000     b
5  2017-01-01  3.000000   6.000000     c
6  2017-01-02  6.000000   7.000000     c
7  2017-01-03  6.000000   7.000000     c
8  2017-01-01  2.000000  10.000000     d
9  2017-01-02  6.000000  11.000000     d
10 2017-01-03  3.000000  15.000000     d
11 2017-01-04  5.000000  13.000000     d
12 2017-01-05  5.000000  14.000000     d
13 2017-01-06  5.333333  14.333333     d
14 2017-01-07  5.666667  14.666667     d
15 2017-01-08  6.000000  15.000000     d

还可以将选项设置为“内插”,并使用 pandas.Series.interpolate 中的“method”、“limit”、“limit_direction”和“order”等键参数 请注意,如果使用的特定方法不适用于某些粒度,则默认线性内插用于这些粒度。


>>> imputer5 = TimeSeriesImputer(input_column=['sales'],
...                              option='interpolate', method='barycentric')
>>> imputer5.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price  sales store
0  2017-01-02    NaN   1.000000     a
1  2017-01-03    NaN   1.000000     a
2  2017-01-04    3.0   1.000000     a
3  2017-01-01    NaN   2.000000     b
4  2017-01-02    4.0   2.000000     b
5  2017-01-01    3.0   6.000000     c
6  2017-01-02    6.0   7.000000     c
7  2017-01-03    NaN   8.000000     c
8  2017-01-01    2.0  10.000000     d
9  2017-01-02    6.0  11.000000     d
10 2017-01-03    3.0  15.000000     d
11 2017-01-04    5.0  13.000000     d
12 2017-01-05    5.0  14.000000     d
13 2017-01-06    NaN  26.904762     d
14 2017-01-07    NaN  42.428571     d
15 2017-01-08    6.0  15.000000     d

还可以将选项设置为“fillna”,并使用 pandas.Series.fillna 中的“method”、“value”和“limit”方法等键参数


>>> imputer6 = TimeSeriesImputer(input_column=['sales'], option='fillna', method='ffill')
>>> imputer6.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

预期数据中所有 ['store'] 有 1 个不同的日期时间频率,推断出 2 个不同的日期时间频率 (['2D' 'D'])。


>>>      date  price  sales store
0  2017-01-02    NaN    1.0     a
1  2017-01-03    NaN    1.0     a
2  2017-01-04    3.0    1.0     a
3  2017-01-01    NaN    2.0     b
4  2017-01-02    4.0    2.0     b
5  2017-01-01    3.0    6.0     c
6  2017-01-02    6.0    7.0     c
7  2017-01-03    NaN    7.0     c
8  2017-01-01    2.0   10.0     d
9  2017-01-02    6.0   11.0     d
10 2017-01-03    3.0   15.0     d
11 2017-01-04    5.0   13.0     d
12 2017-01-05    5.0   14.0     d
13 2017-01-06    NaN   14.0     d
14 2017-01-07    NaN   14.0     d
15 2017-01-08    6.0   15.0     d

>>> imputer7 = TimeSeriesImputer(input_column=['sales'], option='fillna',
...                              value=0)
>>> imputer7.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price  sales store
0  2017-01-02    NaN    1.0     a
1  2017-01-03    NaN    0.0     a
2  2017-01-04    3.0    0.0     a
3  2017-01-01    NaN    2.0     b
4  2017-01-02    4.0    0.0     b
5  2017-01-01    3.0    6.0     c
6  2017-01-02    6.0    7.0     c
7  2017-01-03    NaN    0.0     c
8  2017-01-01    2.0   10.0     d
9  2017-01-02    6.0   11.0     d
10 2017-01-03    3.0   15.0     d
11 2017-01-04    5.0   13.0     d
12 2017-01-05    5.0   14.0     d
13 2017-01-06    NaN    0.0     d
14 2017-01-07    NaN    0.0     d
15 2017-01-08    6.0   15.0     d

有时,你可能想要填充上至较早日期、下至较晚日期的值,可以使用 origin 和 end 属性来达到此目的。


>>> imputer8 = TimeSeriesImputer(input_column=['sales'], option='fillna',
...                              value=0, origin='2016-12-28')
>>> imputer8.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price  sales store
0  2016-12-28    NaN    0.0     a
1  2016-12-29    NaN    0.0     a
2  2016-12-30    NaN    0.0     a
3  2016-12-31    NaN    0.0     a
4  2017-01-01    NaN    0.0     a
5  2017-01-02    NaN    1.0     a
6  2017-01-03    NaN    0.0     a
7  2017-01-04    3.0    0.0     a
8  2016-12-28    NaN    0.0     b
9  2016-12-29    NaN    0.0     b
10 2016-12-30    NaN    0.0     b
11 2016-12-31    NaN    0.0     b
12 2017-01-01    NaN    2.0     b
13 2017-01-02    4.0    0.0     b
14 2016-12-28    NaN    0.0     c
15 2016-12-29    NaN    0.0     c
16 2016-12-30    NaN    0.0     c
17 2016-12-31    NaN    0.0     c
18 2017-01-01    3.0    6.0     c
19 2017-01-02    6.0    7.0     c
20 2017-01-03    NaN    0.0     c
21 2016-12-28    NaN    0.0     d
22 2016-12-29    NaN    0.0     d
23 2016-12-30    NaN    0.0     d
24 2016-12-31    NaN    0.0     d
25 2017-01-01    2.0   10.0     d
26 2017-01-02    6.0   11.0     d
27 2017-01-03    3.0   15.0     d
28 2017-01-04    5.0   13.0     d
29 2017-01-05    5.0   14.0     d
30 2017-01-06    NaN    0.0     d
31 2017-01-07    NaN    0.0     d
32 2017-01-08    6.0   15.0     d

>>> imputer9 = TimeSeriesImputer(input_column=['sales'], option='fillna',
...                              value=0, end='2017-01-10')
>>> imputer9.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.

>>>      date  price  sales store
0  2017-01-02    NaN    1.0     a
1  2017-01-03    NaN    0.0     a
2  2017-01-04    3.0    0.0     a
3  2017-01-05    NaN    0.0     a
4  2017-01-06    NaN    0.0     a
5  2017-01-07    NaN    0.0     a
6  2017-01-08    NaN    0.0     a
7  2017-01-09    NaN    0.0     a
8  2017-01-10    NaN    0.0     a
9  2017-01-01    NaN    2.0     b
10 2017-01-02    4.0    0.0     b
11 2017-01-03    NaN    0.0     b
12 2017-01-04    NaN    0.0     b
13 2017-01-05    NaN    0.0     b
14 2017-01-06    NaN    0.0     b
15 2017-01-07    NaN    0.0     b
16 2017-01-08    NaN    0.0     b
17 2017-01-09    NaN    0.0     b
18 2017-01-10    NaN    0.0     b
19 2017-01-01    3.0    6.0     c
20 2017-01-02    6.0    7.0     c
21 2017-01-03    NaN    0.0     c
22 2017-01-04    NaN    0.0     c
23 2017-01-05    NaN    0.0     c
24 2017-01-06    NaN    0.0     c
25 2017-01-07    NaN    0.0     c
26 2017-01-08    NaN    0.0     c
27 2017-01-09    NaN    0.0     c
28 2017-01-10    NaN    0.0     c
29 2017-01-01    2.0   10.0     d
30 2017-01-02    6.0   11.0     d
31 2017-01-03    3.0   15.0     d
32 2017-01-04    5.0   13.0     d
33 2017-01-05    5.0   14.0     d
34 2017-01-06    NaN    0.0     d
35 2017-01-07    NaN    0.0     d
36 2017-01-08    6.0   15.0     d
37 2017-01-09    NaN    0.0     d
38 2017-01-10    NaN    0.0     d