|
用于插补数据帧列的缺失值的插补转换器。
示例:构造示例数据帧:df1 请注意,df1 不是常规时序,因为对于存储“a”,缺少日期“2017-01-03”的行。
>>> data1 = pd.DataFrame(
... {'store': ['a', 'a', 'b', 'b', 'c', 'c', 'c', 'd',
... 'd', 'd', 'd', 'd', 'd', 'd', 'd'],
... 'date': pd.to_datetime(
... ['2017-01-02', '2017-01-04', '2017-01-01', '2017-01-02',
... '2017-01-01', '2017-01-02', '2017-01-03', '2017-01-01',
... '2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05',
... '2017-01-06', '2017-01-07', '2017-01-08']),
... 'sales': [1, np.nan, 2, np.nan, 6, 7, np.nan, 10, 11, 15, 13, 14,
... np.nan, np.nan, 15],
... 'price': [np.nan, 3, np.nan, 4, 3, 6, np.nan, 2, 6, 3, 5, 5,
... np.nan, np.nan, 6]})
>>> df1 = TimeSeriesDataSet(data1, time_series_id_column_names=['store'],
... time_column_name='date', target_column_name='sales')
>>> df1.data
>>> price sales
>>> date store
2017-01-02 a nan 1.00
2017-01-04 a 3.00 nan
2017-01-01 b nan 2.00
2017-01-02 b 4.00 nan
2017-01-01 c 3.00 6.00
2017-01-02 c 6.00 7.00
2017-01-03 c nan nan
2017-01-01 d 2.00 10.00
2017-01-02 d 6.00 11.00
2017-01-03 d 3.00 15.00
2017-01-04 d 5.00 13.00
2017-01-05 d 5.00 14.00
2017-01-06 d nan nan
2017-01-07 d nan nan
2017-01-08 d 6.00 15.00
如果运行 infer_freq,则“regular_ts”属性为 False,并推断出频率为“D”。
>>> df1.infer_freq()
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
{'regular_ts': False, 'freq': 'D'}
>>> sorted(df1.infer_freq().items())
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
[('freq', 'D'), ('regular_ts', False)]
使用选项“默认值”为单列“销售”插补 df1 请注意,对于存储“a”,还添加并插补了日期“2017-01-03”的缺失行。
此外,默认情况下,用于填充缺失日期的频率是 df1.infer_freq() 的推断频率,在这种情况下为“D”
>>> imputer1 = TimeSeriesImputer(input_column='sales', option='default')
>>> imputer1.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
>>> date price sales store
0 2017-01-02 NaN 1.000000 a
1 2017-01-03 NaN 1.000000 a
2 2017-01-04 3.0 1.000000 a
3 2017-01-01 NaN 2.000000 b
4 2017-01-02 4.0 2.000000 b
5 2017-01-01 3.0 6.000000 c
6 2017-01-02 6.0 7.000000 c
7 2017-01-03 NaN 7.000000 c
8 2017-01-01 2.0 10.000000 d
9 2017-01-02 6.0 11.000000 d
10 2017-01-03 3.0 15.000000 d
11 2017-01-04 5.0 13.000000 d
12 2017-01-05 5.0 14.000000 d
13 2017-01-06 NaN 14.333333 d
14 2017-01-07 NaN 14.666667 d
15 2017-01-08 6.0 15.000000 d
如果要显式指定频率,还可以使用“频率”键参数来传递频率,因为在某些情况下,推断的频率可能并不精确,例如,不从任何时序推断频率,或者推断出多个频率,并且所选频率不是所需的频率。
>>> imputer2 = TimeSeriesImputer(input_column='sales', option='default',
... freq='D')
>>> imputer2.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
>>> date price sales store
0 2017-01-02 NaN 1.000000 a
1 2017-01-03 NaN 1.000000 a
2 2017-01-04 3.0 1.000000 a
3 2017-01-01 NaN 2.000000 b
4 2017-01-02 4.0 2.000000 b
5 2017-01-01 3.0 6.000000 c
6 2017-01-02 6.0 7.000000 c
7 2017-01-03 NaN 7.000000 c
8 2017-01-01 2.0 10.000000 d
9 2017-01-02 6.0 11.000000 d
10 2017-01-03 3.0 15.000000 d
11 2017-01-04 5.0 13.000000 d
12 2017-01-05 5.0 14.000000 d
13 2017-01-06 NaN 14.333333 d
14 2017-01-07 NaN 14.666667 d
15 2017-01-08 6.0 15.000000 d
默认选项与 set option='interpolate'、method='linear' 和 limit_direction='both' 相同
>>> imputer3 = TimeSeriesImputer(input_column='sales',
... option='interpolate', method='linear',
... limit_direction='both')
>>> imputer3.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
>>> date price sales store
0 2017-01-02 NaN 1.000000 a
1 2017-01-03 NaN 1.000000 a
2 2017-01-04 3.0 1.000000 a
3 2017-01-01 NaN 2.000000 b
4 2017-01-02 4.0 2.000000 b
5 2017-01-01 3.0 6.000000 c
6 2017-01-02 6.0 7.000000 c
7 2017-01-03 NaN 7.000000 c
8 2017-01-01 2.0 10.000000 d
9 2017-01-02 6.0 11.000000 d
10 2017-01-03 3.0 15.000000 d
11 2017-01-04 5.0 13.000000 d
12 2017-01-05 5.0 14.000000 d
13 2017-01-06 NaN 14.333333 d
14 2017-01-07 NaN 14.666667 d
15 2017-01-08 6.0 15.000000 d
还可以对列列表进行插补。 此处,为“销售”和“价格”列插补 df1。
>>> imputer4 = TimeSeriesImputer(input_column=['sales', 'price'],
... option='default')
>>> imputer4.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
>>> date price sales store
0 2017-01-02 3.000000 1.000000 a
1 2017-01-03 3.000000 1.000000 a
2 2017-01-04 3.000000 1.000000 a
3 2017-01-01 4.000000 2.000000 b
4 2017-01-02 4.000000 2.000000 b
5 2017-01-01 3.000000 6.000000 c
6 2017-01-02 6.000000 7.000000 c
7 2017-01-03 6.000000 7.000000 c
8 2017-01-01 2.000000 10.000000 d
9 2017-01-02 6.000000 11.000000 d
10 2017-01-03 3.000000 15.000000 d
11 2017-01-04 5.000000 13.000000 d
12 2017-01-05 5.000000 14.000000 d
13 2017-01-06 5.333333 14.333333 d
14 2017-01-07 5.666667 14.666667 d
15 2017-01-08 6.000000 15.000000 d
还可以将选项设置为“内插”,并使用 pandas.Series.interpolate 中的“method”、“limit”、“limit_direction”和“order”等键参数 请注意,如果使用的特定方法不适用于某些粒度,则默认线性内插用于这些粒度。
>>> imputer5 = TimeSeriesImputer(input_column=['sales'],
... option='interpolate', method='barycentric')
>>> imputer5.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
>>> date price sales store
0 2017-01-02 NaN 1.000000 a
1 2017-01-03 NaN 1.000000 a
2 2017-01-04 3.0 1.000000 a
3 2017-01-01 NaN 2.000000 b
4 2017-01-02 4.0 2.000000 b
5 2017-01-01 3.0 6.000000 c
6 2017-01-02 6.0 7.000000 c
7 2017-01-03 NaN 8.000000 c
8 2017-01-01 2.0 10.000000 d
9 2017-01-02 6.0 11.000000 d
10 2017-01-03 3.0 15.000000 d
11 2017-01-04 5.0 13.000000 d
12 2017-01-05 5.0 14.000000 d
13 2017-01-06 NaN 26.904762 d
14 2017-01-07 NaN 42.428571 d
15 2017-01-08 6.0 15.000000 d
还可以将选项设置为“fillna”,并使用 pandas.Series.fillna 中的“method”、“value”和“limit”方法等键参数
>>> imputer6 = TimeSeriesImputer(input_column=['sales'], option='fillna', method='ffill')
>>> imputer6.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
预期数据中所有 ['store'] 有 1 个不同的日期时间频率,推断出 2 个不同的日期时间频率 (['2D' 'D'])。
>>> date price sales store
0 2017-01-02 NaN 1.0 a
1 2017-01-03 NaN 1.0 a
2 2017-01-04 3.0 1.0 a
3 2017-01-01 NaN 2.0 b
4 2017-01-02 4.0 2.0 b
5 2017-01-01 3.0 6.0 c
6 2017-01-02 6.0 7.0 c
7 2017-01-03 NaN 7.0 c
8 2017-01-01 2.0 10.0 d
9 2017-01-02 6.0 11.0 d
10 2017-01-03 3.0 15.0 d
11 2017-01-04 5.0 13.0 d
12 2017-01-05 5.0 14.0 d
13 2017-01-06 NaN 14.0 d
14 2017-01-07 NaN 14.0 d
15 2017-01-08 6.0 15.0 d
>>> imputer7 = TimeSeriesImputer(input_column=['sales'], option='fillna',
... value=0)
>>> imputer7.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
>>> date price sales store
0 2017-01-02 NaN 1.0 a
1 2017-01-03 NaN 0.0 a
2 2017-01-04 3.0 0.0 a
3 2017-01-01 NaN 2.0 b
4 2017-01-02 4.0 0.0 b
5 2017-01-01 3.0 6.0 c
6 2017-01-02 6.0 7.0 c
7 2017-01-03 NaN 0.0 c
8 2017-01-01 2.0 10.0 d
9 2017-01-02 6.0 11.0 d
10 2017-01-03 3.0 15.0 d
11 2017-01-04 5.0 13.0 d
12 2017-01-05 5.0 14.0 d
13 2017-01-06 NaN 0.0 d
14 2017-01-07 NaN 0.0 d
15 2017-01-08 6.0 15.0 d
有时,你可能想要填充上至较早日期、下至较晚日期的值,可以使用 origin 和 end 属性来达到此目的。
>>> imputer8 = TimeSeriesImputer(input_column=['sales'], option='fillna',
... value=0, origin='2016-12-28')
>>> imputer8.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
>>> date price sales store
0 2016-12-28 NaN 0.0 a
1 2016-12-29 NaN 0.0 a
2 2016-12-30 NaN 0.0 a
3 2016-12-31 NaN 0.0 a
4 2017-01-01 NaN 0.0 a
5 2017-01-02 NaN 1.0 a
6 2017-01-03 NaN 0.0 a
7 2017-01-04 3.0 0.0 a
8 2016-12-28 NaN 0.0 b
9 2016-12-29 NaN 0.0 b
10 2016-12-30 NaN 0.0 b
11 2016-12-31 NaN 0.0 b
12 2017-01-01 NaN 2.0 b
13 2017-01-02 4.0 0.0 b
14 2016-12-28 NaN 0.0 c
15 2016-12-29 NaN 0.0 c
16 2016-12-30 NaN 0.0 c
17 2016-12-31 NaN 0.0 c
18 2017-01-01 3.0 6.0 c
19 2017-01-02 6.0 7.0 c
20 2017-01-03 NaN 0.0 c
21 2016-12-28 NaN 0.0 d
22 2016-12-29 NaN 0.0 d
23 2016-12-30 NaN 0.0 d
24 2016-12-31 NaN 0.0 d
25 2017-01-01 2.0 10.0 d
26 2017-01-02 6.0 11.0 d
27 2017-01-03 3.0 15.0 d
28 2017-01-04 5.0 13.0 d
29 2017-01-05 5.0 14.0 d
30 2017-01-06 NaN 0.0 d
31 2017-01-07 NaN 0.0 d
32 2017-01-08 6.0 15.0 d
>>> imputer9 = TimeSeriesImputer(input_column=['sales'], option='fillna',
... value=0, end='2017-01-10')
>>> imputer9.transform(df1).data
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
expect 1 distinct datetime frequency from all ['store'] in the data,
with 2 distinct datetime frequencies (['2D' 'D']) inferred.
>>> date price sales store
0 2017-01-02 NaN 1.0 a
1 2017-01-03 NaN 0.0 a
2 2017-01-04 3.0 0.0 a
3 2017-01-05 NaN 0.0 a
4 2017-01-06 NaN 0.0 a
5 2017-01-07 NaN 0.0 a
6 2017-01-08 NaN 0.0 a
7 2017-01-09 NaN 0.0 a
8 2017-01-10 NaN 0.0 a
9 2017-01-01 NaN 2.0 b
10 2017-01-02 4.0 0.0 b
11 2017-01-03 NaN 0.0 b
12 2017-01-04 NaN 0.0 b
13 2017-01-05 NaN 0.0 b
14 2017-01-06 NaN 0.0 b
15 2017-01-07 NaN 0.0 b
16 2017-01-08 NaN 0.0 b
17 2017-01-09 NaN 0.0 b
18 2017-01-10 NaN 0.0 b
19 2017-01-01 3.0 6.0 c
20 2017-01-02 6.0 7.0 c
21 2017-01-03 NaN 0.0 c
22 2017-01-04 NaN 0.0 c
23 2017-01-05 NaN 0.0 c
24 2017-01-06 NaN 0.0 c
25 2017-01-07 NaN 0.0 c
26 2017-01-08 NaN 0.0 c
27 2017-01-09 NaN 0.0 c
28 2017-01-10 NaN 0.0 c
29 2017-01-01 2.0 10.0 d
30 2017-01-02 6.0 11.0 d
31 2017-01-03 3.0 15.0 d
32 2017-01-04 5.0 13.0 d
33 2017-01-05 5.0 14.0 d
34 2017-01-06 NaN 0.0 d
35 2017-01-07 NaN 0.0 d
36 2017-01-08 6.0 15.0 d
37 2017-01-09 NaN 0.0 d
38 2017-01-10 NaN 0.0 d
|