microsoftml.categorical:将文本列转换为类别

使用情况

microsoftml.categorical(cols: [str, dict, list], output_kind: ['Bag', 'Ind',
    'Key', 'Bin'] = 'Ind', max_num_terms: int = 1000000,
    terms: int = None, sort: ['Occurrence', 'Value'] = 'Occurrence',
    text_key_values: bool = False, **kargs)

说明

可在训练模型之前对数据执行的分类转换。

详细信息

categorical 转换通过数据集传递,在文本列上操作,从而生成类别字典。 对于每一行,输入列中显示的整个文本字符串均定义为类别。 分类转换的输出是一个指示器向量。 此向量中的每个槽对应于字典中的一个类别,因此其长度为生成的字典的大小。 分类转换可以应用于一个或多个列,在这种情况下,它将为它所应用到的每个列生成一个单独的字典。

当前不支持 categorical 处理系数数据。

参数

cols

要转换的字符串或变量名称列表。 如果是 dict,则键表示要创建的新变量的名称。

output_kind

指定输出类型的字符串。

  • "Bag":输出一个多集向量。 如果输入列是类别向量,则输出将包含一个向量,其中每个槽中的值为该类别在输入向量中出现的次数。 如果输入列仅包含一个类别,则指示器矢量和包向量等效

  • "Ind":输出指示器向量。 输入列是类别向量,而且输出在输入列中的每个槽中包含一个指示器向量。

  • "Key":输出索引。 输出是一个类别的整数 ID(介于 1 和字典中的类别数之间)。

  • "Bin":输出一个向量,该向量是类别的二进制表示形式。

默认值是 "Ind"

max_num_terms

指定字典中要包含的最大类别数的整数。 默认值为 1000000。

terms

条件或类别的可选字符向量。

sort

指定排序条件的字符串。

  • "Occurrence":按出现次数对类别进行排序。 出现频率最高的在第一个。

  • "Value":按值对类别进行排序。

text_key_values

键值元数据是否应为文本,而不考虑实际输入类型。

kargs

发送到计算引擎的其他参数。

返回

一个定义转换的对象。

请参阅

categorical_hash

示例

'''
Example on rx_logistic_regression and categorical.
'''
import numpy
import pandas
from microsoftml import rx_logistic_regression, categorical, rx_predict

train_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Do not like it", "Really like it",
        "I hate it", "I like it a lot", "I kind of hate it", "I do like it",
        "I really hate it", "It is very good", "I hate it a bunch", "I love it a bunch",
        "I hate it", "I like it very much", "I hate it very much.",
        "I really do love it", "I really do hate it", "Love it!", "Hate it!",
        "I love it", "I hate it", "I love it", "I hate it", "I love it"],
    like=[True, False, True, False, True, False, True, False, True, False,
        True, False, True, False, True, False, True, False, True, False, True,
        False, True, False, True]))
        
test_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Really like it", "I hate it",
        "I like it a lot", "I love it", "I do like it", "I really hate it", "I love it"]))

# Use a categorical transform: the entire string is treated as a category
out_model = rx_logistic_regression("like ~ reviewCat",
                data=train_reviews,
                ml_transforms=[categorical(cols=dict(reviewCat="review"))])
                
# Note that 'I hate it' and 'I love it' (the only strings appearing more than once)
# have non-zero weights.
print(out_model.coef_)

# Use the model to score.
source_out_df = rx_predict(out_model, data=test_reviews, extra_vars_to_write=["review"])
print(source_out_df.head())

输出:

Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Not adding a normalizer.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
Warning: Too few instances to use 4 threads, decreasing to 1 thread(s)
Beginning optimization
num vars: 20
improvement criterion: Mean Improvement
L1 regularization selected 3 of 20 weights.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:01.6550695
Elapsed time: 00:00:00.2259981
OrderedDict([('(Bias)', 0.21317288279533386), ('I hate it', -0.7937591671943665), ('I love it', 0.19668534398078918)])
Beginning processing data.
Rows Read: 10, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.1385248
Finished writing 10 rows.
Writing completed.
           review PredictedLabel     Score  Probability
0   This is great           True  0.213173     0.553092
1       I hate it          False -0.580586     0.358798
2         Love it           True  0.213173     0.553092
3  Really like it           True  0.213173     0.553092
4       I hate it          False -0.580586     0.358798