microsoftml.categorical_hash:對文字資料行進行雜湊處理並轉換成類別

使用方式

microsoftml.categorical_hash(cols: [str, dict, list],
    hash_bits: int = 16, seed: int = 314489979,
    ordered: bool = True, invert_hash: int = 0,
    output_kind: ['Bag', 'Ind', 'Key', 'Bin'] = 'Bag', **kargs)

Description

在定型模型之前,您可以在資料上執行類別雜湊轉換。

詳細資料

categorical_hash 會透過雜湊值及將雜湊用為包中的索引,將類別值轉換成指標陣列。 如果輸入資料行是向量,則會為其傳回單一指標包。 categorical_hash 目前不支援處理因子資料。

引數

cols

要轉換的字元字串或變數名稱清單。 如果是 dict,則索引鍵代表要建立的新變數名稱。

hash_bits

整數,指定要雜湊的位元數。 必須介於 1 到 30 (含) 之間。 預設值為 16。

seed

指定雜湊種子的整數。 預設值為 314489979。

排序

True,在雜湊中包含每個字詞的位置。 否則為 False。 預設值是 True

invert_hash

整數,指定可用來產生位置名稱的索引鍵數目限制。 0 表示沒有反轉雜湊;-1 表示沒有限制。 雖然零值可提供更好的效能,但需要非零值才能取得有意義的係數名稱。 預設值是 0

output_kind

指定輸出種類的字元字串。

  • "Bag":輸出多組向量。 若輸入資料行是類別的向量,則輸出會包含一個向量,其中每個位置中的值都是輸入向量中類別的出現次數。 若輸入資料行包含單一類別,則指標向量與包向量相等

  • "Ind":輸出指標向量。 輸入資料行是類別的向量,而輸出會在輸入資料行中每個位置都包含一個指標向量。

  • "Key:輸出索引。 輸出是類別的整數識別碼 (介於 1 與目錄中的類別數目之間)。

  • "Bin:輸出向量,其為類別的二進位標記法。

預設值是 "Bag"

kargs

傳送至計算引擎的其他引數。

傳回

定義轉換的物件。

另請參閱

categorical

範例

'''
Example on rx_logistic_regression and categorical_hash.
'''
import numpy
import pandas
from microsoftml import rx_logistic_regression, categorical_hash, rx_predict
from microsoftml.datasets.datasets import get_dataset

movie_reviews = get_dataset("movie_reviews")

train_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Do not like it", "Really like it",
        "I hate it", "I like it a lot", "I kind of hate it", "I do like it",
        "I really hate it", "It is very good", "I hate it a bunch", "I love it a bunch",
        "I hate it", "I like it very much", "I hate it very much.",
        "I really do love it", "I really do hate it", "Love it!", "Hate it!",
        "I love it", "I hate it", "I love it", "I hate it", "I love it"],
    like=[True, False, True, False, True, False, True, False, True, False,
        True, False, True, False, True, False, True, False, True, False, True,
        False, True, False, True]))
        
test_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Really like it", "I hate it",
        "I like it a lot", "I love it", "I do like it", "I really hate it", "I love it"]))


# Use a categorical hash transform.
out_model = rx_logistic_regression("like ~ reviewCat",
                data=train_reviews,
                ml_transforms=[categorical_hash(cols=dict(reviewCat="review"))])
                
# Weights are similar to categorical.
print(out_model.coef_)

# Use the model to score.
source_out_df = rx_predict(out_model, data=test_reviews, extra_vars_to_write=["review"])
print(source_out_df.head())

輸出:

Not adding a normalizer.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
Warning: Too few instances to use 4 threads, decreasing to 1 thread(s)
Beginning optimization
num vars: 65537
improvement criterion: Mean Improvement
L1 regularization selected 3 of 65537 weights.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:00.1209392
Elapsed time: 00:00:00.0190134
OrderedDict([('(Bias)', 0.2132447361946106), ('f1783', -0.7939924597740173), ('f38537', 0.1968022584915161)])
Beginning processing data.
Rows Read: 10, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.0284223
Finished writing 10 rows.
Writing completed.
           review PredictedLabel     Score  Probability
0   This is great           True  0.213245     0.553110
1       I hate it          False -0.580748     0.358761
2         Love it           True  0.213245     0.553110
3  Really like it           True  0.213245     0.553110
4       I hate it          False -0.580748     0.358761