microsoftml.categorical_hash: 텍스트 열을 해시하고 범주로 변환

아티클
05/23/2023

사용

microsoftml.categorical_hash(cols: [str, dict, list],
    hash_bits: int = 16, seed: int = 314489979,
    ordered: bool = True, invert_hash: int = 0,
    output_kind: ['Bag', 'Ind', 'Key', 'Bin'] = 'Bag', **kargs)

Description

모델을 학습시키기 전에 데이터에서 수행할 수 있는 범주 해시 변환입니다.

세부 정보

categorical_hash는 값을 해시하고 해시를 백의 인덱스로 사용하여 범주 값을 표시기 배열로 변환합니다. 입력 열이 벡터이면 단일 표시기 백이 반환됩니다. categorical_hash는 현재 요소 데이터 처리를 지원하지 않습니다.

인수

cols

변환할 문자열 또는 변수 이름 목록입니다. dict이면 키는 만들 새 변수의 이름을 나타냅니다.

hash_bits

해시할 비트 수를 지정하는 정수입니다. 1에서 30(포함) 사이여야 합니다. 기본값은 16입니다.

seed

해시 시드를 지정하는 정수입니다. 기본값은 314489979입니다.

ordered

True이면 해시에 각 용어의 위치가 포함됩니다. 그렇지 않으면 False입니다. 기본값은 True입니다.

invert_hash

슬롯 이름을 생성하는 데 사용할 수 있는 키 수에 대한 제한을 지정하는 정수입니다. 0은 해시 반전이 없음을 의미하고, -1은 제한이 없음을 의미합니다. 값이 0이면 성능이 향상되지만 의미 있는 계수 이름을 얻으려면 0이 아닌 값이 필요합니다. 기본값은 0입니다.

output_kind

출력 종류를 지정하는 문자열입니다.

"Bag": 다중 집합 벡터를 출력합니다. 입력 열이 범주의 벡터인 경우 출력에 하나의 벡터가 포함됩니다. 여기서 각 슬롯의 값은 입력 벡터의 범주 발생 횟수입니다. 입력 열에 단일 범주가 포함된 경우 표시기 벡터와 백 벡터가 동일합니다.
"Ind": 표시기 벡터를 출력합니다. 입력 열은 범주의 벡터이며 출력에는 입력 열의 슬롯당 하나의 표시기 벡터가 포함됩니다.
"Key: 인덱스를 출력합니다. 출력은 범주의 정수 ID(1과 사전의 범주 수 사이)입니다.
"Bin: 범주의 이진 표현인 벡터를 출력합니다.

기본값은 "Bag"입니다.

kargs

컴퓨팅 엔진으로 전송된 추가 인수입니다.

반환

변환을 정의하는 개체입니다.

추가 정보

categorical

예

'''
Example on rx_logistic_regression and categorical_hash.
'''
import numpy
import pandas
from microsoftml import rx_logistic_regression, categorical_hash, rx_predict
from microsoftml.datasets.datasets import get_dataset

movie_reviews = get_dataset("movie_reviews")

train_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Do not like it", "Really like it",
        "I hate it", "I like it a lot", "I kind of hate it", "I do like it",
        "I really hate it", "It is very good", "I hate it a bunch", "I love it a bunch",
        "I hate it", "I like it very much", "I hate it very much.",
        "I really do love it", "I really do hate it", "Love it!", "Hate it!",
        "I love it", "I hate it", "I love it", "I hate it", "I love it"],
    like=[True, False, True, False, True, False, True, False, True, False,
        True, False, True, False, True, False, True, False, True, False, True,
        False, True, False, True]))
        
test_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Really like it", "I hate it",
        "I like it a lot", "I love it", "I do like it", "I really hate it", "I love it"]))


# Use a categorical hash transform.
out_model = rx_logistic_regression("like ~ reviewCat",
                data=train_reviews,
                ml_transforms=[categorical_hash(cols=dict(reviewCat="review"))])
                
# Weights are similar to categorical.
print(out_model.coef_)

# Use the model to score.
source_out_df = rx_predict(out_model, data=test_reviews, extra_vars_to_write=["review"])
print(source_out_df.head())

출력:

Not adding a normalizer.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
Warning: Too few instances to use 4 threads, decreasing to 1 thread(s)
Beginning optimization
num vars: 65537
improvement criterion: Mean Improvement
L1 regularization selected 3 of 65537 weights.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:00.1209392
Elapsed time: 00:00:00.0190134
OrderedDict([('(Bias)', 0.2132447361946106), ('f1783', -0.7939924597740173), ('f38537', 0.1968022584915161)])
Beginning processing data.
Rows Read: 10, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.0284223
Finished writing 10 rows.
Writing completed.
           review PredictedLabel     Score  Probability
0   This is great           True  0.213245     0.553110
1       I hate it          False -0.580748     0.358761
2         Love it           True  0.213245     0.553110
3  Really like it           True  0.213245     0.553110
4       I hate it          False -0.580748     0.358761