資料轉換Data transformations

資料轉換用來:Data transformations are used to:

  • 準備資料以進行模型定型prepare data for model training
  • 以 TensorFlow 或 ONNX 格式套用匯入的模型apply an imported model in TensorFlow or ONNX format
  • 在資料傳遞過模型之後進行後續處理post-process data after it has been passed through a model

本指南中的轉換會傳回能實作 IEstimator 介面的類別。The transformations in this guide return classes that implement the IEstimator interface. 資料轉換可以鏈結在一起。Data transformations can be chained together. 每個轉換都會預期和產生特定類型及格式的資料,其已詳述於連結的參考文件中。Each transformation both expects and produces data of specific types and formats, which are specified in the linked reference documentation.

某些資料轉換需要定型資料以計算其參數。Some data transformations require training data to calculate their parameters. 例如:NormalizeMeanVariance 轉換器會在 Fit() 作業期間計算定型資料的平均數和變異數,並將那些參數用於 Transform() 作業。For example: the NormalizeMeanVariance transformer calculates the mean and variance of the training data during the Fit() operation, and uses those parameters in the Transform() operation.

其他資料轉換並不需要定型資料。Other data transformations don't require training data. 例如:ConvertToGrayscale 轉換可以在無須於 Fit() 作業期間查看任何定型資料的情況下執行 Transform() 作業。For example: the ConvertToGrayscale transformation can perform the Transform() operation without having seen any training data during the Fit() operation.

資料行對應及群組Column mapping and grouping

資料轉換Transform 定義Definition
Concatenate 將一或多個輸入資料行串連成新的輸出資料行Concatenate one or more input columns into a new output column
CopyColumns 複製並重新命名一或多個輸入資料行Copy and rename one or more input columns
DropColumns 卸除一或多個輸入資料行Drop one or more input columns
SelectColumns 選取一或多個資料行以將其自輸入資料中排除Select one or more columns to keep from the input data

標準化和調整Normalization and scaling

資料轉換Transform 定義Definition
NormalizeMeanVariance 減去 (定型資料的) 平均數並除以 (定型資料的) 變異數Subtract the mean (of the training data) and divide by the variance (of the training data)
NormalizeLogMeanVariance 依定型資料的對數進行標準化Normalize based on the logarithm of the training data
NormalizeLpNorm 依據輸入向量的 lp-norm 來對它進行調整,其中 p 為 1、2 或無限。Scale input vectors by their lp-norm, where p is 1, 2 or infinity. 預設為 l2 (歐幾里得距離) 範數Defaults to the l2 (Euclidean distance) norm
NormalizeGlobalContrast 透過減去資料列資料的平均數並除以標準差或 (資料列資料的) l2 範數,並乘以可設定的比例因素 (預設為 2),來調整資料列中的每個值Scale each value in a row by subtracting the mean of the row data and divide by either the standard deviation or l2-norm (of the row data), and multiply by a configurable scale factor (default 2)
NormalizeBinning 將輸入值指派至 bin 目錄並除以 bin 的數目,以產生介於 0 與 1 的浮點值。Assign the input value to a bin index and divide by the number of bins to produce a float value between 0 and 1. 系統會以能將定型資料平均分散到所有 bin 上的方式計算 bin 界線The bin boundaries are calculated to evenly distribute the training data across bins
NormalizeSupervisedBinning 根據 bin 與標籤資料行的關聯性將輸入值指派至該 binAssign the input value to a bin based on its correlation with label column
NormalizeMinMax 依定型資料中最小及最大值之間的差異來調整輸入Scale the input by the difference between the minimum and maximum values in the training data

資料類型之間的轉換Conversions between data types

資料轉換Transform 定義Definition
ConvertType 將某個輸入資料行的類型轉換成新的類型Convert the type of an input column to a new type
MapValue 根據所提供的對應字典將值對應至索引鍵 (類別)Map values to keys (categories) based on the supplied dictionary of mappings
MapValueToKey 透過從輸入資料建立對應來將值對應至索引鍵 (類別)Map values to keys (categories) by creating the mapping from the input data
MapKeyToValue 將索引鍵轉換為其原始值Convert keys back to their original values
MapKeyToVector 將索引鍵轉換為原始值的向量Convert keys back to vectors of original values
MapKeyToBinaryVector 將索引鍵轉換為原始值的二進位向量Convert keys back to a binary vector of original values
Hash 對輸入資料行中的值進行雜湊處理Hash the value in the input column

文字轉換Text transformations

資料轉換Transform 定義Definition
FeaturizeText 將文字資料行轉換為標準化 ngram 和 char-gram 計數的浮動陣列Transform a text column into a float array of normalized ngrams and char-grams counts
TokenizeIntoWords 將一或多個文字資料行分割為個別字詞Split one or more text columns into individual words
TokenizeIntoCharactersAsKeys 將一或多個文字資料行分割為於一組主題上的個別字元浮點數Split one or more text columns into individual characters floats over a set of topics
NormalizeText 變更大小寫,移除變音符號、標點符號及數字Change case, remove diacritical marks, punctuation marks, and numbers
ProduceNgrams 將文字資料行轉換為一袋 ngram 計數 (連續字詞的序列)Transform text column into a bag of counts of ngrams (sequences of consecutive words)
ProduceWordBags 將文字資料行轉換為一袋 ngram 向量計數Transform text column into a bag of counts of ngrams vector
ProduceHashedNgrams 將文字資料行轉換為雜湊 ngram 計數的向量Transform text column into a vector of hashed ngram counts
ProduceHashedWordBags 將文字資料行轉換為一袋雜湊 ngram 計數Transform text column into a bag of hashed ngram counts
RemoveDefaultStopWords 從輸入資料行針對指定語言移除預設停用字詞Remove default stop words for the specified language from input columns
RemoveStopWords 從輸入資料行移除指定停用字詞Removes specified stop words from input columns
LatentDirichletAllocation 將文件 (以浮點數向量表示) 轉換為一組主題上的浮點數向量Transform a document (represented as a vector of floats) into a vector of floats over a set of topics
ApplyWordEmbedding 使用預先定型的模型將文字權杖的向量轉換成句子向量Convert vectors of text tokens into sentence vectors using a pre-trained model

影像轉換Image transformations

資料轉換Transform 定義Definition
ConvertToGrayscale 將影像轉換為灰階Convert an image to grayscale
ConvertToImage 將像素的向量轉換為 ImageDataViewTypeConvert a vector of pixels to ImageDataViewType
ExtractPixels 將來自輸入影像的像素轉換為數字向量Convert pixels from input image into a vector of numbers
LoadImages 從資料夾將影像載入記憶體Load images from a folder into memory
ResizeImages 調整影像大小Resize images
DnnFeaturizeImage 套用預先定型的深度神經網路 (DNN) 模型,將輸入影像轉換成特徵向量Applies a pre-trained deep neural network (DNN) model to transform an input image into a feature vector

類別資料轉換Categorical data transformations

資料轉換Transform 定義Definition
OneHotEncoding 將一或多個文字資料行轉換為 one-hot (英文) 編碼向量Convert one or more text columns into one-hot encoded vectors
OneHotHashEncoding 將一或多個文字資料行轉換為以雜湊為基礎的 one-hot 編碼向量Convert one or more text columns into hash-based one-hot encoded vectors

時間序列資料轉換Time series data transformations

資料轉換Transform 定義Definition
DetectAnomalyBySrCnn 使用光譜殘留 (SR) 演算法偵測輸入時間序列資料中的異常Detect anomalies in the input time series data using the Spectral Residual (SR) algorithm
DetectChangePointBySsa 使用單一頻譜分析 (SSA) 偵測時間序列資料中的變更點Detect change points in time series data using singular spectrum analysis (SSA)
DetectIidChangePoint 使用彈性核心密度估計和鞅分數,偵測獨立和相同分散式 (IID) 時間序列資料中的變更點Detect change points in independent and identically distributed (IID) time series data using adaptive kernel density estimations and martingale scores
ForecastBySsa 使用單一頻譜分析 (SSA) 預測時間序列資料Forecast time series data using singular spectrum analysis (SSA)
DetectSpikeBySsa 使用單一頻譜分析 (SSA) 偵測時間序列資料中的尖峰Detect spikes in time series data using singular spectrum analysis (SSA)
DetectIidSpike 使用彈性核心密度估計和鞅分數,偵測獨立和相同分散式 (IID) 時間序列資料中的尖峰Detect spikes in independent and identically distributed (IID) time series data using adaptive kernel density estimations and martingale scores

遺失值Missing values

資料轉換Transform 定義Definition
IndicateMissingValues 建立新的布林值輸出資料行,其值在輸入資料行中的值遺失時為 trueCreate a new boolean output column, the value of which is true when the value in the input column is missing
ReplaceMissingValues 建立新的輸出資料行,其值在輸入資料行中的值遺失時會被設為預設值,否則則會為輸入值Create a new output column, the value of which is set to a default value if the value is missing from the input column, and the input value otherwise

特徵選取Feature selection

資料轉換Transform 定義Definition
SelectFeaturesBasedOnCount 選取其非預設值大於某個閾值的特徵Select features whose non-default values are greater than a threshold
SelectFeaturesBasedOnMutualInformation 選取其標籤資料行中的資料最具相依性的特徵Select the features on which the data in the label column is most dependent

功能轉換Feature transformations

資料轉換Transform 定義Definition
ApproximatedKernelMap 將每個輸入向量對應至較低維度的功能空間,其中內部產品會近似核心函式,以便可以將功能當作線性演算法的輸入使用Map each input vector onto a lower dimensional feature space, where inner products approximate a kernel function, so that the features can be used as inputs to the linear algorithms
ProjectToPrincipalComponents 套用主體元件分析演算法,以減少輸入特徵向量的維度Reduce the dimensions of the input feature vector by applying the Principal Component Analysis algorithm

可解釋性轉換Explainability transformations

資料轉換Transform 定義Definition
CalculateFeatureContribution 為特徵向量的每個元素計算貢獻分數Calculate contribution scores for each element of a feature vector

校正轉換Calibration transformations

資料轉換Transform 定義Definition
Platt(String, String, String) 使用羅吉斯回歸搭配使用定型資料估計的參數,將二元分類器原始分數轉換成類別機率Transforms a binary classifier raw score into a class probability using logistic regression with parameters estimated using the training data
Platt(Double, Double, String) 使用羅吉斯回歸搭配固定參數,將二元分類器原始分數轉換成類別機率Transforms a binary classifier raw score into a class probability using logistic regression with fixed parameters
Naive 藉由將分數指派給 Bin,並根據各 Bin 間的分佈計算機率,將二元分類器原始分數轉換成類別機率Transforms a binary classifier raw score into a class probability by assigning scores to bins, and calculating the probability based on the distribution among the bins
Isotonic 藉由將分數指派給 Bin 來將二元分類器原始分數轉換成類別機率,其中會使用定型資料來估計界限的位置和 Bin 的大小Transforms a binary classifier raw score into a class probability by assigning scores to bins, where the position of boundaries and the size of bins are estimated using the training data

深度學習轉換Deep learning transformations

資料轉換Transform 定義Definition
ApplyOnnxModel 使用匯入的 ONNX 模型轉換輸入資料Transform the input data with an imported ONNX model
LoadTensorFlowModel 使用匯入的 TensorFlow 模型轉換輸入資料Transform the input data with an imported TensorFlow model

自訂轉換Custom transformations

資料轉換Transform 定義Definition
CustomMapping 搭配使用者定義的對應將現有資料行轉換為新的資料行Transform existing columns onto new ones with a user-defined mapping