在 Azure Machine Learning Studio 中建立情感分析模型(傳統)Create a sentiment analysis model in Azure Machine Learning Studio (classic)

您可以使用 Azure Machine Learning Studio (傳統)來建立和讓文字分析模型。You can use Azure Machine Learning Studio (classic) to build and operationalize text analytics models. 這些模型可協助您解決問題,例如,文件分類或情緒分析問題。These models can help you solve, for example, document classification or sentiment analysis problems.

在文字分析實驗中,您通常需要︰In a text analytics experiment, you would typically:

  1. 清理和前置處理文字資料集Clean and preprocess text dataset
  2. 從已前置處理的文字擷取數值特徵向量Extract numeric feature vectors from pre-processed text
  3. 定型分類或迴歸模型Train classification or regression model
  4. 評分和驗證模型Score and validate the model
  5. 模型部署到生產環境Deploy the model to production

在本教學課程中,您會在使用 Amazon Book 評論資料集逐步解說情感分析模型時,瞭解這些步驟(請參閱這份研究白皮書「生平、Bollywood boom-boxes and、使用者馬上 box and Blenders:情感分類的網域調整」,John Blitzer,Mark Dredze 和 Fernando Pereira;計算語言(ACL)(2007)的關聯。此資料集包含審核分數(1-2 或4-5)和自由格式的文字。In this tutorial, you learn these steps as we walk through a sentiment analysis model using Amazon Book Reviews dataset (see this research paper “Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification” by John Blitzer, Mark Dredze, and Fernando Pereira; Association of Computational Linguistics (ACL), 2007.) This dataset consists of review scores (1-2 or 4-5) and a free-form text. 目標是要預測評論分數︰低 (1-2) 或高 (4-5)。The goal is to predict the review score: low (1-2) or high (4-5).

您可以在 Azure AI 資源庫找到本教學課程中涵蓋的實驗︰You can find experiments covered in this tutorial at Azure AI Gallery:

預測書籍評論Predict Book Reviews

預測書籍評論 - 預測性實驗Predict Book Reviews - Predictive Experiment

步驟 1:清理和前置處理文字資料集Step 1: Clean and preprocess text dataset

一開始我們先將評論分數分為低與高兩類,以將問題公式化為雙類別分類。We begin the experiment by dividing the review scores into categorical low and high buckets to formulate the problem as two-class classification. 我們使用編輯中繼資料群組類別值模組。We use Edit Metadata and Group Categorical Values modules.

建立標籤

然後,我們使用 前置處理文字 模組清除文字。Then, we clean the text using Preprocess Text module. 清除可減少資料集雜訊、協助您找出最重要的特徵,並改善最終模型的精確度。The cleaning reduces the noise in the dataset, help you find the most important features, and improve the accuracy of the final model. 我們會移除停用字詞 (例如 "the" 或 "a" 等常見單字)、數字、特殊字元、重複字元、電子郵件地址和 URL。We remove stopwords - common words such as "the" or "a" - and numbers, special characters, duplicated characters, email addresses, and URLs. 我們也將文字轉換成小寫、將單字按屈折變化形式歸類,並偵測句子界限,然後在預先處理的文字中以 "| | |" 符號表示這些界限。We also convert the text to lowercase, lemmatize the words, and detect sentence boundaries that are then indicated by "|||" symbol in pre-processed text.

前置處理文字

如果想使用自訂的停用字詞清單該怎麼做?What if you want to use a custom list of stopwords? 您可以將它傳入做為選擇性的輸入。You can pass it in as optional input. 您也可以使用自訂的 C# 語法規則運算式來取代子字串,並按詞性 (名詞、動詞或形容詞) 移除單字。You can also use custom C# syntax regular expression to replace substrings, and remove words by part of speech: nouns, verbs, or adjectives.

在前置處理完成之後,我們會將資料分成定型和測試集。After the preprocessing is complete, we split the data into train and test sets.

步驟 2:從已前置處理的文字擷取數值特徵向量Step 2: Extract numeric feature vectors from pre-processed text

若要建置文字資料的模型,您通常需要將自由格式的文字轉換成數值特徵向量。To build a model for text data, you typically have to convert free-form text into numeric feature vectors. 在此範例中,我們使用 從文字擷取 N-Gram 特徵 模組,將文字資料轉換為這種格式。In this example, we use Extract N-Gram Features from Text module to transform the text data to such format. 此模組會採用以空格分隔單字的資料行,並計算出現在您資料集中的單字字典或單字的 N-Gram。This module takes a column of whitespace-separated words and computes a dictionary of words, or N-grams of words, that appear in your dataset. 然後,它會計算每個單字或 N-Gram 出現在每筆記錄的次數,並從這些計數建立特徵向量。Then, it counts how many times each word, or N-gram, appears in each record, and creates feature vectors from those counts. 在本教學課程中,我們將 N-Gram 大小設為 2,因此我們的特徵向量包含一個單字和兩個後續單字的組合。In this tutorial, we set N-gram size to 2, so our feature vectors include single words and combinations of two subsequent words.

擷取 N-Gram

我們會套用 TF*IDF (Term Frequency Inverse Document Frequency) 加權至 N-Gram 計數。We apply TF*IDF (Term Frequency Inverse Document Frequency) weighting to N-gram counts. 這個方法會增加經常出現在單一記錄、卻很少在整個資料集出現的單字的權數。This approach adds weight of words that appear frequently in a single record but are rare across the entire dataset. 其他選項包括二進位、TF 及圖形加權。Other options include binary, TF, and graph weighing.

這類文字特徵通常具有高維度。Such text features often have high dimensionality. 比方說,如果您的語言資料庫有 100,000 個唯一的單字,特徵空間會有 100,000 個維度,或使用更多的 N-Gram。For example, if your corpus has 100,000 unique words, your feature space would have 100,000 dimensions, or more if N-grams are used. 「擷取 N-Gram 特徵」模組提供一組減少維度的選項。The Extract N-Gram Features module gives you a set of options to reduce the dimensionality. 您可以選擇排除過短或過長的單字,或是太常見或太頻繁而有重要預測值的單字。You can choose to exclude words that are short or long, or too uncommon or too frequent to have significant predictive value. 在本教學課程中,我們會排除出現在少於 5 筆記錄或超過 80% 的記錄的 N-Gram。In this tutorial, we exclude N-grams that appear in fewer than 5 records or in more than 80% of records.

此外,使用特徵選取可以只選取與您的預測目標最相關的特徵。Also, you can use feature selection to select only those features that are the most correlated with your prediction target. 我們使用 Chi-Squared 特徵選取來選取 1000 個特徵。We use Chi-Squared feature selection to select 1000 features. 您可以按一下「擷取 N-Gram」模組右側的輸出,即可檢視所選單字或 N-Gram 的詞彙。You can view the vocabulary of selected words or N-grams by clicking the right output of Extract N-grams module.

另一個方法是使用「擷取 N-Gram 特徵」,您就可以使用「特徵雜湊」模組。As an alternative approach to using Extract N-Gram Features, you can use Feature Hashing module. 但請注意, 特徵雜湊 沒有內建的特徵選取功能或 TF*IDF 加權。Note though that Feature Hashing does not have build-in feature selection capabilities, or TF*IDF weighing.

步驟 3:定型分類或迴歸模型Step 3: Train classification or regression model

現在文字已轉換為數值特徵資料行。Now the text has been transformed to numeric feature columns. 資料集仍包含上一階段中的字串資料行,因此我們使用「選取資料集中的資料行」來排除它們。The dataset still contains string columns from previous stages, so we use Select Columns in Dataset to exclude them.

接著使用 二元羅吉斯迴歸 預測我們的目標︰高或低的評論分數。We then use Two-Class Logistic Regression to predict our target: high or low review score. 此時,文字分析問題已轉換成一般分類問題。At this point, the text analytics problem has been transformed into a regular classification problem. 您可以使用 Azure Machine Learning Studio (傳統)中提供的工具來改善模型。You can use the tools available in Azure Machine Learning Studio (classic) to improve the model. 例如,您可以試驗不同的分類器以了解它們所提供結果的精確度,或使用超參數調整改善精確度。For example, you can experiment with different classifiers to find out how accurate results they give, or use hyperparameter tuning to improve the accuracy.

定型和評分

步驟 4:評分和驗證模型Step 4: Score and validate the model

如何驗證定型的模型?How would you validate the trained model? 我們會對照測試資料集來評分,並評估精確度。We score it against the test dataset and evaluate the accuracy. 不過,此模型已從了解定型資料集學到 N-Gram 和其加權的詞彙。However, the model learned the vocabulary of N-grams and their weights from the training dataset. 因此,在從測試資料擷取特徵時,我們應該使用該詞彙和這些加權,而不是重新建立詞彙。Therefore, we should use that vocabulary and those weights when extracting features from test data, as opposed to creating the vocabulary anew. 因此,我們在實驗評分分支加入「擷取 N-Gram 特徵」模組、從定型分支連接輸出詞彙,並將詞彙模式設定為唯讀。Therefore, we add Extract N-Gram Features module to the scoring branch of the experiment, connect the output vocabulary from training branch, and set the vocabulary mode to read-only. 我們也透過將最小值設為 1 個執行個體、最大值設為 100% 來停用依頻率篩選 N-Gram,並關閉特徵選取。We also disable the filtering of N-grams by frequency by setting the minimum to 1 instance and maximum to 100%, and turn off the feature selection.

在測試資料中的文字資料行轉換成數值特徵資料行之後,我們會排除先前階段 (例如在定型分支中) 中的字串資料行。After the text column in test data has been transformed to numeric feature columns, we exclude the string columns from previous stages like in training branch. 接著使用「評分模型」模組進行預測,並使用「評估模型」模組來評估精確度。We then use Score Model module to make predictions and Evaluate Model module to evaluate the accuracy.

步驟 5:將模型部署到生產環境Step 5: Deploy the model to production

模型已幾乎可立即部署到生產環境。The model is almost ready to be deployed to production. 部署為 Web 服務時,它會採用自由格式的文字字串做為輸入,並傳回「高」或「低」的預測。When deployed as web service, it takes free-form text string as input, and return a prediction "high" or "low." 它會使用學習到的 N-Gram 詞彙將文字轉換成特徵,並使用定型的羅吉斯迴歸模型,從這些特徵進行預測。It uses the learned N-gram vocabulary to transform the text to features, and trained logistic regression model to make a prediction from those features.

為設定預測性實驗,我們先儲存 N-Gram 詞彙做為資料集,並使用實驗的定型分支中的定型羅吉斯迴歸模型。To set up the predictive experiment, we first save the N-gram vocabulary as dataset, and the trained logistic regression model from the training branch of the experiment. 接著我們使用「另存新檔」儲存實驗,為預測性實驗建立實驗圖形。Then, we save the experiment using "Save As" to create an experiment graph for predictive experiment. 我們從實驗中移除「分割資料」模組和定型分支。We remove the Split Data module and the training branch from the experiment. 我們再將先前儲存的 N-Gram 詞彙和模型分別連接到「擷取 N-Gram 特徵」和「評分模型」模組。We then connect the previously saved N-gram vocabulary and model to Extract N-Gram Features and Score Model modules, respectively. 我們也會移除「評估模型」模組。We also remove the Evaluate Model module.

我們將「選取資料集中的資料行」模組插入「前置處理文字」模組之前以移除標籤資料行,並取消選取「評分模組」中的「 將評分資料行附加到資料集」選項。We insert Select Columns in Dataset module before Preprocess Text module to remove the label column, and unselect "Append score column to dataset" option in Score Module. 這樣一來,Web 服務就不會要求它正嘗試預測的標籤,也不會對回應中的輸入特徵做回應。That way, the web service does not request the label it is trying to predict, and does not echo the input features in response.

預測性實驗

現在我們有一個實驗可以發行為 Web 服務,並使用要求-回應或批次執行 API 進行呼叫。Now we have an experiment that can be published as a web service and called using request-response or batch execution APIs.

後續步驟Next Steps

MSDN 文件深入了解文字分析模組。Learn about text analytics modules from MSDN documentation.