逐步解說步驟 2:將現有資料上傳至 Azure Machine Learning Studio 實驗中Walkthrough Step 2: Upload existing data into an Azure Machine Learning Studio experiment

這是 在 Azure Machine Learning 中為信用風險評估開發預測性分析解決方案This is the second step of the walkthrough, Develop a predictive analytics solution in Azure Machine Learning

  1. 建立機器學習服務工作區Create a Machine Learning workspace
  2. 上傳現有資料Upload existing data
  3. 建立新實驗Create a new experiment
  4. 訓練及評估模型Train and evaluate the models
  5. 部署 Web 服務Deploy the Web service
  6. 存取 Web 服務Access the Web service

為了開發信用風險的預測模型,我們需要可以用於訓練和測試模型的資料。To develop a predictive model for credit risk, we need data that we can use to train and then test the model. 針對此逐步教學,我們將使用 UCI Irvine Machine Learning Repository 中的「UCI Statlog (德國信用資料) 資料集」。For this walkthrough, we'll use the "UCI Statlog (German Credit Data) Data Set" from the UC Irvine Machine Learning repository. 您可以在下列位置找到此儲存機制:You can find it here:
http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

您可以使用名為 german.data的檔案。We'll use the file named german.data. 將此檔案下載至您的本機硬碟。Download this file to your local hard drive.

german.data 資料集包含過去 1000 名信用額度申請者的 20 個變數資料列。The german.data dataset contains rows of 20 variables for 1000 past applicants for credit. 這 20 個變數代表資料集的特徵集 (「特徵向量」),可分別提供每個信用額度申請者的識別特性。These 20 variables represent the dataset's set of features (the feature vector), which provides identifying characteristics for each credit applicant. 每個資料列另外會有一個資料行代表申請者計算後的信用風險,其中有 700 名申請者被認定為低信用風險,300 名為高風險。An additional column in each row represents the applicant's calculated credit risk, with 700 applicants identified as a low credit risk and 300 as a high risk.

UCI 網站提供了此資料的特徵向量的屬性描述。The UCI website provides a description of the attributes of the feature vector for this data. 包括財務資訊、信用歷史記錄、工作狀態、個人資訊。This includes financial information, credit history, employment status, and personal information. 每個申請者都會有一個二進位評等,指出他們屬於低信用風險還是高風險。For each applicant, a binary rating has been given indicating whether they are a low or high credit risk.

我們將使用此資料來訓練預測分析模型。We'll use this data to train a predictive analytics model. 完成之後,我們的模型應能夠接受新申請者的特徵向量,並預測他或她是屬於低信用風險還是高風險。When we're done, our model should be able to accept a feature vector for a new individual and predict whether he or she is a low or high credit risk.

以下提供一個有趣的論點。Here's an interesting twist. UCI 網站上的資料集描述提及,如果我們錯誤分類一個人的信用風險需付出何種代價。The description of the dataset on the UCI website mentions what it costs if we misclassify a person's credit risk. 如果模型將某個實際為低信用風險的人預測為高信用風險,則此模型做了錯誤分類。If the model predicts a high credit risk for someone who is actually a low credit risk, the model has made a misclassification. 但反向的錯誤分類對金融機構而言需要付出五倍的代價:如果模型將某個實際為高信用風險的人預測為低信用風險。But the reverse misclassification is five times more costly to the financial institution: if the model predicts a low credit risk for someone who is actually a high credit risk.

因此,我們想要訓練模型,讓後者的這一個錯誤分類類型的成本比另一種錯誤分類方式高出五倍。So, we want to train our model so that the cost of this latter type of misclassification is five times higher than misclassifying the other way. 在我們的實驗中,在訓練模型時執行此動作的一個簡單方式是,複製 (五次) 那些代表某個具有高信用風險之人員的項目。One simple way to do this when training the model in our experiment is by duplicating (five times) those entries that represent someone with a high credit risk. 然後,如果模型將某人錯誤分類為低信用風險,但他們實際為高風險時,模型即會會進行該相同錯誤分類五次,針對每個重複項目進行一次。Then, if the model misclassifies someone as a low credit risk when they're actually a high risk, the model does that same misclassification five times, once for each duplicate. 這會在訓練結果中增加此誤差的成本。This will increase the cost of this error in the training results.

轉換資料集格式Convert the dataset format

原始資料集使用以空格分隔的格式。The original dataset uses a blank-separated format. Machine Learning Studio 在使用逗號分隔值 (CSV) 檔案時更能適當運作,因此我們將以逗號取代空格,進行資料集轉換。Machine Learning Studio works better with a comma-separated value (CSV) file, so we'll convert the dataset by replacing spaces with commas.

有許多方法可以轉換此資料。There are many ways to convert this data. 其中一種是使用下列的 Windows PowerShell 命令:One way is by using the following Windows PowerShell command:

cat german.data | %{$_ -replace " ",","} | sc german.csv  

另一種方法是使用 Unix Sed 命令:Another way is by using the Unix sed command:

sed 's/ /,/g' german.data > german.csv  

在任一案例中,我們已經以名為 german.csv 的檔案建立逗號分隔版本的資料,可在我們的實驗中加以使用。In either case, we have created a comma-separated version of the data in a file named german.csv that we can use in our experiment.

將資料集上傳至 Machine Learning StudioUpload the dataset to Machine Learning Studio

在資料轉換為 CSV 格式後,我們必須將其上傳至 Machine Learning Studio 中。Once the data has been converted to CSV format, we need to upload it into Machine Learning Studio.

  1. 開啟 Machine Learning Studio 首頁 (https://studio.azureml.net)。Open the Machine Learning Studio home page (https://studio.azureml.net).

  2. 按一下視窗左上角的功能表功能表,按一下 [Azure Machine Learning],選取 [Studio] 然後登入。Click the menu Menu in the upper-left corner of the window, click Azure Machine Learning, select Studio, and sign in.

  3. 按一下視窗底部的 [ +新增 ]。Click +NEW at the bottom of the window.

  4. 選取 [ 資料集]。Select DATASET.

  5. 選取 [ 從本機檔案]。Select FROM LOCAL FILE.

    從本機檔案新增資料集

  6. 在 [上傳新的資料集] 對話方塊中,按一下 [瀏覽],然後尋找您建立的 german.csv 檔案。In the Upload a new dataset dialog, click Browse and find the german.csv file you created.

  7. 輸入資料集的名稱。Enter a name for the dataset. 在此逐步解說中,將它稱為 "UCI German Credit Card Data"。For this walkthrough, call it "UCI German Credit Card Data".

  8. 針對資料類型,請選取 不具標頭的一般 CSV 檔案 (.nh.csv)For data type, select Generic CSV File With no header (.nh.csv).

  9. 視需要新增說明。Add a description if you’d like.

  10. 按一下 [確定] (打勾記號)。Click the OK check mark.

    上傳資料集

這會將資料上傳至我們可在實驗中使用的資料集模組。This uploads the data into a dataset module that we can use in an experiment.

您可以管理您已上傳至 Studio 的資料集,請按一下 Studio 視窗左側的 [資料集] 索引標籤。You can manage datasets that you've uploaded to Studio by clicking the DATASETS tab to the left of the Studio window.

管理資料集

如需將其他種資料類型匯入實驗的詳細資訊,請參閱將訓練資料匯入 Azure Machine Learning StudioFor more information about importing other types of data into an experiment, see Import your training data into Azure Machine Learning Studio.

下一步:建立新實驗Next: Create a new experiment