question

DJamin-2804 avatar image
0 Votes"
DJamin-2804 asked ramr-msft commented

Input data sample training clarifications

Dear users and experts,
I am not clear about how is used the input data sample.
I have a csv file with 5 fields and 20k lines entries.
When I run a classification training, does the 20k entries used ? Or the algorithm split this data sample randomly?

My final goal is to ensure that I am training on the desired examples I am providing, in order to make my training better when I find some new examples on which the algortihm is doing wrong.

Best regards.

azure-machine-learningdotnet-ml-big-data
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@DJamin-2804 Thanks for the question. Are you trying with the Azure AutoML or AML Studio?

0 Votes 0 ·

Hi, I am using Visual Studio 2019, .NET Core console for C#

0 Votes 0 ·
ramr-msft avatar image
0 Votes"
ramr-msft answered ramr-msft edited

@DJamin-2804 Thanks for the question. The algorithm will not split the data. The idea is to split the whole dataset into training and test, where the test dataset is held back from training your model. Then in the training stage, the original training dataset is divided again into the (secondary) training dataset and validation dataset, where the validation dataset is also held back from training your model. The reason for the second split of training dataset is that the most models have some hyperparameters that need to be tuned, where the role of validation dataset is to be used for this purpose with a specific model. Thus, if my model does not have hyperparameters to be tuned, I do not need to have the training dataset split into the (secondary) training and validation datasets.

• Training Dataset: The sample of data used to fit the model.
• Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.
• Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

DJamin-2804 avatar image
0 Votes"
DJamin-2804 answered ramr-msft commented

Dear expert,
thanks for the feedback.
But sorry I am not clear with your explanation of the splitting.
The whole 20k dataset is used for training and these same 20k entries are used again for evaluation?

I have another last question : is there a way to provide a training sample that will be used fully. And then provide a test sample (containing different data) to evaluate the accuracy of the training. This test sample will be used fully too.

I want to ensure that all the desired data will be used for learning.
Anyone knows if such possiblity exists?
Best regards.

· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@DJamin-2804 Thanks for the details, Can you please share your code/document that you are trying for the classification. The model will be overfitting to a single training set if cross validation not done.


0 Votes 0 ·

I cannot provide my dataset sorry.
I am using the default Classification of .Net Core Console (C#), so I have only the control on my dataset sample from what I understand.
The choice of the ML model (FastTreeOva,...) and the tunings are handled by the .NET Core ML algorithm.
I imagine the k-fold is part of the tunings you mentionned above.
If there is a documentation of .NetCore ML, I would be happy to read it.

0 Votes 0 ·

@DJamin-2804 Thanks for the details. Here is the link to the ML.NET.

https://dotnet.microsoft.com/learn/ml-dotnet/what-is-mldotnet

0 Votes 0 ·