Load and classify your own datasets
In this short tutorial, we are going to learn how to load your own dataset to train a classifier model with the Quantum Development Kit (QDK).
We highly recommend the use of a standardized serialization format such as JSON files to store your data. Such formats offer high compatibility with different frameworks like Python and the .NET ecosystem. In particular, we recommend using our template for loading the data, so that you can copy-paste the code directly from the samples.
Template for loading your datasets
Suppose we have a training dataset $(x, y)$ of size $N=2$ where each instance $x_i$ of $x$ has three features: $x_{i1}$, $x_{i2}$ and $x_{i3}$.
The validation dataset has the same structure.
These datsets can be represented by a data.json file similar to the following:
{
"TrainingData": {
"Features": [
[
x_11,
x_12,
x_13
],
[
x_21,
x_22,
x_23
]
],
"Labels": [
y_1,
y_2
]
},
"ValidationData": {
"Features": [
[
xv_11,
xv_12,
xv_13
],
[
xv_11,
xv_12,
xv_13
]
],
"Labels": [
yv_1,
yv_2
]
}
}
Example using the template
Suppose we have a small dataset with the heights and weights of different cats and dogs. This dataset is very small to train a model but will be enough to show the process of loading a dataset.
| Height (m) | Weight (kg) | Animal |
|---|---|---|
| 0.54 | 30 | Dog |
| 0.30 | 8 | Cat |
| 0.91 | 44 | Dog |
| 0.86 | 31 | Dog |
| 0.32 | 5 | Cat |
| 0.25 | 4 | Cat |
The process is:
- First we need to separate the dataset into training and validation. In this case we can just take the first three samples for training and the rest of samples for validation. In general it is a good practice to sample randomly the training and validation dataset to avoid unwanted biases in the training data.
- Secondly, we need to assign a numeric label to each class. Note that, for the moment, the QML library only admits binary classification problems. So we will assign the label 0 to the class
Dogand the number 1 to the classCat. - Finally, we fill the template using the data from our dataset. Note that for big datasets you should build a small script to automatically generate the template from your specific dataset. This script will depend on the original format of your dataset.
For our dataset the data.json file is:
{
"TrainingData": {
"Features": [
[
0.54,
30
],
[
0.30,
8
],
[
0.91,
44
]
],
"Labels": [
0,
1,
0
]
},
"ValidationData": {
"Features": [
[
0.86,
31
],
[
0.32,
5
]
[
0.25,
4
]
],
"Labels": [
0,
1,
1
]
}
}
Loading the data
Once you have your data serialized as a JSON file, you can load it in using JSON libraries provided with your host language of choice.
Python provides the built-in json package for working with JSON-serialized data:
import json
data = json.load(f)
parameter_starting_points = [
[0.060057, 3.00522, 2.03083, 0.63527, 1.03771, 1.27881, 4.10186, 5.34396],
Next steps
Now you are ready to start running your own experiments with your own datasets. Try different classifiers and dataset and contribute to the community sharing your results!
Povratne informacije
Pošalјite i prikažite povratne informacije za