Accepted data formats

If you're trying to import your data into custom text classification, it has to follow a specific format. If you don't have data to import you can create your project and use Language Studio to label your documents.

Labels file format

Your Labels file should be in the json format below. This will enable you to import your labels into a project.

{
    "projectFileVersion": "2022-05-01",
    "stringIndexType": "Utf16CodeUnit",
    "metadata": {
      "projectKind": "CustomMultiLabelClassification",
      "storageInputContainerName": "{CONTAINER-NAME}",
      "projectName": "{PROJECT-NAME}",
      "multilingual": false,
      "description": "Project-description",
      "language": "en-us"
    },
    "assets": {
      "projectKind": "CustomMultiLabelClassification",
      "classes": [
        {
          "category": "Class1"
        },
        {
          "category": "Class2"
        }
      ],
      "documents": [
          {
              "location": "{DOCUMENT-NAME}",
              "language": "{LANGUAGE-CODE}",
              "dataset": "{DATASET}",
              "classes": [
                  {
                      "category": "Class1"
                  },
                  {
                      "category": "Class2"
                  }
              ]
          }
      ]
  }
Key Placeholder Value Example
multilingual true A boolean value that enables you to have documents in multiple languages in your dataset and when your model is deployed you can query the model in any supported language (not necessarily included in your training documents). See language support to learn more about multilingual support. true
projectName {PROJECT-NAME} Project name myproject
storageInputContainerName {CONTAINER-NAME} Container name mycontainer
classes [] Array containing all the classes you have in the project. These are the classes you want to classify your documents into. []
documents [] Array containing all the documents in your project and the classes labeled for this document. []
location {DOCUMENT-NAME} The location of the documents in the storage container. Since all the documents are in the root of the container, this value should be the document name. doc1.txt
dataset {DATASET} The test set to which this file will go to when split before training. See How to train a model for more information. Possible values for this field are Train and Test. Train

Next steps