Tokenizer sample skill for AI search

Code Sample
11/15/2023

This custom skill extracts normalized non-stop words from a text using the ML.NET library.

The language used for stop word removal can be optionally specified with the languageCode parameter using the ISO 639-1 code. Supported languages are:

Arabic(ar)
Czech (cs)
Danish (da)
Dutch (nl)
English (en), is the default language used if none is specified.
French (fr)
German (de)
Italian (it)
Japanese (ja)
Norwegian Bokmål (nb)
Polish (pl)
Portuguese (pt)
Spanish (es)
Swedish (sv)
Russian (ru)

Requirements

This skills have no additional requirements than the ones described in the root README.md file.

Deployment

tokenizer

Sample Input:

{
    "values": [
        {
 "recordId": "record1",
            "data": { 
                "text": "ML.NET's RemoveDefaultStopWords API removes stop words from tHe text/string. It requires the text/string to be tokenized beforehand.",
                "languageCode": "en"
            }
        }
    ]
}

Sample Output:

{
    "values": [
        {
            "recordId": "record1",
            "data": {
                "words": [
                    "mlnets",
                    "removedefaultstopwords",
                    "api",
                    "removes",
                    "stop",
                    "words",
                    "textstring",
                    "requires",
                    "textstring",
                    "tokenized"
                ]
            },
            "errors": [],
            "warnings": []
        }
    ]
}

Sample Skillset Integration

In order to use this skill in a AI search pipeline, you'll need to add a skill definition to your skillset. Here's a sample skill definition for this example (inputs and outputs should be updated to reflect your particular scenario and skillset environment):

{
    "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
    "description": "Tokenizer",
    "uri": "[AzureFunctionEndpointUrl]/api/tokenizer?code=[AzureFunctionDefaultHostKey]",
    "batchSize": 1,
    "context": "/document/content",
    "inputs": [
        {
            "name": "text",
            "source": "/document/content"
        },
        {
            "name": "languageCode",
            "source": "document/language"
        }
    ],
    "outputs": [
        {
            "name": "words",
            "targetName": "words"
        }
    ]
}