संपादित करें

Share via


What's new in ML.NET

Note

This article is a work in progress.

You can find all of the release notes for the ML.NET API in the dotnet/machinelearning repo.

New deep-learning tasks

ML.NET 3.0 added support for the following deep-learning tasks:

  • Object detection (backed by TorchSharp)
  • Named entity recognition (NER)
  • Question answering (QA)

These trainers are included in the Microsoft.ML.TorchSharp package. For more information, see Announcing ML.NET 3.0.

AutoML

In ML.NET 3.0, the AutoML sweeper was updated to support the sentence similarity, question answering, and object detection tasks. For more information about AutoML, see How to use the ML.NET Automated Machine Learning (AutoML) API.

Additional tokenizer support

Microsoft.ML.Tokenizers is an open-source, cross-platform tokenization library. When it was introduced, the library was scoped to the Byte-Pair Encoding (BPE) tokenization strategy to satisfy the language set of scenarios in ML.NET. Version 4.0 Preview 1 added support for the Tiktoken tokenizer.

The following examples show how to use the Tiktoken text tokenizer.

Tokenizer tokenizer = Tokenizer.CreateTiktokenForModel("gpt-4");
string text = "Hello, World!";

// Encode to IDs.
IReadOnlyList<int> encodedIds = tokenizer.EncodeToIds(text);
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}");
// encodedIds = {9906, 11, 4435, 0}

// Decode IDs to text.
string? decodedText = tokenizer.Decode(encodedIds);
Console.WriteLine($"decodedText = {decodedText}");
// decodedText = Hello, World!

// Get token count.
int idsCount = tokenizer.CountTokens(text);
Console.WriteLine($"idsCount = {idsCount}");
// idsCount = 4

// Full encoding.
EncodingResult result = tokenizer.Encode(text);
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Tokens)}'}}");
// result.Tokens = {'Hello', ',', ' World', '!'}
Console.WriteLine($"result.Offsets = {{{string.Join(", ", result.Offsets)}}}");
// result.Offsets = {(0, 5), (5, 1), (6, 6), (12, 1)}
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Ids)}}}");
// result.Ids = {9906, 11, 4435, 0}

// Encode up to number of tokens limit.
int index1 = tokenizer.IndexOfTokenCount(
    text,
    maxTokenCount: 1,
    out string processedText1,
    out int tokenCount1
    ); // Encode up to one token.
Console.WriteLine($"processedText1 = {processedText1}");
// processedText1 = Hello, World!
Console.WriteLine($"tokenCount1 = {tokenCount1}");
// tokenCount1 = 1
Console.WriteLine($"index1 = {index1}");
// index1 = 5

int index2 = tokenizer.LastIndexOfTokenCount(
    text,
    maxTokenCount: 1,
    out string processedText2,
    out int tokenCount2
    ); // Encode from end up to one token.
Console.WriteLine($"processedText2 = {processedText2}");
// processedText2 = Hello, World!
Console.WriteLine($"tokenCount2 = {tokenCount2}");
// tokenCount2 = 1
Console.WriteLine($"index2 = {index2}");
// index2 = 12

About tokenization

Tokenization is a fundamental component in the preprocessing of natural language text for AI models. Tokenizers are responsible for breaking down a string of text into smaller, more manageable parts, often referred to as tokens. When using services like Azure OpenAI, you can use tokenizers to get a better understanding of cost and manage context. When working with self-hosted or local models, tokens are the inputs provided to those models.

Model Builder (Visual Studio extension)

Model Builder has been updated to consume the ML.NET 3.0 release. Model Builder version 17.18.0 added question answering (QA) and named entity recognition (NER) scenarios.

You can find all of the Model Builder release notes in the dotnet/machinelearning-modelbuilder repo.

See also