CategoricalCatalog.OneHotHashEncoding Method

Definition

Overloads

OneHotHashEncoding(TransformsCatalog+CategoricalTransforms, InputOutputColumnPair[], OneHotEncodingEstimator+OutputKind, Int32, UInt32, Boolean, Int32)

Create a OneHotHashEncodingEstimator, which converts one or more input text columns specified by columns into as many columns of hash-based one-hot encoded vectors.

OneHotHashEncoding(TransformsCatalog+CategoricalTransforms, String, String, OneHotEncodingEstimator+OutputKind, Int32, UInt32, Boolean, Int32)

Create a OneHotHashEncodingEstimator, which converts a text column specified by inputColumnName into a hash-based one-hot encoded vector column named outputColumnName.

OneHotHashEncoding(TransformsCatalog+CategoricalTransforms, InputOutputColumnPair[], OneHotEncodingEstimator+OutputKind, Int32, UInt32, Boolean, Int32)

Create a OneHotHashEncodingEstimator, which converts one or more input text columns specified by columns into as many columns of hash-based one-hot encoded vectors.

public static Microsoft.ML.Transforms.OneHotHashEncodingEstimator OneHotHashEncoding (this Microsoft.ML.TransformsCatalog.CategoricalTransforms catalog, Microsoft.ML.InputOutputColumnPair[] columns, Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind outputKind = Microsoft.ML.Transforms.OneHotEncodingEstimator+OutputKind.Indicator, int numberOfBits = 16, uint seed = 314489979, bool useOrderedHashing = true, int maximumNumberOfInverts = 0);
static member OneHotHashEncoding : Microsoft.ML.TransformsCatalog.CategoricalTransforms * Microsoft.ML.InputOutputColumnPair[] * Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind * int * uint32 * bool * int -> Microsoft.ML.Transforms.OneHotHashEncodingEstimator
<Extension()>
Public Function OneHotHashEncoding (catalog As TransformsCatalog.CategoricalTransforms, columns As InputOutputColumnPair(), Optional outputKind As OneHotEncodingEstimator.OutputKind = Microsoft.ML.Transforms.OneHotEncodingEstimator+OutputKind.Indicator, Optional numberOfBits As Integer = 16, Optional seed As UInteger = 314489979, Optional useOrderedHashing As Boolean = true, Optional maximumNumberOfInverts As Integer = 0) As OneHotHashEncodingEstimator

Parameters

catalog
TransformsCatalog.CategoricalTransforms

The transform catalog

columns
InputOutputColumnPair[]

The pairs of input and output columns. The output columns' data type will be a vector of Single if outputKind is Bag, Indicator, and Binary. If outputKind is Key, the output columns' data type will be a key in the case of scalar input column or a vector of keys in the case of a vector input column.

outputKind
OneHotEncodingEstimator.OutputKind

The conversion mode.

numberOfBits
Int32

Number of bits to hash into. Must be between 1 and 30, inclusive.

seed
UInt32

Hashing seed.

useOrderedHashing
Boolean

Whether the position of each term should be included in the hash.

maximumNumberOfInverts
Int32

During hashing we constuct mappings between original values and the produced hash values. Text representation of original values are stored in the slot names of the metadata for the new column. Hashing, as such, can map many initial values to one. maximumNumberOfInverts specifies the upper bound of the number of distinct input values mapping to a hash that should be retained. 0 does not retain any input values. -1 retains all input values mapping to each hash.

Returns

Examples

using System;
using Microsoft.ML;

namespace Samples.Dynamic.Transforms.Categorical
{
    public static class OneHotHashEncodingMultiColumn
    {
        public static void Example()
        {
            // Create a new ML context for ML.NET operations. It can be used for
            // exception tracking and logging as well as the source of randomness.
            var mlContext = new MLContext();

            // Get a small dataset as an IEnumerable.
            var samples = new[]
            {
                new DataPoint {Education = "0-5yrs", ZipCode = "98005"},
                new DataPoint {Education = "0-5yrs", ZipCode = "98052"},
                new DataPoint {Education = "6-11yrs", ZipCode = "98005"},
                new DataPoint {Education = "6-11yrs", ZipCode = "98052"},
                new DataPoint {Education = "11-15yrs", ZipCode = "98005"}
            };

            // Convert training data to IDataView.
            IDataView data = mlContext.Data.LoadFromEnumerable(samples);

            // Multi column example: A pipeline for one hot has encoding two
            // columns 'Education' and 'ZipCode'.
            var multiColumnKeyPipeline =
                mlContext.Transforms.Categorical.OneHotHashEncoding(
                    new[]
                    {
                        new InputOutputColumnPair("Education"),
                        new InputOutputColumnPair("ZipCode")
                    },
                    numberOfBits: 3);

            // Fit and Transform the data.
            IDataView transformedData =
                multiColumnKeyPipeline.Fit(data).Transform(data);

            var convertedData =
                mlContext.Data.CreateEnumerable<TransformedData>(transformedData,
                    true);

            Console.WriteLine(
                "One Hot Hash Encoding of two columns 'Education' and 'ZipCode'.");

            // One Hot Hash Encoding of two columns 'Education' and 'ZipCode'.

            foreach (TransformedData item in convertedData)
                Console.WriteLine("{0}\t\t\t{1}", string.Join(" ", item.Education),
                    string.Join(" ", item.ZipCode));

            // We have 8 slots, because we used numberOfBits = 3.

            // 0 0 0 1 0 0 0 0                 0 0 0 0 0 0 0 1
            // 0 0 0 1 0 0 0 0                 1 0 0 0 0 0 0 0
            // 0 0 0 0 1 0 0 0                 0 0 0 0 0 0 0 1
            // 0 0 0 0 1 0 0 0                 1 0 0 0 0 0 0 0
            // 0 0 0 0 0 0 0 1                 0 0 0 0 0 0 0 1
        }

        private class DataPoint
        {
            public string Education { get; set; }

            public string ZipCode { get; set; }
        }

        private class TransformedData
        {
            public float[] Education { get; set; }

            public float[] ZipCode { get; set; }
        }
    }
}

Remarks

If multiple columns are passed to the estimator, all of the columns will be processed in a single pass over the data. Therefore, it is more efficient to specify one estimator with many columns than it is to specify many estimators each with a single column.

Applies to

OneHotHashEncoding(TransformsCatalog+CategoricalTransforms, String, String, OneHotEncodingEstimator+OutputKind, Int32, UInt32, Boolean, Int32)

Create a OneHotHashEncodingEstimator, which converts a text column specified by inputColumnName into a hash-based one-hot encoded vector column named outputColumnName.

public static Microsoft.ML.Transforms.OneHotHashEncodingEstimator OneHotHashEncoding (this Microsoft.ML.TransformsCatalog.CategoricalTransforms catalog, string outputColumnName, string inputColumnName = default, Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind outputKind = Microsoft.ML.Transforms.OneHotEncodingEstimator+OutputKind.Indicator, int numberOfBits = 16, uint seed = 314489979, bool useOrderedHashing = true, int maximumNumberOfInverts = 0);
static member OneHotHashEncoding : Microsoft.ML.TransformsCatalog.CategoricalTransforms * string * string * Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind * int * uint32 * bool * int -> Microsoft.ML.Transforms.OneHotHashEncodingEstimator
<Extension()>
Public Function OneHotHashEncoding (catalog As TransformsCatalog.CategoricalTransforms, outputColumnName As String, Optional inputColumnName As String = Nothing, Optional outputKind As OneHotEncodingEstimator.OutputKind = Microsoft.ML.Transforms.OneHotEncodingEstimator+OutputKind.Indicator, Optional numberOfBits As Integer = 16, Optional seed As UInteger = 314489979, Optional useOrderedHashing As Boolean = true, Optional maximumNumberOfInverts As Integer = 0) As OneHotHashEncodingEstimator

Parameters

catalog
TransformsCatalog.CategoricalTransforms

The transform catalog.

outputColumnName
String

Name of the column resulting from the transformation of inputColumnName. This column's data type will be a vector of Single if outputKind is Bag, Indicator, and Binary. If outputKind is Key, this column's data type will be a key in the case of a scalar input column or a vector of keys in the case of a vector input column.

inputColumnName
String

Name of column to transform. If set to null, the value of the outputColumnName will be used as source. This column's data type can be scalar or vector of numeric, text, boolean, DateTime or DateTimeOffset.

outputKind
OneHotEncodingEstimator.OutputKind

The conversion mode.

numberOfBits
Int32

Number of bits to hash into. Must be between 1 and 30, inclusive.

seed
UInt32

Hashing seed.

useOrderedHashing
Boolean

Whether the position of each term should be included in the hash.

maximumNumberOfInverts
Int32

During hashing we constuct mappings between original values and the produced hash values. Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one. maximumNumberOfInverts specifies the upper bound of the number of distinct input values mapping to a hash that should be retained. 0 does not retain any input values. -1 retains all input values mapping to each hash.

Returns

Examples

using System;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms;

namespace Samples.Dynamic.Transforms.Categorical
{
    public static class OneHotHashEncoding
    {
        public static void Example()
        {
            // Create a new ML context for ML.NET operations. It can be used for
            // exception tracking and logging as well as the source of randomness.
            var mlContext = new MLContext();

            // Create a small dataset as an IEnumerable.
            var samples = new[]
            {
                new DataPoint {Education = "0-5yrs"},
                new DataPoint {Education = "0-5yrs"},
                new DataPoint {Education = "6-11yrs"},
                new DataPoint {Education = "6-11yrs"},
                new DataPoint {Education = "11-15yrs"}
            };

            // Convert training data to an IDataView.
            IDataView data = mlContext.Data.LoadFromEnumerable(samples);

            // A pipeline for one hot hash encoding the 'Education' column.
            var pipeline = mlContext.Transforms.Categorical.OneHotHashEncoding(
                "EducationOneHotHashEncoded", "Education", numberOfBits: 3);

            // Fit and transform the data.
            IDataView hashEncodedData = pipeline.Fit(data).Transform(data);

            PrintDataColumn(hashEncodedData, "EducationOneHotHashEncoded");
            // We have 8 slots, because we used numberOfBits = 3.

            // 0 0 0 1 0 0 0 0
            // 0 0 0 1 0 0 0 0
            // 0 0 0 0 1 0 0 0
            // 0 0 0 0 1 0 0 0
            // 0 0 0 0 0 0 0 1

            // A pipeline for one hot hash encoding the 'Education' column
            // (using keying strategy).
            var keyPipeline = mlContext.Transforms.Categorical.OneHotHashEncoding(
                "EducationOneHotHashEncoded", "Education",
                OneHotEncodingEstimator.OutputKind.Key, 3);

            // Fit and transform the data.
            IDataView hashKeyEncodedData = keyPipeline.Fit(data).Transform(data);

            // Get the data of the newly created column for inspecting.
            var keyEncodedColumn =
                hashKeyEncodedData.GetColumn<uint>("EducationOneHotHashEncoded");

            Console.WriteLine(
                "One Hot Hash Encoding of single column 'Education', with key " +
                "type output.");

            // One Hot Hash Encoding of single column 'Education', with key type output.

            foreach (uint element in keyEncodedColumn)
                Console.WriteLine(element);

            // 4
            // 4
            // 5
            // 5
            // 8
        }

        private static void PrintDataColumn(IDataView transformedData,
            string columnName)
        {
            var countSelectColumn = transformedData.GetColumn<float[]>(
                transformedData.Schema[columnName]);

            foreach (var row in countSelectColumn)
            {
                for (var i = 0; i < row.Length; i++)
                    Console.Write($"{row[i]}\t");
                Console.WriteLine();
            }
        }

        private class DataPoint
        {
            public string Education { get; set; }
        }
    }
}

Applies to