DataSchema Class

Reference

Defines a schema for a datasets.

Inheritance: builtins.object

DataSchema

Constructor

DataSchema(schema, **options)

Examples


   from nimbusml import DataSchema, FileDataStream
   from nimbusml import Pipeline
   from nimbusml.ensemble import LightGbmRegressor
   from nimbusml.feature_extraction.categorical import OneHotVectorizer
   import numpy as np
   import pandas as pd

   data = pd.DataFrame(dict(real = [0.1, 2.2],
                           text = ['word','class'],
                           y = [1,3]))
   data.to_csv('data.csv', index = False, header = True)

   schema = DataSchema.read_schema('data.csv', collapse = False,
                                   numeric_dtype = np.float32,
                                   sep = ',')
   print(schema)
   #col=real:R4:0 col=text:TX:1 col=y:R4:2 header=+ sep=,

   exp = Pipeline([
                OneHotVectorizer(columns = ['text']),
                LightGbmRegressor(minimum_example_count_per_leaf = 1)
               ])

   exp.fit(FileDataStream('data.csv', schema = schema), 'y')

Remarks

The DataSchema class automatically generates a description of the data schema from various data sources. The data source may be a list, array, dataframe or a file. A schema is required for all nimbusml trainers and transforms, and when not provided explicitly, it needs to be inferred automatically before any data processing can occur. In the case of list, array or dataframes, the schema inference is usually straightforward, but when the data source is a file, it may require further inspection to ensure it matches the data, and that the types are aligned as needed (e.g. R4 vs I4).

For more details on the schema format, refer to Schema, Types and Vector Type.

Methods

clone
extract_idv_schema_from_file
format_options	Formats the options for the parser from the core library.
parse	Parses a schema defined as a string.
read_schema	Infers the schema of a data view.
read_schema_file	Infers the schema of a file. Additional options: collapse: (False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If `collapse* == 'all'`, the method collapses all columns not specified in parameter names. numeric_dtype: if not None, changes all numeric types into this type
rename	Renames a column.
to_string	Converts the schema into a string.

clone

clone()

extract_idv_schema_from_file

extract_idv_schema_from_file(path)

Parameters

path

Required

format_options

Formats the options for the parser from the core library.

format_options(add_sep=False)

Parameters

add_sep

default value: False

the code library usually requires the separator, it is not added if the user does not explicitely specify it unless add_sep is True, in that case, the default value is added.

Returns

formatted options as a string

parse

Parses a schema defined as a string.

parse(schema)

Parameters

schema

Required

read_schema

Infers the schema of a data view.

read_schema(*data, **options)

Parameters

data

Required

features, labels, weights, groups

collapse

Required

(False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If collapse* == 'all', the method collapses all columns not specified in parameter names.

sep

Required

string value of file seperation character (for example: ',')

header

Required

whether the data has a header row; defaults to True

dtype

Required

change dtype of specific columns; takes dictionary of column names mapped to desired dtype

numeric_dtype

Required

if not None, changes all numeric types into this type

names

Required

specify new names for columns; takes dictionary of column index mapped to desired name

ind

Required

first column index (in case DataFrame are concatenated)

tool

Required

'pandas' or 'nimbusml'

Returns

schema as a string

read_schema_file

Infers the schema of a file.

Additional options:

collapse: (False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If collapse* == 'all', the method collapses all columns not specified in parameter names.
numeric_dtype: if not None, changes all numeric types into this type

read_schema_file(filepath_or_buffer, tool='pandas', nrows=100, **options)

Parameters

filepath_or_buffer

Required

stream or filename

tool

default value: pandas

'pandas' or 'nimbusml'

nrows

default value: 100

use the first top rows only

options

Required

additional options for read_csv from pandas or internal reader

Returns

schema

rename

Renames a column.

rename(old_name, new_name)

Parameters

old_name

Required

old name

new_name

Required

new_name

Returns

self

to_string

Converts the schema into a string.

to_string(add_sep=False)

Parameters

add_sep

default value: False

sep is not added if the user does not specify it, but it is required by the core library, the method adds the default value if not specified.

Returns

formatted schema as a string

DataSchema Class

Constructor

Examples

Remarks

Methods

clone

extract_idv_schema_from_file

Parameters

format_options

Parameters

Returns

parse

Parameters

read_schema

Parameters

Returns

read_schema_file

Parameters

Returns

rename

Parameters

Returns

to_string

Parameters

Returns

Additional resources