FileDataStream Class

Data view from a file.

Inheritance
nimbusml.internal.utils.data_stream.DataStream
FileDataStream

Constructor

FileDataStream(filename, schema, roles=None)

Examples


   from nimbusml import FileDataStream
   from nimbusml import Pipeline
   from nimbusml.ensemble import LightGbmRegressor
   from nimbusml.feature_extraction.categorical import OneHotVectorizer
   import numpy as np
   import pandas as pd

   data = pd.DataFrame(dict(real = [0.1, 2.2],
                            text = ['word','class'],
                            y = [1,3]))
   data.to_csv('data.csv', index = False, header = True)

   ds = FileDataStream.read_csv('data.csv', collapse = False,
                               numeric_dtype = np.float32, sep = ',')
   ds.head()
   #   real   text    y
   #0   0.1   word  1.0
   #1   2.2  class  3.0
   exp = Pipeline([
                OneHotVectorizer(columns = ['text']),
                LightGbmRegressor(minimum_example_count_per_leaf = 1)
               ])

   exp.fit(ds, 'y')

Remarks

FileDataStream enables training from files by streaming the examples sequentially. Some trainers require the full data matrix to be resident in memory, and will cache the data if required. For trainers that implement online or batch techniques, using FileDataStream will substantially reduce overall memory utilization. Runtime efficiency is also increased and data copying is minimized for nimbusml trainers/transforms when used in conjunction with FileDataStream text reader.

A schema of the data is required to describe the column names, positions, types and delimiters. This can be provided explicitly to FileDataStream by using the DataSchema class to construct it, or optionally the read_csv method can be used to infer the schema automatically. For more control over column names and index ranges, especially Vector Type columns, the schema can be designed manually.

For more details of the schema format, refer to Schema and DataSchema.

Methods

clone

Copy/clone the object.

read_csv

Creates a FileDataStream from a filename or a buffer. For more details of the schema format for a FileDataStream, refer to Schema all the arguments that DataSchema.read_schema() uses applies to this method as well.

read_csv_pandas

Creates a FileDataStream from a filename or a buffer.

The method leverages read_csv to guess the schema of a filename with the first nrows of a file.

clone

Copy/clone the object.

clone()

read_csv

Creates a FileDataStream from a filename or a buffer. For more details of the schema format for a FileDataStream, refer to Schema all the arguments that DataSchema.read_schema() uses applies to this method as well.

read_csv(filepath_or_buffer, tool=None, nrows=100, **kwargs)

Parameters

filepath_or_buffer
Required

filename or stream

tool
Required

parser to choose to guess the schema, this module 'internal' or 'pandas', if None, the function chooses the most relevant one given the additional arguments given to the function

nrows
Required

number of rows used to guess the schema

numeric_dtype
Required

changes all numeric types into the same one, recommended to use numpy.float32 in many cases

collapse
Required

(False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If collapse* == 'all', the method collapses all columns not specified in parameter names.

sep
Required

seperation of the data columns, such as ',', or '/t'

header
Required

if the input data has a header, can be True or False

names
Required

rename the data columns, users can specify a dictionary with column number as the key, such as {0:'Label', 1:'GroupId', (2,None):'Features'} It renames columns 0, 1, as Label and GroupId. It renames columns 2:end with Features_0, ..., Features_2040.

dtype
Required

overwrite the data column types, users can specify a dictionary with column name as the key, such as {'column1':numpy.float32}

kwargs
Required

additional parameters sent to read_csv or the internal parser.

Returns

a FileDataStream instance

read_csv_pandas

Creates a FileDataStream from a filename or a buffer.

The method leverages read_csv to guess the schema of a filename with the first nrows of a file.

read_csv_pandas(filepath_or_buffer, nrows=100, collapse=False, numeric_dtype=None, **kwargs)

Parameters

filepath_or_buffer
Required

filename or stream

nrows
Required

number of rows used to guess the schema

kwargs
Required

additional parameters sent to read_csv or the internal

numeric_dtype
Required

changes all numeric types into the same one

collapse
Required

collapse into one vector column all columns sharing the same type

Returns

a FileDataStream instance