FileDataStream Class

Reference

Data view from a file.

Inheritance: nimbusml.internal.utils.data_stream.DataStream

FileDataStream

Constructor

FileDataStream(filename, schema, roles=None)

Examples


   from nimbusml import FileDataStream
   from nimbusml import Pipeline
   from nimbusml.ensemble import LightGbmRegressor
   from nimbusml.feature_extraction.categorical import OneHotVectorizer
   import numpy as np
   import pandas as pd

   data = pd.DataFrame(dict(real = [0.1, 2.2],
                            text = ['word','class'],
                            y = [1,3]))
   data.to_csv('data.csv', index = False, header = True)

   ds = FileDataStream.read_csv('data.csv', collapse = False,
                               numeric_dtype = np.float32, sep = ',')
   ds.head()
   #   real   text    y
   #0   0.1   word  1.0
   #1   2.2  class  3.0
   exp = Pipeline([
                OneHotVectorizer(columns = ['text']),
                LightGbmRegressor(minimum_example_count_per_leaf = 1)
               ])

   exp.fit(ds, 'y')

Remarks

FileDataStream enables training from files by streaming the examples sequentially. Some trainers require the full data matrix to be resident in memory, and will cache the data if required. For trainers that implement online or batch techniques, using FileDataStream will substantially reduce overall memory utilization. Runtime efficiency is also increased and data copying is minimized for nimbusml trainers/transforms when used in conjunction with FileDataStream text reader.

A schema of the data is required to describe the column names, positions, types and delimiters. This can be provided explicitly to FileDataStream by using the DataSchema class to construct it, or optionally the read_csv method can be used to infer the schema automatically. For more control over column names and index ranges, especially Vector Type columns, the schema can be designed manually.

For more details of the schema format, refer to Schema and DataSchema.

Methods

clone

Copy/clone the object.

read_csv

Creates a FileDataStream from a filename or a buffer. For more details of the schema format for a FileDataStream, refer to Schema all the arguments that DataSchema.read_schema() uses applies to this method as well.

read_csv_pandas

Creates a FileDataStream from a filename or a buffer.

The method leverages read_csv to guess the schema of a filename with the first nrows of a file.

clone

Copy/clone the object.

clone()

read_csv

read_csv(filepath_or_buffer, tool=None, nrows=100, **kwargs)

Parameters

filepath_or_buffer

Required

filename or stream

tool

Required

parser to choose to guess the schema, this module 'internal' or 'pandas', if None, the function chooses the most relevant one given the additional arguments given to the function

nrows

Required

number of rows used to guess the schema

numeric_dtype

Required

changes all numeric types into the same one, recommended to use numpy.float32 in many cases

collapse

Required

(False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If collapse* == 'all', the method collapses all columns not specified in parameter names.

sep

Required

seperation of the data columns, such as ',', or '/t'

header

Required

if the input data has a header, can be True or False

names

Required

rename the data columns, users can specify a dictionary with column number as the key, such as {0:'Label', 1:'GroupId', (2,None):'Features'} It renames columns 0, 1, as Label and GroupId. It renames columns 2:end with Features_0, ..., Features_2040.

dtype

Required

overwrite the data column types, users can specify a dictionary with column name as the key, such as {'column1':numpy.float32}

kwargs

Required

additional parameters sent to read_csv or the internal parser.

Returns

a FileDataStream instance

read_csv_pandas

Creates a FileDataStream from a filename or a buffer.

The method leverages read_csv to guess the schema of a filename with the first nrows of a file.

read_csv_pandas(filepath_or_buffer, nrows=100, collapse=False, numeric_dtype=None, **kwargs)

Parameters

filepath_or_buffer

Required

filename or stream

nrows

Required

number of rows used to guess the schema

kwargs

Required

additional parameters sent to read_csv or the internal

numeric_dtype

Required

changes all numeric types into the same one

collapse

Required

collapse into one vector column all columns sharing the same type

Returns

a FileDataStream instance

FileDataStream Class

Constructor

Examples

Remarks

Methods

clone

read_csv

Parameters

Returns

read_csv_pandas

Parameters

Returns

Additional resources