FileDataStream class

Definition

Data view from a file.

FileDataStream(filename, schema, roles=None)
Inheritance
builtins.object
nimbusml.internal.utils.data_stream.DataStream
FileDataStream

Examples


   from nimbusml import FileDataStream
   from nimbusml import Pipeline
   from nimbusml.ensemble import LightGbmRegressor
   from nimbusml.feature_extraction.categorical import OneHotVectorizer
   import numpy as np
   import pandas as pd

   data = pd.DataFrame(dict(real = [0.1, 2.2],
                            text = ['word','class'],
                            y = [1,3]))
   data.to_csv('data.csv', index = False, header = True)

   ds = FileDataStream.read_csv('data.csv', collapse = False,
                               numeric_dtype = np.float32, sep = ',')
   ds.head()
   #   real   text    y
   #0   0.1   word  1.0
   #1   2.2  class  3.0
   exp = Pipeline([
                OneHotVectorizer(columns = ['text']),
                LightGbmRegressor(minimum_example_count_per_leaf = 1)
               ])

   exp.fit(ds, 'y')

Remarks

FileDataStream enables training from files by streaming the examples sequentially. Some trainers require the full data matrix to be resident in memory, and will cache the data if required. For trainers that implement online or batch techniques, using FileDataStream will substantially reduce overall memory utilization. Runtime efficiency is also increased and data copying is minimized for nimbusml trainers/transforms when used in conjunction with FileDataStream text reader.

A schema of the data is required to describe the column names, positions, types and delimiters. This can be provided explicitly to FileDataStream by using the DataSchema class to construct it, or optionally the read_csv(filepath_or_buffer, tool=None, nrows=100, **kwargs) method can be used to infer the schema automatically. For more control over column names and index ranges, especially Vector Type columns, the schema can be designed manually.

For more details of the schema format, refer to Schema and DataSchema.

Methods

clone()

Copy/clone the object.

read_csv(filepath_or_buffer, tool=None, nrows=100, **kwargs)

Creates a FileDataStream from a filename or a buffer. For more details of the schema format for a FileDataStream, refer to Schema all the arguments that DataSchema.read_schema() uses applies to this method as well.

read_csv_pandas(filepath_or_buffer, nrows=100, collapse=False, numeric_dtype=None, **kwargs)

Creates a FileDataStream from a filename or a buffer.

The method leverages read_csv to guess the schema of a filename with the first nrows of a file.

clone()

Copy/clone the object.

clone()

read_csv(filepath_or_buffer, tool=None, nrows=100, **kwargs)

Creates a FileDataStream from a filename or a buffer. For more details of the schema format for a FileDataStream, refer to Schema all the arguments that DataSchema.read_schema() uses applies to this method as well.

read_csv(filepath_or_buffer, tool=None, nrows=100, **kwargs)

Parameters

filepath_or_buffer

filename or stream

tool

parser to choose to guess the schema, this module 'internal' or 'pandas', if None, the function chooses the most relevant one given the additional arguments given to the function

nrows

number of rows used to guess the schema

numeric_dtype

changes all numeric types into the same one, recommended to use numpy.float32 in many cases

collapse

(False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If collapse* == 'all', the method collapses all columns not specified in parameter names.

sep

seperation of the data columns, such as ',', or '/t'

header

if the input data has a header, can be True or False

names

rename the data columns, users can specify a dictionary with column number as the key, such as {0:'Label', 1:'GroupId', (2,None):'Features'} It renames columns 0, 1, as Label and GroupId. It renames columns 2:end with Features_0, ..., Features_2040.

dtype

overwrite the data column types, users can specify a dictionary with column name as the key, such as {'column1':numpy.float32}

kwargs

additional parameters sent to read_csv or the internal parser.

Returns

a FileDataStream instance

read_csv_pandas(filepath_or_buffer, nrows=100, collapse=False, numeric_dtype=None, **kwargs)

Creates a FileDataStream from a filename or a buffer.

The method leverages read_csv to guess the schema of a filename with the first nrows of a file.

read_csv_pandas(filepath_or_buffer, nrows=100, collapse=False, numeric_dtype=None, **kwargs)

Parameters

filepath_or_buffer

filename or stream

nrows

number of rows used to guess the schema

kwargs

additional parameters sent to read_csv or the internal

numeric_dtype

changes all numeric types into the same one

collapse

collapse into one vector column all columns sharing the same type

Returns

a FileDataStream instance