azureml-opendatasets Package

Packages

opendatasets

Contains functionality for consuming Azure Open Datasets as dataframes and for enriching customer data.

Azure Open Datasets are curated public datasets that you can use to add scenario-specific features to machine learning solutions for more accurate models. You can convert these public datasets into Spark and pandas dataframes with filters applied. For some datasets, you can use an enricher to join the public data with your data. For example, you can join your data with weather data by longitude and latitude or zip code and time.

Included in Azure Open Datasets are public-domain data for weather, census, holidays, public safety, and location that help you train machine learning models and enrich predictive solutions. Open Datasets are in the cloud on Microsoft Azure and are integrated into Azure Machine Learning. For more information about working with Azure Open Datasets, see Create datasets with Azure Open Datasets.

For general information about Azure Open Datasets, see Azure Open Datasets Documentation.

Modules

country_or_region_time_customer_data

Customer data with location and time columns should be wrapped using this class.

country_or_region_time_public_data

Public data with country_or_region and time columns can be wrapped with this class.

country_region_data

Contains fucntionality for working with location data, with supported column classes.

customer_data

Contains the base class of all customer data.

location_data

Contains functionality for working with location data, with supported column classes.

location_time_customer_data

Contains functionality for wrapping customer data with location and time columns.

location_time_public_data

Contains functionality for wrapping public data with location and time columns.

open_dataset_base

Base class for tabular open datasets.

public_data

Contains the public data base class.

time_data

Contains functionality for representing time data and related operations in opendatasets.

aggregator

Defines the base class for all aggregators.

aggregator_all

Contains the the aggregator for including all columns, that is, when no aggregation is performed.

aggregator_avg

Contains the aggregator average class.

aggregator_max

Contains the aggregator max class.

aggregator_min

Contains the aggregator min class.

aggregator_top

Contains the aggregator top class.

base_blob_info

Contains the blob info base class.

blob_parquet_descriptor

Contains the descriptor of blob parquet.

dataset_partition_prep

Contains functionality for specifying dataset partition preparation.

Partition preparation occurs automatically, when you use a opendatasets classe that requires a partition of data, such as the NycTlcGreen class.

pandas_data_load_limit

Contains functionality to control how the limit pandas data loads when parquet files are large.

With this module's functionality, you can specify how to limit how pandas data loads when parquet files are too large to load.

common_weather_enricher

Contains functionality for enriching custom data with weather public data.

enricher

Defines the generic enricher class for joining together data with different granularity and aggregators.

This module contains static function overloads: get_max_date_by_granularity(max_date, granularity) where granularity is one of MonthGranularity, DayGranularity, or HourGranularity. These static methods return the max data based on the specified granularity.

holiday_enricher

Contains functionality for enriching custom data with holiday public data.

environ

Defines runtime environment classes where Azure Open Datasets are used.

The classes in this module ensure Azure Open Datasets functionality is optimized for different environments. In general, you do not need to instantiate these environment classes or worry about their implementation. Instead, use the get_environ module function to return the environment.

granularity

Contains granularity definitions for time and location.

The granularities are organized as follows:

You work with a granularity by specifying it in an enricher function. For example, when using the HolidayEnricher class methods to enrich data, specify the TimeGranularity as an input parameter to the method.

country_region_selector

Contains the country region selector class.

enricher_selector

Contains the base classes for location and time selectors.

There are two subclasses of EnricherSelector:

The EnricherSelector is the root class of LocationClosestSelector and TimeNearestSelector.

location_closest_selector

Contains the location closest selector class.

time_nearest_selector

Contains the time nearest selector class.