Feature engineering with MLlib

Article
03/01/2024

Apache Spark MLlib contains many utility functions for performing feature engineering at scale, including methods for encoding and transforming features. These methods can also be used to process features for other machine learning libraries.

Azure Databricks recommends the following Apache Spark MLLib guides:

Extracting, transforming and selecting features with MLlib
MLlib Programming Guide
Python API Reference
Scala API Reference

This PySpark-based notebook includes preprocessing steps that convert categorical data to numeric variables using category indexing and one-hot encoding.

Binary classification example

Get notebook

Feature engineering with MLlib

Binary classification example

Feedback

Feedback

Additional resources