# Machine Learning

As organizations create more diverse and more user-focused data products and services, there is a growing need for machine learning, which can be used to develop personalizations, recommendations, and predictive insights. The Apache Spark machine learning library (MLlib) allows data scientists to focus on their data problems and models instead of solving the complexities surrounding distributed data (such as infrastructure, configurations, and so on).

In this tutorial module, you will learn how to:

- Load sample data
- Prepare and visualize data for ML algorithms
- Run a linear regression model
- Evaluation a linear regression model
- Visualize a linear regression model

We also provide a sample notebook that you can import to access and run all of the code examples included in the module.

## Load sample data

The easiest way to start working with machine learning is to use an example Azure Databricks dataset available in the `/databricks-datasets`

folder accessible within the Azure Databricks workspace. For example, to access the file that compares city population to median sale prices of homes, you can access the file `/databricks-datasets/samples/population-vs-price/data_geo.csv`

.

```
# Use the Spark CSV datasource with options specifying:
# - First line of file is a header
# - Automatically infer the schema of the data
data = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", header="true", inferSchema="true")
data.cache() # Cache data for faster reuse
```

To view this data in a tabular format, instead of exporting this data to a third-party tool, you can use the `display()`

command in your Databricks notebook.

```
display(data)
```

## Prepare and visualize data for ML algorithms

In supervised learningâ€”-such as a regression algorithmâ€”-you typically define a label and a set of features. In this linear regression example, the label is the 2015 median sales price and the feature is the 2014 population estimate. That is, you use the feature (population) to predict the label (sales price).

Drop rows with missing values and rename the feature and label columns, replacing spaces with `_`

.

```
from pyspark.sql.functions import col
data = data.dropna() # drop rows with missing values
exprs = [col(column).alias(column.replace(' ', '_')) for column in data.columns]
```

Select and vectorize the population feature column:

```
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
vdata = data.select(*exprs).selectExpr("2014_Population_estimate as population", "2015_median_sales_price as label")
stages = []
assembler = VectorAssembler(inputCols=["population"], outputCol="features")
stages += [assembler]
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(vdata)
dataset = pipelineModel.transform(vdata)
# Keep relevant columns
selectedcols = ["features", "label"]
```

Display the selected columns:

```
display(dataset.select(selectedcols))
```

## Run the linear regression model

This section runs two different linear regression models using different regularization parameters to determine how well either of these two models predict the sales price (label) based on the population (feature).

### Build the model

```
# Import LinearRegression class
from pyspark.ml.regression import LinearRegression
# Define LinearRegression algorithm
lr = LinearRegression()
# Fit 2 models, using different regularization parameters
modelA = lr.fit(dataset, {lr.regParam:0.0})
modelB = lr.fit(dataset, {lr.regParam:100.0})
```

Using the model, you can also make predictions by using the `transform()`

function, which adds a new column of predictions. For example, the code below takes the first model (modelA) and shows you both the label (original sales price) and prediction (predicted sales price) based on the features (population).

```
# Make predictions
predictionsA = modelA.transform(dataset)
display(predictionsA)
```

## Evaluate the model

To evaluate the regression analysis, calculate the root mean square error using the `RegressionEvaluator`

. Here is the Python code for evaluating the two models and their output.

```
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="rmse")
RMSE = evaluator.evaluate(predictionsA)
print("ModelA: Root Mean Squared Error = " + str(RMSE))
# ModelA: Root Mean Squared Error = 128.602026843
predictionsB = modelB.transform(dataset)
RMSE = evaluator.evaluate(predictionsB)
print("ModelB: Root Mean Squared Error = " + str(RMSE))
# ModelB: Root Mean Squared Error = 129.496300193
```

## Visualize the model

As is typical for many machine learning algorithms, you want to visualize the scatterplot. Since Azure Databricks supports pandas and ggplot, the code below creates a linear regression plot using pandas DataFrame (pydf) and ggplot to display the scatterplot and the two regression models.

```
# Import numpy, pandas, and ggplot
import numpy as np
from pandas import *
from ggplot import *
# Create Python DataFrame
pop = dataset.rdd.map(lambda p: (p.features[0])).collect()
price = dataset.rdd.map(lambda p: (p.label)).collect()
predA = predictionsA.select("prediction").rdd.map(lambda r: r[0]).collect()
predB = predictionsB.select("prediction").rdd.map(lambda r: r[0]).collect()
# Create a pandas DataFrame
pydf = DataFrame({'pop':pop,'price':price,'predA':predA, 'predB':predB})
# Visualizing the Model
# Create scatter plot and two regression models (scaling exponential) using ggplot
p = ggplot(pydf, aes('pop','price')) + \
geom_point(color='blue') + \
geom_line(pydf, aes('pop','predA'), color='red') + \
geom_line(pydf, aes('pop','predB'), color='green') + \
scale_x_log10() + scale_y_log10()
display(p)
```

## Notebook

To access all of these code examples, import the **Population vs. Median Home Prices** notebook. For more machine learning examples, see Machine Learning.

### Apache Spark machine learning notebook

## Feedback

Loading feedback...