Spark library management
Applies to: SQL Server 2019 (15.x)
This article provides guidance on how to import and install packages for a Spark session through session and notebook configurations.
Built-in tools
Spark and Hadoop base packages Python 3.7 and Python 2.7 Pandas, Sklearn, Numpy, and other data processing packages. R and MRO packages Sparklyr
Install packages from a Maven repository onto the Spark cluster at runtime
Maven packages can be installed onto your Spark cluster using notebook cell configuration at the start of your spark session. Before starting a spark session in Azure Data Studio, run the following code:
%%configure -f \
{"conf": {"spark.jars.packages": "com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.1"}}
Install Python packages at PySpark job-submission time
- Specify the path to a requirements.txt file in HDFS to use as a reference for packages to install.
%%configure -f \
{"conf": {
"spark.pyspark.virtualenv.enabled" : "true",
"spark.pyspark.virtualenv.type" : "conda",
"spark.pyspark.virtualenv.requirements" : "requirements.txt",
"spark.pyspark.virtualenv.bin.path" : "/opt/mls/python/bin/conda"
},
"files": ["hdfs://nmnode-0/tmp/requirements.txt"]
}
- Create a conda virtualenv without a requirements file and dynamically add packages during the Spark session.
%%configure -f \
{"conf": {
'spark.pyspark.virtualenv.enabled' : 'true',
'spark.pyspark.virtualenv.type' : 'conda',
'spark.pyspark.virtualenv.bin.path' : '/opt/mls/python/bin/conda',
'spark.pyspark.virtualenv.python_version': '3.6'
}
sc.install_packages("numpy==1.11.0")
import numpy as np
Import .jar from HDFS for use at runtime
Import jar at runtime through Azure Data Studio notebook cell configuration.
%%configure -f
{"conf": {"spark.jars": "/jar/mycodeJar.jar"}}
Import .jar at runtime through Azure Data Studio notebook cell configuration
%%configure -f
{"conf": {"spark.jars": "/jar/mycodeJar.jar"}}
Next steps
For more information on SQL Server big data cluster and related scenarios, See SQL Server Big Data Clusters.