Data Overview
This article describes how to import data into Azure Databricks using the UI, read imported data using the Spark and local APIs, and modify imported data using Databricks File System (DBFS) commands.
Import data
If you have small data files on your local machine that you want to analyze with Azure Databricks, you can easily import them to Databricks File System (DBFS) using the UI:
Drop files into or browse to files in the Import & Explore Data box on the landing page:
Upload the files in the Create table UI.
Files imported to DBFS using one of these methods are stored in FileStore.
For production environments, we recommend that you explicitly upload files into DBFS using the DBFS CLI, DBFS API, Databricks file system utilities (dbutils.fs).
You can also use a wide variety of data sources to access data.
Read data on cluster nodes using Spark APIs
You read data imported to DBFS into Apache Spark DataFrames using Spark APIs. For example, if you import a CSV file, you can read the data using one of these examples.
Tip
For easier access, we recommend that you create a table. See Databases and Tables for more information.
Scala
val sparkDF = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/state_income-9f7c5.csv")
Python
sparkDF = spark.read.csv('/FileStore/tables/state_income-9f7c5.csv', header="true", inferSchema="true")
R
sparkDF <- read.df(source = "csv", path = "/FileStore/tables/state_income-9f7c5.csv", header="true", inferSchema = "true")
Read data on cluster nodes using local APIs
You can also read data imported to DBFS in programs running on the Spark driver node using local file APIs. For example:
Python
pandas_df = pd.read_csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header='infer')
R
df = read.csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header = TRUE)
Modify uploaded data
You cannot edit imported data directly within Azure Databricks, but you can overwrite a data file using Spark APIs, the DBFS CLI, DBFS API, and Databricks file system utilities (dbutils.fs).
To delete data from DBFS, use the same APIs and tools. For example, you can use the Databricks Utilities command dbutils.fs.rm
:
dbutils.fs.rm("dbfs:/FileStore/tables/state_income-9f7c5.csv", true)
Warning
Deleted data cannot be recovered.
Feedback
Loading feedback...