Databases and tables

An Azure Databricks database is a collection of tables. An Azure Databricks table is a collection of structured data. You can cache, filter, and perform any operations supported by Apache Spark DataFrames on Azure Databricks tables. You can query tables with Spark APIs and Spark SQL.

Requirements

To view and create databases and tables, you must have a running cluster.

Types of tables

There are two types of tables: global and local. A global table is available across all clusters. Azure Databricks registers global tables either to the Azure Databricks Hive metastore or to an external Hive metastore. For details about Hive support, see Apache Hive compatibility. A local table is not accessible from other clusters and is not registered in the Hive metastore. This is also known as a temporary view.

View databases and tables

Click Data Icon Data in the sidebar. Azure Databricks selects a running cluster to which you have access. The Databases folder displays the list of databases with the default database selected. The Tables folder displays the list of tables in the default database.

Create tables list

You can change the cluster from the Databases menu, create table UI, or view table UI. For example, from the Databases menu:

  1. Click the Down Caret at the top of the Databases folder.

  2. Select a cluster.

    Select cluster

Create a database

To create a database in SQL:

CREATE DATABASE <database-name> ...

For more options, see

Create a table

You can create a table with Create Icon Create in the sidebar, the Create Table UI, or programmatically. You can populate a table from files in DBFS or data stored in any of the supported data sources.

Note

When you create a table using the UI, you cannot:

Create a table using Create Icon Create in the sidebar

Using the Create icon in the sidebar, you can only create global tables. To create a local table, see Create a table programmatically.

  1. Click Create Icon Create in the sidebar and select Table from the menu. The Create New Table dialog appears.
  2. Follow the steps in Create a table using the UI, beginning at Step 4.

Create a table using the UI

With the UI, you can only create global tables. To create a local table, see Create a table programmatically.

  1. Click Data Icon Data in the sidebar. The Databases and Tables folders display.

  2. In the Databases folder, select a database.

  3. Above the Tables folder, click Create Table.

    Add Table Icon

  4. Choose a data source and follow the steps in the corresponding section to configure the table.

    If an Azure Databricks administrator has disabled the Upload File option, you do not have the option to upload files; you can create tables using one of the other data sources.

    Configure table

    Instructions for Upload data option

    1. Drag files to the Files dropzone or click the dropzone to browse and choose files. After upload, a path displays for each file. The path will be something like /FileStore/tables/<filename>-<integer>.<file-type>. You can use this path in a notebook to read data.

      File dropzone

    2. Click Create Table with UI.

    3. In the Cluster drop-down, choose a cluster.

    Instructions for DBFS option

    1. Select a file.
    2. Click Create Table with UI.
    3. In the Cluster drop-down, choose a cluster.
  5. Click Preview Table to view the table.

  6. In the Table Name field, optionally override the default table name. A table name can contain only lowercase alphanumeric characters and underscores and must start with a lowercase letter or underscore.

  7. In the Create in Database field, optionally override the selected default database.

  8. In the File Type field, optionally override the inferred file type.

  9. If the file type is CSV:

    1. In the Column Delimiter field, select whether to override the inferred delimiter.
    2. Indicate whether to use the first row as the column titles.
    3. Indicate whether to infer the schema.
  10. If the file type is JSON, indicate whether the file is multi-line.

  11. Click Create Table.

Create a table in a notebook

In the Create New Table UI you can use quickstart notebooks provided by Azure Databricks to connect to any data source.

  • DBFS: Click Create Table in Notebook.
  • Other Data Sources: In the Connector drop-down, select a data source type. Then click Create Table in Notebook.

Create a table programmatically

This section describes how to create global and local tables programmatically.

Create a global table

To create a global table in SQL:

CREATE TABLE <table-name> ...

For more options, see Create Table for Databricks Runtime 5.5 LTS and Databricks Runtime 6.4, or CREATE TABLE for Databricks Runtime 7.1 and above.

To create a global table from a DataFrame in Python or Scala:

dataFrame.write.saveAsTable("<table-name>")

Create a local table

To create a local table from a DataFrame in Python or Scala:

dataFrame.createOrReplaceTempView("<table-name>")

Here is an example that creates a local table called diamonds from a file in Databricks File System (DBFS):

dataFrame = "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"
spark.read.format("csv").option("header","true")\
  .option("inferSchema", "true").load(dataFrame)\
  .createOrReplaceTempView("diamonds")

Access a table

You can view table details, read, update, and delete a table.

View table details

The table details view shows the table schema and sample data.

  1. Click Data Icon Data in the sidebar.

  2. In the Databases folder, click a database.

  3. In the Tables folder, click the table name.

  4. In the Cluster drop-down, optionally select another cluster to render the table preview.

    Table details

    Note

    To display the table preview, a Spark SQL query runs on the cluster selected in the Cluster drop-down. If the cluster already has a workload running on it, the table preview may take longer to load.

Query a table

These examples show you how to query and display a table called diamonds.

SQL

SELECT * FROM diamonds

Python

diamonds = spark.sql("select * from diamonds")
display(diamonds.select("*"))

diamonds = spark.table("diamonds")
display(diamonds.select("*"))

R

diamonds <- sql(sqlContext, "select * from diamonds")
display(diamonds)

diamonds <- table(sqlContext, "diamonds")
display(diamonds)

Scala

val diamonds = spark.sql("select * from diamonds")
display(diamonds.select("*"))

val diamonds = spark.table("diamonds")
display(diamonds.select("*"))

Update a table

The table schema is immutable. However, you can update table data by changing the underlying files.

For example, for tables created from a storage directory, adding or removing files in that directory changes the contents of the table.

After updating the files underlying a table, refresh the table using the following command:

REFRESH TABLE <table-name>

This ensures that when you access the table, Spark SQL reads the correct files even if the underlying files change.

Delete a table

Delete a table using the UI

  1. Click Data Icon Data in the sidebar.
  2. Click the Menu Dropdown next to the table name and select Delete.

Delete a table programmatically

DROP TABLE <table-name>

Managed and unmanaged tables

Every Spark SQL table has metadata information that stores the schema and the data itself.

A managed table is a Spark SQL table for which Spark manages both the data and the metadata. In the case of managed table, Databricks stores the metadata and data in DBFS in your account. Since Spark SQL manages the tables, doing a DROP TABLE example_data deletes both the metadata and data.

Some common ways of creating a managed table are:

SQL

CREATE TABLE <example-table>(id STRING, value STRING)

Python

dataframe.write.saveAsTable("<example-table>")

Another option is to let Spark SQL manage the metadata, while you control the data location. We refer to this as an unmanaged table. Spark SQL manages the relevant metadata, so when you perform DROP TABLE <example-table>, Spark removes only the metadata and not the data itself. The data is still present in the path you provided.

You can create an unmanaged table with your data in data sources such as Cassandra, JDBC table, and so on. See Data sources for more information about the data sources supported by Databricks. Some common ways of creating an unmanaged table are:

SQL

CREATE TABLE <example-table>(id STRING, value STRING) USING org.apache.spark.sql.parquet OPTIONS (PATH "<your-storage-path>")

Python

dataframe.write.option('path', "<your-storage-path>").saveAsTable("<example-table>")

Replace table contents

There are two main approaches to replacing table contents: simple and recommended.

Simple way to replace table contents

The simplest way replace table contents is to delete the table metadata and data and create another table.

Managed table
DROP TABLE IF EXISTS <example-table>     // deletes the metadata and data
CREATE TABLE <example-table> AS SELECT ...
Unmanaged table
DROP TABLE IF EXISTS <example-table>         // deletes the metadata
dbutils.fs.rm("<your-storage-path>", true)   // deletes the data
CREATE TABLE <example-table> using org.apache.spark.sql.parquet OPTIONS (path "<your-storage-path>") AS SELECT ...

An alternative is to:

  1. Create the table using the SQL DDL:

    CREATE TABLE <table-name> (id long, date string) USING PARQUET LOCATION "<storage-location>"
    
  2. Store new data in <storage-location>.

  3. Run refresh table <table-name>.

To avoid potential consistency issues, the best approach to replacing table contents is to overwrite the table.

Python
dataframe.write.mode("overwrite").saveAsTable("<example-table>") // Managed Overwrite
dataframe.write.mode("overwrite").option("path","<your-storage-path>").saveAsTable("<example-table>")  // Unmanaged Overwrite
SQL

Use the insert overwrite keyword. This method applies to managed and unmanaged tables. For example, for an unmanaged table:

CREATE TABLE <example-table>(id STRING, value STRING) USING org.apache.spark.sql.parquet OPTIONS (PATH "<your-storage-path>")
INSERT OVERWRITE TABLE <example-table> SELECT ...
Scala
dataframe.write.mode(SaveMode.Overwrite).saveAsTable("<example-table>")    // Managed Overwrite
dataframe.write.mode(SaveMode.Overwrite).option("path", "<your-storage-path>").saveAsTable("<example-table>")  // Unmanaged Overwrite

Partitioned tables

Spark SQL is able to generate partitions dynamically at the file storage level to provide partition columns for tables.

Create a partitioned table

These examples partition data that you write. Spark SQL discovers the partitions and registers them in the Hive metastore.

// Create managed table as select
dataframe.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("<example-table>")

// Create unmanaged/external table as select
dataframe.write.mode(SaveMode.Overwrite).option("path", "<file-path>").saveAsTable("<example-table>")

However, if you create a partitioned table from existing data, Spark SQL does not automatically discover the partitions and register them in the Hive metastore. In this case, SELECT * FROM <example-table> does not return results. To register the partitions, run the following to generate the partitions: MSCK REPAIR TABLE "<example-table>".

// Save data to external files
dataframe.write.mode(SaveMode.Overwrite).partitionBy("id").parquet("<file-path>")

// Create unmanaged/external table
spark.sql("CREATE TABLE <example-table>(id STRING, value STRING) USING parquet PARTITIONED BY(id) LOCATION "<file-path>"")
spark.sql("MSCK REPAIR TABLE "<example-table>"")

Partition pruning

When the table is scanned, Spark pushes down the filter predicates involving the partitionBy keys. In that case, Spark avoids reading data that doesn’t satisfy those predicates. For example, suppose you have a table <example-data> that is partitioned by <date>. A query such as SELECT max(id) FROM <example-data> WHERE date = '2010-10-10' reads only the data files containing tuples whose date value matches the one specified in the query.

Table access control

Table access control allow admins and users to give fine-grained access to other users. See Data object privileges for details.