Connect to dbt Core

Article
05/03/2024

Note

This article covers dbt Core, a version of dbt for your local development machine that interacts with Databricks SQL warehouses and Azure Databricks clusters within your Azure Databricks workspaces. To use the hosted version of dbt (called dbt Cloud) instead, or to use Partner Connect to quickly create a SQL warehouse within your workspace and then connect it to dbt Cloud, see Connect to dbt Cloud.

dbt (data build tool) is a development environment that enables data analysts and data engineers to transform data by simply writing select statements. dbt handles turning these select statements into tables and views. dbt compiles your code into raw SQL and then runs that code on the specified database in Azure Databricks. dbt supports collaborative coding patterns and best practices such as version control, documentation, modularity, and more.

dbt does not extract or load data. dbt focuses on the transformation step only, using a “transform after load” architecture. dbt assumes that you already have a copy of your data in your database.

This article focuses on using dbt Core. dbt Core enables you to write dbt code in the text editor or IDE of your choice on your local development machine and then run dbt from the command line. dbt Core includes the dbt Command Line Interface (CLI). The dbt CLI is free to use and open source.

A hosted version of dbt called dbt Cloud is also available. dbt Cloud comes equipped with turnkey support for scheduling jobs, CI/CD, serving documentation, monitoring and alerting, and an integrated development environment (IDE). For more information, see Connect to dbt Cloud. The dbt Cloud Developer plan provides one free developer seat; Team and Enterprise paid plans are also available. For more information, see dbt Pricing on the dbt website.

Because dbt Core and dbt Cloud can use hosted git repositories (for example, on GitHub, GitLab or BitBucket), you can use dbt Core to create a dbt project and then make it available to your dbt Cloud users. For more information, see Creating a dbt project and Using an existing project on the dbt website.

For a general overview of dbt, watch the following YouTube video (26 minutes).

Requirements

Before you install dbt Core, you must install the following on your local development machine:

Python 3.7 or higher
A utility for creating Python virtual environments (such as pipenv)

You also need one of the following to authenticate:

(Recommended) dbt Core enabled as an OAuth application in your account. This is enabled by default.
A personal access token

Note

As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens.

If you use personal access token authentication, Databricks recommends using personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

Step 1: Create and activate a Python virtual environment

In this step, you use pipenv to create a Python virtual environment. We recommend using a Python virtual environment as it isolates package versions and code dependencies to that specific environment, regardless of the package versions and code dependencies within other environments. This helps reduce unexpected package version mismatches and code dependency collisions.

From your terminal, switch to an empty directory, creating that directory first if necessary. This procedure creates an empty directory named dbt_demo in the root of your user home directory.

Unix, linux, macos
```
mkdir ~/dbt_demo
cd ~/dbt_demo
```
Windows
```
mkdir %USERPROFILE%\dbt_demo
cd %USERPROFILE%\dbt_demo
```
In this empty directory, create a file named Pipfile with the following content. This Pipfile instructs pipenv to use Python version 3.8.6. If you use a different version, replace 3.8.6 with your version number.
```
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
dbt-databricks = "*"

[requires]
python_version = "3.8.6"
```
Note

The preceding line dbt-databricks = "*" instructs pipenv to use the latest version of the dbt-databricks package. In production scenarios, you should replace * with the specific version of the package that you want to use. Databricks recommends version 1.6.0 or greater of the dbt-databricks package. See dbt-databricks Release history on the Python Package Index (PyPI) website.
Create a Python virtual environment in this directory by running pipenv and specifying the Python version to use. This command specifies Python version 3.8.6. If you use a different version, replace 3.8.6 with your version number:
```
pipenv --python 3.8.6
```
Install the dbt Databricks adapter by running pipenv with the install option. This installs the packages in your Pipfile, which includes the dbt Databricks adapter package, dbt-databricks, from PyPI. The dbt Databricks adapter package automatically installs dbt Core and other dependencies.

Important

If your local development machine uses any of the following operating systems, you must complete additional steps first: CentOS, MacOS, Ubuntu, Debian, and Windows. See the “Does my operating system have prerequisites” section of Use pip to install dbt on the dbt Labs website.
```
pipenv install
```
Activate this virtual environment by running pipenv shell. To confirm the activation, the terminal displays (dbt_demo) before the prompt. The virtual environment begins using the specified version of Python and isolates all package versions and code dependencies within this new environment.
```
pipenv shell
```
Note

To deactivate this virtual environment, run exit. (dbt_demo) disappears from before the prompt. If you run python --version or pip list with this virtual environment deactivated, you might see a different version of Python, a different list of available packages or package versions, or both.
Confirm that your virtual environment is running the expected version of Python by running python with the --version option.
```
python --version
```
If an unexpected version of Python displays, make sure you have activated your virtual environment by running pipenv shell.
Confirm that your virtual environment is running the expected versions of dbt and the dbt Databricks adapter by running dbt with the --version option.
```
dbt --version
```
If an unexpected version of dbt or the dbt Databricks adapter displays, make sure you have activated your virtual environment by running pipenv shell. If an unexpected version still displays, try installing dbt or the dbt Databricks adapter again after you activate your virtual environment.

Step 2: Create a dbt project and specify and test connection settings

In this step, you create a dbt project, which is a collection of related directories and files that are required to use dbt. You then configure your connection profiles, which contain connection settings to an Azure Databricks cluster, a SQL warehouse, or both. To increase security, dbt projects and profiles are stored in separate locations by default.

Tip

You can connect to an existing cluster or SQL warehouse, or you can create a new one.

An existing cluster or SQL warehouse can be efficient for multiple dbt projects, for using dbt in a team, or for development use cases.
A new cluster or SQL warehouse allows you to run a single dbt project in isolation for production use cases, as well as leverage automatic termination to save costs when that dbt project is not running.

Use Azure Databricks to create a new cluster or SQL warehouse, and then reference the newly-created or existing cluster or SQL warehouse from your dbt profile.

With the virtual environment still activated, run the dbt init command with a name for your project. This procedure creates a project named my_dbt_demo.
```
dbt init my_dbt_demo
```
When you are prompted to choose a databricks or spark database, enter the number that corresponds to databricks.
When prompted for a host value, do the following:
- For a cluster, enter the Server Hostname value from the Advanced Options, JDBC/ODBC tab for your Azure Databricks cluster.
- For a SQL warehouse, enter the Server Hostname value from the Connection Details tab for your SQL warehouse.
When prompted for an http_path value, do the following:
- For a cluster, enter the HTTP Path value from the Advanced Options, JDBC/ODBC tab for your Azure Databricks cluster.
- For a SQL warehouse, enter the HTTP Path value from the Connection Details tab for your SQL warehouse.
To choose an authentication type, enter the number that corresponds with use oauth (recommended) or use access token.
If you chose use access token for your authentication type, enter the value of your Azure Databricks personal access token.

Note

As a security best practice, when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.
When prompted for the desired Unity Catalog option value, enter the number that corresponds with use Unity Catalog or not use Unity Catalog.
If you chose to use Unity Catalog, enter the desired value for catalog when prompted.
Enter the desired values for schema and threads when prompted.
dbt writes your entries to a profiles.yml file. The location of this file is listed in the output of the dbt init command. You can also list this location later by running the dbt debug --config-dir command. You can open this file now to examine and verify its contents.

If you chose use oauth for your authentication type, add your machine-to-machine (M2M) or user-to-machine (U2M) authentication profile to profiles.yml.

For examples, see Configure Azure Databricks sign-on from dbt Core with Microsoft Entra ID.

Databricks does not recommend specifying secrets in profiles.yml directly. Instead, set the client ID and client secret as environment variables.
Confirm that the connection details are correct by traversing into the my_dbt_demo directory and running the dbt debug command.

If you chose use oauth for your authentication type, you’re prompted to sign in with your identity provider.

Important

Before you begin, verify that your cluster or SQL warehouse is running.

You should see output similar to the following:
```
cd my_dbt_demo
dbt debug
```
```
...
Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]

Required dependencies:
  - git [OK found]

Connection:
  ...
  Connection test: OK connection ok
```

Next steps

Create, run, and test dbt Core models locally. See the dbt Core tutorial.
Run dbt Core projects as Azure Databricks job tasks. See Use dbt transformations in an Azure Databricks job.

Connect to dbt Core

Requirements

Step 1: Create and activate a Python virtual environment

Unix, linux, macos

Windows

Step 2: Create a dbt project and specify and test connection settings

Next steps

Additional resources

Feedback

Feedback

Additional resources