dbt Core integration with Azure Databricks

dbt (data build tool) is a development environment that enables data analysts and data engineers to transform data by simply writing select statements. dbt handles turning these select statements into tables and views. dbt compiles your code into raw SQL and then runs that code on the specified database in Azure Databricks. dbt supports collaborative coding patterns and best practices such as version control, documentation, modularity, and more. For more information, see What, exactly, is dbt? and Analytics Engineering for Everyone: Databricks in dbt Cloud on the dbt website.

dbt does not extract or load data. dbt focuses on the transformation step only, using a “transform after load” architecture. dbt assumes that you already have a copy of your data in your database.

This article focuses on using dbt Core. dbt Core enables you to write dbt code in the text editor or IDE of your choice on your local development machine and then run dbt from the command line. dbt Core includes the dbt Command Line Interface (CLI). The dbt CLI is free to use and open source.

A hosted version of dbt called dbt Cloud is also available. dbt Cloud comes equipped with turnkey support for scheduling jobs, CI/CD, serving documentation, monitoring and alerting, and an integrated development environment (IDE). For more information, see dbt Cloud integration with Azure Databricks. The dbt Cloud Developer plan provides one free developer seat; Team and Enterprise paid plans are also available. For more information, see dbt Pricing on the dbt website.

Because dbt Core and dbt Cloud can use hosted git repositories (for example, on GitHub, GitLab or BitBucket), you can use dbt Core to create a dbt project and then make it available to your dbt Cloud users. For more information, see Creating a dbt project and Using an existing project on the dbt website.

Requirements

Before you install dbt Core, you must install the following on your local development machine:

Note

This article has been tested on macOS with the following components:

Step 1: Create and activate a Python virtual environment

In this step, you use pipenv to create a Python virtual environment. We recommend using a Python virtual environment as it isolates package versions and code dependencies to that specific environment, regardless of the package versions and code dependencies within other environments. This helps reduce unexpected package version mismatches and code dependency collisions.

  1. From your terminal, switch to an empty directory, creating that directory first if necessary. This procedure creates an empty directory named dbt_demo in the root of your user home directory.

    Unix, linux, macos

    mkdir ~/dbt_demo
    cd ~/dbt_demo
    

    Windows

    mkdir %USERPROFILE%\dbt_demo
    cd %USERPROFILE%\dbt_demo
    
  2. In this empty directory, create a file named Pipfile with the following content. This Pipfile instructs pipenv to use Python version 3.8.6. If you use a different version, replace 3.8.6 with your version number.

    [[source]]
    url = "https://pypi.org/simple"
    verify_ssl = true
    name = "pypi"
    
    [requires]
    python_version = "3.8.6"
    
  3. Create a Python virtual environment in this directory by running pipenv and specifying the Python version to use. This command specifies Python version 3.8.6. If you use a different version, replace 3.8.6 with your version number:

    pipenv --python 3.8.6
    
  4. Activate this virtual environment by running pipenv shell. To confirm the activation, the terminal displays (dbt_demo) before the prompt. The virtual environment begins using the specified version of Python and isolates all package versions and code dependencies within this new environment.

    pipenv shell
    

    Note

    To deactivate this virtual environment, run exit. (dbt_demo) disappears from before the prompt. If you run python --version or pip list with this virtual environment deactivated, you might see a different version of Python, a different list of available packages or package versions, or both.

  5. Confirm that your virtual environment is running the expected version of Python by running python with the --version option.

    python --version
    
    Python 3.8.6
    

    If an unexpected version of Python displays, make sure you have activated your virtual environment by running pipenv shell. This procedure assumes the expected version for Python is 3.8.6.

Step 2: Install required software into your virtual environment

In this step, you install the pyodbc module, dbt, and the dbt Spark ODBC plugin into your Python virtual environment.

  1. With the virtual environment activated, install the pyodbc module by running pipenv. This command specifies pyodbc version 4.0.30. If you use a different version, replace 4.0.30 with a different version number, or install the latest version by omitting ==4.0.30.

    pipenv install pyodbc==4.0.30
    
  2. Install dbt by running pipenv with the name of the dbt package from the Python Package Index (PyPI), which is dbt. This command specifies dbt version 0.19.1. If you use a different version, replace 0.19.1 with a different version number, or install the latest version by omitting ==0.19.1.

    Important

    If your local development machine uses any of the following operating systems, you must complete additional steps before installing dbt: Ubuntu, Debian, CentOS, and Windows.

    pipenv install dbt==0.19.1
    
  3. Install the dbt Spark ODBC plugin by running pipenv with the name of the dbt Spark ODBC plugin package from PyPI, which is dbt-spark[ODBC]. This command specifies dbt Spark ODBC plugin version 0.19.1. If you use a different version, replace 0.19.1 with a different version number, or install the latest version by omitting ==0.19.1.

    pipenv install "dbt-spark[ODBC]==0.19.1"
    

    You must specify "dbt-spark[ODBC]", including the version number if needed, within quotes.

  4. Confirm that your virtual environment is running the expected versions of dbt and the dbt Spark ODBC plugin by running dbt with the --version option. This procedure assumes the expected version for each is 0.19.1.

    dbt --version
    
    Installed version: 0.19.1
    ...
    Plugins:
    ...
      - spark: 0.19.1
    

    If an unexpected version of dbt or the dbt Spark ODBC plugin displays, make sure you have activated your virtual environment by running pipenv shell. If an unexpected version still displays, try installing dbt or the dbt Spark ODBC plugin again after you activate your virtual environment.

Step 3: Create a dbt project and specify and test connection settings

In this step, you create a dbt project, which is a collection of related directories and files that are required to use dbt. You then configure your connection profiles, which contain connection settings to an Azure Databricks cluster, a SQL endpoint, or both. To increase security, dbt projects and profiles are stored in separate locations by default.

  1. With the virtual environment still activated, run the dbt init command with a name for your project, specifying the spark adapter. This procedure creates a project named my_dbt_demo.

    dbt init my_dbt_demo --adapter spark
    

    Tip

    Carefully read the output of the dbt init command, as it provides additional helpful usage guidance.

  2. From within the dbt configuration folder, open the profiles.yml file. The location of this file is listed in the output of the dbt init command.

    Tip

    If you forget where the dbt configuration folder is located, you can list it by running the dbt debug --config-dir command.

  3. Modify the contents of the profiles.yml file, depending on what you want to connect to.

    Cluster

    databricks_cluster:
      outputs:
        dev:
          type: spark
          method: odbc
          driver: <path-to-odbc-driver>
          schema: default
          host: <cluster-server-hostname>
          organization: "<organization-id>"
          token: <personal-access-token>
          cluster: <cluster-id>
          port: <cluster-port-number>
          connect_retries: 5
          connect_timeout: 60
      target: dev
    

    Replace:

    • <path-to-odbc-driver> with the path to where the Databricks ODBC driver is installed on your local development machine. To get the path, for Unix, Linux, or macOS, run cat /etc/odbcinst.ini. For Windows, use the Registry Editor to browse to the COMPUTER\HKEY_LOCAL_MACHINE\SOFTWARE\ODBC\ODBCINST.INI key.

    • <cluster-server-hostname> with the Server Hostname value from the Advanced Options, JDBC/ODBC tab for your Azure Databricks cluster.

    • <organization-id> with the value between ?o= and any other query string separators in the URL to your Azure Databricks workspace.

      Important

      The organization ID value must be within quotes.

    • <personal-access-token> with the value of your personal access token.

    • <cluster-id> with the ID of your cluster. You can get this ID from the HTTP Path value from the Advanced Options, JDBC/ODBC tab for your Azure Databricks cluster. The ID is the string of characters following the final forward slash character. For example, if the HTTP Path value is sql/protocolv1/o/1234567890123456/0123-456789-test012, the cluster ID is 0123-456789-test012.

    • <cluster-port-number> with the Port value from the Advanced Options, JDBC/ODBC tab for your Azure Databricks cluster.

    For more information, see Connecting to Databricks ODBC on the dbt website and Configuring your profile in the dbt-labs/dbt-spark repository in GitHub.

    Sql endpoint

    sql_endpoint:
      outputs:
        dev:
          type: spark
          method: odbc
          driver: <path-to-odbc-driver>
          schema: default
          host: <sql-endpoint-server-hostname>
          organization: "<organization-id>"
          token: <personal-access-token>
          endpoint: <sql-endpoint-id>
          port: <sql-endpoint-port-number>
          connect_retries: 5
          connect_timeout: 60
      target: dev
    

    Replace:

    • <path-to-odbc-driver> with the path to where the Databricks ODBC driver is installed on your local development machine. To get the path, for Unix, Linux, or macOS, run cat /etc/odbcinst.ini. For Windows, use the Registry Editor to browse to the COMPUTER\HKEY_LOCAL_MACHINE\SOFTWARE\ODBC\ODBCINST.INI key.

    • <sql-endpoint-server-hostname> with the Server Hostname value from the Connection Details tab for your SQL endpoint.

    • <organization-id> with the value between adb- and the first dot (.) in the Server Hostname value from the Connection Details tab for your SQL endpoint.

      Important

      This value must be within quotes.

    • <personal-access-token> with the value of your personal access token.

    • <sql-endpoint-id> with the ID of your SQL endpoint. You can get this ID from the HTTP Path value from the Connection Details tab for your SQL endpoint. The ID is the string of characters following the final forward slash character. For example, if the HTTP Path value is /sql/1.0/endpoints/a123456bcde7f890, the SQL endpoint ID is a123456bcde7f890.

    • <sql-endpoint-port-number> with the Port value from the Connection Details tab for your SQL endpoint.

    For more information, see Connecting to Databricks ODBC on the dbt website and Configuring your profile in the dbt-labs/dbt-spark repository in GitHub.

    Tip

    You do not have to use the connection profile name provided in the profiles.yml example (such as databricks_cluster). You can use whatever connection profile names you want. To allow dbt to switch connections, you can add a separate profile entry for each connection, giving each profile entry a unique name, for example:

    databricks_cluster:
      outputs:
        ...
      target: ...
    
    <some-unique-name-for-this-second-connection-profile>:
      outputs:
        ...
      target: ...
    
    <some-other-unique-name-for-this-third-connection-profile>:
      outputs:
        ...
      target: ...
    
  4. In your project’s dbt_project.yml file, change the value of the profile setting to match the name of your connection profile in the profiles.yml file. This procedure uses a connection profile named databricks_cluster.

    profile: 'databricks_cluster'
    
  5. Confirm that the connection details are correct by running the dbt debug command.

    dbt debug
    
    ...
    Configuration:
      profiles.yml file [OK found and valid]
      dbt_project.yml file [OK found and valid]
    
    Required dependencies:
     - git [OK found]
    
    Connection:
      ...
      Connection test: OK connection ok
    

Step 4: Create and run models

In this step, you use your favorite text editor to create models, which are select statements that create either a new view (the default) or a new table in a database, based on existing data in that same database. This procedure creates a model based on the sample diamonds table from the Azure Databricks datasets, as described in the Create a table section of _. This procedure assumes this table has already been created in your workspace’s default database.

  1. In the project’s models directory, create a file named diamonds_four_cs.sql with the following SQL statement. This statement selects only the carat, cut, color, and clarity details for each diamond from the diamonds table. The config block instructs dbt to create a table in the database based on this statement.

    {{ config(
      materialized='table',
      file_format='delta'
    ) }}
    
    select carat, cut, color, clarity
    from diamonds
    

    Tip

    For additional config options such as using the Delta file format and the merge incremental strategy, see Apache Spark configurations on the dbt website and the “Model Configuration” and “Incremental Models” sections of the Usage Notes in the dbt-labs/dbt-spark repository in GitHub.

  2. In the project’s models directory, create a second file named diamonds_list_colors.sql with the following SQL statement. This statement selects unique values from the colors column in the diamonds_four_cs table, sorting the results in alphabetical order first to last. Because there is no config block, this model instructs dbt to create a view in the database based on this statement.

    select distinct color
    from diamonds_four_cs
    sort by color asc
    
  3. In the project’s models directory, create a third file named diamonds_prices.sql with the following SQL statement. This statement averages diamond prices by color, sorting the results by average price from highest to lowest. This model instructs dbt to create a view in the database based on this statement.

    select color, avg(price) as price
    from diamonds
    group by color
    order by price desc
    
  4. With the virtual environment activated, run the dbt run command with the paths to the three preceding files. In the default database (as specified in the profiles.yml file), dbt creates one table named diamonds_four_cs and two views named diamonds_list_colors and diamonds_prices. dbt gets these view and table names from their related .sql file names.

    dbt run --model models/diamonds_four_cs.sql models/diamonds_list_colors.sql models/diamonds_prices.sql
    
    ...
    ... | 1 of 3 START table model default.diamonds_four_cs.................... [RUN]
    ... | 1 of 3 OK created table model default.diamonds_four_cs............... [OK ...]
    ... | 2 of 3 START view model default.diamonds_list_colors................. [RUN]
    ... | 2 of 3 OK created view model default.diamonds_list_colors............ [OK ...]
    ... | 3 of 3 START view model default.diamonds_prices...................... [RUN]
    ... | 3 of 3 OK created view model default.diamonds_prices................. [OK ...]
    ... |
    ... | Finished running 1 table model, 2 view models ...
    
    Completed successfully
    
    Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
    
  5. Run the following SQL code to list information about the new views and to select all rows from the table and views.

    If you are connecting to a cluster, you can run this SQL code from a notebook that is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL endpoint, you can run this SQL code from a query.

    SHOW views IN default;
    
    +-----------+----------------------+-------------+
    | namespace | viewName             | isTemporary |
    +===========+======================+=============+
    | default   | diamonds_list_colors | false       |
    +-----------+----------------------+-------------+
    | default   | diamonds_prices      | false       |
    +-----------+----------------------+-------------+
    
    SELECT * FROM diamonds_four_cs;
    
    +-------+---------+-------+---------+
    | carat | cut     | color | clarity |
    +=======+=========+=======+=========+
    | 0.23  | Ideal   | E     | SI2     |
    +-------+---------+-------+---------+
    | 0.21  | Premium | E     | SI1     |
    +-------+---------+-------+---------+
    ...
    
    SELECT * FROM diamonds_list_colors;
    
    +-------+
    | color |
    +=======+
    | D     |
    +-------+
    | E     |
    +-------+
    ...
    
    SELECT * FROM diamonds_prices;
    
    +-------+---------+
    | color | price   |
    +=======+=========+
    | J     | 5323.82 |
    +-------+---------+
    | I     | 5091.87 |
    +-------+---------+
    ...
    

Step 5: Create and run more complex models

In this step, you create more complex models for a set of related data tables. These data tables contain information about a fictional sports league of three teams playing a season of six games. This procedure creates the data tables, creates the models, and runs the models.

  1. Run the following SQL code to create the necessary data tables.

    If you are connecting to a cluster, you can run this SQL code from a notebook that is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL endpoint, you can run this SQL code from a query.

    The tables and views in this step start with zzz_ to help identify them as part of this example. You do not need to follow this pattern for your own tables and views.

    DROP TABLE IF EXISTS zzz_game_opponents;
    DROP TABLE IF EXISTS zzz_game_scores;
    DROP TABLE IF EXISTS zzz_games;
    DROP TABLE IF EXISTS zzz_teams;
    
    CREATE TABLE zzz_game_opponents (
    game_id INT,
    home_team_id INT,
    visitor_team_id INT
    ) USING DELTA;
    
    INSERT INTO zzz_game_opponents VALUES (1, 1, 2);
    INSERT INTO zzz_game_opponents VALUES (2, 1, 3);
    INSERT INTO zzz_game_opponents VALUES (3, 2, 1);
    INSERT INTO zzz_game_opponents VALUES (4, 2, 3);
    INSERT INTO zzz_game_opponents VALUES (5, 3, 1);
    INSERT INTO zzz_game_opponents VALUES (6, 3, 2);
    
    /*
    +---------+--------------+-----------------+
    | game_id | home_team_id | visitor_team_id |
    +=========+==============+=================+
    | 1       | 1            | 2               |
    +---------+--------------+-----------------+
    | 2       | 1            | 3               |
    +---------+--------------+-----------------+
    | 3       | 2            | 1               |
    +---------+--------------+-----------------+
    | 4       | 2            | 3               |
    +---------+--------------+-----------------+
    | 5       | 3            | 1               |
    +---------+--------------+-----------------+
    | 6       | 3            | 2               |
    +---------+--------------+-----------------+
    */
    
    CREATE TABLE zzz_game_scores (
    game_id INT,
    home_team_score INT,
    visitor_team_score INT
    ) USING DELTA;
    
    INSERT INTO zzz_game_scores VALUES (1, 4, 2);
    INSERT INTO zzz_game_scores VALUES (2, 0, 1);
    INSERT INTO zzz_game_scores VALUES (3, 1, 2);
    INSERT INTO zzz_game_scores VALUES (4, 3, 2);
    INSERT INTO zzz_game_scores VALUES (5, 3, 0);
    INSERT INTO zzz_game_scores VALUES (6, 3, 1);
    
    /*
    +---------+-----------------+--------------------+
    | game_id | home_team_score | visitor_team_score |
    +=========+=================+====================+
    | 1       | 4               | 2                  |
    +---------+-----------------+--------------------+
    | 2       | 0               | 1                  |
    +---------+-----------------+--------------------+
    | 3       | 1               | 2                  |
    +---------+-----------------+--------------------+
    | 4       | 3               | 2                  |
    +---------+-----------------+--------------------+
    | 5       | 3               | 0                  |
    +---------+-----------------+--------------------+
    | 6       | 3               | 1                  |
    +---------+-----------------+--------------------+
    */
    
    CREATE TABLE zzz_games (
    game_id INT,
    game_date DATE
    ) USING DELTA;
    
    INSERT INTO zzz_games VALUES (1, '2020-12-12');
    INSERT INTO zzz_games VALUES (2, '2021-01-09');
    INSERT INTO zzz_games VALUES (3, '2020-12-19');
    INSERT INTO zzz_games VALUES (4, '2021-01-16');
    INSERT INTO zzz_games VALUES (5, '2021-01-23');
    INSERT INTO zzz_games VALUES (6, '2021-02-06');
    
    /*
    +---------+------------+
    | game_id | game_date  |
    +=========+============+
    | 1       | 2020-12-12 |
    +---------+------------+
    | 2       | 2021-01-09 |
    +---------+------------+
    | 3       | 2020-12-19 |
    +---------+------------+
    | 4       | 2021-01-16 |
    +---------+------------+
    | 5       | 2021-01-23 |
    +---------+------------+
    | 6       | 2021-02-06 |
    +---------+------------+
    */
    
    CREATE TABLE zzz_teams (
    team_id INT,
    team_city VARCHAR(15)
    ) USING DELTA;
    
    INSERT INTO zzz_teams VALUES (1, "San Francisco");
    INSERT INTO zzz_teams VALUES (2, "Seattle");
    INSERT INTO zzz_teams VALUES (3, "Amsterdam");
    
    /*
    +---------+---------------+
    | team_id | team_city     |
    +=========+===============+
    | 1       | San Francisco |
    +---------+---------------+
    | 2       | Seattle       |
    +---------+---------------+
    | 3       | Amsterdam     |
    +---------+---------------+
    */
    
  2. In the project’s models directory, create a file named zzz_game_details.sql with the following SQL statement. This statement creates a table that provides the details of each game, such as team names and scores. The config block instructs dbt to create a table in the database based on this statement.

    -- Create a table that provides full details for each game, including
    -- the game ID, the home and visiting teams' city names and scores,
    -- the game winner's city name, and the game date.
    
    {{ config(
      materialized='table',
      file_format='delta'
    ) }}
    
    -- Step 4 of 4: Replace the visitor team IDs with their city names.
    select
      game_id,
      home,
      t.team_city as visitor,
      home_score,
      visitor_score,
      -- Step 3 of 4: Display the city name for each game's winner.
      case
        when
          home_score > visitor_score
            then
              home
        when
          visitor_score > home_score
            then
              t.team_city
      end as winner,
      game_date as date
    from (
      -- Step 2 of 4: Replace the home team IDs with their actual city names.
      select
        game_id,
        t.team_city as home,
        home_score,
        visitor_team_id,
        visitor_score,
        game_date
      from (
        -- Step 1 of 4: Combine data from various tables (for example, game and team IDs, scores, dates).
        select
          g.game_id,
          go.home_team_id,
          gs.home_team_score as home_score,
          go.visitor_team_id,
          gs.visitor_team_score as visitor_score,
          g.game_date
        from
          zzz_games as g,
          zzz_game_opponents as go,
          zzz_game_scores as gs
        where
          g.game_id = go.game_id and
          g.game_id = gs.game_id
      ) as all_ids,
        zzz_teams as t
      where
        all_ids.home_team_id = t.team_id
    ) as visitor_ids,
      zzz_teams as t
    where
      visitor_ids.visitor_team_id = t.team_id
    order by game_date desc
    
  3. In the project’s models directory, create a file named zzz_win_loss_records.sql with the following SQL statement. This statement creates a view that lists team win-loss records for the season.

    -- Create a view that summarizes the season's win and loss records by team.
    
    -- Step 2 of 2: Calculate the number of wins and losses for each team.
    select
      winner as team,
      count(winner) as wins,
      -- Each team played in 4 games.
      (4 - count(winner)) as losses
    from (
      -- Step 1 of 2: Determine the winner and loser for each game.
      select
        game_id,
        winner,
        case
          when
            home = winner
              then
                visitor
          else
            home
        end as loser
      from zzz_game_details
    )
    group by winner
    order by wins desc
    
  4. With the virtual environment activated, run the dbt run command with the paths to the two preceding files. In the default database (as specified in the profiles.yml file), dbt creates one table named zzz_game_details and one view named zzz_win_loss_records. dbt gets these view and table names from their related .sql file names.

    dbt run --model models/zzz_game_details.sql models/zzz_win_loss_records.sql
    
    ...
    ... | 1 of 2 START table model default.zzz_game_details.................... [RUN]
    ... | 1 of 2 OK created table model default.zzz_game_details............... [OK ...]
    ... | 2 of 2 START view model default.zzz_win_loss_records................. [RUN]
    ... | 2 of 2 OK created view model default.zzz_win_loss_records............ [OK ...]
    ... |
    ... | Finished running 1 table model, 1 view model ...
    
    Completed successfully
    
    Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
    
  5. Run the following SQL code to list information about the new view and to select all rows from the table and view.

    If you are connecting to a cluster, you can run this SQL code from a notebook that is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL endpoint, you can run this SQL code from a query.

    SHOW VIEWS FROM default LIKE 'zzz_win_loss_records';
    
    +-----------+----------------------+-------------+
    | namespace | viewName             | isTemporary |
    +===========+======================+=============+
    | default   | zzz_win_loss_records | false       |
    +-----------+----------------------+-------------+
    
    SELECT * FROM zzz_game_details;
    
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    | game_id | home          | visitor       | home_score | visitor_score | winner        | date       |
    +=========+===============+===============+============+===============+===============+============+
    | 1       | San Francisco | Seattle       | 4          | 2             | San Francisco | 2020-12-12 |
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    | 2       | San Francisco | Amsterdam     | 0          | 1             | Amsterdam     | 2021-01-09 |
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    | 3       | Seattle       | San Francisco | 1          | 2             | San Francisco | 2020-12-19 |
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    | 4       | Seattle       | Amsterdam     | 3          | 2             | Seattle       | 2021-01-16 |
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    | 5       | Amsterdam     | San Francisco | 3          | 0             | Amsterdam     | 2021-01-23 |
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    | 6       | Amsterdam     | Seattle       | 3          | 1             | Amsterdam     | 2021-02-06 |
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    
    SELECT * FROM zzz_win_loss_records;
    
    +---------------+------+--------+
    | team          | wins | losses |
    +===============+======+========+
    | Amsterdam     | 3    | 1      |
    +---------------+------+--------+
    | San Francisco | 2    | 2      |
    +---------------+------+--------+
    | Seattle       | 1    | 3      |
    +---------------+------+--------+
    

Step 6: Create and run tests

In this step, you create tests, which are assertions you make about your models. When you run these tests, dbt will tell you if each test in your project passes or fails.

There are two type of tests. Schema tests, applied in YAML, return the number of records that do not pass an assertion. When this number is zero, all records pass, therefore the tests pass. Data tests are specific queries that must return zero records to pass.

  1. In the project’s models directory, create a file named schema.yml with the following content. This file includes schema tests that determine whether the specified columns have unique values, are not null, have only the specified values, or a combination.

    version: 2
    
    models:
      - name: zzz_game_details
        columns:
          - name: game_id
            tests:
              - unique
              - not_null
          - name: home
            tests:
              - not_null
              - accepted_values:
                  values: ['Amsterdam', 'San Francisco', 'Seattle']
          - name: visitor
            tests:
              - not_null
              - accepted_values:
                  values: ['Amsterdam', 'San Francisco', 'Seattle']
          - name: home_score
            tests:
              - not_null
          - name: visitor_score
            tests:
              - not_null
          - name: winner
            tests:
              - not_null
              - accepted_values:
                  values: ['Amsterdam', 'San Francisco', 'Seattle']
          - name: date
            tests:
              - not_null
      - name: zzz_win_loss_records
        columns:
          - name: team
            tests:
              - unique
              - not_null
              - relationships:
                  to: ref('zzz_game_details')
                  field: home
          - name: wins
            tests:
              - not_null
          - name: losses
            tests:
              - not_null
    
  2. In the project’s tests directory, create a file named zzz_game_details_check_dates.sql with the following SQL statement. This file includes a data test to determine whether any games happened outside of the regular season.

    -- This season's games happened between 2020-12-12 and 2021-02-06.
    -- For this test to pass, this query must return no results.
    
    select date
    from zzz_game_details
    where date < '2020-12-12'
    or date > '2021-02-06'
    
  3. In the project’s tests directory, create a file named zzz_game_details_check_scores.sql with the following SQL statement. This file includes a data test to determine whether any scores were negative or any games were tied.

    -- This sport allows no negative scores or tie games.
    -- For this test to pass, this query must return no results.
    
    select home_score, visitor_score
    from zzz_game_details
    where home_score < 0
    or visitor_score < 0
    or home_score = visitor_score
    
  4. In the project’s tests directory, create a file named zzz_win_loss_records_check_records.sql with the following SQL statement. This file includes a data test to determine whether any teams had negative win or loss records, had more win or loss records than games played, or played more games than were allowed.

    -- Each team participated in 4 games this season.
    -- For this test to pass, this query must return no results.
    
    select wins, losses
    from zzz_win_loss_records
    where wins < 0 or wins > 4
    or losses < 0 or losses > 4
    or (wins + losses) > 4
    
  5. With the virtual environment activated, run the dbt test command with the --schema option and names of the two models in the models/schema.yml file to run the tests that are specified for those models.

    dbt test --schema --models zzz_game_details zzz_win_loss_records
    
    ...
    ... | 1 of 15 START test accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [RUN]
    ... | 1 of 15 PASS accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [PASS ...]
    ...
    ... |
    ... | Finished running 15 tests ...
    
    Completed successfully
    
    Done. PASS=15 WARN=0 ERROR=0 SKIP=0 TOTAL=15
    
  6. Run the dbt test command with the --data option to run the tests in the project’s tests directory.

    dbt test --data
    
    ...
    ... | 1 of 3 START test zzz_game_details_check_dates....................... [RUN]
    ... | 1 of 3 PASS zzz_game_details_check_dates............................. [PASS ...]
    ...
    ... |
    ... | Finished running 3 tests ...
    
    Completed successfully
    
    Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
    

Step 7: Clean up

You can delete the tables and views you created for this example by running the following SQL code.

If you are connecting to a cluster, you can run this SQL code from a notebook that is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL endpoint, you can run this SQL code from a query.

DROP TABLE zzz_game_opponents;
DROP TABLE zzz_game_scores;
DROP TABLE zzz_games;
DROP TABLE zzz_teams;
DROP TABLE zzz_game_details;
DROP VIEW zzz_win_loss_records;

DROP TABLE diamonds;
DROP TABLE diamonds_four_cs;
DROP VIEW diamonds_list_colors;
DROP VIEW diamonds_prices;

Next steps

  • Learn more about dbt models.
  • Learn more about how to test your dbt projects.
  • Learn how to use Jinja, a templating language, for programming SQL in your dbt projects.
  • Learn about dbt best practices.
  • Learn about dbt Cloud, a hosted version of dbt.

Additional resources