Quickstart: Create an Azure Database for PostgreSQL - Hyperscale (Citus) in the Azure portal

Azure Database for PostgreSQL is a managed service that you use to run, manage, and scale highly available PostgreSQL databases in the cloud. This Quickstart shows you how to create an Azure Database for PostgreSQL - Hyperscale (Citus) server group using the Azure portal. You'll explore distributed data: sharding tables across nodes, ingesting sample data, and running queries that execute on multiple nodes.

If you don't have an Azure subscription, create a free account before you begin.

Sign in to the Azure portal

Sign in to the Azure portal.

Create an Azure Database for PostgreSQL - Hyperscale (Citus)

Follow these steps to create an Azure Database for PostgreSQL server:

  1. Click Create a resource in the upper left-hand corner of the Azure portal.

  2. Select Databases from the New page, and select Azure Database for PostgreSQL from the Databases page.

  3. For the deployment option, click the Create button under Hyperscale (Citus) server group.

  4. Fill out the new server details form with the following information:

    • Resource group: click the Create new link below the text box for this field. Enter a name such as myresourcegroup.
    • Server group name: enter a unique name for the new server group, which will also be used for a server subdomain.
    • Admin username: currently required to be the value citus, and can't be changed.
    • Password: must be at least eight characters long and contain characters from three of the following categories – English uppercase letters, English lowercase letters, numbers (0-9), and non-alphanumeric characters (!, $, #, %, and so on.)
    • Location: use the location that is closest to your users to give them the fastest access to the data.

    Important

    The server admin password that you specify here is required to log in to the server and its databases. Remember or record this information for later use.

  5. Click Configure server group. Leave the settings in that section unchanged and click Save.

  6. Click Next : Networking > at the bottom of the screen.

  7. In the Networking tab, click the Public endpoint radio button. Public endpoint selected

  8. Click the link + Add current client IP address. Added client IP

    Note

    Azure PostgreSQL server communicates over port 5432. If you are trying to connect from within a corporate network, outbound traffic over port 5432 may not be allowed by your network's firewall. If so, you cannot connect to your Hyperscale (Citus) cluster unless your IT department opens port 5432.

  9. Click Review + create and then Create to provision the server. Provisioning takes a few minutes.

  10. The page will redirect to monitor deployment. When the live status changes from Your deployment is underway to Your deployment is complete, click the Outputs menu item on the left of the page.

  11. The outputs page will contain a coordinator hostname with a button next to it to copy the value to the clipboard. Record this information for later use.

Connect to the database using psql

When you create your Azure Database for PostgreSQL server, a default database named citus is created. To connect to your database server, you need a connection string and the admin password.

  1. Obtain the connection string. In the server group page click the Connection strings menu item. (It's under Settings.) Find the string marked psql. It will be of the form:

    psql "host=hostname.postgres.database.azure.com port=5432 dbname=citus user=citus password={your_password} sslmode=require"
    

    Copy the string. You will need to replace "{your_password}" with the administrative password you chose earlier. The system doesn't store your plaintext password and so can't display it for you in the connection string.

  2. Open a terminal window on your local computer.

  3. At the prompt, connect to your Azure Database for PostgreSQL server with the psql utility. Pass your connection string in quotes, being sure it contains your password:

    psql "host=..."
    

    For example, the following command connects to the coordinator node of the server group mydemoserver:

    psql "host=mydemoserver-c.postgres.database.azure.com port=5432 dbname=citus user=citus password={your_password} sslmode=require"
    

Create and distribute tables

Once connected to the hyperscale coordinator node using psql, you can complete some basic tasks.

Within Hyperscale servers there are three types of tables:

  • Distributed or sharded tables (spread out to help scaling for performance and parallelization)
  • Reference tables (multiple copies maintained)
  • Local tables (often used for internal admin tables)

In this quickstart, we'll primarily focus on distributed tables and getting familiar with them.

The data model we're going to work with is simple: user and event data from GitHub. Events include fork creation, git commits related to an organization, and more.

Once you've connected via psql, let's create our tables. In the psql console run:

CREATE TABLE github_events
(
    event_id bigint,
    event_type text,
    event_public boolean,
    repo_id bigint,
    payload jsonb,
    repo jsonb,
    user_id bigint,
    org jsonb,
    created_at timestamp
);

CREATE TABLE github_users
(
    user_id bigint,
    url text,
    login text,
    avatar_url text,
    gravatar_id text,
    display_login text
);

The payload field of github_events has a JSONB datatype. JSONB is the JSON datatype in binary form in Postgres. The datatype makes it easy to store a flexible schema in a single column.

Postgres can create a GIN index on this type, which will index every key and value within it. With an index, it becomes fast and easy to query the payload with various conditions. Let's go ahead and create a couple of indexes before we load our data. In psql:

CREATE INDEX event_type_index ON github_events (event_type);
CREATE INDEX payload_index ON github_events USING GIN (payload jsonb_path_ops);

Next we’ll take those Postgres tables on the coordinator node and tell Hyperscale to shard them across the workers. To do so, we’ll run a query for each table specifying the key to shard it on. In the current example we’ll shard both the events and users table on user_id:

SELECT create_distributed_table('github_events', 'user_id');
SELECT create_distributed_table('github_users', 'user_id');

We're ready to load data. In psql still, shell out to download the files:

\! curl -O https://examples.citusdata.com/users.csv
\! curl -O https://examples.citusdata.com/events.csv

Next, load the data from the files into the distributed tables:

SET CLIENT_ENCODING TO 'utf8';

\copy github_events from 'events.csv' WITH CSV
\copy github_users from 'users.csv' WITH CSV

Run queries

Now it's time for the fun part, actually running some queries. Let's start with a simple count (*) to see how much data we loaded:

SELECT count(*) from github_events;

That worked nicely. We'll come back to that sort of aggregation in a bit, but for now let’s look at a few other queries. Within the JSONB payload column there's a good bit of data, but it varies based on event type. PushEvent events contain a size that includes the number of distinct commits for the push. We can use it to find the total number of commits per hour:

SELECT date_trunc('hour', created_at) AS hour,
       sum((payload->>'distinct_size')::int) AS num_commits
FROM github_events
WHERE event_type = 'PushEvent'
GROUP BY hour
ORDER BY hour;

So far the queries have involved the github_events exclusively, but we can combine this information with github_users. Since we sharded both users and events on the same identifier (user_id), the rows of both tables with matching user IDs will be colocated on the same database nodes and can easily be joined.

If we join on user_id, Hyperscale can push the join execution down into shards for execution in parallel on worker nodes. For example, let's find the users who created the greatest number of repositories:

SELECT gu.login, count(*)
  FROM github_events ge
  JOIN github_users gu
    ON ge.user_id = gu.user_id
 WHERE ge.event_type = 'CreateEvent'
   AND ge.payload @> '{"ref_type": "repository"}'
 GROUP BY gu.login
 ORDER BY count(*) DESC;

Clean up resources

In the preceding steps, you created Azure resources in a server group. If you don't expect to need these resources in the future, delete the server group. Press the Delete button in the Overview page for your server group. When prompted on a pop-up page, confirm the name of the server group and click the final Delete button.

Next steps

In this quickstart, you learned how to provision a Hyperscale (Citus) server group. You connected to it with psql, created a schema, and distributed data.

Next, follow a tutorial to build scalable multi-tenant applications.