教學課程:使用適用於 PostgreSQL 的 Azure 資料庫 – Hyperscale (Citus) (預覽) 設計即時分析儀表板Tutorial: Design a real-time analytics dashboard by using Azure Database for PostgreSQL – Hyperscale (Citus) (preview)

在本教學課程中,您可以使用適用於 PostgreSQL 的 Azure 資料庫 – Hyperscale (Citus) (預覽) 了解如何:In this tutorial, you use Azure Database for PostgreSQL - Hyperscale (Citus) (preview) to learn how to:

  • 建立 Hyperscale (Citus) 伺服器群組Create a Hyperscale (Citus) server group
  • 使用 psql 公用程式建立結構描述Use psql utility to create a schema
  • 進行跨節點的資料表分區化Shard tables across nodes
  • 產生範例資料Generate sample data
  • 執行彙總Perform rollups
  • 查詢原始和彙總資料Query raw and aggregated data
  • 使資料過期Expire data

必要條件Prerequisites

如果您沒有 Azure 訂用帳戶,請在開始前建立免費帳戶If you don't have an Azure subscription, create a free account before you begin.

登入 Azure 入口網站Sign in to the Azure portal

登入 Azure 入口網站Sign in to the Azure portal.

建立適用於 PostgreSQL 的 Azure 資料庫 – Hyperscale (Citus)Create an Azure Database for PostgreSQL - Hyperscale (Citus)

請依照下列步驟來建立「適用於 PostgreSQL 的 Azure 資料庫」伺服器:Follow these steps to create an Azure Database for PostgreSQL server:

  1. 按一下 Azure 入口網站左上角的 [建立資源] 。Click Create a resource in the upper left-hand corner of the Azure portal.

  2. 從 [新增] 頁面中選取 [資料庫] ,然後從 [資料庫] 頁面中選取 [適用於 PostgreSQL 的 Azure 資料庫] 。Select Databases from the New page, and select Azure Database for PostgreSQL from the Databases page.

  3. 針對部署選項,按一下 [Hyperscale (Citus) 伺服器群組 - 預覽] 下方的 [建立] 按鈕。For the deployment option, click the Create button under Hyperscale (Citus) server group - PREVIEW.

  4. 在新伺服器詳細資料表單中填寫下列資訊︰Fill out the new server details form with the following information:

    • 資源群組:按一下此欄位的文字方塊下方的 [新建] 連結。Resource group: click the Create new link below the text box for this field. 輸入名稱,例如 myresourcegroupEnter a name such as myresourcegroup.
    • 伺服器群組名稱:輸入新伺服器群組的唯一名稱,該名稱也會用於伺服器子網域。Server group name: enter a unique name for the new server group, which will also be used for a server subdomain.
    • 系統管理員使用者名稱:其值目前必須是 citus,且無法變更。Admin username: currently required to be the value citus, and can't be changed.
    • 密碼:長度必須至少有八個字元,且包含來自下列三種類別的字元:英文大寫字母、英文小寫字母、數字 (0-9) 和非英數字元 (!、$、#、% 等等)。Password: must be at least eight characters long and contain characters from three of the following categories – English uppercase letters, English lowercase letters, numbers (0-9), and non-alphanumeric characters (!, $, #, %, and so on.)
    • 位置:使用最接近使用者的位置,讓他們能以最快速度存取資料。Location: use the location that is closest to your users to give them the fastest access to the data.

    重要

    必須有您在此處指定的伺服器系統管理員密碼,才能登入伺服器和其資料庫。The server admin password that you specify here is required to log in to the server and its databases. 請記住或記錄此資訊,以供稍後使用。Remember or record this information for later use.

  5. 按一下 [設定伺服器群組] 。Click Configure server group. 讓該區段中的設定維持不變,然後按一下 [儲存] 。Leave the settings in that section unchanged and click Save.

  6. 按一下 [檢閱 + 建立] ,然後按一下 [建立] 以佈建伺服器。Click Review + create and then Create to provision the server. 佈建需要幾分鐘的時間。Provisioning takes a few minutes.

  7. 此頁面會重新導向以監視部署。The page will redirect to monitor deployment. 當即時狀態從 [您的部署正在進行中] 變更為 [您的部署已完成] 時,按一下頁面左側的 [輸出] 功能表項目。When the live status changes from Your deployment is underway to Your deployment is complete, click the Outputs menu item on the left of the page.

  8. [輸出] 頁面將包含協調器主機名稱,且旁邊會有按鈕可將此值複製到剪貼簿。The outputs page will contain a coordinator hostname with a button next to it to copy the value to the clipboard. 請記錄此資訊,以供後續使用。Record this information for later use.

設定伺服器層級防火牆規則Configure a server-level firewall rule

「適用於 PostgreSQL 的 Azure 資料庫 – Hyperscale (Citus) (預覽)」服務會使用伺服器層級的防火牆。The Azure Database for PostgreSQL – Hyperscale (Citus) (preview) service uses a firewall at the server-level. 根據預設,防火牆會防止所有外部應用程式和工具連線到協調器節點及其中的任何資料庫。By default, the firewall prevents all external applications and tools from connecting to the coordinator node and any databases inside. 我們必須新增規則,以針對特定 IP 位址範圍開啟防火牆。We must add a rule to open the firewall for a specific IP address range.

  1. 在您先前複製協調器節點主機名稱的 [輸出] 區段中,按一下 [上一頁] 返回 [概觀] 功能表項目。From the Outputs section where you previously copied the coordinator node hostname, click back into the Overview menu item.

  2. 尋找部署的伺服器群組名稱,然後對其按一下。Find the name of your deployment's server group and click it. (伺服器群組名稱將「不會」 有尾碼。(The server group name will not have a suffix. 舉例來說,名稱結尾為 "-c"、"-w0" 或 "-w1" 的項目不是伺服器群組)。Items with names ending in, for example, "-c", "-w0", or "-w1" are not the server group.)

  3. 按一下左側功能表中 [安全性] 下方的 [防火牆] 。Click Firewall under Security in the left-hand menu.

  4. 按一下[+ 為目前的用戶端 IP 位址新增防火牆規則] 連結。Click the link + Add firewall rule for current client IP address.

  5. 最後,按一下 [儲存] 按鈕。Finally, click the Save button.

    注意

    Azure PostgreSQL 伺服器會透過連接埠 5432 進行通訊。Azure PostgreSQL server communicates over port 5432. 如果您嘗試從公司網路內進行連線,您網路的防火牆可能不允許透過連接埠 5432 的輸出流量。If you are trying to connect from within a corporate network, outbound traffic over port 5432 may not be allowed by your network's firewall. 若情況如此,除非 IT 部門開啟連接埠 5432,否則您無法連線至 Azure SQL Database 伺服器。If so, you cannot connect to your Azure SQL Database server unless your IT department opens port 5432.

使用 psql 連線到資料庫Connect to the database using psql

當您建立 Azure Database for PostgreSQL 伺服器時,系統會建立名為 citus 的預設資料庫。When you create your Azure Database for PostgreSQL server, a default database named citus is created. 若要連線到資料庫伺服器,您需要連接字串和系統管理員密碼。To connect to your database server, you need a connection string and the admin password.

  1. 取得連接字串。Obtain the connection string. 在 [伺服器群組] 頁面中,按一下 [連接字串] 功能表項目。In the server group page click the Connection strings menu item. (其位於 [設定] 底下)。尋找標示為 C++ (libpq) 的字串。(It's under Settings.) Find the string marked C++ (libpq). 其格式將會是:It will be of the form:

    host=hostname.postgres.database.azure.com port=5432 dbname=citus user=citus password={your_password} sslmode=require
    

    複製字串。Copy the string. 您必須將 "{your_password}" 取代為您先前選擇的系統管理密碼。You will need to replace "{your_password}" with the administrative password you chose earlier. 系統不會儲存您的純文字密碼,因此無法在連接字串中對您顯示。The system doesn't store your plaintext password and so can't display it for you in the connection string.

  2. 在本機電腦上開啟終端機視窗。Open a terminal window on your local computer.

  3. 在提示字元處,使用 psql 公用程式來連線到「適用於 PostgreSQL 的 Azure 資料庫」伺服器。At the prompt, connect to your Azure Database for PostgreSQL server with the psql utility. 以引號括住來傳遞連接字串,並確定其包含密碼:Pass your connection string in quotes, being sure it contains your password:

    psql "{connection_string}"
    

    例如,下列命令會連線至伺服器群組 mydemoserver 的協調器節點:For example, the following command connects to the coordinator node of the server group mydemoserver:

    psql "host=mydemoserver-c.postgres.database.azure.com port=5432 dbname=citus user=citus password={your_password} sslmode=require"
    

使用 psql 公用程式建立結構描述Use psql utility to create a schema

使用 psql 連線到適用於 PostgreSQL 的 Azure 資料庫 - Hyperscale (Citus) (預覽) 後,您可以完成一些基本工作。Once connected to the Azure Database for PostgreSQL - Hyperscale (Citus) (preview) using psql, you can complete some basic tasks. 本教學課程會引導您擷取 Web 分析中的流量資料,然後彙總資料,以便根據該資料提供即時儀表板。This tutorial walks you through ingesting traffic data from web analytics, then rolling up the data to provide real-time dashboards based on that data.

我們會建立一個資料表,該資料表將取用所有未經處理的 Web 流量資料。Let's create a table that will consume all of our raw web traffic data. 在 psql 終端機中執行下列命令:Run the following commands in the psql terminal:

CREATE TABLE http_request (
  site_id INT,
  ingest_time TIMESTAMPTZ DEFAULT now(),

  url TEXT,
  request_country TEXT,
  ip_address TEXT,

  status_code INT,
  response_time_msec INT
);

我們也會建立一個可保存每分鐘彙總的資料表,以及一個可維護最後一個彙總位置的資料表。We're also going to create a table that will hold our per-minute aggregates, and a table that maintains the position of our last rollup. 也在 psql 中執行下列命令:Run the following commands in psql as well:

CREATE TABLE http_request_1min (
  site_id INT,
  ingest_time TIMESTAMPTZ, -- which minute this row represents

  error_count INT,
  success_count INT,
  request_count INT,
  average_response_time_msec INT,
  CHECK (request_count = error_count + success_count),
  CHECK (ingest_time = date_trunc('minute', ingest_time))
);

CREATE INDEX http_request_1min_idx ON http_request_1min (site_id, ingest_time);

CREATE TABLE latest_rollup (
  minute timestamptz PRIMARY KEY,

  CHECK (minute = date_trunc('minute', minute))
);

您現在可以使用此 psql 命令來查看資料表清單中新建立的資料表:You can see the newly created tables in the list of tables now with this psql command:

\dt

進行跨節點的資料表分區化Shard tables across nodes

超大規模部署會根據使用者指定的資料行值,在不同的節點上儲存資料表資料列。A hyperscale deployment stores table rows on different nodes based on the value of a user-designated column. 此「散發資料行」表示跨節點的資料分區化方式。This "distribution column" marks how data is sharded across nodes.

讓我們將散發資料行設定為 site_id,即分區索引鍵。Let's set the distribution column to be site_id, the shard key. 在 psql 中,執行下列函式:In psql, run these functions:

SELECT create_distributed_table('http_request',      'site_id');
SELECT create_distributed_table('http_request_1min', 'site_id');

產生範例資料Generate sample data

我們的伺服器群組現在應已準備好擷取一些資料。Now our server group should be ready to ingest some data. 我們可以在本機從我們的 psql 連線執行下列命令,以持續插入資料。We can run the following locally from our psql connection to continuously insert data.

DO $$
  BEGIN LOOP
    INSERT INTO http_request (
      site_id, ingest_time, url, request_country,
      ip_address, status_code, response_time_msec
    ) VALUES (
      trunc(random()*32), clock_timestamp(),
      concat('http://example.com/', md5(random()::text)),
      ('{China,India,USA,Indonesia}'::text[])[ceil(random()*4)],
      concat(
        trunc(random()*250 + 2), '.',
        trunc(random()*250 + 2), '.',
        trunc(random()*250 + 2), '.',
        trunc(random()*250 + 2)
      )::inet,
      ('{200,404}'::int[])[ceil(random()*2)],
      5+trunc(random()*150)
    );
    COMMIT;
    PERFORM pg_sleep(random() * 0.25);
  END LOOP;
END $$;

查詢會每一秒插入大約八個資料列。The query inserts approximately eight rows every second. 如散發資料行 site_id 所引導,資料列會儲存在不同的背景工作角色節點上。The rows are stored on different worker nodes as directed by the distribution column, site_id.

注意

讓資料產生查詢保持執行,並針對本教學課程中的剩餘命令,開啟第二個 psql 連線。Leave the data generation query running, and open a second psql connection for the remaining commands in this tutorial.

查詢Query

超大規模裝載選項可讓多個節點平行處理查詢以加快速度。The hyperscale hosting option allows multiple nodes to process queries in parallel for speed. 例如,資料庫會在背景工作角色節點上計算 SUM 和 COUNT 等彙總,並將結果結合成最終的答案。For instance, the database calculates aggregates like SUM and COUNT on worker nodes, and combines the results into a final answer.

以下的查詢可計算每分鐘的 Web 要求以及一些統計資料。Here's a query to count web requests per minute along with a few statistics. 嘗試在 psql 中執行該查詢並觀察結果。Try running it in psql and observe the results.

SELECT
  site_id,
  date_trunc('minute', ingest_time) as minute,
  COUNT(1) AS request_count,
  SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count,
  SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count,
  SUM(response_time_msec) / COUNT(1) AS average_response_time_msec
FROM http_request
WHERE date_trunc('minute', ingest_time) > now() - '5 minutes'::interval
GROUP BY site_id, minute
ORDER BY minute ASC;

彙總資料Rolling up data

上述查詢在早期階段可正常運作,但其效能會隨著您的資料擴充而下降。The previous query works fine in the early stages, but its performance degrades as your data scales. 即使使用分散式處理,預先計算資料的速度會比重複重新計算還要快。Even with distributed processing, it's faster to pre-compute the data than to recalculate it repeatedly.

我們可以定期將未經處理的資料彙總到彙總資料表中,以確保儀表板保持快速。We can ensure our dashboard stays fast by regularly rolling up the raw data into an aggregate table. 您可以試驗彙總持續時間。You can experiment with the aggregation duration. 我們已使用每分鐘彙總資料表,但您可以改為將資料分成 5、15 或 60 分鐘。We used a per-minute aggregation table, but you could break data into 5, 15, or 60 minutes instead.

為了更輕鬆地執行此彙總,我們將其放入 plpgsql 函式。To run this roll-up more easily, we're going to put it into a plpgsql function. 在 psql 中執行下列命令以建立 rollup_http_request 函式。Run these commands in psql to create the rollup_http_request function.

-- initialize to a time long ago
INSERT INTO latest_rollup VALUES ('10-10-1901');

-- function to do the rollup
CREATE OR REPLACE FUNCTION rollup_http_request() RETURNS void AS $$
DECLARE
  curr_rollup_time timestamptz := date_trunc('minute', now());
  last_rollup_time timestamptz := minute from latest_rollup;
BEGIN
  INSERT INTO http_request_1min (
    site_id, ingest_time, request_count,
    success_count, error_count, average_response_time_msec
  ) SELECT
    site_id,
    date_trunc('minute', ingest_time),
    COUNT(1) as request_count,
    SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count,
    SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count,
    SUM(response_time_msec) / COUNT(1) AS average_response_time_msec
  FROM http_request
  -- roll up only data new since last_rollup_time
  WHERE date_trunc('minute', ingest_time) <@
          tstzrange(last_rollup_time, curr_rollup_time, '(]')
  GROUP BY 1, 2;

  -- update the value in latest_rollup so that next time we run the
  -- rollup it will operate on data newer than curr_rollup_time
  UPDATE latest_rollup SET minute = curr_rollup_time;
END;
$$ LANGUAGE plpgsql;

備妥我們的函式,加以執行來彙總資料:With our function in place, execute it to roll up the data:

SELECT rollup_http_request();

然後將資料放在預先彙總的表單中,我們即可查詢彙總資料表,以取得與稍早相同的報告。And with our data in a pre-aggregated form we can query the rollup table to get the same report as earlier. 請執行下列查詢:Run the following query:

SELECT site_id, ingest_time as minute, request_count,
       success_count, error_count, average_response_time_msec
  FROM http_request_1min
 WHERE ingest_time > date_trunc('minute', now()) - '5 minutes'::interval;

使舊資料過期Expiring old data

彙總讓查詢速度變快,但我們仍需要使舊資料過期,以避免不受控制的儲存體成本。The rollups make queries faster, but we still need to expire old data to avoid unbounded storage costs. 決定您要針對每種細微性保留資料多久,並使用標準查詢來刪除過期的資料。Decide how long you’d like to keep data for each granularity, and use standard queries to delete expired data. 在下列範例中,我們已決定將未經處理的資料保留一天,並將每分鐘的彙總保留一個月:In the following example, we decided to keep raw data for one day, and per-minute aggregations for one month:

DELETE FROM http_request WHERE ingest_time < now() - interval '1 day';
DELETE FROM http_request_1min WHERE ingest_time < now() - interval '1 month';

在生產環境中,您可將這些查詢包裝在函式中,並且在 cron 作業中每分鐘呼叫該函式。In production, you could wrap these queries in a function and call it every minute in a cron job.

清除資源Clean up resources

在前述步驟中,您在伺服器群組中建立了 Azure 資源。In the preceding steps, you created Azure resources in a server group. 如果您認為未來不需要這些資源,請刪除伺服器群組。If you don't expect to need these resources in the future, delete the server group. 在您伺服器群組的 [概觀] 頁面中,按一下 [刪除] 按鈕。Press the Delete button in the Overview page for your server group. 當快顯頁面上出現提示時,請確認伺服器群組的名稱,然後按一下最後一個 [刪除] 按鈕。When prompted on a pop-up page, confirm the name of the server group and click the final Delete button.

後續步驟Next steps

在本教學課程中,您已了解如何佈建 Hyperscale (Citus) 伺服器群組。In this tutorial, you learned how to provision a Hyperscale (Citus) server group. 您已使用 psql 連線到該群組、建立結構描述,以及分散資料。You connected to it with psql, created a schema, and distributed data. 您已了解如何查詢未經處理格式的資料、定期彙總該資料、查詢彙總的資料表,以及使舊資料過期。You learned to query data both in the raw form, regularly aggregate that data, query the aggregated tables, and expire old data.

接下來,了解超大規模的概念。Next, learn about the concepts of hyperscale.