快速入門：使用 Lucene Index (預覽) 來搜尋 Azure Managed Instance for Apache Cassandra

發行項
05/15/2024

Cassandra Lucene Index 衍生自 Stratio Cassandra，是 Apache Cassandra 的外掛程式，可擴充其索引功能，提供全文搜尋功能，以及免費的多重變數、地理空間和 bitemporal 搜尋。這是透過以 Apache Lucene 為基礎的 Cassandra 次要索引實作來達成，其中，叢集的每個節點都會為自己的資料編製索引。本快速入門示範如何使用 Lucene Index 來搜尋 Azure Managed Instance for Apache Cassandra。

重要

Lucene Index 處於公開預覽狀態。此功能是在沒有服務等級協定的情況下提供，不建議用於生產工作負載。如需詳細資訊，請參閱 Microsoft Azure 預覽版增補使用條款。

警告

Lucene Index 外掛程式的限制是跨分割區搜尋不能單獨在索引中執行 - Cassandra 必須將查詢傳送至每個節點。這可能會導致跨分割區搜尋的效能 (記憶體和 CPU 負載) 問題，進而影響穩定狀態工作負載。

如果搜尋需求相當重要，建議您部署專用的次要資料中心，僅用於搜尋，內含最少量的節點，且每個都有大量核心 (最少 16 個)。然後，將主要 (作業) 資料中心中的索引鍵空間設定為將資料複寫至次要 (搜尋) 資料中心。

必要條件

如果您沒有 Azure 訂用帳戶，請在開始前建立免費帳戶。
部署 Azure Managed Instance for Apache Cassandra 叢集。您可以透過入口網站執行此動作 - 從入口網站部署叢集時，預設會啟用 Lucene 索引。如果要將 Lucene 索引新增至現有的叢集，請在入口網站概觀刀鋒視窗中按一下 Update，選取 Cassandra Lucene Index，然後按一下 [更新] 以進行部署。
從 CQLSH 連線到您的叢集。

使用 Lucene Index 建立資料

在您的 CQLSH 命令視窗中，建立索引鍵空間和資料表，如下所示：

   CREATE KEYSPACE demo
   WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'datacenter-1': 3};
   USE demo;
   CREATE TABLE tweets (
      id INT PRIMARY KEY,
      user TEXT,
      body TEXT,
      time TIMESTAMP,
      latitude FLOAT,
      longitude FLOAT
   );

現在，使用 Lucene Index 在資料表上建立自訂次要索引：

   CREATE CUSTOM INDEX tweets_index ON tweets ()
   USING 'com.stratio.cassandra.lucene.Index'
   WITH OPTIONS = {
      'refresh_seconds': '1',
      'schema': '{
         fields: {
            id: {type: "integer"},
            user: {type: "string"},
            body: {type: "text", analyzer: "english"},
            time: {type: "date", pattern: "yyyy/MM/dd"},
            place: {type: "geo_point", latitude: "latitude", longitude: "longitude"}
         }
      }'
   };

插入下列範例推文：

    INSERT INTO tweets (id,user,body,time,latitude,longitude) VALUES (1,'theo','Make money fast, 5 easy tips', '2023-04-01T11:21:59.001+0000', 0.0, 0.0);
    INSERT INTO tweets (id,user,body,time,latitude,longitude) VALUES (2,'theo','Click my link, like my stuff!', '2023-04-01T11:21:59.001+0000', 0.0, 0.0);
    INSERT INTO tweets (id,user,body,time,latitude,longitude) VALUES (3,'quetzal','Click my link, like my stuff!', '2023-04-02T11:21:59.001+0000', 0.0, 0.0);
    INSERT INTO tweets (id,user,body,time,latitude,longitude) VALUES (4,'quetzal','Click my link, like my stuff!', '2023-04-01T11:21:59.001+0000', 40.3930, -3.7328);
    INSERT INTO tweets (id,user,body,time,latitude,longitude) VALUES (5,'quetzal','Click my link, like my stuff!', '2023-04-01T11:21:59.001+0000', 40.3930, -3.7329);

控制讀取一致性

您稍早建立的索引會對資料表中具有指定類型的所有資料行編製索引，而用於搜尋的讀取索引會每秒重新整理一次。或者，您可以使用內含 consistency ALL 的空白搜尋，明確地重新整理所有索引分區：
```
    CONSISTENCY ALL
    SELECT * FROM tweets WHERE expr(tweets_index, '{refresh:true}');
    CONSISTENCY QUORUM
```

現在，您可以搜尋特定日期範圍內的推文：

    SELECT * FROM tweets WHERE expr(tweets_index, '{filter: {type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"}}');

您也可以強制明確重新整理相關的索引分區來執行此搜尋：

    SELECT * FROM tweets WHERE expr(tweets_index, '{
       filter: {type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"},
       refresh: true
    }') limit 100;

搜尋資料

若要搜尋前 100 條更相關的推文，其中本文欄位包含特定日期範圍內的「按一下我的連結」片語：

    SELECT * FROM tweets WHERE expr(tweets_index, '{
       filter: {type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"},
       query: {type: "phrase", field: "body", value: "Click my link", slop: 1}
    }') LIMIT 100;

若要精簡搜尋，只取得名稱開頭為 "q" 的使用者所撰寫的推文：

    SELECT * FROM tweets WHERE expr(tweets_index, '{
       filter: [
          {type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"},
          {type: "prefix", field: "user", value: "q"}
       ],
       query: {type: "phrase", field: "body", value: "Click my link", slop: 1}
    }') LIMIT 100;

若要取得 100 個較新的篩選結果，您可以使用排序選項：

    SELECT * FROM tweets WHERE expr(tweets_index, '{
       filter: [
          {type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"},
          {type: "prefix", field: "user", value: "q"}
       ],
       query: {type: "phrase", field: "body", value: "Click my link", slop: 1},
       sort: {field: "time", reverse: true}
    }') limit 100;

先前的搜尋可以限制為接近地理位置所建立的推文：

    SELECT * FROM tweets WHERE expr(tweets_index, '{
       filter: [
          {type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"},
          {type: "prefix", field: "user", value: "q"},
          {type: "geo_distance", field: "place", latitude: 40.3930, longitude: -3.7328, max_distance: "1km"}
       ],
       query: {type: "phrase", field: "body", value: "Click my link", slop: 1},
       sort: {field: "time", reverse: true}
    }') limit 100;

您也可以依地理位置的距離來排序結果：

    SELECT * FROM tweets WHERE expr(tweets_index, '{
       filter: [
          {type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"},
          {type: "prefix", field: "user", value: "q"},
          {type: "geo_distance", field: "place", latitude: 40.3930, longitude: -3.7328, max_distance: "1km"}
       ],
       query: {type: "phrase", field: "body", value: "Click my link", slop: 1},
       sort: [
          {field: "time", reverse: true},
          {field: "place", type: "geo_distance", latitude: 40.3930, longitude: -3.7328}
       ]
    }') limit 100;

下一步

在本快速入門中，您已了解如何使用 Lucene Search 來搜尋 Azure Managed Instance for Apache Cassandra 叢集。您現在可以開始使用叢集：

使用 Azure Databricks 部署受控 Apache Spark 叢集

共用方式為

快速入門：使用 Lucene Index (預覽) 來搜尋 Azure Managed Instance for Apache Cassandra

必要條件

使用 Lucene Index 建立資料

控制讀取一致性

搜尋資料

下一步

其他資源