Azure Databricks Lakehouse Monitoring queries

Sudipta Goswami 0 Reputation points
2024-04-28T11:50:10.0333333+00:00

Hi Team,

I was exploring on Azure Databricks Lakehouse monitoring. I have few queries on this:

  1. When I am running a "refresh metrics" irrespective of an automated schedule or manual refresh, which compute does it run? There is no mechanism to point its compute. Since refreshing metrics is present in the dashboard, does it runs in some Serverless Warehouse?
  2. When I explore the profile metrics and drift table, I can view few clusterID's, however these ClusterID's are not traceable. Even there entries are not present in Warehouse events system tables. So what are these cluster id's?
  3. I have run refresh on table with 1K records, 50K records as well as with 100K records. The refresh time takes somewhere between 7 mins to 13 mins. Is this expected behavior or we can get some improvements on the execution times? If so on what factor it depends?
  4. Is there any place where there is an audit log present for each refresh run so that I can get a historical value and not only the current and base table.
  5. How the billing is taking place for the monitoring aspect.

Can you please help me with this? We are planning to use Lakehouse monitoring feature extensively, but before it, we need to be clear on these aspects.

Thanks & Regards,

Sudipta Goswami,

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,947 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247-1375 11,301 Reputation points
    2024-04-28T13:48:31.5533333+00:00

    Hi Sudipta Goswami

    Thanks for reaching out to Microsoft Q&A.

    When I am running a "refresh metrics" irrespective of an automated schedule or manual refresh, which compute does it run? There is no mechanism to point its compute. Since refreshing metrics is present in the dashboard, does it runs in some Serverless Warehouse?

    refresh metrics don’t explicitly specify a particular compute. it operates within the context of the cluster or compute resource where the associated dashboard or metric visualization is being accessed. The refresh process will not run in a separate serverless warehouse but leverages the existing available compute resources in the cluster where the dashboard is displayed.

    When I explore the profile metrics and drift table, I can view few clusterID's, however these ClusterID's are not traceable. Even there entries are not present in Warehouse events system tables. So what are these cluster id's?

    Yes, you wont be able to trace them. These cluster id's are not directly traceable to specific instances or events outside the context of the ADB workspace. The cluster id's you observe in profile metrics correspond to the clusters that were active during the data profiling or drift analysis. These id's represent the specific compute resources(clusters) that are used when generating the metrics or analyzing data changes.

    I have run refresh on table with 1K records, 50K records as well as with 100K records. The refresh time takes somewhere between 7 mins to 13 mins. Is this expected behavior or we can get some improvements on the execution times? If so on what factor it depends?

    it depens on various factors like volume, transformation complexities(too many joins) or the configured cluster(nodes or instance types). I cannot ascertain if the time you say is reasonable as I don't know about your configuration and environment. But for optimization you can consider the following common points:

    • Using appropriately sized clusters.
    • Leveraging caching where possible.
    • Optimizing your data transformations

    Is there any place where there is an audit log present for each refresh run so that I can get a historical value and not only the current and base table.

    The auditing feature primarily focuses on tracking user activity within the Databricks workspace. But you can refer to audit logs which has details including refresh runs, it has the sys table schema which includes columns such as event_time, event_date, workspace_id, source_ip_address, user_agent, and more. Trying azure log analytics for gathering logs for each refresh run(after custom congiuration) might work, pls try.

    How the billing is taking place for the monitoring aspect.

    Billing for monitoring aspects (such as metrics, profiling, and drift analysis) is included in your overall Azure Databricks usage. There’s no separate billing specifically for monitoring features. However, if you have specific requirements for extended retention or additional logging, consider the associated costs based on your Azure subscription.

    Hope this answers all of your questions.

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

    0 comments No comments