Data observability

Article
07/03/2023

Data observability is your ability to understand the health of your data and data systems by collecting and correlating events across areas like data, storage, compute and processing pipelines.

Building and operating a resilient, scalable, and performant data platform requires adopting proven DevOps-inspired processes across teams that represent functional domains. Data observability enables business owners, DevOps engineers, data architects, data engineers, and site reliability engineers to automate issue detection, prediction, and prevention, and to avoid downtime that can break production analytics and AI.

Key areas of data observability

Most data platforms operate on these key areas of data observability:

Data Platform Service Monitoring
Data Pipeline Performance Monitoring
Data Quality Monitoring
Data Lineage
Data Discovery

End-to-end data observability involves not just capturing events and measuring metrics across all these components but also correlating those events and metrics. This provides a comprehensive view of your enterprise data environment's health and reliability.

This article describes each component and how it contributes to achieving data observability.

Data platform service monitoring

Foundational infrastructure for an enterprise data platform can include a mix of both provider-managed and self-managed infrastructure to enable storage and computing. DevOps engineers or infrastructure engineers need to monitor this foundational infrastructure so they can identify and resolve system outages and performance bottlenecks that affect modern data and analytics pipelines.

Monitoring data from databases and networking layers can improve your processing throughput and minimize network latency. Teams needs tools that they can use to capture metrics, notify, track, and remediate incidents and correlate with the data and analytics issues.

We recommend that your teams incorporate observability-as-code into your infrastructure-as-code layer so monitoring instrumentation is enabled out-of-box as soon as they create a resource. Most Azure services offer out-of-box instrumentation for key resource metrics like diagnostic data.

Data pipeline performance monitoring

Increasingly complex data pipelines containing multiple stages and dependencies now generate massive amounts of monitoring data. This data includes events, metrics, and logs. You can optimize your data pipeline performance by collecting and analyzing monitoring data.

Your data teams should track the state of your data pipelines across multiple related data products and business domains. When your team is notified early about failures or runtimes that are longer than expected, they can minimize and remediate downtime. Correlation of pipeline monitoring data and platform service monitoring can provide recommendations for performance tuning, such as boosting CPU and memory for your high load pipelines.

Data quality monitoring

Data quality is the degree to which your data is accurate, complete, timely, and consistent with your organization's requirements. You need to constantly monitor your data sets for quality to ensure that the data applications they power remain reliable and trustworthy. DataOps has been consistently improving data reliability and performance by automating data quality tests (unit, functional, and integration). These improvements make faster and more efficient fault detection and data analytics possible.

To adopt DevOps and SRE principles into data quality, teams must build repeatable, iterative processes and frameworks to catch data quality issues, track those issues in dashboards, and set up alerts for any deviations.

Time to Detect (TTD), Time to Recovery (TTR), and other data quality metrics can be tracked from your data quality monitoring. TTD refers to the length of time it takes for your data team to detect a data quality issue of any kind, from freshness anomalies to schema changes that break entire pipelines. TTR refers the length of time it takes for your team to resolve a data incident once alerted. Improving your data quality is more than a technical challenge; it involves significant organizational and cultural support.

The governance section on data quality explores how you can implement data quality within your scenario.

Data lineage

Data lineage is broadly understood as a continuous record that follows your data's origin, transformations, and movement over time across your data estate. Data lineage is used in retrospective tasks, including troubleshooting, debugging, and tracing root causes of pipeline issues. Lineage is also used for data quality analysis, compliance, and "what if" scenarios, which are often referred to as impact analysis.

Lineage is represented visually to show data moving from source to destination, including how the data is transformed over time.

The governance section on data lineage explores how you can implement data lineage within your scenario.

Data discovery

Data discovery is the first step for a data analytics or data governance workload for consumers. In an enterprise data lake platform, it's difficult for data consumers (like data scientists and analysts) to locate the data they need and evaluate its reliability. Data catalogs with accurate metadata make searches easier using data index that provides:

locations of available data
data quality detection
data structure understanding
data lineage understanding
access to desired data

Data catalogs offering these search capabilities increase the speed of all data discovery processes.

The governance section on data catalogs explores how you can implement data discovery within your scenario.

Set SLAs, SLIs and SLOs

Your organization's teams can adopt DevOps-style Site Reliability Engineering (SRE) practices for data monitoring. Service level agreements (SLAs), service level indicators (SLIs), and service level objectives (SLOs) can help your organization reduce downtime and ensure your data's data reliability.

Service level agreements (SLAs)

SLAs require well-defined SLIs, which are quantitative measures of service quality, and agreed-upon SLOs, which are the ideal values or ranges each SLI should meet.

Setting a data SLA requires the active participation and collaboration of all stakeholders that will be affected by an SLA. These stakeholders can include data producers, data engineers, data analysts, data consumers, business analysts, and others.

Setting reliability SLAs usually includes three steps: defining, measuring, and tracking.

Begin setting your SLA by defining what reliability means. All key stakeholders must agree on this definition. Ensure every key stakeholder is involved and buys in, especially if your downstream consumers come from different teams or different geographical regions and time zones.

Your SLA needs to be carefully crafted. Involve your legal team if data consumers are external paid customers. For internal customers, your SLA definition should includ key areas like data promise, data quality, and a process to handle data incidents if the promise isn't met.

Example SLA

Suppose Contoso is a media company that runs an enterprise data lake, and this data lake powers multiple data products across different business domains. The Contoso's data application team is responsible for delivering the prior-day sales data that powers Contoso's sales dashboard. When they miss a data delivery or deliver incomplete data, the data engineering team faces emails from frustrated executives and has to manually triage the broken pipeline that's supposed to deliver sales data. To measure and improve on their deliverables, the data team sets an SLA with the Sales team as demonstrated in the following section.

Service Level Agreement - Data Team to Sales Team

Agreement	Description
Business area	The data team promises to empower the sales team's ability to make data-driven decisions
Promise	The data team promises to deliver the prior-day sales data that powers the sales dashboard. This data can provide sales and conversion rates for all US regions. Data pipelines will deliver data to power the sales dashboard before 6:00 UTC
Data quality	Null check: Customer name can't be null. Missing value: Customer region can't be missing. Freshness: Sales date should include any transaction before 24:00 UTC
Data incident management	If the above promise of data delivery isn't met, the sales team can report the problem and the data team promises to resolve the problem with a TTR < 6 Hours

Service level indicators (SLIs)

SLIs should always meet or exceed the SLOs outlined in your SLA. When setting an SLI, begin by identifying key metrics you can track and measure to achieve your agreed-upon SLA.

SLI example

Suppose Contoso's data team identifies key metrics from different areas to meet the SLA outlined in the previous example. They also build a dashboard, set up alerts for if key metrics deviate from a set baseline, and automate actions to mitigate some issues.

Metric	Purpose
Spark cluster CPU and memory usage	To measure any performance bottle neck in the underlying infrastructure used to run data pipelines
Pipeline total run time in minutes	To measure if a pipeline takes more time than expected to run
Pipeline failure and success rates	To measure how many pipelines fail or succeed
Data quality metrics (downstream)	To ensure the data delivered by the data pipeline meets expectations
Data quality metrics (upstream)	To ensure that upstream decencies of raw data quality are met
Transformation metadata updates	To ensure that lineage from upstream to downstream contains metadata about all transformations applied to data
Downstream data indexing and updates	To ensure the sales team discovers all data sets that power their dashboard
Defined process for measuring TTD and TTR	To measure TTD and TTR and ensure TTR < 6 hours

Service level objectives (SLOs)

An SLO consists of an SLI, the duration over which that SLI is measured, and the targeted success rate that is practically achievable. Defining your direction and targeted success can be an overwhelming task initially. Don't expect perfection, but rather steady improvement over multiple iterations.

SLOs can depend on:

Data product
Data category
Data source regions
Data observability components

SLO example

Suppose Contoso's data team delivers sales data across seven different United States regions. 210 data sets are delivered every calendar year across all regions, and only 200 data sets are complete and meet the SLA. These successful deliveries translate to a 95.99% success rate for that year. The 10 failed (incomplete) data sets occured at an acceptable error rate of 4%.

The data team creates a monitoring dashboard that tracks aggregated SLIs to monitor this SLO over a period of 30 days. Both the data team and sales team get notified when target success rate isn't achieved.