Analyze monitoring data for cloud applications

After you've collected data from various data sources, analyze the data to assess the overall well-being of the system. For analysis, have a clear understanding of:

  • How to structure data based on KPIs and performance metrics you've defined.
  • How to correlate the data captured in different metrics and log files. This is important when tracking a sequence of events and help diagnose problems.

In most cases, data for each part component of the architecture is captured locally and then accurately combined with data generated by other components.

For example, a three-tier application has:

  • Presentation tier that allows a user to connect to a website
  • Middle tier that hosts a set of microservices that processes business logic
  • Database tier that stores data associated with the operation

The usage data for a single business operation might span across all three tiers. This information needs to be correlated to provide an overall view of the resource and processing usage for the operation. The correlation might involve some preprocessing and filtering of data on the database tier. On the middle tier, common tasks are aggregation and formatting.

Best practices

  • Correlate application and resource level logs—Evaluate data at both levels to optimize the detection of issues and troubleshooting of detected issues. You can aggregate the data in a single data sink or have ways to query events across both levels. Using a unified solution, such as Azure Log Analytics, is recommended to aggregate and query application and resource level logs.

  • Define clear retention times on storage for cold analysis—This practice is recommended to allow historic analysis over a specific period. Another benefit is control on control storage costs. Have processes that make sure data gets archived to cheaper storage and aggregate data for long-term trend analysis.

  • Analyze long-term trends analyzed to predict operational issues—Evaluate across long-term data to form operational strategies and also to predict what operational issues are likely to occur and when. For instance, if the average response times have been slowly increasing over time and getting closer to the maximum target.

Correlate data

Data generated by instrumentation and captured by Application Performance Monitoring (APM) tools can provide a snapshot of the system state. The main purpose of the snapshot to make this data actionable.

For example:

  • What has caused an intense I/O loading at the system level at a specific time?
  • Is it the result of a large number of database operations?
  • Is this reflected in the database response times, the number of transactions per second, and application response times at the same juncture?

A possible remedial action to reduce the load might be to shard the data over more servers. In addition, exceptions can arise because a fault in any tier. An exception in one level often triggers another fault in the level above.

So it's recommended that you correlate different types of monitoring data at each level to produce an overall view of the state of the system and the applications that are running on it. This is vital for root cause analysis for failures. You can also use this information to make decisions about whether the system is functioning acceptably or not, and determine what can be done to improve the quality of the system.

In a distributed application, an operation will lead to multiple transactions. Each transaction generates events in different services and those events must be correlated and visualized. This way you can spot performance bottlenecks or failure hotspots across services. Application Map is a popular choice for visualizing flows.

Telemetry correlation ensures that the transactions can be mapped through the end-to-end application and critical system flows. Platform-level metrics and logs such as CPU percentage, network in/out, and disk operations/sec should be collected from the application to inform a health model and detect/predict issues. This can also help to distinguish between transient and non-transient faults.

Here's an Application Map of an application that has several microservices. With this visual representation, you can see that Workflow service is getting errors from the Delivery service.

Application Map for microservices

When you look at the end-to-end transaction, you can see that an exception is thrown due to memory limits in Azure Cache for Redis.

End to end transaction

When correlating data make sure that the raw instrumentation data includes sufficient context and activity ID information to support the required aggregations for correlating events. Additionally, this data might be held in different formats, and it might be necessary to parse this information to convert it into a standardized format for analysis.

Analysis patterns

Analyzing and reformatting data for visualization, reporting, and alerting purposes can be a complex process that consumes its own set of resources. Some forms of monitoring are time-critical and require immediate analysis of data to be effective. While other Other forms of analysis are less time-critical and might require some computation and aggregation after the raw data has been received. This table shows the patterns for analysis.

Pattern Characteristics Considerations Use case
Hot analysis Time-critical and requires immediate analysis.
  • Data must be structured and available quickly for efficient processing.
  • Move the analysis processing to the individual VMs that stores the data.
  • Alerting.
  • Detecting a security attack on the system.
Warm analysis Less time-critical and might require aggregation before analysis.
  • Perform computation and aggregation on the received raw data.
  • Aggregate a series of events instead of an isolated event.
  • Performance analysis on data from a series of events to identify reliability issues.
  • Diagnosis of health issues by statistical evaluation of the events leading up to a health event.
Cold analysis Analysis can be performed at a later date.
  • Data is stored safely after it has been captured.
  • Aggregate a series of events instead of an isolated event.
  • Usage monitoring and auditing needs a view of the system at points in time.
  • Predictive health analysis with historical information.

Combined approach

In common monitoring use cases, you'll use a combination of all three patterns.

For example, a health event is typically processed through hot analysis and can raise an alert immediately. An operator can then drill into the reasons for the health event by examining the data from the warm path. This data should contain information about the events leading up to the issue that caused the health event. An operator can also use cold analysis to provide the data for predictive health analysis. The operator can gather historical information over a specified period and use it in conjunction with the current health data (retrieved from the hot path) to spot trends that might soon cause health issues. In these cases, it might be necessary to raise an alert so that corrective action can be taken.

Diagnose issues

Diagnosis requires the ability to determine the cause of faults or unexpected behavior, including performing root cause analysis. The information that's required typically includes:

  • Detailed information from event logs and traces, either for the entire system or for a specified subsystem during a specified time window.
  • Complete stack traces resulting from exceptions and faults of any specified level that occur within the system or a specified subsystem during a specified period.
  • Crash dumps for any failed processes either anywhere in the system or for a specified subsystem during a specified time window.
  • Activity logs recording the operations that are performed either by all users or for selected users during a specified period.

Analyzing data for troubleshooting purposes often requires a deep technical understanding of the system architecture and the various components that compose the solution. As a result, a large degree of manual intervention is often required to interpret the data, establish the cause of problems, and recommend an appropriate strategy to correct them. It might be appropriate simply to store a copy of this information in its original format and make it available for cold analysis by an expert.

Next steps