Audit logging and monitoring overview

How do Microsoft online services employ audit logging?

Microsoft online services employ audit logging to detect unauthorized activities and provide accountability for Microsoft personnel. Audit logs capture details about system configuration changes and access events, with details to identify who was responsible for the activity, when and where the activity took place, and what the outcome of the activity was. Automated log analysis supports near real-time detection of suspicious behavior. Potential incidents are escalated to the appropriate Microsoft security response team for further investigation.

Microsoft online services internal audit logging captures log data from various sources, such as:

  • Event logs
  • AppLocker logs
  • Performance data
  • System Center data
  • Call detail records
  • Quality of experience data
  • IIS Web Server logs
  • SQL Server logs
  • Syslog data
  • Security audit logs

How do Microsoft online services centralize and report on audit logs?

Many different types of log data are uploaded from Microsoft servers to a proprietary security monitoring solution for near real-time (NRT) analysis and an internal big data computing service (Cosmos) or Azure Data Explorer (Kusto) for long-term storage. This data transfer occurs over a FIPS 140-2-validated TLS connection on approved ports and protocols using automated log management tools.

Logs are processed in NRT using rule-based, statistical, and machine learning methods to detect system performance indicators and potential security events. Machine learning models use incoming log data and historical log data stored in Cosmos or Kusto to continuously improve detection capabilities. Security-related detections generate alerts, notifying on-call engineers of a potential incident and triggering automated remediation actions when applicable. In addition to automated security monitoring, service teams use analysis tools and dashboards for data correlation, interactive queries, and data analytics. These reports are used to monitor and improve the overall performance of the service.

Audit data flow.

How do Microsoft online services protect audit logs?

The tools used in Microsoft online services to collect and process audit records don’t allow permanent or irreversible changes to the original audit record content or time ordering. Access to Microsoft online service data stored in Cosmos or Kusto is restricted to authorized personnel. In addition, Microsoft restricts the management of audit logs to a limited subset of security team members responsible for audit functionality. Security team personnel don’t have standing administrative access to Cosmos or Kusto. Administrative access requires Just-In-Time (JIT) access approval, and all changes to logging mechanisms for Cosmos are recorded and audited. Audit logs are retained long enough to support incident investigations and meet regulatory requirements. The exact period of audit log data retention determined by the service teams; most audit log data is retained for 90 days in Cosmos and 180 days in Kusto.

How do Microsoft online services protect user personal data that may be captured in audit logs?

Prior to uploading log data, an automated log management application uses a scrubbing service to remove any fields that contain customer data, such as tenant information and user personal data, and replace those fields with a hash value. The anonymized and hashed logs are rewritten and then uploaded into Cosmos. All log transfers occur over a TLS encrypted connection (FIPS 140-2).

What is Microsoft's strategy for monitoring security?

Microsoft engages in continuous security monitoring of its systems to detect and respond to threats to Microsoft online services. Our key principles for security monitoring and alerting are:

  • Robustness: signals and logic to detect various attack behaviors
  • Accuracy: meaningful alerts to avoid distractions from noise
  • Speed: ability to catch attackers quickly enough to stop them

Automation, scale, and cloud-based solutions are key pillars of our monitoring and response strategy. For us to effectively prevent attacks at the scale of some of the Microsoft online services, our monitoring systems need to automatically raise highly accurate alerts in near real time. Likewise, when an issue is detected, we need the ability to mitigate the risk at scale, we cannot rely on our team to manually fix issues machine-by-machine. To mitigate risks at scale, we use cloud-based tools to automatically apply countermeasures and provide engineers with tools to apply approved mitigation actions quickly across the environment.

How do Microsoft online services perform security monitoring?

Microsoft online services use centralized logging to collect and analyze log events for activities that might indicate a security incident. Centralized logging tools aggregate logs from all system components, including event logs, application logs, access control logs, and network-based intrusion detection systems. In addition to server logging and application-level data, core infrastructure is equipped with customized security agents that generate detailed telemetry and provide host-based intrusion detection. We use this telemetry for monitoring and forensics.

The logging and telemetry data we collect enables 24/7 security alerting. Our alerting system analyzes log data as it gets uploaded, producing alerts in near real time. This includes rules-based alerts and more sophisticated alerting based on machine learning models. Our monitoring logic goes beyond generic attack scenarios and incorporates deep awareness of service architecture and operations. We analyze security monitoring data to continuously improve our models to detect new kinds of attacks and improve the accuracy of our security monitoring.

How do Microsoft online services respond to security monitoring alerts?

When security events that trigger alerts require responsive action or further investigation of forensic evidence throughout the service, our cloud-based tools allow for rapid response throughout the environment. These tools include fully automated, intelligent agents that respond to detected threats with security countermeasures. In many cases, these agents deploy automatic countermeasures to mitigate security detections at scale without human intervention. When this response is not possible, the security monitoring system automatically alerts the appropriate on-call engineers, who are equipped with a set of tools that enable them to act in real time to mitigate detected threats at scale. Potential incidents are escalated to the appropriate Microsoft security response team and are resolved using the security incident response process.

How do Microsoft online services monitor system availability?

Microsoft actively monitors its systems for indicators of resource over-utilization and abnormal use. Resource monitoring is complemented by service redundancies to help avoid unexpected downtime and provide customers with reliable access to products and services. Microsoft online service health issues are communicated promptly to customers through the Service Health Dashboard (SHD).

Azure and Dynamics 365 online services utilize multiple infrastructure services to monitor their security and health availability. The implementation of Synthetic Transaction (STX) testing allows Azure and Dynamics services to check the availability of their services. The STX framework is designed to support the automated testing of components in running services and is tested on live site failure alerts. Additionally, the Azure Security Monitoring (ASM) service has implemented centralized synthetic testing procedures to verify security alerts function as expected in both new and running services.

Microsoft's online services are regularly audited for compliance with external regulations and certifications. Refer to the following table for validation of controls related to audit logging and monitoring.

Azure and Dynamics 365

External audits Section Latest report date
ISO 27001/27002

Statement of Applicability
Certificate
A.12.1.3: Availability monitoring and capacity planning
A.12.4: Logging and monitoring
November 6, 2023
ISO 27017

Statement of Applicability
Certificate
A.12.1.3: Availability monitoring and capacity planning
A.12.4: Logging and monitoring
A.16.1: Management of information security incidents and improvements
November 6, 2023
ISO 27018

Statement of Applicability
Certificate
A.12.4: Logging and monitoring November 6, 2023
SOC 1 IM-1: Incident management framework
IM-2: Incident detection configuration
IM-3: Incident management procedures
IM-4: Incident post-mortem
VM-1: Security event logging and collection
VM-12: Azure services availability monitoring
VM-4: Malicious events investigation
VM-6: Security vulnerability monitoring
November 17, 2023
SOC 2
SOC 3
C5-6: Restricted access to logs
IM-1: Incident management framework
IM-2: Incident detection configuration
IM-3: Incident management procedures
IM-4: Incident post-mortem
PI-2: Azure portal SLA performance review
VM-1: Security event logging and collection
VM-12: Azure services availability monitoring
VM-4: Malicious events investigation
VM-6: Security vulnerability monitoringVM
November 17, 2023

Microsoft 365

External audits Section Latest report date
FedRAMP (Office 365) AC-2: Account management
AC-17: Remote access
AU-2: Audit events
AU-3: Content of audit records
AU-4: Audit storage capacity
AU-5: Response to audit processing failures
AU-6: Audit review, analysis, and reporting
AU-7: Audit reduction and report generation
AU-8: Time stamps
AU-9: Protection of audit information
AU-10: Non-repudiation
AU-11: Audit record retention
AU-12: Audit generation
SI-4: Information system monitoring
SI-7: Software, firmware, and information integrity
July 31, 2023
ISO 27001/27002/27017

Statement of Applicability
Certification (27001/27002)
Certification (27017)
A.12.3: Availability monitoring and capacity planning
A.12.4: Logging and monitoring
March 2024
SOC 1
SOC 2
CA-19: Change monitoring
CA-26: Security incident reporting
CA-29: On-call engineers
CA-30: Availability monitoring
CA-48: Datacenter logging
CA-60: Audit logging
January 23, 2024
SOC 3 CUEC-08: Reporting incidents
CUEC-10: Service contracts
January 23, 2024

Resources