This article presents a solution for genomic analysis and reporting. The processes and results are appropriate for precision medicine scenarios, or areas of medical care that use genetic profiling.
Architecture
The diagram contains two boxes. The first, on the left, has the label Azure Data Factory for orchestration. The second box has the label Clinician views. The first box contains several smaller boxes that represent data or various Azure components. Arrows connect the boxes, and numbered labels on the arrows correspond with the numbered steps in the document text. Two arrows flow between the boxes, ending in the Clinician views box. One arrow points to a clinician icon. The other points to a Power BI icon.
Download a Visio file of this architecture.
Workflow
Azure Data Factory orchestrates the workflow:
Data Factory transfers the initial sample file to Azure Blob Storage. The file is in FASTQ format.
Microsoft Genomics runs secondary analysis on the file.
Microsoft Genomics stores the output in Blob Storage in one of these formats:
- Variant call format (VCF)
- Genomic VCF (GVCF)
Jupyter Notebook annotates the output file. The notebook runs on Azure Databricks.
Azure Data Lake Storage stores the annotated file.
Jupyter Notebook merges the file with other datasets and analyzes the data. The notebook runs on Azure Databricks.
Data Lake Storage stores the processed data.
Azure Healthcare APIs packs the data into a Fast Healthcare Interoperability Resources (FHIR) bundle. The clinical data then enters the patient electronic health record (EHR).
Clinicians view the results in Power BI dashboards.
Components
The solution uses the following components:
Microsoft Genomics
Microsoft Genomics offers an efficient and accurate genomics pipeline that implements the industry's best practices. Its high-performance engine is optimized for these tasks:
- Reading large files of genomic data
- Processing them efficiently across many cores
- Sorting and filtering the results
- Writing the results to output files
To maximize throughput, this engine operates a Burrows-Wheeler Aligner (BWA) and a Genome Analysis Toolkit (GATK) HaplotypeCaller variant caller. The engine also uses several other components that make up standard genomics pipelines. Examples include duplicate marking, base quality score recalibration, and indexing. In a few hours, the engine can process a single genomic sample on a single multi-core server. The processing starts with raw reads. It produces aligned reads and variant calls.
Internally, the Microsoft Genomics controller manages these aspects of the process:
- Distributing batches of genomes across pools of machines in the cloud
- Maintaining a queue of incoming requests
- Distributing the requests to servers that run the genomics engine
- Monitoring the servers' performance and progress
- Evaluating the results
- Ensuring that processing runs reliably and securely at scale, behind a secure web service API
You can easily use Microsoft Genomics results in tertiary analysis and machine learning services. And because Microsoft Genomics is a cloud service, you don't need to manage or update hardware or software.
Other components
Data Factory is an integration service that works with data from disparate data stores. You can use this fully managed, serverless platform to orchestrate and automate workflows. Specifically, Data Factory pipelines transfer data to Azure in this solution. A sequence of pipelines then triggers each step of the workflow.
Blob Storage offers optimized cloud object storage for large amounts of unstructured data. In this scenario, Blob Storage provides the initial landing zone for the FASTQ file. This service also functions as the output target for the VCF and GVCF files that Microsoft Genomics generates. Tiering functionality in Blob Storage provides a way to archive FASTQ files in inexpensive long-term storage after processing.
Azure Databricks is a data analytics platform. Its fully managed Spark clusters process large streams of data from various sources. In this solution, Azure Databricks provides the computational resources that Jupyter Notebook needs to annotate, merge, and analyze the data.
Data Lake Storage is a scalable and secure data lake for high-performance analytics workloads. This service can manage multiple petabytes of information while sustaining hundreds of gigabits of throughput. The data may be structured, semi-structured, or unstructured. It typically comes from multiple, heterogeneous sources. In this architecture, Data Lake Storage provides the final landing zone for the annotated files and the merged datasets. It also gives downstream systems access to the final output.
Power BI is a collection of software services and apps that display analytics information. You can use Power BI to connect and display unrelated sources of data. In this solution, you can populate Power BI dashboards with the results. Clinicians can then create visuals from the final dataset.
Azure Healthcare APIs is a managed, standards-based, compliant interface for accessing clinical health data. In this scenario, Azure Healthcare APIs passes an FHIR bundle to the EHR with the clinical data.
Scenario details
This article presents a solution for genomic analysis and reporting. The processes and results are appropriate for precision medicine scenarios, or areas of medical care that use genetic profiling. Specifically, the solution provides a clinical genomics workflow that automates these tasks:
- Taking data from a sequencer
- Moving the data through secondary analysis
- Providing results that clinicians can consume
The growing scale, complexity, and security requirements of genomics make it an ideal candidate for moving to the cloud. Consequently, the solution uses Azure services in addition to open-source tools. This approach takes advantage of the security, performance, and scalability features of the Azure cloud:
- Scientists plan on sequencing hundreds of thousands of genomes in coming years. The task of storing and analyzing this data requires significant computing power and storage capacity. With data centers around the world that provide these resources, Azure can meet these demands.
- Azure is certified for major global security and privacy standards, such as ISO 27001.
- Azure complies with the security and provenance standards that the Health Insurance Portability and Accountability Act (HIPAA) establishes for personal health information.
A key component of the solution is Microsoft Genomics. This service offers an optimized secondary analysis implementation that can process a 30x genome in a few hours. Standard technologies can take days.
Potential use cases
This solution is ideal for the healthcare industry. It applies to many areas:
- Risk scoring patients for cancer
- Identifying patients with genetic markers that predispose them to disease
- Generating patient cohorts for studies
Considerations
The following considerations align with the Microsoft Azure Well-Architected Framework and apply to this solution:
Availability
The service level agreements (SLAs) of most Azure components guarantee availability:
- At least 99.9 percent of Data Factory pipelines are guaranteed to run successfully.
- The Azure Databricks SLA guarantees 99.95 percent availability.
- Microsoft Genomics offers a 99.99 percent availability SLA for workflow requests.
- Blob Storage and Data Lake Storage are part of Azure Storage, which offers availability through redundancy.
Scalability
Most Azure services are scalable by design:
- Data Factory transforms data at scale.
- The clusters in Azure Databricks resize as needed.
- For information on optimizing scalability in Blob Storage, see Performance and scalability checklist for Blob Storage.
- Data Lake Storage can manage exabytes of data.
- Microsoft Genomics runs exabyte-scale workloads.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar.
The technologies in this solution meet most companies' requirements for security.
Guidelines
Because of the sensitive nature of medical data, establish governance and security by following the guidelines in these documents:
- Security in the Microsoft Cloud Adoption Framework for Azure
- Practical guide to designing secure health solutions using Microsoft Azure
- Enterprise-scale landing zones
Regulatory compliance
See these documents for information on complying with HIPAA and the Health Information Technology for Economic and Clinical Health (HITECH) Act:
Components of this solution are in scope for HIPAA according to Microsoft Azure Compliance Offerings. If you substitute any other components, validate them first against the list in that document's appendix.
General security features
Several components also secure data in other ways:
Azure Databricks provides many tools for securing network infrastructure and data. Examples include access control lists, secrets, and no public IP (NPIP).
Blob storage supports storage service encryption (SSE), which automatically encrypts data before storing it. It also provides many other ways to protect data and networks.
Data Lake Storage provides access control. Its model supports these types of controls:
- Azure role-based access control (RBAC)
- Portable Operating System Interface (POSIX) access control lists (ACLs)
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.
With most Azure services, you can reduce costs by only paying for what you use:
- With Data Factory, your activity run volume determines the cost.
- Azure Databricks offers many tiers, workloads, and pricing plans to help you minimize costs.
- Blob Storage costs depend on data redundancy options and volume.
- With Data Lake Storage, pricing depends on many factors: your namespace type, storage capacity, and choice of tier.
- For Microsoft Genomics, the charge depends on the number of gigabases that each workflow processes.
Contributors
This article is maintained by Microsoft. It was originally written by the following contributors.
Principal authors:
- Wylie Graham | Senior Program Manager
- Matt Hansen | Senior Cloud Solution Architect
To see non-public LinkedIn profiles, sign in to LinkedIn.
Next steps
- Microsoft Genomics: Common questions
- Genomics quickstart starter kit
- Burrows-Wheeler Aligner
- Genome Analysis Toolkit
Related resources
Fully deployable architectures:
Data Factory solutions
- Automated enterprise BI
- [Hybrid ETL with Azure Data Factory][Hybrid ETL with Azure Data Factory]
- Replicate and sync mainframe data in Azure
Analytics solutions
- Data warehousing and analytics
- Geospatial data processing and analytics
- Stream processing with Azure Databricks