您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

快速入门:通过 Microsoft 基因组学服务运行工作流Quickstart: Run a workflow through the Microsoft Genomics service

在本快速入门中,你会将输入数据上传到 Azure Blob 存储帐户中,并使用 Python 基因组学客户端通过 Microsoft 基因组学服务运行工作流。In this quickstart, you upload input data into an Azure Blob storage account, and run a workflow through the Microsoft Genomics service by using the Python Genomics client. Microsoft 基因组学是一种可缩放的安全服务,适用于二次分析,可以快速处理一个基因组,从原始的读数开始,生成比对读数和变体调用。Microsoft Genomics is a scalable, secure service for secondary analysis that can rapidly process a genome, starting from raw reads and producing aligned reads and variant calls.

先决条件Prerequisites

  • 具有活动订阅的 Azure 帐户。An Azure account with an active subscription. 免费创建帐户Create an account for free.
  • 装有 pipPython 2.7.12+,并且系统路径中具有 pythonPython 2.7.12+, with pip installed, and python in your system path. Microsoft 基因组学客户端与 Python 3 不兼容。The Microsoft Genomics client isn't compatible with Python 3.

设置:在 Azure 门户中创建 Microsoft 基因组学帐户Set up: Create a Microsoft Genomics account in the Azure portal

要创建 Microsoft 基因组学帐户,请在 Azure 门户中导航到“创建基因组学帐户”To create a Microsoft Genomics account, navigate to Create a Genomics account in the Azure portal. 如果还没有 Azure 订阅,请先创建一个,然后再创建 Microsoft 基因组学帐户。If you don’t have an Azure subscription yet, create one before creating a Microsoft Genomics account.

Azure 门户上的 Microsoft 基因组学Microsoft Genomics on Azure portal

为基因组学帐户配置以下信息,如上图所示。Configure your Genomics account with the following information, as shown in the preceding image.

设置Setting 建议的值Suggested value 字段说明Field description
订阅Subscription 订阅名称Your subscription name 这是 Azure 服务的计费单位 - 有关订阅的详细信息,请参阅订阅This is the billing unit for your Azure services - For details about your subscription see Subscriptions
资源组Resource group MyResourceGroupMyResourceGroup 可以通过资源组将多个 Azure 资源(存储帐户、基因组学帐户等)分到一个组中,方便管理。Resource groups allow you to group multiple Azure resources (storage account, genomics account, etc.) into a single group for simple management. 有关详细信息,请参阅资源组For more information, see Resource Groups. 请参阅命名规则,了解什么是有效的资源组名称For valid resource group names, see Naming Rules
帐户名Account name MyGenomicsAccountMyGenomicsAccount 选择唯一的帐户标识符。Choose a unique account identifier. 请参阅命名规则,了解什么是有效的名称For valid names, see Naming Rules
位置Location 美国西部 2West US 2 在美国西部 2、西欧和东南亚提供服务Service is available in West US 2, West Europe, and Southeast Asia

可以选择顶部菜单栏中的“通知”来监视部署过程。 You can select Notifications in the top menu bar to monitor the deployment process.

通知Notifications

有关 Microsoft 基因组学的详细信息,请参阅什么是 Microsoft 基因组学?For more information about Microsoft Genomics, see What is Microsoft Genomics?

设置:安装 Microsoft 基因组学 Python 客户端Set up: Install the Microsoft Genomics Python client

需要在本地环境中安装 Python 和 Microsoft 基因组学 Python 客户端 msgenYou need to install both Python and the Microsoft Genomics Python client msgen in your local environment.

安装 PythonInstall Python

Microsoft 基因组学 Python 客户端与 Python 2.7.12 或更高的 2.7.xx 版本兼容。The Microsoft Genomics Python client is compatible with Python 2.7.12 or a later 2.7.xx version. 2.7.14 是建议的版本。2.7.14 is the suggested version. 可以在此处找到下载版本。You can find the download here.

重要

Python 3.x 与 Python 2.7.xx 不兼容。Python 3.x isn't compatible with Python 2.7.xx. msgen 是 Python 2.7 应用程序。msgen is a Python 2.7 application. 运行 msgen 时,请确保有效的 Python 环境使用的是 Python 版本 2.7.xx。When running msgen, make sure that your active Python environment is using a 2.7.xx version of Python. 如果尝试将 msgen 与 3.x 版本的 Python 一起使用,则可能会出错。You may get errors when trying to use msgen with a 3.x version of Python.

安装 Microsoft 基因组学 Python 客户端 msgenInstall the Microsoft Genomics Python client msgen

使用 Python pip 安装 Microsoft 基因组学客户端 msgenUse Python pip to install the Microsoft Genomics client msgen. 以下说明假定你的系统路径中已有 Python2.x。The following instructions assume Python2.x is already in your system path. 如果存在无法识别 pip 安装的问题,则需要向系统路径添加 Python 和脚本子文件夹。If you have issues with pip install not being recognized, you need to add Python and the scripts subfolder to your system path.

pip install --upgrade --no-deps msgen
pip install msgen

如果不希望将 msgen 作为系统范围的二进制文件安装且不希望修改系统范围的 Python 包,请将 –-user 标志与 pip 配合使用。If you don't want to install msgen as a system-wide binary and modify system-wide Python packages, use the –-user flag with pip. 如果使用基于包的安装或 setup.py,则会安装所有必需的包。When you use the package-based installation or setup.py, all necessary required packages are installed.

测试 msgen Python 客户端Test msgen Python client

若要测试 Microsoft 基因组学客户端,请从基因组学帐户下载配置文件。To test the Microsoft Genomics client, download the config file from your Genomics account. 在 Azure 门户中导航到你的基因组学帐户:选择左上角的“所有服务”,然后搜索并选择“基因组学帐户”。In the Azure portal, navigate to your Genomics account by selecting All services in the top left, and then searching for and selecting Genomics accounts.

在 Azure 门户上查找“Microsoft 基因组学”Find Microsoft Genomics on Azure portal

选择刚刚创建的基因组学帐户,导航到“访问密钥”,然后下载配置文件。Select the Genomics account you just made, navigate to Access Keys, and download the configuration file.

从 Microsoft 基因组学下载配置文件Download config file from Microsoft Genomics

使用以下命令测试 Microsoft 基因组学 Python 客户端是否正常运行Test that the Microsoft Genomics Python client is working with the following command

msgen list -f "<full path where you saved the config file>"

创建 Microsoft Azure 存储帐户Create a Microsoft Azure Storage account

Microsoft 基因组学服务要求将输入作为块 Blob 存储在 Azure 存储帐户中。The Microsoft Genomics service expects inputs to be stored as block blobs in an Azure storage account. 它也会将输出文件作为块 Blob 写入到 Azure 存储帐户中用户指定的容器。It also writes output files as block blobs to a user-specified container in an Azure storage account. 输入和输出可以驻留在不同的存储帐户中。The inputs and outputs can reside in different storage accounts. 如果已将数据置于 Azure 存储帐户中,则只需确保该数据与基因组学帐户处于同一位置即可。If you already have your data in an Azure storage account, you only need to make sure that it is in the same location as your Genomics account. 否则在运行 Microsoft 基因组学服务时会产生传出费用。Otherwise, egress charges are incurred when running the Microsoft Genomics service. 如果你没有 Azure 存储帐户,需要创建一个存储帐户并上传数据。If you don’t yet have an Azure storage account, you need to create one and upload your data. 可在此处找到有关 Azure 存储帐户的详细信息,包括存储帐户是什么,以及它提供哪些服务。You can find more information about Azure storage accounts here, including what a storage account is and what services it provides. 若要创建 Azure 存储帐户,请在 Azure 门户中导航到“创建存储帐户”To create an Azure storage account, navigate to Create storage account in the Azure portal.

存储帐户创建页Storage account create page

如上图所示,为存储帐户配置以下信息。Configure your storage account with the following information, as shown in the preceding image. 对存储帐户使用大多数标准选项,仅指定帐户是“Blob 存储”帐户,而不是常规用途帐户。Use most of the standard options for a storage account, specifying only that the account is BlobStorage, not general purpose. Blob 存储帐户的下载和上传速度是常规用途帐户的 2-5 倍。Blob storage can be 2-5x faster for downloads and uploads. 建议使用默认的部署模型,即 Azure 资源管理器。The default deployment model, Azure Resource Manager, is recommended.

设置Setting 建议的值Suggested value 字段说明Field description
订阅Subscription Azure 订阅Your Azure subscription 有关订阅的详细信息,请参阅订阅For details about your subscription see Subscriptions
资源组Resource group MyResourceGroupMyResourceGroup 可以选择同一资源组作为基因组学帐户。You can select the same resource group as your Genomics account. 请参阅命名规则,了解什么是有效的资源组名称For valid resource group names, see Naming rules
存储帐户名称Storage account name MyStorageAccountMyStorageAccount 选择唯一的帐户标识符。Choose a unique account identifier. 请参阅命名规则,了解什么是有效的名称For valid names, see Naming rules
位置Location 美国西部 2West US 2 使用的位置与基因组学帐户的位置相同,目的是降低传出费用和延迟。Use the same location as the location of your Genomics account, to reduce egress charges, and reduce latency.
性能Performance 标准Standard 默认值为“标准”。The default is standard. 有关标准和高级存储帐户的更多详细信息,请参阅 Microsoft Azure 存储简介For more details on standard and premium storage accounts, see Introduction to Microsoft Azure storage
帐户类型Account kind BlobStorageBlobStorage Blob 存储帐户的下载和上传速度可以是常规用途帐户的 2-5 倍。Blob storage can be 2-5x faster than general purpose for downloads and uploads.
复制Replication 本地冗余存储Locally redundant storage 本地冗余存储在创建存储帐户时所在区域的数据中心内复制数据。Locally redundant storage replicates your data within the datacenter in the region you created your storage account. 有关详细信息,请参阅 Azure 存储复制For more information, see Azure Storage replication
访问层Access tier Hot 热访问是指存储帐户中对象的访问频率会更高。Hot access indicates objects in the storage account will be more frequently accessed.

然后选择“查看 + 创建”,以创建存储帐户。Then select Review + create to create your storage account. 与创建基因组学帐户时一样,可以选择顶部菜单栏中的“通知”来监视部署过程。As you did with the creation of your Genomics account, you can select Notifications in the top menu bar to monitor the deployment process.

将输入数据上传到存储帐户Upload input data to your storage account

Microsoft 基因组学服务要求使用双端测序读长(fastq 或 bam 文件)作为输入文件。The Microsoft Genomics service expects paired end reads (fastq or bam files) as input files. 进行探查时,可以选择上传自己的数据,也可以使用为你提供的可以公开使用的示例数据。You can choose to either upload your own data, or explore using publicly available sample data provided for you. 若要使用公开提供的示例数据,可以在下面的托管位置查找:If you would like to use the publicly available sample data, it is hosted here:

https://msgensampledata.blob.core.windows.net/small/chr21_1.fq.gz https://msgensampledata.blob.core.windows.net/small/chr21_2.fq.gz

需要在存储帐户中为输入数据创建一个 Blob 容器,为输出数据创建另一个 Blob 容器。Within your storage account, you need to make one blob container for your input data and a second blob container for your output data. 请将输入数据上传到输入 Blob 容器中。Upload the input data into your input blob container. 执行此操作可以使用各种工具,包括 Microsoft Azure 存储资源管理器BlobPorterAzCopyVarious tools can be used to do this, including Microsoft Azure Storage Explorer, BlobPorter, or AzCopy.

使用 msgen Python 客户端通过 Microsoft 基因组学服务运行工作流Run a workflow through the Microsoft Genomics service using the msgen Python client

若要通过 Microsoft 基因组学服务运行工作流,请编辑 config.txt 文件,为数据指定输入和输出存储容器。To run a workflow through the Microsoft Genomics service, edit the config.txt file to specify the input and output storage container for your data. 打开从基因组学帐户下载的 config.txt 文件。Open the config.txt file that you downloaded from your Genomics account. 需指定的部分包括:订阅密钥和底部的六个项、存储帐户名称、输入和输出的密钥和容器名称。The sections you need to specify are your subscription key and the six items at the bottom, the storage account name, key, and container name for both the input and output. 若要查找该信息,可以在 Azure 门户中导航到存储帐户的“访问密钥”,也可以直接从 Azure 存储资源管理器导航。You can find this information by navigating in the Azure portal to Access keys for your storage account, or directly from the Azure Storage Explorer.

基因组学配置Genomics config

如果想要运行 GATK4,请将 process_name 参数设置为 gatk4If you would like to run GATK4, set the process_name parameter to gatk4.

默认情况下,基因组学服务将输出 VCF 文件。By default, the Genomics service outputs VCF files. 如果需要 gVCF 输出而非 VCF 输出(等效于 GATK 3.x 中的 -emitRefConfidence 和 GATK 4.x 中的 emit-ref-confidence),请将 emit_ref_confidence 参数添加到 config.txt,并将它设置为 gvcf,如上图所示。If you would like a gVCF output rather than a VCF output (equivalent to -emitRefConfidence in GATK 3.x and emit-ref-confidence in GATK 4.x), add the emit_ref_confidence parameter to your config.txt and set it to gvcf, as shown in the preceding figure. 若要更改回 VCF 输出,请将它从 config.txt 文件中删除,或者将 emit_ref_confidence 参数设置为 noneTo change back to VCF output, either remove it from the config.txt file or set the emit_ref_confidence parameter to none.

bgzip 是用于压缩 vcf 或 gvcf 文件的工具,tabix 会创建已压缩文件的索引。bgzip is a tool that compresses the vcf or gvcf file, and tabix creates an index for the compressed file. 默认情况下,基因组学服务在“.g.vcf”输出上运行 bgzip(后接 tabix),但默认情况下不对“.vcf”输出运行这些工具。By default, the Genomics service runs bgzip followed by tabix on ".g.vcf" output but does not run these tools by default for ".vcf" output. 运行时,服务会生成“.gz”(bgzip 输出)和“.tbi”(tabix 输出)文件。When run, the service produces ".gz" (bgzip output) and ".tbi" (tabix output) files. 参数是布尔值;默认情况下,它对“.vcf”输出设置为 false,对“.g.vcf”输出设置为 true。The argument is a boolean, which is set to false by default for ".vcf" output, and to true by default for ".g.vcf" output. 若要在命令行中使用它,请将 -bz--bgzip-output 指定为 true(运行 bgzip 和 tabix)或 falseTo use on the command line, specify -bz or --bgzip-output as true (run bgzip and tabix) or false. 若要在 config.txt 文件中使用此参数,请将 bgzip_output: truebgzip_output: false 添加到文件中。To use this argument in the config.txt file, add bgzip_output: true or bgzip_output: false to the file.

使用 msgen Python 客户端将工作流提交到 Microsoft 基因组学服务Submit your workflow to the Microsoft Genomics service using the msgen Python client

使用 Microsoft 基因组学 Python 客户端通过以下命令提交工作流:Use the Microsoft Genomics Python client to submit your workflow with the following command:

msgen submit -f [full path to your config file] -b1 [name of your first paired end read] -b2 [name of your second paired end read]

可使用以下命令查看工作流的状态:You can view the status of your workflows using the following command:

msgen list -f c:\temp\config.txt 

工作流完成后,可以在 Azure 存储帐户的已配置输出容器中查看输出文件。Once your workflow completes, you can view the output files in your Azure storage account in the output container that you configured.

后续步骤Next steps

在本文中,你已将示例输入数据上传到 Azure 存储中,并通过 msgen Python 客户端将工作流提交到了 Microsoft 基因组学服务。In this article, you uploaded sample input data into Azure storage and submitted a workflow to the Microsoft Genomics service through the msgen Python client. 要详细了解可以与 Microsoft 基因组学服务配合使用的其他输入文件类型,请查看以下页面:配对的 FASTQ | BAM | 多个 FASTQ 或 BAMTo learn more about other input file types that can be used with the Microsoft Genomics service, see the following pages: paired FASTQ | BAM | Multiple FASTQ or BAM.