“肿瘤/正常”管道Tumor/Normal pipeline

Azure Databricks tumor/normal 管道是 GATK 的 最佳实践 兼容管道,适用于使用 MuTect2 变体调用方进行较短的读取对齐和 somatic 变体调用。The Azure Databricks tumor/normal pipeline is a GATK best practices compliant pipeline for short read alignment and somatic variant calling using the MuTect2 variant caller.

演练Walkthrough

管道由以下步骤组成:The pipeline consists of the following steps:

  1. 使用 BWA 的正常示例对齐方式。Normal sample alignment using BWA-MEM.
  2. 使用 BWA 的 Tumor 示例对齐方式。Tumor sample alignment using BWA-MEM.
  3. 用 MuTect2 调用的变体。Variant calling with MuTect2.

设置Setup

管道作为 Azure Databricks 作业运行。The pipeline is run as an Azure Databricks job. 可以设置 群集策略 来保存配置:You can set up a cluster policy to save the configuration:

{
  "num_workers": {
    "type": "unlimited",
    "defaultValue": 13
  },
  "node_type_id": {
    "type": "unlimited",
    "defaultValue": "Standard_F32s_v2"
  },
  "spark_env_vars.refGenomeId": {
    "type": "unlimited",
    "defaultValue": "grch38"
  },
  "spark_version": {
    "type": "regex",
    "pattern": ".*-hls.*",
    "defaultValue": "7.0.x-hls-scala2.12"
  }
}
  • 群集配置应将 Databricks Runtime 用于基因组学。The cluster configuration should use Databricks Runtime for Genomics.
  • 该任务应为在此页面底部找到的 tumor/normal 笔记本。The task should be the tumor/normal notebook found at the bottom of this page.
  • 为了获得最佳性能,请使用至少包含60GB 内存的计算优化 Vm。For best performance, use the compute optimized VMs with at least 60GB of memory. 建议 Standard_F32s_v2 vm。We recommend Standard_F32s_v2 VMs.
  • 如果正在运行基本质量分数来校准,请改用常规用途 (Standard_D32s_v3) 实例,因为此操作需要更多内存。If you’re running base quality score recalibration, use general purpose (Standard_D32s_v3) instances instead since this operation requires more memory.

参数Parameters

管道接受控制其行为的参数。The pipeline accepts parameters that control its behavior. 此处记录了最重要且经常更改的参数。The most important and commonly changed parameters are documented here. 若要查看所有可用参数及其用法信息,请运行管道笔记本的第一个单元。To view all available parameters and their usage information, run the first cell of the pipeline notebook. 将定期添加新的参数。New parameters are added regularly. 可以为所有运行或每次运行设置参数。Parameters can be set for all runs or per-run.

参数Parameter 默认Default 说明Description
manifestmanifest 不适用n/a 描述输入的清单。The manifest describing the input.
outputoutput 不适用n/a 应在其中写入管道输出的路径。The path where pipeline output should be written.
replayModereplayMode skipskip * 如果 skip 为,则将跳过已存在的阶段。* If skip, stages will be skipped if output already exists.
* 如果 overwrite 为,则将删除现有输出。* If overwrite, existing output will be deleted.
exportVCFexportVCF falsefalse 如果为 true,则管道会将结果写入 .VCF 文件以及增量。If true, the pipeline writes results to a VCF file as well as Delta.
perSampleTimeoutperSampleTimeout 12小时12h 每个样本应用的超时。A timeout applied per sample. 达到此超时值后,管道将继续到下一个示例。After reaching this timeout, the pipeline continues on to the next sample. 此参数的值必须包含超时单位: "of 秒"、"m" 表示分钟或 "h" 表示小时。The value of this parameter must include a timeout unit: ‘s’ for seconds, ‘m’ for minutes, or ‘h’ for hours. 例如,"60m" 将导致超时为60分钟。For example, ‘60m’ will result in a timeout of 60 minutes.

提示

若要优化运行时,请 spark.sql.shuffle.partitions 在 Spark 配置中设置为群集的核心数的三倍。To optimize runtime, set spark.sql.shuffle.partitions in the Spark config to three times the number of cores of the cluster.

参考基因组Reference genomes

必须使用 环境变量配置引用基因组。You must configure the reference genome using an environment variable. 若要使用 GRCh37,请按如下所示设置环境变量:To use GRCh37, set an environment variable like this:

refGenomeId=grch37

若要使用 GRCh38,请将更改 grch37grch38To use GRCh38, change grch37 to grch38.

若要使用自定义引用基因组,请参阅 自定义引用基因组中的说明。To use a custom reference genome, see instructions in Custom reference genomes.

清单格式Manifest format

备注

基因组学和更高版本的 Databricks Runtime 6.6 支持清单 blob。Manifest blobs are supported in Databricks Runtime 6.6 for Genomics and above.

清单是一个 CSV 文件或 blob,用于描述在何处查找输入 FASTQ 或 BAM 文件。The manifest is a CSV file or blob describing where to find the input FASTQ or BAM files. 示例:An example:

pair_id,file_path,sample_id,label,paired_end,read_group_id
HG001,*_R1_*.normal.fastq.bgz,HG001_normal,normal,1,read_group_normal
HG001,*_R2_*.normal.fastq.bgz,HG001_normal,normal,2,read_group_normal
HG001,*_R1_*.tumor.fastq.bgz,HG001_tumor,1,tumor,read_group_tumor
HG001,*_R2_*.tumor.fastq.bgz,HG001_tumor,2,tumor,read_group_tumor

如果输入包含未对齐的 BAM 文件,则应省略 paired_end 字段:If your input consists of unaligned BAM files, you should omit the paired_end field:

pair_id,file_path,sample_id,label,paired_end,read_group_id
HG001,*.normal.bam,HG001_normal,normal,,read_group_tumor
HG001,*.tumor.bam,HG001_tumor,tumor,,read_group_normal

给定个体的 tumor 和法线样本按 pair_id 字段分组。The tumor and normal samples for a given individual are grouped by the pair_id field. Tumor 和常规示例名称读取组名称在对中必须不同。The tumor and normal sample names read group names must be different within a pair.

提示

如果提供的清单是文件,则 file_path 每行中的字段可能是一个绝对路径或相对于清单文件的路径。If the provided manifest is a file, the file_path field in each row may be an absolute path or a path relative to the manifest file. 如果提供的清单是一个 blob,则该 file_path 字段必须为绝对路径。If the provided manifest is a blob, the file_path field must be an absolute path. 可以包含 glob (*) 来匹配多个文件。You can include globs (*) to match many files.

其他使用信息和故障排除Additional usage info and troubleshooting

Tumor/normal 管道与其他 Azure Databricks 管道共享许多操作细节。The tumor/normal pipeline shares many operational details with the other Azure Databricks pipelines. 有关更详细的用法信息,如输出格式结构、以编程方式运行的提示、设置自定义引用基因组的步骤以及常见问题的详细信息,请参阅 DNASeq 管道For more detailed usage information, such as output format structure, tips for running programmatically, steps for setting up custom reference genomes, and common issues, see DNASeq pipeline.

备注

管道已从 TNSeq 重命名为基因组学和更高版本中 Databricks Runtime 7.3 LTS 的 MutSeq。The pipeline was renamed from TNSeq to MutSeq in Databricks Runtime 7.3 LTS for Genomics and above.

MutSeq 管道笔记本MutSeq pipeline notebook

获取笔记本Get notebook

TNSeq 管道笔记本 (传统) TNSeq pipeline notebook (Legacy)

获取笔记本Get notebook