GATKBaseRecalibration

This is a wrapper node for AnalyzeCovariates, BaseRecalibrator and PrintReads of the Genome Analysis Toolkit (GATK). This node addresses the problem of systematic errors in the base quality score emitted by sequencing machines. As these base qualities are used by many variant calling tools removing the bias leads to more accurate variant calls. The process of recalibration consists of 3 steps.
Step 1: A machine learning device is trained to build a model of covariation which is generated from the actual data and from known sites of genetic variation. (walkers: BaseRecalibrator)
Step 2: This optional step builds a second model and compares it to the first one. The comparison allows to generate before/after plots of the quality values. (walkers: BaseRecalibrator + AnalyzeCovariates).
Step 3: Finally, the model is applied to the alignment data and the base qualities are adapted to the biases found. (walkers: PrintReads)
For further information, see the GATK documentation of the BaseRecalibrator, the AnalyzeCovariates and the PrintReads walkers.

Options

Sets of known polymorphisms
You have to provide the node with at least one of the three named sets: Indels from 1000 Genomes project, indels from Mills and 1000 Genomes project, variants from dbSNP. BaseRecalibrator needs the sets for training its model.
Interval for recalibration
You can check this option to perform recalibration in certain genomic regions. You have to specify the intervals in a text file in BED format and select the file in the file browser.
Analyze Covariates
Before/after plots of the base quality score can be generated.
Optional flags: Set additional command line flags for the AnalyzeCovariates walker.
Print Reads
Specify whether to remove all additional information from the output BAM file except of the read group tag. The option reduces the output file size.
Optional flags: Set additional command line flags for the PrintReads walker.
General options
Number of CPU threads: Increasing the number of threads speeds up the node, but it also increases the memory required for the calculations. The BaseRecalibrator and the PrintReads walker run in multi-threaded mode.
Shared Java Memory: Set the maximum Java heap size shared by all CPU threads.

BaseRecalibrator

Cycle threshold
Set the maximum cycle value permitted for the Cycle covariate. (Default value = 500)
The cycle covariate will generate an error if it encounters a cycle greater than this value. This argument is ignored if the Cycle covariate is not used.
Gap open penalty
Gap open penalty for calculating BAQ (par-base alignment quality, probability that a base is not correctly aligned). Default value is 40. 30 is perhaps better for whole genome call sets.
Default quality for deletions
Set the default quality to use as a prior (reported quality) in the base deletion covariate model. (Default value = 45)
This value will replace all base qualities in the read for this default value. A Negative value turns it off.
Default quality for insertions
Set the default base quality to use as a prior (reported quality) in the base insertion covariate model. (Default value = 45)
This parameter is used for all reads without insertion quality scores for each base. [default is on] Setting this value to -1 disables the option.
Default quality for mismatches
Set the default quality to use as a prior (reported quality) in the base mismatch covariate model. (Default value = -1)
This value will replace all base qualities in the read for this default value. A negative value turns it off.
k-mer context size for indels
Define the size of the k-mer context to be used for base insertions and deletions. (Default value = 3)
The context covariate will use a context of this size to calculate its covariate value for base insertions and deletions. The value must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.
k-mer context size for mismatches
Set the size of the k-mer context to be used for base mismatches. (Default value = 2)
The context covariate will use a context of this size to calculate its covariate value for base mismatches. The value must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.
Quality threshold for read tails
Define the minimum quality for tha bases in the tail of the reads to be considered. (Default value = 2)
Reads with low quality bases on either tail (beginning or end) will not be considered in the context. This parameter defines the quality below which (inclusive) a tail is considered low quality
Optional flags
Set additional command line flags for the BaseRecalibrator walker.

Preference page

HTE
Set threshold for repeated execution. Only used if HTE is enabled in the preference page.
Path to GATK jar file
Set the path to the GenomeAnalysisTK.jar. This will be done automatically if the path is already defined in the preference page.
Path to reference genome
Set the path to the reference genome.
Path to 1000G Indels
Set the path to the 1000G project indels data set.
Path to Mills
Set the path to the Mills and 1000G reference data set.
Path to dbSNP
Set the path to the dbSNP reference data set.

Input Ports

Icon
Cell 0: Path to input BAM file

Output Ports

Icon
Cell 0: Path to recalibrated BAM file

Views

STDOUT / STDERR
The node offers a direct view of its standard out and the standard error of the tool.

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.