0 ×

GATKRealignment

IBIS Helmholtz-Node extension for KNIME Workbench version 1.8.1.201707071203 by IBIS KNIME Team

This is a wrapper node for RealignerTargetCreator and the IndelRealigner walker of the Genome Analysis Toolkit (GATK). The aim of the node is to perform a realignment around putative insertions and deletions in a BAM file to reduce mapping errors and to avoid false positive variant calls. This is achieved in two steps. First, the RealignerTargetCreator walker generates a list of candidate regions for realignment. In the second step the IndelRealigner performs the realignment for all candidate regions. To increase the accuracy of realignment both walkers can be provided with sites of known indels. For further information, see the GATK documentation of the RealignerTargetCreator and the IndelRealigner.

Options

Sets of known indels
By choosing to support the realignment with a site of known indels (indels from 1000 Genomes Project or Mills and 1000 Genomes Gold Standard Indels) you increase the realignment accuracy. File paths can be set via the preference page.
Interval for realignment
You can check this option to perform realignment in certain genomic regions. You have to specify the intervals in a text file in BED format and select the file in the file browser.
Number of threads
Increasing the number of threads speeds up the node, but also increases the memory required for the calculations. Only the RealignerTargetCreator tool can be run in multi-threaded mode.
Java Memory in GB
Set the maximum Java heap size per thread.

RealignerTargetCreator

Maximum length of realignment interval
This option serves to limit the interval length for realignment. By defining the maximum interval size, any intervals larger than this value will be dropped. (Default value = 500)
Note, that the realignment algorithm has quadratic complexity and therefore longer intervals heavily impact the runtime.
Minimum number of reads for entropy calculations
Define the minimum number of reads at a locus to enable using the entropy calculation. (Default value = 4)
If a locus is covered by at least that many reads TargetCreator calculates an entropy value for the site.
Fraction of mismatching base qualities
This option refers to the minimal fraction of mismatches at a locus that is defined as high entropy.
To disable this behavior, set this value to 0. This feature is really only necessary in case of an ungapped primary alignment. (Default value = 0.0)
Window size for clustering SNPs
Any two SNP calls and/or high entropy positions are considered clustered when they occur no more than this many basepairs apart. Must be > 1. (Default value = 10)
Optional flags
Set additional command line flags for the RealignerTargetCreator.

IndelRealigner

Consensus determination model
You can choose between three different models to calculate the alternate consensus sequence.
  • USE_READS: Recommended option. This model uses known indels and the indels in the original alignment for identifying a consensus.
  • USE_SW: Additionally uses 'Smith-Waterman' to generate alternate consenses. If you have used an ungapped aligner you should select this model.
  • KNOWNS_ONLY: Uses only indels from a provided ROD of known indels for determining the consensus sequence.
LOD Threshold
The LOD is a measure for significance. A region with low LOD implies that realignment of the region leads only to small improvements. Therefore only regions with LOD equal or above this threshold will be realigned. The threshold makes realignment more efficient. Decreasing the threshold can be helpful when dealing with low coverage data or when searching for rare indels. (Default value = 5)
Entropy threshold
The entropy threshold defines the minimum percentage of mismatches at a locus to be considered having high entropy. IndelRealigner performs only realignment at such loci with high entropy and only if the realignment reduces the overall entropy of the region. (Default value = 0.15)
Consensus threshold
Define the max alternate consensuses to try (necessary to improve performance in deep coverage). (Default value = 30)
This option tries to reduce the overall realignment runtime.
Insert size threshold
The insert size of a read pair is defined as the distance between the leftmost and the rightmost mapping position of the read edges. An extremely high insert size indicates a completely misplaced read and is likely not to represent the correct region of origin of a read. Define the maximum insert size of read pairs by setting a threshold. (Default value = 3000)
Read shift threshold
This option aims to reduce the overall runtime. Define the maximum positional move in basepairs that a read can be adjusted during realignment by setting a threshold. (Default value = 200)
Maximum number of reads used for consensus calculation
Set a threshold for the max number of reads to be used for finding the alternate consensuses. (Default value = 120)
This option aims to reduce the overall runtime. As consensus calculation for a large number of reads is costly in terms of time, all regions exceeding this threshold will not be realigned. Increase this value according to the depth of your data if you do not want to exclude regions of high coverage.
Maximum number of reads for realignment
Define the max number of reads allowed at an interval for realignment. (Default value = 20 000)
This option is to limit the usage of memory. IndelRealigner has to load all reads in need of realignment of an interval into memory. If the number of reads exceeds this threshold no realignment is performed. When dealing with deep sequencing data, you can increase this value.
Output original cigar string
IndelRealigner resets mapping position and cigar string for all realigned reads of the output BAM file. Usually, if you do not tick this option, the tool will add the original coordinates as tag. You can prevent this behaviour and reduce the size of your output BAM file by checking this option.
Optional flags
Set additional command line flags for the IndelRealigner walker.

Preference page

HTE
Set threshold for repeated execution. Only used if HTE is enabled in the preference page.
Path to GATK jar file
Set the path to the GenomeAnalysisTK.jar. This will be done automatically if the path is already defined in the preference page.
Path to reference genome
Set the path to the reference genome. This will be done automatically if the path is already defined in the preference page.
Path to 1000G Indels
Set the path to the 1000G project indels data set.
Path to Mills
Set the path to the Mills and 1000G reference data set.

Input Ports

Icon
Cell 0: Path to input BAM file

Output Ports

Icon
Cell 0: Path to realigned BAM file

Views

STDOUT / STDERR
The node offers a direct view of its standard out and the standard error of the tool.

Best Friends (Incoming)

Best Friends (Outgoing)

Installation

To use this node in KNIME, install KNIME4NGS from the following update site:

KNIME 4.3

You don't know what to do with this link? Read our NodePit Product and Node Installation Guide that explains you in detail how to install nodes to your KNIME Analytics Platform.

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform. Browse NodePit from within KNIME, install nodes with just one click and share your workflows with NodePit Space.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.