GATKRealignment

This is a wrapper node for RealignerTargetCreator and the IndelRealigner walker of the Genome Analysis Toolkit (GATK). The aim of the node is to perform a realignment around putative insertions and deletions in a BAM file to reduce mapping errors and to avoid false positive variant calls. This is achieved in two steps. First, the RealignerTargetCreator walker generates a list of candidate regions for realignment. In the second step the IndelRealigner performs the realignment for all candidate regions. To increase the accuracy of realignment both walkers can be provided with sites of known indels. For further information, see the GATK documentation of the RealignerTargetCreator and the IndelRealigner.

Options

Sets of known indels: By choosing to support the realignment with a site of known indels (indels from 1000 Genomes Project or Mills and 1000 Genomes Gold Standard Indels) you increase the realignment accuracy. File paths can be set via the preference page.
Interval for realignment: You can check this option to perform realignment in certain genomic regions. You have to specify the intervals in a text file in BED format and select the file in the file browser.
Number of threads: Increasing the number of threads speeds up the node, but also increases the memory required for the calculations. Only the RealignerTargetCreator tool can be run in multi-threaded mode.
Java Memory in GB: Set the maximum Java heap size per thread.

RealignerTargetCreator

Maximum length of realignment interval: This option serves to limit the interval length for realignment. By defining the maximum interval size, any intervals larger than this value will be dropped. (Default value = 500)
Note, that the realignment algorithm has quadratic complexity and therefore longer intervals heavily impact the runtime.
Minimum number of reads for entropy calculations: Define the minimum number of reads at a locus to enable using the entropy calculation. (Default value = 4)
If a locus is covered by at least that many reads TargetCreator calculates an entropy value for the site.
Fraction of mismatching base qualities: This option refers to the minimal fraction of mismatches at a locus that is defined as high entropy.
To disable this behavior, set this value to 0. This feature is really only necessary in case of an ungapped primary alignment. (Default value = 0.0)
Window size for clustering SNPs: Any two SNP calls and/or high entropy positions are considered clustered when they occur no more than this many basepairs apart. Must be > 1. (Default value = 10)
Optional flags: Set additional command line flags for the RealignerTargetCreator.

IndelRealigner

Consensus determination model

You can choose between three different models to calculate the alternate consensus sequence.

USE_READS: Recommended option. This model uses known indels and the indels in the original alignment for identifying a consensus.
USE_SW: Additionally uses 'Smith-Waterman' to generate alternate consenses. If you have used an ungapped aligner you should select this model.
KNOWNS_ONLY: Uses only indels from a provided ROD of known indels for determining the consensus sequence.

LOD Threshold

The LOD is a measure for significance. A region with low LOD implies that realignment of the region leads only to small improvements. Therefore only regions with LOD equal or above this threshold will be realigned. The threshold makes realignment more efficient. Decreasing the threshold can be helpful when dealing with low coverage data or when searching for rare indels. (Default value = 5)

Entropy threshold

The entropy threshold defines the minimum percentage of mismatches at a locus to be considered having high entropy. IndelRealigner performs only realignment at such loci with high entropy and only if the realignment reduces the overall entropy of the region. (Default value = 0.15)

Consensus threshold

Define the max alternate consensuses to try (necessary to improve performance in deep coverage). (Default value = 30)
This option tries to reduce the overall realignment runtime.

Insert size threshold

The insert size of a read pair is defined as the distance between the leftmost and the rightmost mapping position of the read edges. An extremely high insert size indicates a completely misplaced read and is likely not to represent the correct region of origin of a read. Define the maximum insert size of read pairs by setting a threshold. (Default value = 3000)

Read shift threshold

This option aims to reduce the overall runtime. Define the maximum positional move in basepairs that a read can be adjusted during realignment by setting a threshold. (Default value = 200)

Maximum number of reads used for consensus calculation

Set a threshold for the max number of reads to be used for finding the alternate consensuses. (Default value = 120)
This option aims to reduce the overall runtime. As consensus calculation for a large number of reads is costly in terms of time, all regions exceeding this threshold will not be realigned. Increase this value according to the depth of your data if you do not want to exclude regions of high coverage.

Maximum number of reads for realignment

Define the max number of reads allowed at an interval for realignment. (Default value = 20 000)
This option is to limit the usage of memory. IndelRealigner has to load all reads in need of realignment of an interval into memory. If the number of reads exceeds this threshold no realignment is performed. When dealing with deep sequencing data, you can increase this value.

Output original cigar string

IndelRealigner resets mapping position and cigar string for all realigned reads of the output BAM file. Usually, if you do not tick this option, the tool will add the original coordinates as tag. You can prevent this behaviour and reduce the size of your output BAM file by checking this option.

Optional flags

Set additional command line flags for the IndelRealigner walker.

Preference page

HTE: Set threshold for repeated execution. Only used if HTE is enabled in the preference page.
Path to GATK jar file: Set the path to the GenomeAnalysisTK.jar. This will be done automatically if the path is already defined in the preference page.
Path to reference genome: Set the path to the reference genome. This will be done automatically if the path is already defined in the preference page.
Path to 1000G Indels: Set the path to the 1000G project indels data set.
Path to Mills: Set the path to the Mills and 1000G reference data set.

Input Ports

: Cell 0: Path to input BAM file

Output Ports

: Cell 0: Path to realigned BAM file

Popular Predecessors

Popular Successors

Views

STDOUT / STDERR: The node offers a direct view of its standard out and the standard error of the tool.

Workflows

No workflows found

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME4NGS from the below update site following our NodePit Product and Node Installation Guide:

v5.6

Plugin provider: IBIS KNIME Team

Plugin version: 1.8.1.201707071203

On NodePit since: 2025-08-15

Last update: 2025-08-17

KNIME versions: Since v3.6

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!