0 ×

GATKUnifiedGenotyper

IBIS Helmholtz-Node extension for KNIME Workbench version 1.8.1.201707071203 by IBIS KNIME Team

This is a wrapper node for UnifiedGenotyper which is part of the Genome Analysis Toolkit (GATK). UnifiedGenotyper is based on a Bayesian genotype likelihood model and can identify SNPs and indels. Furthermore, it is possible to annotate all found variants with their corresponding dbSNP ID. Please note, that this tool has been deprecated in favor of HaplotypeCaller, a much more sophisticated variant caller that produces much better calls, especially on indels, and includes features that allow it to scale to much larger cohort sizes. For further information, see the online documentation of the UnifiedGenotyper.

Options

Choose variant type(s)
This option enables you to choose whether to call only SNPs, only indels or both.
Use dbSNP
Tick this option in order to annotate a variant with its ID from dbSNP. The path to the dbSNP VCF file can be set in the preference page (tab).
Output folder
You can choose an output folder for the resulting VCF file. Leave this field empty, if you want to use the same folder where the input BAM files are located.
SNP calling
Define the minimum base quality required to consider a base for calling. (Default value = 17)
Note that the base quality of a base is capped by the mapping quality so that bases on reads with low mapping quality may get filtered out depending on this value. Note too that this argument is ignored in indel calling. In indel calling, low-quality ends of reads are clipped off (with fixed threshold of Q20).
Indel calling
Indel heterozygosity value: The probabilistic model of UnifiedGenotyper uses the indel heterozygosity value for calculating the prior likelihood for an indel. (Default value = 1.25E-4)
Minimum count of indel reads: Set the minimum number of consensus indels required to trigger genotyping run. (Default value = 5)
A candidate indel is genotyped (and potentially called) if there are this number of reads with a consensus indel at a site. Decreasing this value leads to higher sensitivity, increased runtime and larger rates of false positives.
Minimum fraction of indel reads: Define the minimum fraction of all reads at a locus that must contain an indel (of any allele) for that sample to contribute to the indel count for alleles. (Default value = 0.25)
This option is complementary to the option minimum count of indel reads . Only samples with at least this fraction of indel-containing reads will contribute to counting and overcome the threshold of the minimum number of consensus indels required to trigger genotyping run. This parameter ensures that in deep data you don't end up summing lots of super rare errors up to overcome the 5 read default threshold. Should work equally well for low-coverage and high-coverage samples, as low coverage samples with any indel containing reads should easily over come this threshold.
Indel gap open penalty: The gap open penalty for an indel is the phred-scaled probability assumed for the occurrence of an indel start. It is used for calculating alignment scores. (Default value = 45)
Indel gap continuation penalty: The gap continuation penalty for an indel is the phred-scaled probability assumed for the occurrence of an indel continuation. It is used for calculating alignment scores. (Default value = 10)
Further parameters
Sample contamination: Define the fraction of contamination in sequence data (for all samples) to aggressively remove. (Default value = 0)
If this fraction is greater is than zero, the caller will aggressively attempt to remove contamination through biased down-sampling of reads. Basically, it will ignore the contamination fraction of reads for each alternate allele. So if the pileup contains N total bases, then we will try to remove (N * contamination fraction) bases for each alternate allele.
Heterozygosity: Set the heterozygosity value. The probabilistic model of UnifiedGenotyper uses this heterozygosity value for calculating the prior likelihood that a locus is non-reference. (Default value = 0.001)
That is, a heterozygosity value of 0.001 implies that two randomly chosen chromosomes from the population of organisms would differ from each other at a rate of 1 in 1000 bp.
Fraction of deletions: Set a threshold for the maximum fraction of reads with deletions spanning this locus for it to be callable. (Default value = 0.05)
If the fraction of reads with deletions spanning a locus is greater than this value, the site will not be considered callable and will be skipped. To disable the use of this parameter, set its value to >1. All loci below this threshold are examined for additional variants.
PCR error: Estimate the expected PCR error rate, which is used for computing fragment-based likelihoods. (Default value = 0.0001)
The PCR error rate is independent of the sequencing error rate, which is necessary because the tool cannot necessarily distinguish between PCR errors vs. sequencing errors. The practical implication for this value is that it effectively acts as a cap on the base qualities.
Confidence threshold for calling: Define the minimum phred-scaled confidence threshold at which variants should be called.
This threshold refers to the GATK variant quality score. The minimum phred-scaled Qscore threshold separates high confidence from low confidence calls. Only genotypes with confidence above or equal to this threshold are emitted as called sites. A reasonable threshold is 30 for high-pass calling (this is the default).
Confidence threshold for emitting: Set the minimum phred-scaled confidence threshold at which variants should be emitted. (Default value = 30)
This threshold refers to the GATK variant quality score. GATK will output all variants with a score equal or above this threshold. All low-confidence variants according to the confidence threshold will be marked in the filter field of the VCF file.
Malformed read filter
When the length of the read does not match the length of the base quality score, GATK will report an error. By ticking this option you force GATK to skip such reads.
Number of threads
Set the number of threads to be used.
Increasing the number of threads speeds up the node, but it also increases the memory required for the calculations.

GATK

Jave Memory
Set the maximum Java heap size (in GB).
Use BED file?
Tick this option in order to call variants in certain genomic regions. You have to specify the intervals in a text file in BED format and select the file in the file browser.
Further options
Set additional command line flags for the GATKUnifiedGenotyper.

Preference page

HTE
Set a threshold for repeated execution. Only used if HTE is enabled in the preference page.
Path to dbSNP
Set the path to the dbSNP data set file. This will be done automatically if the path is already defined in the preference page.
Path to reference sequence
Set the path to the reference sequence. This will be done automatically if the path is already defined in the preference page.
Path to GATK jar file
Set the path to GenomeAnalysisTK.jar. This will be done automatically if the path is already defined in the preference page.

Input Ports

Icon
Cell 0: Path to BAM file (BAM file for variant calling; it has to be indexed)

Output Ports

Icon
Cell 0: Path to VCF file

Views

STDOUT / STDERR
The node offers a direct view of its standard out and the standard error of the tool.

Best Friends (Incoming)

Best Friends (Outgoing)

Installation

To use this node in KNIME, install KNIME4NGS from the following update site:

KNIME 4.3

You don't know what to do with this link? Read our NodePit Product and Node Installation Guide that explains you in detail how to install nodes to your KNIME Analytics Platform.

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform. Browse NodePit from within KNIME, install nodes with just one click and share your workflows with NodePit Space.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.