This node uses the DESeq package of R to test for differential expression based on a model using the negative binomial distribution.
DESeq takes as input a count table and an annotation file.


Empirical dispersion calculation
This method obtains dispersion estimates for a count data set. For each condition (or collectively for all conditions) it first computes for each gene an empirical dispersion value (a.k.a. a raw SCV value), then fits by regression a dispersion-mean relationship and finally chooses for each gene a dispersion parameter that will be used in subsequent tests from the empirical and the fitted value according to the 'sharingMode' argument.
There are three ways how to empirical dispersion can be computed:
  • pooled - Use the samples from all conditions with replicates to estimate a single pooled empirical dispersion value, and assign it to all samples.
  • pooled-CR - Estimate the dispersion by maximizing a Cox-Reid adjusted profile likelihood (CR-ALP).
  • per-condition - For each condition with replicates, compute a gene's empirical dispersion value by considering the data from samples for this condition. For samples of unreplicated conditions, the maximum of empirical dispersion values from the other conditions is used.
  • blind - Ignore the sample labels and compute a gene's empirical dispersion value as if all samples were replicates of a single condition. This can be done even if there are no biological replicates.
(default: pooled)
Sharing mode
After the empirical dispersion values have been computed for each gene, a dispersion-mean relationship is fitted for sharing information across genes in oder to reduce variability of the dispersion estimates. After that, for each gene, we have two values: the empirical values (derived only from this gene's data), and the fitted value (i.e., the dispersion value typical for genes with an average expression similar to those of this gene). The sharingModel argument specifies which of these two values will be written to the featureData's disp_columns.
  • fit-only - use only the fitted value, i.e., the empirical value is used only as input to the fitting, and then ignored. Use this only with very few replicates, and when your are not too concerned about false positives from dispersion outliers, i.e. genes with an unusually high variability.
  • maximum - take the maximum of the two values. This is the conservative or prudent choice, recommended once you have at least three or four replicates and maybe even with only two replicates.
  • gene-est-only - No fitting or sharing, use only the empirical value. This method is preferable when the number of replicates is large and the empirical dispersion values are sufficiently reliable. If a number of replicates is small, this option may lead to many cases where the dispersion of a gene is accidentally underestimated and a false positive arises in the subsequent testing.
(default: maximum)

Input Ports

Row names: IDs of features.
Column headers are the names of the samples.
Cell 0...n: Count of features in the samples.
Row names: Names of the samples as they are named in the count table.
The column header should be named 'condition'.
Cell 0: Condition (should only contain two conditions).

Output Ports

Cell 0: ID of feature
Cell 1: Average log2 CPM (counts per million) expression
Cell 2: Log2 CPM A expression
Cell 3: Log2 CPM B expression
Cell 4: Fold change
Cell 5: Log2 fold change
Cell 6: P-value
Cell 7: Adjusted p-value (corrected with BH)


STDOUT and STDERR of the underlying R script.




You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.