Nominal Probability Distribution Creator

Creates a column containing a probability distribution either from numeric columns or a single string column. In case of numeric columns, one or more columns that contain probability values can be picked. The probability values must be non-negative and must sum up to 1.
In case of a string column, one single column can be selected. The probability distribution of the string column produces a one-hot encoding of the string column. In order to do this, the column must have a valid domain, i.e., the possible values of the column must be known. You can use a Domain Calculator to calculate these values if they are not present. Each of the possible values will be treated as a separate class, i.e., the number of distinct values in the string column will be the number of classes in the created probability distribution. The string value of a cell will have a probability of 1 whereby all the other possible string values of the column will have a probability of 0. The same output can be achieved by creating a probability distribution of the numeric output columns of the One to Many node applied to the same string column.

Options

Column type
Choose whether to create probability distributions from numeric columns or a single string column.
  • Numeric columns: Creates probability distributions from one or more numeric columns containing probability values.
  • String column: Creates a one-hot encoded probability distribution from a single string column.
String column
Select a single string column with a valid domain to create a one-hot encoding probability distribution. I.e., the number of distinct values in the string column will be the number of classes in the created distribution and the string value of a cell will have probability 1 whereby all other possible string values of a cell will have a probablity of 0.
Numeric columns
Move the columns that contain the probability values to the "Include" list.
Allow probabilities that sum up to 1 imprecisely
If enabled, the probabilities must not sum up to 1 precisely. This might be helpful if there are, e.g., some rounding errors in the probability values.
Precision (number of decimal digits)
Defines the precision that the sum of the probabilities must have by restricting the number of decimal digits that must be precise. The sum is accepted if abs(sum - 1) <= 10^(-precision) , e.g., if the sum is 0.999, it is only accepted with a precision of <=2. The lower the specified number, the higher is the tolerance.
Invalid probability distribution handling
Specify how to treat invalid probabilities. Invalid means, e.g., negative probabilities or probabilities that do not sum up to 1 (with respect to the specified precision). If Fail is selected, the node will fail. Otherwise, the node just gives a warning and puts missing values in the output for the corresponding rows.
Output column name
Specify the name of the created column containing the probability distribution.
Remove included columns
If selected, the included numeric columns or the picked string column will be removed from the output.
Missing value handling
Specify how to treat a missing value in one of the input columns. If 'Fail' is selected, the node will fail. If 'Ignore' is selected, the node just gives a warning and puts missing values in the output for the corresponding rows. If 'Treat as zero' is selected, the missing value will be treated as 0.

Input Ports

Icon
Data with columns containing probability values or a column containing string values.

Output Ports

Icon
Input data with an appended column that contains the nominal probability distribution.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.