Synthetic Data Generator (Nominal)

This component generates synthetic values into a nominal column based on the frequency distribution of the original nominal column. It’s also possible to generate synthetic data based on the multivariate frequency distribution of the nominal column and one or more dependency columns. The synthetic value of each row in the original data can be recognized by the row ID.

In addition, it is possible to exclude too few original examples from the data generation, and add random noise to the synthetic data.

Synthetic data generation is used, for example, when the original data is confidential (anonymization) or difficult or expensive to collect.

Options

Use Seed: Use a seed to make the output reproducible
Select Dependency Columns: The frequency distribution between the nominal column and the columns selected here will be maintained during the synthetic data generation. Numeric columns will be binned.
Select Nominal Column to Generate: Values in this column will be replaced by synthetic values
Select Noise: The number entered here corresponds to the fraction of rows where noise will be added. Noise means replacing the generated value with a random value within the column domain.
Select Required Number of Examples: If a nominal category has less than the selected number of examples, the rows in this category will be excluded from the synthetic data generation. This could happen with many dependency columns, or if any of the nominal/dependency columns have many unique values. Notice that if all dependency groups are smaller than the required size, the component returns an empty table in its top output.
Select Number of Bins for Numeric Dependency Columns: If a numeric dependency column was selected, it will be categorized by creating the selected number of bins of equal width. If no numeric dependency column was selected, this option has no effect.
Select Sample Size: The number of generated rows in the component's output
Select Seed: Select the seed for the data generation if the seed is enabled

Input Ports

: The original nominal column and possibly dependency columns

Output Ports

: The synthetic nominal column together with the original row IDs
: The frequency distribution of the original nominal column and possibly the dependency columns
: The dependency group ID of each row ID in the original data

Synthetic Data Generator (Nominal)

Options

Input Ports

Output Ports

Nodes

Extensions

Links

Download