Synthetic Data Generator (Nominal)

This component generates synthetic values into a nominal column based on the frequency distribution of the original nominal column. It’s also possible to generate synthetic data based on the multivariate frequency distribution of the nominal column and one or more dependency columns. The synthetic value of each row in the original data can be recognized by the row ID.

In addition, it is possible to exclude too few original examples from the data generation, and add random noise to the synthetic data.

Synthetic data generation is used, for example, when the original data is confidential (anonymization) or difficult or expensive to collect.

Options

Use Seed
Use a seed to make the output reproducible
Select Dependency Columns
The frequency distribution between the nominal column and the columns selected here will be maintained during the synthetic data generation. Numeric columns will be binned.
Select Nominal Column to Generate
Values in this column will be replaced by synthetic values
Select Noise
The number entered here corresponds to the fraction of rows where noise will be added. Noise means replacing the generated value with a random value within the column domain.
Select Required Number of Examples
If a nominal category has less than the selected number of examples, the rows in this category will be excluded from the synthetic data generation. This could happen with many dependency columns, or if any of the nominal/dependency columns have many unique values. Notice that if all dependency groups are smaller than the required size, the component returns an empty table in its top output.
Select Number of Bins for Numeric Dependency Columns
If a numeric dependency column was selected, it will be categorized by creating the selected number of bins of equal width. If no numeric dependency column was selected, this option has no effect.
Select Sample Size
The number of generated rows in the component's output
Select Seed
Select the seed for the data generation if the seed is enabled

Input Ports

Icon
The original nominal column and possibly dependency columns

Output Ports

Icon
The synthetic nominal column together with the original row IDs
Icon
The frequency distribution of the original nominal column and possibly the dependency columns
Icon
The dependency group ID of each row ID in the original data

Nodes

Extensions

Links