Synthetic Data Generator (Numeric)

This component generates synthetic values into a numeric column by sampling from a selected distribution (Uniform/Gaussian/Gamma) where the distribution parameters have been defined from the original column. It’s also possible to generate synthetic data from separate distributions for different subsets (dependency groups) of the data as defined based on one or more dependency columns. The synthetic value of each row in the original data can be recognized by the row ID.

In addition, it is possible to exclude dependency groups with too few examples from the data generation, and add random noise to the synthetic data.

Synthetic data generation is used, for example, when the original data is confidential (anonymization) or difficult or expensive to collect.

Options

Use Seed
Use a seed to make the output reproducible
Select Dependency Columns
These columns determine the subsets for which the distribution parameters for the data generation will be calculated separately. Numeric columns will be binned before determining the subsets.
Select Numeric Column to Generate
Values in this column will be replaced by synthetic values
Select Noise
The number entered here corresponds to the fraction of rows where noise will be added. Noise means replacing the generated value with a random value within the column domain.
Select Required Number of Examples
If a dependency group has less than the selected number of examples, the rows in this category will be excluded from the synthetic data generation. This could happen if you select many dependency columns, or a dependency column with many unique values. Notice that if all dependency groups are smaller than the required size, the component returns an empty table in its top output.
Select Number of Bins for Numeric Dependency Columns
If a numeric dependency column was selected, it will be categorized by creating the selected number of bins of equal width. If no numeric dependency column is selected, this option has no effect.
Select Sample Size
The number of generated rows in the component's output
Select Seed
Select the seed for the data generation if the seed is enabled
Select Distribution
The synthetic values are generated from the selected distribution Uniform, Gaussian, or Gamma

Input Ports

Icon
The original numeric column and possibly dependency columns

Output Ports

Icon
The synthetic numeric column together with the original row IDs
Icon
The distribution parameters of the original numeric column, possibly separately for the different dependency groups
Icon
The dependency group ID of each row ID in the original data

Nodes

Extensions

Links