Synthetic Data Generator (Numeric)

This component generates synthetic values into a numeric column by sampling from a selected distribution (Uniform/Gaussian/Gamma) where the distribution parameters have been defined from the original column. It’s also possible to generate synthetic data from separate distributions for different subsets (dependency groups) of the data as defined based on one or more dependency columns. The synthetic value of each row in the original data can be recognized by the row ID.

In addition, it is possible to exclude dependency groups with too few examples from the data generation, and add random noise to the synthetic data.

Synthetic data generation is used, for example, when the original data is confidential (anonymization) or difficult or expensive to collect.

Options

Use Seed: Use a seed to make the output reproducible
Select Dependency Columns: These columns determine the subsets for which the distribution parameters for the data generation will be calculated separately. Numeric columns will be binned before determining the subsets.
Select Numeric Column to Generate: Values in this column will be replaced by synthetic values
Select Noise: The number entered here corresponds to the fraction of rows where noise will be added. Noise means replacing the generated value with a random value within the column domain.
Select Required Number of Examples: If a dependency group has less than the selected number of examples, the rows in this category will be excluded from the synthetic data generation. This could happen if you select many dependency columns, or a dependency column with many unique values. Notice that if all dependency groups are smaller than the required size, the component returns an empty table in its top output.
Select Number of Bins for Numeric Dependency Columns: If a numeric dependency column was selected, it will be categorized by creating the selected number of bins of equal width. If no numeric dependency column is selected, this option has no effect.
Select Sample Size: The number of generated rows in the component's output
Select Seed: Select the seed for the data generation if the seed is enabled
Select Distribution: The synthetic values are generated from the selected distribution Uniform, Gaussian, or Gamma

Input Ports

: The original numeric column and possibly dependency columns

Output Ports

: The synthetic numeric column together with the original row IDs
: The distribution parameters of the original numeric column, possibly separately for the different dependency groups
: The dependency group ID of each row ID in the original data

Synthetic Data Generator (Numeric)

Options

Input Ports

Output Ports

Nodes

Extensions

Links

Download