auto synthetic data generator

This node uses CTGAN to generate synthetic data. CTGAN is a collection of Deep Learning based Synthetic Data Generators for single table data, which are able to learn from real data and generate synthetic clones with high fidelity. With ML tools (like the CTGAN), one inputs real data into the software. The software then learns patterns from the data and outputs data that matches those patterns.

For more about this technology, you can see the paper 'Modeling Tabular data using Conditional GAN' at https://arxiv.org/abs/1907.00503 and the 'sdv' site: https://sdv.dev/SDV/user_guides/index.html .

Synthetic data is generated for all the columns of table whether numeric or categorical.

Set of python libraries comprising 'sdv' are required to be installed. If your KNIME is configured to access packages in 'base' Anaconda environment, then on first execution of the component, all necessary packages will be automatically installed. The principal package among these is pytorch.

One of the outputs includes evaluation metrics as to how close the synthetic data is to real data.

Options

Output sample size: Output sample size
No of epochs: No of iterations that the model will perform to optimize its parameters,
Batch Size: Specify batch size. No of samples used in each iteration step. Must be multiple of 10

Input Ports

: Input KNIME table with or without missing values

Output Ports

: Synthetic data
: This is the model created by the generator and can be used again to generate synthetic data.
: This node outputs evaluation metrics

Nodes

Integer Configuration3 ×
Component Input1 ×
Component Output1 ×
Conda Environment Propagation1 ×
Merge Variables1 ×
Show all 6 nodes

auto synthetic data generator

Options

Input Ports

Output Ports

Nodes

Extensions

Links

Download