Icon

02_​Fetch_​And_​Transform_​PubChem_​Data

Fetch and Transform PubChem Data

This workflow prepares a data set using Local Big Data Environment for Data Chefs Battle: Chemistry vs Biology. It collects results of biological experiments from PubChem database API and cleans them up.
In the top part the results of biological experiments are collected from PubChem database using its API. In the middle part the preprocessing of the data is performed using Local Big Data Environment. In the bottom part the data are backed up

GET IDs of the relevant experimentsHere we collect experiments with type"Screening" Extract count of tested molecules for each experimentHere we request counts for the first 150 experiments OPTIONAL: backup data Download results of the experiments from PubChem.Here we collect experiments in which 200k - 350k compounds weretested Process Data in Local Big Data EnvironmentStep I:Remove missing values and pivot. Step II: Clean up pivoting results Save data to the Local Big Data Environment Fetch and Transform PubChem Data This workflow prepares a data set using Local Big Data Environment for Data Chefs Battle: Chemistry vs Biology. It collects results of biological experiments from PubChem database API and cleans them up.For more information see the workflow metadata. Find it here: View -> Description.Required extensions: KNIME Expressions, KNIME Extension for Apache Spark, KNIME Extension for Big Data File Formats, KNIME Extension for Local Big Data Environments Experiments with counts200k - 350k resultsPivoted data Fetch ExperimentDetails from PubChem Table Writer Row Filter ORC Writer Get IDs ofExperiments Extract Countsfor Experiments ORC Writer Column Filter Column Rename Spark to ORC ORC to Spark Spark to Table Preprocess inSpark Step I Preprocess inSpark Step II Spark ColumnRename (Regex) Spark Column Filter Create Local BigData Environment GET IDs of the relevant experimentsHere we collect experiments with type"Screening" Extract count of tested molecules for each experimentHere we request counts for the first 150 experiments OPTIONAL: backup data Download results of the experiments from PubChem.Here we collect experiments in which 200k - 350k compounds weretested Process Data in Local Big Data EnvironmentStep I:Remove missing values and pivot. Step II: Clean up pivoting results Save data to the Local Big Data Environment Fetch and Transform PubChem Data This workflow prepares a data set using Local Big Data Environment for Data Chefs Battle: Chemistry vs Biology. It collects results of biological experiments from PubChem database API and cleans them up.For more information see the workflow metadata. Find it here: View -> Description.Required extensions: KNIME Expressions, KNIME Extension for Apache Spark, KNIME Extension for Big Data File Formats, KNIME Extension for Local Big Data Environments Experiments with counts200k - 350k resultsPivoted data Fetch ExperimentDetails from PubChem Table Writer Row Filter ORC Writer Get IDs ofExperiments Extract Countsfor Experiments ORC Writer Column Filter Column Rename Spark to ORC ORC to Spark Spark to Table Preprocess inSpark Step I Preprocess inSpark Step II Spark ColumnRename (Regex) Spark Column Filter Create Local BigData Environment

Nodes

Extensions

Links