Icon

04_​Generate_​Features

This is the forth workflow in the PubChem Big Data story.

We prepare three datasets for the machine learning experiments. Set 1: Compounds, their chemical structures and their bioactivity values as reported in PubChem. Set 2: Compounds, their chemical structures and their bioactivity values where missing values were replaced with 0 (i.e. compunds were assumed to have shown no activity). Set 3: Unique compounds (duplicates removed)

Additionally, in step 4, we collect the counts of active and inactive compounds per AID.

AWS Autentication component, Paths to Livy and S3 component, and Create Spark Contex (Livy) node require configuration.






04_Generate_FeaturesThis workflow shows how to preprocess chemical structures using KNIME Workflow Executor for Apache Spark.For more information see the workflow metadata. Find it here: View -> Description”Required extensions: KNIME Extension for Apache Spark, and KNIME Workflow Executor for Apache Spark (Preview), KNIME Extension for Big Data File Formats, KNIME Base Chemistry Types & Nodes, RDKit Nodes Feature Step 1. Standardize chem structures Connect to AWS and Create Big Data Environment Step 2. Calculate chemical fingerprints Collect statistics Step 3. Collect SMILES of the duplicates Step 4. Collect the counts of Active and Inactive CIDs per AID. rm duplicatesusing canonical smilesthe field PUBCHEM_molMorgan2PubChem_Moloutput columnremove columnsCIDs vs AIDs nulls filteredand SMILESset 2: ChemFPsmissing bioactivity replacedset 1: BioFP real outcomesset 3:remove count Paths toLivy and S3 Spark GroupBy Compute Features GenerateCanonical SMILES Timer Info Destroy SparkContext Merge Variables CSV Writer Spark Column Filter Standardize thebioactivity values ORC to Spark Spark to ORC Spark to ORC Generate Path Spark SQL Query Spark to ORC Spark Column Filter Collect Counts of Activesand Inactives per AID Amazon S3 Connector Create SparkContext (Livy) AWS Authentication 04_Generate_FeaturesThis workflow shows how to preprocess chemical structures using KNIME Workflow Executor for Apache Spark.For more information see the workflow metadata. Find it here: View -> Description”Required extensions: KNIME Extension for Apache Spark, and KNIME Workflow Executor for Apache Spark (Preview), KNIME Extension for Big Data File Formats, KNIME Base Chemistry Types & Nodes, RDKit Nodes Feature Step 1. Standardize chem structures Connect to AWS and Create Big Data Environment Step 2. Calculate chemical fingerprints Collect statistics Step 3. Collect SMILES of the duplicates Step 4. Collect the counts of Active and Inactive CIDs per AID. rm duplicatesusing canonical smilesthe field PUBCHEM_molMorgan2PubChem_Moloutput columnremove columnsCIDs vs AIDs nulls filteredand SMILESset 2: ChemFPsmissing bioactivity replacedset 1: BioFP real outcomesset 3:remove count Paths toLivy and S3 Spark GroupBy Compute Features GenerateCanonical SMILES Timer Info Destroy SparkContext Merge Variables CSV Writer Spark Column Filter Standardize thebioactivity values ORC to Spark Spark to ORC Spark to ORC Generate Path Spark SQL Query Spark to ORC Spark Column Filter Collect Counts of Activesand Inactives per AID Amazon S3 Connector Create SparkContext (Livy) AWS Authentication

Nodes

Extensions

Links