Icon

01_​Fetch_​BioAssays

This is the first workflow in the PubChem Big Data story.

In the top part of the workflow we download the assay data from the PubChem database using its API and upload it to a specified S3 bucket on AWS. One file per assay/experiment (AID).

In the bottom part we clean up the assay data using KNIME Extension for Apache Spark and store cleaned up files on AWS.

AWS Autentication component, Paths to Livy and S3 component, and Create Spark Contex (Livy) node require configuration.

Prioritize the experiments to work with Upload temporary filesto AWS S3 01_Fetch_BioAssaysThis workflow demonstrates how to fetch bioactivity data from PubChem database via its REST API, preprocess the data using Apache Spark and transfer it to the cloud on Amazon S3.For more information see the workflow metadata. Find it here: View -> Description”Required extensions: KNIME Extension for Apache Spark, and KNIME Workflow Executor for Apache Spark (Preview) GET IDs of the experiments withtype "Screening" Download data for these experiments usingPubChem API into a temporary folder Clean up each fetched file using KNIME Executor for Apache Spark and store them in the ORC format Read remote files Collect statistics Connect to AWS and Create Big Data Environment 480 assays with > 100k mols1562 assays1 to 642275 molsfind duplicated requestsSID&CID&Outcomerm missingadd AID columnto each tablewith DELAYAIDsNode 500Upload files to S3 AWS Authentication Row Filter Get IDs ofExperiments Extract Countsfor Experiments GroupBy Spark Column Filter Spark Row Filter Spark DataFrameJava Snippet Fetch ExperimentDetails from PubChem CSV to Spark List Files/Folders Timer Info Table Row ToVariable Loop Start Variable Loop End Destroy SparkContext Path to String(Variable) String Manipulation CSV Writer Spark to CSV Generate Path Amazon S3 Connector Create SparkContext (Livy) Merge Variables Transfer Files Paths toLivy and S3 Prioritize the experiments to work with Upload temporary filesto AWS S3 01_Fetch_BioAssaysThis workflow demonstrates how to fetch bioactivity data from PubChem database via its REST API, preprocess the data using Apache Spark and transfer it to the cloud on Amazon S3.For more information see the workflow metadata. Find it here: View -> Description”Required extensions: KNIME Extension for Apache Spark, and KNIME Workflow Executor for Apache Spark (Preview) GET IDs of the experiments withtype "Screening" Download data for these experiments usingPubChem API into a temporary folder Clean up each fetched file using KNIME Executor for Apache Spark and store them in the ORC format Read remote files Collect statistics Connect to AWS and Create Big Data Environment 480 assays with > 100k mols1562 assays1 to 642275 molsfind duplicated requestsSID&CID&Outcomerm missingadd AID columnto each tablewith DELAYAIDsNode 500Upload files to S3AWS Authentication Row Filter Get IDs ofExperiments Extract Countsfor Experiments GroupBy Spark Column Filter Spark Row Filter Spark DataFrameJava Snippet Fetch ExperimentDetails from PubChem CSV to Spark List Files/Folders Timer Info Table Row ToVariable Loop Start Variable Loop End Destroy SparkContext Path to String(Variable) String Manipulation CSV Writer Spark to CSV Generate Path Amazon S3 Connector Create SparkContext (Livy) Merge Variables Transfer Files Paths toLivy and S3

Nodes

Extensions

Links