Icon

02_​Pivot_​PubChemData

This is the second workflow in the PubChem Big Data story.

In the top part of the workflow we pivot the assay data using KNIME Extension for Apache Spark. We save two files: pivoted matrix and IDs of the compounds (CIDs) in this matrix.

In the bottom part (optional) we extract additional information on the data.

AWS Autentication component, Paths to Livy and S3 component, and Create Spark Contex (Livy) node require configuration.

Optional: Extract additional information on the data: 1. Complete matrix: compounds vs assays filled with outcomes2. assay ids that are missing compound idsNote: The same big data environment could be used for this part as for the main workflow (marked in yellow). We create a new environment here for clarity. Save outputs Pivot data CID/AID/Outcome toa matrix CID vs AID filled with Outcomes Read individualdocuments into Spark Remove compounds that were measured in less than half of the experiments 02_Pivot_PubChemDataThis workflow demonstrates how to pivot downloaded PubChem data using Apache Spark.For more information see the workflow metadata. Find it here: View -> Description”Required extensions: KNIME Extension for Apache Spark, and KNIME Workflow Executor for Apache Spark (Preview), KNIME Extension for Big Data File Formats Connect to AWS and Create Big Data Environment Collect statistics Connect to AWS and Create Big Data Environment Collect statistics Count nullsaggregate outcomesCID not nullCID nullcollect assaysfor which AIDs CID not nullcollect assaysfor which CIDs find out for which AIDsMissing and NOT missing CIDs were reported:179 assaysaggregate outcomeskeep CID columncount of nullsless than 238CID not nullMatrixCIDs_vs_AIDs_nulls_filtered.orcsave unique CIDsNode 500complete matrixAIDs withmissing CIDsread fetcheddataread fetcheddata+ to _ PySpark Script(1 to 1) Timer Info Spark Pivot Spark Row Filter Spark Row Filter Spark GroupBy Spark Row Filter Spark GroupBy Spark Joiner Spark Column Rename Spark Column Rename Spark Pivot Spark Column Filter Spark Row Filter Spark Row Filter Spark to ORC Spark to ORC Destroy SparkContext Merge Variables Spark Column Filter CSV Writer Spark to ORC Spark to ORC Destroy SparkContext Timer Info Merge Variables CSV Writer CSV to Spark CSV to Spark Spark ColumnRename (Regex) Generate Path Generate Path Paths toLivy and S3 Amazon S3 Connector Create SparkContext (Livy) AWS Authentication Create SparkContext (Livy) Paths toLivy and S3 AWS Authentication Amazon S3 Connector Optional: Extract additional information on the data: 1. Complete matrix: compounds vs assays filled with outcomes2. assay ids that are missing compound idsNote: The same big data environment could be used for this part as for the main workflow (marked in yellow). We create a new environment here for clarity. Save outputs Pivot data CID/AID/Outcome toa matrix CID vs AID filled with Outcomes Read individualdocuments into Spark Remove compounds that were measured in less than half of the experiments 02_Pivot_PubChemDataThis workflow demonstrates how to pivot downloaded PubChem data using Apache Spark.For more information see the workflow metadata. Find it here: View -> Description”Required extensions: KNIME Extension for Apache Spark, and KNIME Workflow Executor for Apache Spark (Preview), KNIME Extension for Big Data File Formats Connect to AWS and Create Big Data Environment Collect statistics Connect to AWS and Create Big Data Environment Collect statistics Count nullsaggregate outcomesCID not nullCID nullcollect assaysfor which AIDs CID not nullcollect assaysfor which CIDs find out for which AIDsMissing and NOT missing CIDs were reported:179 assaysaggregate outcomeskeep CID columncount of nullsless than 238CID not nullMatrixCIDs_vs_AIDs_nulls_filtered.orcsave unique CIDsNode 500complete matrixAIDs withmissing CIDsread fetcheddataread fetcheddata+ to _ PySpark Script(1 to 1) Timer Info Spark Pivot Spark Row Filter Spark Row Filter Spark GroupBy Spark Row Filter Spark GroupBy Spark Joiner Spark Column Rename Spark Column Rename Spark Pivot Spark Column Filter Spark Row Filter Spark Row Filter Spark to ORC Spark to ORC Destroy SparkContext Merge Variables Spark Column Filter CSV Writer Spark to ORC Spark to ORC Destroy SparkContext Timer Info Merge Variables CSV Writer CSV to Spark CSV to Spark Spark ColumnRename (Regex) Generate Path Generate Path Paths toLivy and S3 Amazon S3 Connector Create SparkContext (Livy) AWS Authentication Create SparkContext (Livy) Paths toLivy and S3 AWS Authentication Amazon S3 Connector

Nodes

Extensions

Links