Icon

01_​Exercise

Data Manipulation

Solution to "Manipulation Data" exercise for basic Life Science User Training
- Concatenate data from two different sources
- Modify String values
- Join data from multiple tables
- Remove duplicates in the data

URL: Advaced ETL Functionalities and Machine Learning based Pre-Processing https://youtu.be/IEAsUTN8q68
URL: Joining Data Tables https://youtu.be/6BigLM6vbhs
URL: Joining Data Tables - Inner Join https://youtu.be/9uV99ByH-TA
URL: Concatenate https://youtu.be/VzH2lHbDAg0
URL: Concatenate Node https://youtu.be/ku6SyEZ1Pv8
URL: Data Manipulation: Numbers, Strings, and Rules https://youtu.be/mJrBXmLQ4ko
URL: Slides for KNIME Analytics Platform Course (L1-LS) https://www.knime.com/form/material-download-registration

19. Select all nodes in this box, right-click on one of them and create a component. Explore and selectsome data for export. REOS rules-- "MW" if AMW is smaller than 100 or greater than 700-- "Complexity" if NumHeavyAtoms is smaller than 5 or greater than 50 or NumRotatableBonds is greater or equal to 12-- "HBond" if NumHBD is greater than 5 or NumHBA is greater than 10-- "logP" if SlogP is smaller -5 or greater than 7.5-- "Pass" for all other cases 3. Use the SDF Reader to read in themolecules and their properties. Set knime://knime.workflow/../data/malariahts_molecules.sdf as path. Extract the Mol Block. Rememberto extract the properties in the PropertyHandling tab. 4. Use the Table Reader to import themalariahts_molecules_feature.table fromthe data folder. Use a workflow relativepath to navigate to the table. 2. Use the Excel Reader to import themalariahts_experiment_no-hits.xlsx fromthe data folder. TIP: just drag and drop thefile from the KNIME Explorer on the leftside 5. Filter out the rows with missingvalues in the Pf3D7_pEC50 columnusing the Missing Value node. Chose"Remove Row" in the Column SettingsTab. 6. Concatenate the data from theMissing Value node and the ExcelReader. Notice the data types of allcolumns. Connect the output of theConcatenate node to the ColumnExpressions node. 7. This node is pre-configured.Just execute it and inspect theoutput table. Look at the Sample column. Is there a inconsistencyyou notice? 8. Use the String Manipulationnode to upper case all sampleIDs. Try to assemble theexpression yourself by selectingthe method "upperCase" anddouble-click on it, and double-clickon the column "Sample" to insert itinto the brackets. In case thatdoesn't work, copy-paste upperCase($Sample$)into the expression field. 9. Use the Joiner node to join thetables from the SDF Reader andthe Table Reader. Add a matchingcriterion and join based on theSample column. Stay with theinner join. Exclude the samplecolumn from the bottom input tablein the Column Selection tab toavoid having it twice. 13. This Joiner node is pre-configured and joins the tablewith the assay data with thetable containing the moleculedata 10. Let's clean up the data tableusing the Table Manipulator node. - exclude the ExactMW column- rename theAverageMolecularWeight columnto AMW- change the data type ofNumRotatableBond andNumHeavyAtoms from String toInteger 11. This node is pre-configuredand will create a new column thisREOS rule tags based on the rulesstated above. 12. Exclude all the rows that areduplicate using the Duplicate RowFIlter node. Detect duplicatesbased on the columns Sampleand Mol Block. 14. Use the Row Filter or RowSplitter node to exclude rows withmissing values in the columnPf3D7_pEC50. The Row Splitterallows you to inspect the excludeddata, the Row Filter doesn't. Makesure to select "Exclude rows byattribute value" 15. Use the Column Filter node toexclude the columnNumSaturatedRings. 1. This node is pre-configured. Justexecute it and inspect the output table. EXERCISETry to follow the instructions in each box. Bold words give you a hint on which node to use. In case you get lost, have a sneak peek to the 02_Solution workflow. Tip: Check and familiarize yourself with the output table of each node, to see how the data changed in comparison to theprevious step and to see how it still deviates from how you want it to look like. 16. This two nodes are pre-configured, justexecute them. Check the output table of the RDKitMolecule to SVG node. The molecular structuresare now rendered to images, so we can usethem in the interactive view. 18. Visualize the data using theScatter Plot node. - select the axes/columns to plot inthe options tab (green vs red)- give a title in the General PlotOptions tab- make sure it interacts with otherview nodes in the View Controlstab under Selection 20. Define locations youwant to export the data to. 17. Assign colours to thePf3D7_pEC50 column using theColor Manager node. Connect itsinput port to the RDKit Molecule toSVG node. 21. Use the Sorter node andconnect it to the Row Filter/Splitternode from step 14. Sort accordingto highest pEC50. 22. Use the Table Row to Variablenode to make the content of thefirst row available as flow variable.Select the InChI Key. 23. Connect the flow variable fromabove with a String Manipulationnode. Join the string from the tablecreator with the flow variablecontaining the InChI Key. Replacethe chemblidURL column. 24. Configure a GET Requestnode to use the URL in thechemlidURL column that weassembled in the prevous step. 25. Parse the output of the GET Requestnode with the XPath node. In theconfiguration dialogue, scroll through theXML Cell Prview, click the blue items andAdd XPath. Extract the - assa_chembl_id- assay_description- standard_type- standard_units- standard_valueWhen adding the XPath, choose MultipleRows Keep SelectedKNIME-native formatmalariahts_experiment_hits.csvcreateREOS rulescorrect column typeexclude "selected" columnconvert the Mol to InChI Keybase URLChEMBLextract ChEMBL ID/ parsing the outputactivity URLChEMBLconstruct activity URL Row Filter Tile View(JavaScript) Column Resorter GroupBy Table View(JavaScript) Table Writer CSV Reader Rule Engine Column Expressions Joiner RDKit From Molecule RDKit Moleculeto SVG Parallel CoordinatesPlot (JavaScript) Color Manager SDF Writer Excel Writer Column Filter Column Renamer RDKit To InChI Table Rowto Variable Table Creator JSON Path Table Creator Table Rowto Variable String Manipulation GET Request 19. Select all nodes in this box, right-click on one of them and create a component. Explore and selectsome data for export. REOS rules-- "MW" if AMW is smaller than 100 or greater than 700-- "Complexity" if NumHeavyAtoms is smaller than 5 or greater than 50 or NumRotatableBonds is greater or equal to 12-- "HBond" if NumHBD is greater than 5 or NumHBA is greater than 10-- "logP" if SlogP is smaller -5 or greater than 7.5-- "Pass" for all other cases 3. Use the SDF Reader to read in themolecules and their properties. Set knime://knime.workflow/../data/malariahts_molecules.sdf as path. Extract the Mol Block. Rememberto extract the properties in the PropertyHandling tab. 4. Use the Table Reader to import themalariahts_molecules_feature.table fromthe data folder. Use a workflow relativepath to navigate to the table. 2. Use the Excel Reader to import themalariahts_experiment_no-hits.xlsx fromthe data folder. TIP: just drag and drop thefile from the KNIME Explorer on the leftside 5. Filter out the rows with missingvalues in the Pf3D7_pEC50 columnusing the Missing Value node. Chose"Remove Row" in the Column SettingsTab. 6. Concatenate the data from theMissing Value node and the ExcelReader. Notice the data types of allcolumns. Connect the output of theConcatenate node to the ColumnExpressions node. 7. This node is pre-configured.Just execute it and inspect theoutput table. Look at the Sample column. Is there a inconsistencyyou notice? 8. Use the String Manipulationnode to upper case all sampleIDs. Try to assemble theexpression yourself by selectingthe method "upperCase" anddouble-click on it, and double-clickon the column "Sample" to insert itinto the brackets. In case thatdoesn't work, copy-paste upperCase($Sample$)into the expression field. 9. Use the Joiner node to join thetables from the SDF Reader andthe Table Reader. Add a matchingcriterion and join based on theSample column. Stay with theinner join. Exclude the samplecolumn from the bottom input tablein the Column Selection tab toavoid having it twice. 13. This Joiner node is pre-configured and joins the tablewith the assay data with thetable containing the moleculedata 10. Let's clean up the data tableusing the Table Manipulator node. - exclude the ExactMW column- rename theAverageMolecularWeight columnto AMW- change the data type ofNumRotatableBond andNumHeavyAtoms from String toInteger 11. This node is pre-configuredand will create a new column thisREOS rule tags based on the rulesstated above. 12. Exclude all the rows that areduplicate using the Duplicate RowFIlter node. Detect duplicatesbased on the columns Sampleand Mol Block. 14. Use the Row Filter or RowSplitter node to exclude rows withmissing values in the columnPf3D7_pEC50. The Row Splitterallows you to inspect the excludeddata, the Row Filter doesn't. Makesure to select "Exclude rows byattribute value" 15. Use the Column Filter node toexclude the columnNumSaturatedRings. 1. This node is pre-configured. Justexecute it and inspect the output table. EXERCISETry to follow the instructions in each box. Bold words give you a hint on which node to use. In case you get lost, have a sneak peek to the 02_Solution workflow. Tip: Check and familiarize yourself with the output table of each node, to see how the data changed in comparison to theprevious step and to see how it still deviates from how you want it to look like. 16. This two nodes are pre-configured, justexecute them. Check the output table of the RDKitMolecule to SVG node. The molecular structuresare now rendered to images, so we can usethem in the interactive view. 18. Visualize the data using theScatter Plot node. - select the axes/columns to plot inthe options tab (green vs red)- give a title in the General PlotOptions tab- make sure it interacts with otherview nodes in the View Controlstab under Selection 20. Define locations youwant to export the data to. 17. Assign colours to thePf3D7_pEC50 column using theColor Manager node. Connect itsinput port to the RDKit Molecule toSVG node. 21. Use the Sorter node andconnect it to the Row Filter/Splitternode from step 14. Sort accordingto highest pEC50. 22. Use the Table Row to Variablenode to make the content of thefirst row available as flow variable.Select the InChI Key. 23. Connect the flow variable fromabove with a String Manipulationnode. Join the string from the tablecreator with the flow variablecontaining the InChI Key. Replacethe chemblidURL column. 24. Configure a GET Requestnode to use the URL in thechemlidURL column that weassembled in the prevous step. 25. Parse the output of the GET Requestnode with the XPath node. In theconfiguration dialogue, scroll through theXML Cell Prview, click the blue items andAdd XPath. Extract the - assa_chembl_id- assay_description- standard_type- standard_units- standard_valueWhen adding the XPath, choose MultipleRows Keep SelectedKNIME-native formatmalariahts_experiment_hits.csvcreateREOS rulescorrect column typeexclude "selected" columnconvert the Mol to InChI Keybase URLChEMBLextract ChEMBL ID/ parsing the outputactivity URLChEMBLconstruct activity URL Row Filter Tile View(JavaScript) Column Resorter GroupBy Table View(JavaScript) Table Writer CSV Reader Rule Engine Column Expressions Joiner RDKit From Molecule RDKit Moleculeto SVG Parallel CoordinatesPlot (JavaScript) Color Manager SDF Writer Excel Writer Column Filter Column Renamer RDKit To InChI Table Rowto Variable Table Creator JSON Path Table Creator Table Rowto Variable String Manipulation GET Request

Nodes

Extensions

Links