Icon

Chemistry_​Exercise

01_Chemistry_basics

This workflow demonstrates basic cheminformatics functionality within KNIME Analytics Platform:
Reading and writing various chemistry data formats; canonalization of chemical structures; duplicate filtering; descriptor calculation; interactive filtering on multiple properties.
Data sets were collected from ChEMBLdb. Each set corresponds to a publication in which lipophilicity was determined experimentally.




Chemistry_Manipulation_and_VisualizationBuild a workflow that reads several input files, preprocesses the data and chemical structures, displays the data in an interactive view, and writes several output files1. Read data from multiple files using corresponding Reader nodes. Find them in Node repository >> IO >> Read. 2. Customize the data by adding column names and removing redundant columns.3. Generate canonical SMILES and remove duplicates. 4. Compute descriptors and use Parallel Coordinates Plot to filter data interactively on multiple properties. (Make sure to keep selection from the View)5. Finally, save the data to TABLE, Excel, and SDF files.Required extensions: RDKit KNIME Integration, Chemistry Base Types and Nodes Step 2. Preprocess and customize the data1. Make sure that each data table contains columns "Molecule_ID" and "LogD" (each of same data type)2. The public data table is missing chemical structures. Add them from the SDF file using the Joiner node on the IDs/Names.3. Generate an RDKit molecule for each of the tables. Make sure to generate coordinates, perform partial sanitization, and assign the sameColumn name.4. Collect all the data in a single table and filter any reduntant columnsHint: Remember to convert the smiles strings to smiles format with the Molecule Type Cast node where necessary Step 5.Save1. Write the resulting table to a SDF file andto a table file.Hint: use corresponding writer nodes.2. Save the images of the molecules alongwith some properties in an Excel table. Makesure to filter out columns of SD and mol type (Excel tables don't like those) Step 3. Remove duplicates1. Use the RDKit nodes to strip the salts and to generate canonicalSMILES.2. Remove duplicates based on the canonical SMILES Step 1. Read data from different sources1. Drag and drop the File Reacder (Complex Format)from the node repository and point it to theinhouse_CHEMBL3301363.csv file. Change the columntype of the column with smiles stringe to "smiles" byclicking on the column in the preview. 2. Read the rest of the data by draging them from the"data" folder:public_AID_686912.csv; public_AID_686912.sdf;new_inhouse.xlsxHint: Remember to extract the Molecule Name for the SDfile. Step 4. Compute descriptors and filter data table on multiple properties1. Compute physchem properties using the RDKit Descriptor Calculation node. 2. Color the data rows based on the values in the "Tag" column. Connect the node to the RDKit from Molecule node.3. Compute the images of molecules from the Salt Stripped molecule using the RDKit Molecule to SVG node and display them with the Tile View node.4. Configure Parallel Coordinates Plot to display the experimental LogD and computed physchem properties. 5. Use the GroupBy node to compute the mean and SD of LogD and the unique number of molecules for each data set. Connect it to the Table View to displaythe results6. Select the nodes after the RDKit Molecule to SVG. Right click on the selection >"Create Component". Give the component a descriptive name, e.g."Visualize molecules with experimental data". 7. Execute the component and explore the interactive view. Check its layout. Do you like it? If not, Ctrl/Cmd + Double click on the component to open its content.Adjust the layout by clicking on the corresponding icon in the Toolbar.8. Explore the interactive view again. In the view select the compounds you are interested in and close the view. 9. Add an output port to the Component by Right Click > Component > Setup and then add an output portHint: to enable interactivity in views following the GroupBy node you need to enable hilighting in the GroupBy node. add Tagadd Tagadd TagMolecule_IDLogDMolecule_IDLogDKeep SelectedNode 563 Tile View Table View ConstantValue Column ConstantValue Column ConstantValue Column Column Rename Column Rename Column Rename ParallelCoordinates Plot Row Filter Column Resorter Sorter Chemistry_Manipulation_and_VisualizationBuild a workflow that reads several input files, preprocesses the data and chemical structures, displays the data in an interactive view, and writes several output files1. Read data from multiple files using corresponding Reader nodes. Find them in Node repository >> IO >> Read. 2. Customize the data by adding column names and removing redundant columns.3. Generate canonical SMILES and remove duplicates. 4. Compute descriptors and use Parallel Coordinates Plot to filter data interactively on multiple properties. (Make sure to keep selection from the View)5. Finally, save the data to TABLE, Excel, and SDF files.Required extensions: RDKit KNIME Integration, Chemistry Base Types and Nodes Step 2. Preprocess and customize the data1. Make sure that each data table contains columns "Molecule_ID" and "LogD" (each of same data type)2. The public data table is missing chemical structures. Add them from the SDF file using the Joiner node on the IDs/Names.3. Generate an RDKit molecule for each of the tables. Make sure to generate coordinates, perform partial sanitization, and assign the sameColumn name.4. Collect all the data in a single table and filter any reduntant columnsHint: Remember to convert the smiles strings to smiles format with the Molecule Type Cast node where necessary Step 5.Save1. Write the resulting table to a SDF file andto a table file.Hint: use corresponding writer nodes.2. Save the images of the molecules alongwith some properties in an Excel table. Makesure to filter out columns of SD and mol type (Excel tables don't like those) Step 3. Remove duplicates1. Use the RDKit nodes to strip the salts and to generate canonicalSMILES.2. Remove duplicates based on the canonical SMILES Step 1. Read data from different sources1. Drag and drop the File Reacder (Complex Format)from the node repository and point it to theinhouse_CHEMBL3301363.csv file. Change the columntype of the column with smiles stringe to "smiles" byclicking on the column in the preview. 2. Read the rest of the data by draging them from the"data" folder:public_AID_686912.csv; public_AID_686912.sdf;new_inhouse.xlsxHint: Remember to extract the Molecule Name for the SDfile. Step 4. Compute descriptors and filter data table on multiple properties1. Compute physchem properties using the RDKit Descriptor Calculation node. 2. Color the data rows based on the values in the "Tag" column. Connect the node to the RDKit from Molecule node.3. Compute the images of molecules from the Salt Stripped molecule using the RDKit Molecule to SVG node and display them with the Tile View node.4. Configure Parallel Coordinates Plot to display the experimental LogD and computed physchem properties. 5. Use the GroupBy node to compute the mean and SD of LogD and the unique number of molecules for each data set. Connect it to the Table View to displaythe results6. Select the nodes after the RDKit Molecule to SVG. Right click on the selection >"Create Component". Give the component a descriptive name, e.g."Visualize molecules with experimental data". 7. Execute the component and explore the interactive view. Check its layout. Do you like it? If not, Ctrl/Cmd + Double click on the component to open its content.Adjust the layout by clicking on the corresponding icon in the Toolbar.8. Explore the interactive view again. In the view select the compounds you are interested in and close the view. 9. Add an output port to the Component by Right Click > Component > Setup and then add an output portHint: to enable interactivity in views following the GroupBy node you need to enable hilighting in the GroupBy node. add Tagadd Tagadd TagMolecule_IDLogDMolecule_IDLogDKeep SelectedNode 563Tile View Table View ConstantValue Column ConstantValue Column ConstantValue Column Column Rename Column Rename Column Rename ParallelCoordinates Plot Row Filter Column Resorter Sorter

Nodes

Extensions

Links