Icon

01_​Data_​Importing_​and_​Preprocessing_​solution

01 Data Importing and Preprocessing

This workflow is part of a collection of exercise/solution materials used at a hands on workshop held at German Conference for Bioinformatics (GCB-2020). The title of the workshop is "Binding Preference Prediction using KNIME Analytics Platform and its Keras Deep Learning Integration"

Exercise 1: Data Importing and Preprocessing1. Configure the provided Simple File Reader node to read data/iDeepS_PARCLIP/training/ELAVL1A.fa.gz file as follows: Column delimiter = “\n” Row delimiter = “>” Has column header = unchecked2. Add a second Simple File Reader node to read the file data/iDeepS_PARCLIP/test/ELAVL1A.fa.gz with similar settings as the node from step 1. Hint: you can copy paste the node you configured from step 1 and change only the filepath. Change the node annotation to "Test Data" by double clicking on it.3. Use a Concatenate node to append the output of the second Simple File Reader node to the output of the first one. 4. Use a Column Combiner node to concatenate the DNA sequences from Column1 and Column2. In the configuration dialog of the node, set Replace Delimiter option toempty string, “Name of appended column” to “Sequence” and check the “remove included columns” option.5. Use the Cell Splitter Node to extract the class information (whether a sequence is binding or not) by splitting column0 using a colon as a delimiter.6. Use the Column Rename node to rename the column containing the class information to “IsBindingSite”.7. Use the Column Filter node to keep only the Sequence and IsBindingSite columns.8. Create a new column named isNotBindingSite using a Math Formula node. (Hint: use the expression abs(1 - $IsBindingSite$) )9. Use a Table Writer node to write the processed table. Save the file where it can be easily located for later use. The data used in this workflow are from the following publication:Xiaoyong Pan, Peter Rijnbeek, Junchi Yan, Hong-Bin Shen. Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neuralnetworks. BMC Genomics, 2018, 19:511.Specifically: https://github.com/xypan1232/iDeepS/tree/master/datasets/clip Training Data30,000 rowsTest Data10,000 rowsSimple File Reader Simple File Reader Concatenate Cell Splitter Column Combiner Column Filter Column Rename Table Writer Math Formula Exercise 1: Data Importing and Preprocessing1. Configure the provided Simple File Reader node to read data/iDeepS_PARCLIP/training/ELAVL1A.fa.gz file as follows: Column delimiter = “\n” Row delimiter = “>” Has column header = unchecked2. Add a second Simple File Reader node to read the file data/iDeepS_PARCLIP/test/ELAVL1A.fa.gz with similar settings as the node from step 1. Hint: you can copy paste the node you configured from step 1 and change only the filepath. Change the node annotation to "Test Data" by double clicking on it.3. Use a Concatenate node to append the output of the second Simple File Reader node to the output of the first one. 4. Use a Column Combiner node to concatenate the DNA sequences from Column1 and Column2. In the configuration dialog of the node, set Replace Delimiter option toempty string, “Name of appended column” to “Sequence” and check the “remove included columns” option.5. Use the Cell Splitter Node to extract the class information (whether a sequence is binding or not) by splitting column0 using a colon as a delimiter.6. Use the Column Rename node to rename the column containing the class information to “IsBindingSite”.7. Use the Column Filter node to keep only the Sequence and IsBindingSite columns.8. Create a new column named isNotBindingSite using a Math Formula node. (Hint: use the expression abs(1 - $IsBindingSite$) )9. Use a Table Writer node to write the processed table. Save the file where it can be easily located for later use. The data used in this workflow are from the following publication:Xiaoyong Pan, Peter Rijnbeek, Junchi Yan, Hong-Bin Shen. Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neuralnetworks. BMC Genomics, 2018, 19:511.Specifically: https://github.com/xypan1232/iDeepS/tree/master/datasets/clip Training Data30,000 rowsTest Data10,000 rowsSimple File Reader Simple File Reader Concatenate Cell Splitter Column Combiner Column Filter Column Rename Table Writer Math Formula

Nodes

Extensions

Links