Icon

02_​One_​Hot_​Encoding

02 One Hot Encoding

This workflow is part of a collection of exercise/solution materials used at a hands on workshop held at German Conference for Bioinformatics (GCB-2020). The title of the workshop is "Binding Preference Prediction using KNIME Analytics Platform and its Keras Deep Learning Integration"

Exercise 2: One-hot Encoding a DNA/RNA sequenceOne-hot encoding table================================== A -> 1,0,0,0 C -> 0,1,0,0 G -> 0,0,1,0 T -> 0,0,0,1 N -> 0.25,0.25,0.25,0.25 1. Start from the Table Reader node which reads the exported data from Exercise 1. The table output contains DNA sequences and whether or not they are a binding site. 2. Using String Replacer node, replace all occurrences of the nucleotide “A” in the Sequence column with the text “1,0,0,0,”. Do not forget the last comma. 3. Repeat the same process for other nucleotides (C, G, T & N) using their corresponding replacement listed in the table above. At the end you should have a series of String Replacer nodes (5 nodes to be exact) connected sequentially. 4. Use a Cell Splitter node to split the output of the last String Replacer node by using “,” as a delimiter. Check the Remove input column option and leave other settings unchanged. 5. Use a Table Writer node to write the one-hot encoded data table. Save the file where it can be easily located using the filename “OneHot_Encoded_Sequences.table” The data used in this workflow are from the following publication:Xiaoyong Pan, Peter Rijnbeek, Junchi Yan, Hong-Bin Shen. Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neuralnetworks. BMC Genomics, 2018, 19:511.Specifically: https://github.com/xypan1232/iDeepS/tree/master/datasets/clip Table Reader Exercise 2: One-hot Encoding a DNA/RNA sequenceOne-hot encoding table================================== A -> 1,0,0,0 C -> 0,1,0,0 G -> 0,0,1,0 T -> 0,0,0,1 N -> 0.25,0.25,0.25,0.25 1. Start from the Table Reader node which reads the exported data from Exercise 1. The table output contains DNA sequences and whether or not they are a binding site. 2. Using String Replacer node, replace all occurrences of the nucleotide “A” in the Sequence column with the text “1,0,0,0,”. Do not forget the last comma. 3. Repeat the same process for other nucleotides (C, G, T & N) using their corresponding replacement listed in the table above. At the end you should have a series of String Replacer nodes (5 nodes to be exact) connected sequentially. 4. Use a Cell Splitter node to split the output of the last String Replacer node by using “,” as a delimiter. Check the Remove input column option and leave other settings unchanged. 5. Use a Table Writer node to write the one-hot encoded data table. Save the file where it can be easily located using the filename “OneHot_Encoded_Sequences.table” The data used in this workflow are from the following publication:Xiaoyong Pan, Peter Rijnbeek, Junchi Yan, Hong-Bin Shen. Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neuralnetworks. BMC Genomics, 2018, 19:511.Specifically: https://github.com/xypan1232/iDeepS/tree/master/datasets/clip Table Reader

Nodes

Extensions

Links