Icon

Fragment_​generator_​synCor_​Ubuntu_​finalV1

Fragment_generator_synCor_Ubuntu_finalV1

Fragment_generator_synCor_Ubuntu_finalV1

The workflow can be used to generate novel fragments for FBDD using the "syntax corrected" dual encoder model reported in Bisland et al (2021). See https://pubs.acs.org/doi/10.1021/acs.jcim.0c01226. Full description of the KNIME workflow, instructions, and requirements are detailed in Bilsland et al (2022). See: https://pubs.rsc.org/en/content/articlelanding/2022/md/d2md00152g. Conda environment required for the workflow is available at: https://github.com/abilsland/fragmentEncoder_Knime. Further details on suggested augmentation of data files supplied with the workflow are also given there. The data folder contains a modified version of the GPUutil.py file by Anders Krogh Mortensen (anderskm). See https://github.com/anderskm/gputil. The MIT license is included in the modified file. The data folder also contains the file BaseFeatures.fdef by Greg Landrum containing base definitions of pharmacophore features. The BSD 3-clause license covering rdkit code redistribution is: Copyright (c) 2006-2015, Rational Discovery LLC, Greg Landrum, and Julie Penzotti and others All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Initial processing of seed molecules. Molecules should be provided ineither .smi or .smiles file without a header. The workflow data folder contains adefault file. On Knime server, users can upload an alternative file.Stereochemistry is stripped and molecules with some common atoms not inthe model vocabulary are excluded. Atoms excluded are not exhaustive. If youare likely to have exotic input fragments, some additions may be needed. Themodel vocabulary is: [,n,3,),=,-,],!,O,N,1,s,C,+,E,2,X,Y,c,(,H,4,o,K,S,F,#,where "!" and "E" are start and stop characters.Rdkit Morgan and pharmacophore features are generated and features usedby the model retained. The pharmacophore features are definedBaseFeatures.fdef by Greg Landrum (https://github.com/rdkit/rdkit/blob/master/Data/BaseFeatures.fdef) in the data folder. The BSD3 clause relating to Rdkitcode redistribution is included in the file. 2-character atoms "Cl" and "Br" aresubstituted for "X" and "Y". If more than one GPU available, choose which to perform the run on. Pass to all nodes with Keras networks. Also get data folder path: PSO is performed in apython node with no awareness of the Knime protocol and server jobs are launched in a runtime directory. To try to preserve portability, the assumedworkflow repository is extracted from context properties. The os module is invoked in the python script to find a file in the workflow data folder(blankFileForOSToFind.txt). If alternative versions of the workflow group are saved, change both the name of the file and the reference in the table creatornode to avoid ending in the wrong data folder. Fingerprint branch of the encodermodel. Encode fingerprints generatedfor each smiles. Smiles branch of dual encoder. Loop through input smiles and one-hot encodeaccording to the model vocabulary. Send to smiles encoder network. Select PSO swarm configuration with sliders: - iterations- N_particles- bounds- N_samples- acceleration- inertia weightMerge these variables with path and gpu data from the above part of workflow. Pass the current fingerprint encoding unaltered along this branch, then join to thesmiles encoding generated in the branch above. Together, these are input to thelatent vector encoder. Collect outputs of fingerprint and smiles encoding - pass to latent vector encoder. Then perform PSO. It iseasier in the PSO script to pass in a single column containing the entire (comma separated) vector then splitthis, rather than having all 64 output columns from the latent vector network. So, combine all to single columnthen drop the rest.PSO optimises for a range of properties as described in the paper which are combined to an overall score. Ifa decoded molecule has a better overall score than the initial seed in any iteration, that molecule is retainedfor the next iteration up to the maximum number defined by the user in "get pso iterations". Returned smilesare filtered downstream on overall score. Join encodings of the branches and loopthrough these to get latent vectorscorresponding to each PSO initial seedmolecule. Collect and display all results - convert string outputs of PSO script to numbersand round then filter on overall score and display. Implementation of an AI-assisted fragment-generator in an open-source platformBilsland et al, RSC Med Chem, in press Knime implementation of dual fragment autoencoder model as described in Bilsland et al, J. Chem. Inf. Model. 2021, 61, 6, 2547–2559. The model is trained to reproduce bothSMILES and chemical/pharmacophore fingerprints using ~465K fragments from commercial sources. We applied transfer learning to fingerprint decoder layers using data fromprevious in house fragment screens to develop a classifier for "privleged fragments". The generative model uses particle swarm optimisation to move toward better fragments, asdefined by parameters including fraction sp3-hybridisation, SAS, priveleged fragment score, heavy atom count, structural alerts, and ring sizes. See the above paper and ...............(where this workflow is reported) for references.NOTE TO USERS: we provide a default list of structural alerts from surechembl (https://www.surechembl.org/knowledgebase/169485-non-medchem-friendly-smarts) augmentedwith a few DL-specific filters we have found useful. The user is strongly advised to provide their own more comprehensive list for optimal performance. However, gains could alsobe made by adjusting PSO parameters. The default settings are those from the original paper. fingerprint encoderdata reshapefor encoderget current smilesone-hotStrip stereo, excludeK, I, check for rdkitvalidity recover smiles stringsubstitute Cl->X, Br->Yfingerprintencoderjoin smilesand encoded fingerprintsmake newrow IDsget current fingerprint encodingsmiles encodersmilesencoderconcatsmiles and fpencodingsmake newrow IDscombine allto single columnkeep combinedoverall scoreto numberlatent vector encoderget latent vectorNode 320select eachpso seedsmilesend readinitial seedsmilesPSO_matrixDecodeNode 366get workspacevariablesNode 368file to findget data pathKeras NetworkReader Python Script (1⇒1) Variable toTable Row one-hot encoding Filter SMILES atom substitution Keras NetworkExecutor Joiner RowID Variable toTable Row Keras NetworkReader Keras NetworkExecutor Joiner RowID Column Combiner Column Filter String To Number configure swarm 1 Keras NetworkReader Keras NetworkExecutor read file view mols Merge Variables get fingerprints Table Row ToVariable Loop Start Loop End Python Script (1⇒1) filter and display configure swarm 2 Round Double Extract ContextProperties Variable toTable Row Table Creator Python Script (2⇒1) Component Initial processing of seed molecules. Molecules should be provided ineither .smi or .smiles file without a header. The workflow data folder contains adefault file. On Knime server, users can upload an alternative file.Stereochemistry is stripped and molecules with some common atoms not inthe model vocabulary are excluded. Atoms excluded are not exhaustive. If youare likely to have exotic input fragments, some additions may be needed. Themodel vocabulary is: [,n,3,),=,-,],!,O,N,1,s,C,+,E,2,X,Y,c,(,H,4,o,K,S,F,#,where "!" and "E" are start and stop characters.Rdkit Morgan and pharmacophore features are generated and features usedby the model retained. The pharmacophore features are definedBaseFeatures.fdef by Greg Landrum (https://github.com/rdkit/rdkit/blob/master/Data/BaseFeatures.fdef) in the data folder. The BSD3 clause relating to Rdkitcode redistribution is included in the file. 2-character atoms "Cl" and "Br" aresubstituted for "X" and "Y". If more than one GPU available, choose which to perform the run on. Pass to all nodes with Keras networks. Also get data folder path: PSO is performed in apython node with no awareness of the Knime protocol and server jobs are launched in a runtime directory. To try to preserve portability, the assumedworkflow repository is extracted from context properties. The os module is invoked in the python script to find a file in the workflow data folder(blankFileForOSToFind.txt). If alternative versions of the workflow group are saved, change both the name of the file and the reference in the table creatornode to avoid ending in the wrong data folder. Fingerprint branch of the encodermodel. Encode fingerprints generatedfor each smiles. Smiles branch of dual encoder. Loop through input smiles and one-hot encodeaccording to the model vocabulary. Send to smiles encoder network. Select PSO swarm configuration with sliders: - iterations- N_particles- bounds- N_samples- acceleration- inertia weightMerge these variables with path and gpu data from the above part of workflow. Pass the current fingerprint encoding unaltered along this branch, then join to thesmiles encoding generated in the branch above. Together, these are input to thelatent vector encoder. Collect outputs of fingerprint and smiles encoding - pass to latent vector encoder. Then perform PSO. It iseasier in the PSO script to pass in a single column containing the entire (comma separated) vector then splitthis, rather than having all 64 output columns from the latent vector network. So, combine all to single columnthen drop the rest.PSO optimises for a range of properties as described in the paper which are combined to an overall score. Ifa decoded molecule has a better overall score than the initial seed in any iteration, that molecule is retainedfor the next iteration up to the maximum number defined by the user in "get pso iterations". Returned smilesare filtered downstream on overall score. Join encodings of the branches and loopthrough these to get latent vectorscorresponding to each PSO initial seedmolecule. Collect and display all results - convert string outputs of PSO script to numbersand round then filter on overall score and display. Implementation of an AI-assisted fragment-generator in an open-source platformBilsland et al, RSC Med Chem, in press Knime implementation of dual fragment autoencoder model as described in Bilsland et al, J. Chem. Inf. Model. 2021, 61, 6, 2547–2559. The model is trained to reproduce bothSMILES and chemical/pharmacophore fingerprints using ~465K fragments from commercial sources. We applied transfer learning to fingerprint decoder layers using data fromprevious in house fragment screens to develop a classifier for "privleged fragments". The generative model uses particle swarm optimisation to move toward better fragments, asdefined by parameters including fraction sp3-hybridisation, SAS, priveleged fragment score, heavy atom count, structural alerts, and ring sizes. See the above paper and ...............(where this workflow is reported) for references.NOTE TO USERS: we provide a default list of structural alerts from surechembl (https://www.surechembl.org/knowledgebase/169485-non-medchem-friendly-smarts) augmentedwith a few DL-specific filters we have found useful. The user is strongly advised to provide their own more comprehensive list for optimal performance. However, gains could alsobe made by adjusting PSO parameters. The default settings are those from the original paper. fingerprint encoderdata reshapefor encoderget current smilesone-hotStrip stereo, excludeK, I, check for rdkitvalidity recover smiles stringsubstitute Cl->X, Br->Yfingerprintencoderjoin smilesand encoded fingerprintsmake newrow IDsget current fingerprint encodingsmiles encodersmilesencoderconcatsmiles and fpencodingsmake newrow IDscombine allto single columnkeep combinedoverall scoreto numberlatent vector encoderget latent vectorNode 320select eachpso seedsmilesend readinitial seedsmilesPSO_matrixDecodeNode 366get workspacevariablesNode 368file to findget data pathKeras NetworkReader Python Script (1⇒1) Variable toTable Row one-hot encoding Filter SMILES atom substitution Keras NetworkExecutor Joiner RowID Variable toTable Row Keras NetworkReader Keras NetworkExecutor Joiner RowID Column Combiner Column Filter String To Number configure swarm 1 Keras NetworkReader Keras NetworkExecutor read file view mols Merge Variables get fingerprints Table Row ToVariable Loop Start Loop End Python Script (1⇒1) filter and display configure swarm 2 Round Double Extract ContextProperties Variable toTable Row Table Creator Python Script (2⇒1) Component

Nodes

Extensions

Links