0 ×

Generate Spark Database

Cresset KNIME Nodes version 2.1.0.20539 by www.cresset-group.com

Generate Spark Database is a tool for generating or updating Spark databases. It reads a list of molecules, breaks them into fragments, and stores the fragments into a database file for use with Spark or the "Spark Database Search" Node.

This node wraps the executable 'sparkdb', which must be installed with a valid license for this node to work. If this is installed in the default location on Windows, then it should be found automatically. Otherwise, you must either set the "Cresset Home" preference or the CRESSET_HOME environment variable to the base Cresset software install directory. You may also set the "sparkdb Path" preference or the CRESSET_SPARKDB_EXE environment variable to point directly at the executable itself.

The Generate Spark Database node can be configured to use additional resources to perform calculations. The time taken for the node to run will be drastically reduced if you use the Cresset's Engine Broker To use this facility either set the "Cresset Engine Broker" preference or the CRESSET_BROKER environment variable to point to the location of your local Engine Broker. If you do not currently have the Cresset Engine Broker then contact Cresset (enquiries@cresset-group.com) for pricing on local and cloud based brokers.

For more information visit www.cresset-group.com or contact us at support@cresset-group.com.

Options

Basic

Column containing input molecules structures
The column in the first input datatable containing the molecules to fragment and add to the database.
Title column
The column in the first input datatable containing the title of the molecule. If this is left blank then the title in the molecule structure column will be used. The molecule title will appear in the "Spark Database Search" node output in the "Parent Title" column.
Extra meta data column 1
The column in the first input datatable containing extra meta data to store in the database. The meta data will appear in the "Spark Database Search" node output in the "Parent Aux1" column.
Extra meta data column 2
The column in the first input datatable containing extra meta data to store in the database. The meta data will appear in the "Spark Database Search" node output in the "Parent Aux2" column.
Database to create/update
The full path to the database to create or update. For this database to be accessible from the "Spark Database Search" node the database should be saved to one of the directories listed in the "SPARK_CRESSET_DB" or "SPARK_DB" environment variables.
Category
When selecting databases in Spark, this database will be listed under the category specified by this option. If this is not set, then the database will appear as 'Uncategorized'.
Sub-category
When selecting databases in Spark, this database will be listed under the subcategory specified by this option. If this is not set, then the database will appear directly under its category.
Description
A description of the database.
Speed
Speed of the operation. Choose from (in order of decreasing speed, but increasing thoroughness): Quick, Normal or Exhaustive. Note that changing this option will alter the values of several other options.

Fragmentation Method

Fragmentation mode
Specify how the input molecules are to be handled.
  • Molecules, need fragmentation - the input molecules will be broken into pieces using Cresset's fragmentation rules, and the pieces stored in the database.
  • Pre-labelled fragments - the input molecules are assumed to be pre-existing fragments which have been labelled with a particular element marking the attachment points (see the Attachment points labels option). The attachment point labels will be removed and the molecules will be imported into the database without further fragmentation.
  • Reagent importer - the input molecules are reagents to be processed according one or more of Cresset's reagent-handling rules (see the Reagent type parameter. This mode converts a file of usable reagents into the R group that is used in the final molecule.
Maximum pieces per fragment
The fragmentation process breaks each molecule into multiple small pieces. Fragments are comprised of all connected sets of up to N pieces. This parameter controls the maximum number of connected pieces to be used in creating a fragment. If this value is large, more fragments will be created, but these will be larger, more flexible and more functionalised.
Attachment point labels
This option sets the atomic number of the element which labels the fragments' attachment point (e.g., 52 for Tellurium). Any molecule without such a label is ignored.
Create all reagent databases
If checked, each of the available reagent-handling rules will be applied to the input molecules, creating in turn the appropriate databases. The databases will be named name as specified by the 'Database to create/update' option, with the reagent name added to it. Any input molecule that does not contain a matching pattern will be ignored.
Reagent type
If importing reagents, you need to specify what the reactive group is and how it is changed during the reaction. For example, a set of boronic acids for use in a Suzuki coupling needs to have the boronic acid removed and the atom that it was attached to labelled as the fragment attachment point. The default reagent are:
  • Amines, delete the N - Primary amines as an alkylating agent where the N is deleted on addition e.g. R-NH2 -> R-*
  • Aliphatic alcohols, delete the O - Alcohols used as alkylating agents where the O is deleted on addition e.g. R-OH -> R-*
  • Aromatic alcohols, keep the O - Aromatic alcohols where the attachment is through the oxygen e.g. Ar-OH -> Ar-O-*
  • Secondary amines, keep the N - Secondary amines where the N is the attachment point such as in nucleophilic substition e.g. R1(R2)NH -> R1(R2)N-*
  • Olefins, delete the -C=C - Terminal olefins, keep only the attached group e.g. R-C=C -> R-*
  • Primary aliphatic halide - Primary aliphatic halides (Cl,Br,I) e.g. R-CH2-Cl -> R-CH2-*
  • Aromatic halide - Aromatic halides (Cl,Br,I) e.g. Ph-Cl -> Ph-*
  • Aliphatic halide - Primary/secondary/tertiary aliphatic halides (Cl,Br,I) e.g. R(1-3)C-Cl -> R(1-3)C-*
  • Sulfonic acids/acid chlorides, delete the -SO2X - Sulfonic acids where we keep only the group attached to the sulfur e.g. R-SO3H -> R-*
  • Thiols, keep the S - Thiols where the attachment is through the sulfur e.g. R-SH -> R-S-*
  • Acids/acid chlorides, delete the -COOH - Acids where we keep only the group attached to the acid carbonyl. e.g. R-COOH -> R-*
  • Isocyanates, delete -NCO - Isocyanates, keeping only the attached group e.g. R-NCO -> R-*
  • Aromatic boronic acids, delete -B(OH)2 - Aromatic boronic acids for Suzuki couplings etc: lose the boronic acid and attach the remainder. e.g. Ph-B(OH)2 -> Ph-*
  • Sulfonic acids/acid chlorides, Keep the -SO2 - Sulfonic acids where we keep the -SO2 group e.g. R-SO3H -> R-SO2-*
  • Alkynes, delete the -C#C - Alkynes, keep only the attached group e.g. R-C#C -> R-*
  • Acids/acid chlorides, keep the carbonyl - Acids where we attach through the carbonyl group (eg acylations)e.g. R-COOH -> R-C(=O)-*
  • Aliphatic thiols, delete the S - Thiols used as alkylating agents where the S is deleted on addition e.g. R-SH -> R-*
  • Cyano groups, delete -CN - Cyano reagents, keeping only the attached group e.g. R-CN -> R-*
  • Amines, keep the N - Primary and secondary amines where the N is the attachment point such as in reductive aminations e.g. R-NH2 -> R-NH-*
  • Alcohols, keep the O - Alcohols where the attachment is through the oxygen e.g. R-OH -> R-O-*

Fragmentation Settings

Maximum attachment points per fragment
Specifies the maximum number of attachment points a fragment can have. The larger the value, the larger the database.
Only keep ring-containing fragments
If checked, only fragments containing one or more ring atoms will be kept.
Enumerate tautomers of input structures
If this option is checked, then Cresset's rules for tautomer enumeration will be applied to every input molecule, creating all tautomers where a 1,3 or 1,5 or 1,7 hydrogen shift is possible. If you prefer to use your own enumeration technique, leave this option unchecked and load the input molecules as a pre-populated set of enumerated tautomers.
Reprocess molecules that have already been seen
If this option is checked, molecules that have been fragmented previously will be processed again. This option is useful if you want to change the fragmentation settings (e.g. Maximum attachment points per fragment) Note that the frequency of occurrence data will no longer be reliable when this option is turned on (for example, if you run the same file twice, all fragment frequencies will double) and that using this option will not cause fragments that are already in the database to have their conformations recalculated. The default value for this option (unchecked) is to completely skip molecules that have already been seen.
Max confs before committing to database
The fragmentation process collects fragments and commits them to the database in batches. The larger the batch size, the more memory is used, but the more efficient the process is especially when multiple parallel fragmentation processes are running on the same database. If you are using large numbers of Field Engines you should increase this value.
Maximum fragment heavy atom count
Fragments with more than this number of heavy atoms will not be generated.
Maximum fragment molecular weight
Fragments which weigh more than this limit will not be generated.
Maximum number of rotatable bonds
Fragments which exceed this limit will not be generated. This is useful to prevent long alkyl chains and the like from appearing in the database.
One rotatable bond counts as this many heavy atoms
One rotatable bond counts as this many heavy atoms when checking the maximum fragment heavy atom count. This gives the option to penalize molecules with large numbers of rotatable bonds by including them on the 'Maximum fragment heavy atom count'.

Conformer Hunt

Filter duplicate conformers at RMS
Sets the similarity threshold below which two conformers are deemed identical. This effectively controls the coarseness of the sampling of conformational space. A low value leads to conformations that are only marginally different, while using a large value means that a conformation near the 'correct' one may not be generated. Values of 0.5 to 1.0 are recommended: values at the higher end of the range are more appropriate for larger, more flexible molecules.
Maximum number of conformations
The maximum number of conformations to generate for any fragment. Values of 20-30 are recommended: this should usually suffice to cover the conformational space of most reasonable fragments. If you are generating particularly large or flexible fragments you may want to increase this to 50 (at the expense of longer generation time, larger database files and longer search times).
If set to 0 then no conformations will be generated and the fragments will be imported in the input conformations. This is useful for e.g. building databases from PDB or CSD conformations.
No. of high-T dynamics runs for flexible rings
Most small rings are handled using a ring conformation library. Conformations for rings that are not found in the library are sampled using high-temperature (~600K) dynamics with energy initially distributed into torsional degrees of freedom. The number of dynamics runs (and hence the degree of ring conformation sampling) is set by this value. Values of 2-10 are recommended. Values above 5 make little difference to flexible rings of fewer than 8 atoms.
Gradient cutoff for conformer minimization
All conformers found are minimized using the XED force field. This option sets the gradient cut-off at which the minimization is terminated. Values that are too small lead to insufficient sampling of conformational space and long run times. Values that are too large can lead to unrealistic structures being generated. Values of 0.1 kcal/mol/A to 1.0 kcal/mol/A are recommended with values at the smaller end of the range being preferred if the 'Include coulombics' option is not checked.
Energy window
Conformations that have a minimized energy that is outside the energy window are discarded. The window is calculated from the lowest energy conformation that has been found. The ideal value for this option depends on the 'Gradient cut-off for conformer minimization' and 'Include coulombics' options. The best results when the 'Include coulombics' option is not checked are obtained by minimizing to a low gradient (0.1 or better) and applying a smaller energy window (3 kcal/mol) but this significantly increases the time for the calculation. Checking the 'Include coulombics' option requires a significantly larger energy window for large molecules (12 kcal/mol) as these can form very low energy collapsed and internally H-bonded structures.
Acyclic secondary amides handling
Specify how the conformation hunter is to handle amides.
  • Force amides trans - forces all secondary amides to adopt the trans geometry.
  • Use input amide geometry - leaves secondary amides in the geometry that they were in in the input file, and sets them as non-rotatable. As a result, if the input molecule was drawn with a cis amide then only conformations with cis amides will be generated.
  • Allow amides to spin - allows the amide bind to spin, so a mixture of cis and trans amides can be generated.
Include coulombics
If checked, then the conformer generation process uses the full force field, including long-range electrostatics. Better conformer populations are usually generated with this option not checked.

Ignore Existing Fragments

Ignore Fragments in Databases
Skips fragments that are already present in the specified databases:more than one database can be selected by using the "Ctrl" key.

Input Ports

The molecules to be fragmented and added to the database.

Best Friends (Incoming)

Best Friends (Outgoing)

Workflows

Update Site

To use this node in KNIME, install Cresset KNIME Nodes from the following update site:

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform.