Icon

USPTO_​Yield_​Conversion_​Lowe_​Schwaller

Yield cleanup of USPTO csv/rsmi files

A simple way to clean up the yield columns in the Lowe/Schwaller public data-sets.

Bonus: a (simple) way to split the reaction column into separate components.

Settings, if not found automatically:Lowe rsmi files:Header = yesdelim = <tab>Schwaller csv files:Header = yes delim = <tab>Advanced: Limit Rows: Skipt Rows = 2Short Lines: Allow Short Line = yes Here the yield will be fixed:Check insideif you want to remove or keepcertain columns or change the order.Also:An ID is created which one mightwant to adapt to own needs orremove entirely. Write to text fileOr choose any other outputDepending on your needs Bonus (proof of concept):Split reaction into componentsExplicit goal: csv output, but intermediate mol structures could ofcourse be used in other ways. RI.SE A simplistic way to deconvolute the two yield columns in the public available USPTO files, the so called "Lowe" and later "Schwaller" data-sets. These deconvoluted data sets and more explanation can be found on FigShare https://doi.org/10.6084/m9.figshare.14414039An alternativ python script version can be found on Github: https://github.com/DocMinus/Yield_curation_USPTO/tree/mainBonus in this Knime workflow: a POC for splitting the reaction string into the componentsThe yield conversion uses on core-Knime nodes.Requires OpenBabel Node and Indigo2 for the Bonus part.The former is auto-recognized by Knime, the latter requires a manual software site entry: KNIME Community Extensions (Experimental) - https://update.knime.com/community-contributions/4.x (x depending on your Knime Version)(some optionalfilters & sortinginside)Write Header = yesSeparator = \t (i.e. tab)Node 2084Read csv or rsmifileWrite Header = yesSeparator = \t (i.e. tab)FixYield CSV Writer Convert File Reader CSV Writer Settings, if not found automatically:Lowe rsmi files:Header = yesdelim = <tab>Schwaller csv files:Header = yes delim = <tab>Advanced: Limit Rows: Skipt Rows = 2Short Lines: Allow Short Line = yes Here the yield will be fixed:Check insideif you want to remove or keepcertain columns or change the order.Also:An ID is created which one mightwant to adapt to own needs orremove entirely. Write to text fileOr choose any other outputDepending on your needs Bonus (proof of concept):Split reaction into componentsExplicit goal: csv output, but intermediate mol structures could ofcourse be used in other ways. RI.SE A simplistic way to deconvolute the two yield columns in the public available USPTO files, the so called "Lowe" and later "Schwaller" data-sets. These deconvoluted data sets and more explanation can be found on FigShare https://doi.org/10.6084/m9.figshare.14414039An alternativ python script version can be found on Github: https://github.com/DocMinus/Yield_curation_USPTO/tree/mainBonus in this Knime workflow: a POC for splitting the reaction string into the componentsThe yield conversion uses on core-Knime nodes.Requires OpenBabel Node and Indigo2 for the Bonus part.The former is auto-recognized by Knime, the latter requires a manual software site entry: KNIME Community Extensions (Experimental) - https://update.knime.com/community-contributions/4.x (x depending on your Knime Version)(some optionalfilters & sortinginside)Write Header = yesSeparator = \t (i.e. tab)Node 2084Read csv or rsmifileWrite Header = yesSeparator = \t (i.e. tab)FixYield CSV Writer Convert File Reader CSV Writer

Nodes

Extensions

Links