Icon

TroveKleaner

TroveKleaner
WELCOME TO THE TroveKleaner.This workflow will correct some (not all)OCR errors in large collections ofdocuments. For more information, visitplease read the content in the node below. 2. SELECT INPUT FILESelect the input file to process. If you havenot generated any corrections yet, you willonly be able to choose the cleaned text. 3. CORRECT CONTENT WORDSThis node guesses corrections for terms that appear to be OCRerrors. It only corrects terms that have semantic content -- that is, itignores stop-words. It does, however, build a list of errors that arelikely to be stop-words. The output is LDA-corrections.zip. 4. CORRECT STOP-WORDSThis node finds replacements for the likely stopword errorsidentified in the previous step. The corrected output is a file calledStopword-corrections.zip. 5. TAG BIGRAMSThis node tags statistically interesting pairs of consequtive words,or bigrams. These will yield more numerous and reliablecorrections from the previous two steps when you next run them.The output is a file called Bigrams-tagged.zip. REFRESH OR FINISHThis node lets you split bigrams, strip punctuation, re-applycorrections, and save your data as a CSV file 1. PRELIMINARY CLEANINGBefore you generate corrections specific to your data, use thesenodes to remove unwanted characters and unusable documents.Here you can also apply previously saved corrections. The resultis a file called Cleaned-text.zip. INSPECT RESULTSView the log and compare corrected and uncorrectedversions of the text. RESET and runthis node first!Double-clickto selectDouble-click to openDouble-click to open List files Select input file Correctcontent words Tag Bigrams Inspect results Correct stopwords Finishing Clean text More information WELCOME TO THE TroveKleaner.This workflow will correct some (not all)OCR errors in large collections ofdocuments. For more information, visitplease read the content in the node below. 2. SELECT INPUT FILESelect the input file to process. If you havenot generated any corrections yet, you willonly be able to choose the cleaned text. 3. CORRECT CONTENT WORDSThis node guesses corrections for terms that appear to be OCRerrors. It only corrects terms that have semantic content -- that is, itignores stop-words. It does, however, build a list of errors that arelikely to be stop-words. The output is LDA-corrections.zip. 4. CORRECT STOP-WORDSThis node finds replacements for the likely stopword errorsidentified in the previous step. The corrected output is a file calledStopword-corrections.zip. 5. TAG BIGRAMSThis node tags statistically interesting pairs of consequtive words,or bigrams. These will yield more numerous and reliablecorrections from the previous two steps when you next run them.The output is a file called Bigrams-tagged.zip. REFRESH OR FINISHThis node lets you split bigrams, strip punctuation, re-applycorrections, and save your data as a CSV file 1. PRELIMINARY CLEANINGBefore you generate corrections specific to your data, use thesenodes to remove unwanted characters and unusable documents.Here you can also apply previously saved corrections. The resultis a file called Cleaned-text.zip. INSPECT RESULTSView the log and compare corrected and uncorrectedversions of the text. RESET and runthis node first!Double-clickto selectDouble-click to openDouble-click to open List files Select input file Correctcontent words Tag Bigrams Inspect results Correct stopwords Finishing Clean text More information

Nodes

Extensions

Links