

1. Load data and scrub textSelect your dataset (in either CSV orKNIME's table format), create uniquedocument IDs, and scrub the text tomake it ready for further processing. 3. Find duplicates andremove boilerplatingDetecting duplicated documents ishighly recommended if you plan to tagngrams or use topic modelling. 4. Tag and filter termsTag names and ngrams, standardiseplurals and other variants, and removeterms that are rare or uninformative. Theoutputs will be suitable for topicmodelling or term frequency analyses. TextKleanerA Knime workflow for preparing textual datasets for topic modelling and other types of analysis.Created by Angus Veitch, January 2020. Version 0.1.1. 2. Filter documents byrelevance, terms or topicsFind and remove unwanted documentsfrom your dataset. The excludeddocuments are saved in a separate file,which you can review at any stage. Tag Ngrams Filter andstandardise terms Load data Scrub text Create Document IDs Removeboilerplate text Filter bykey phrases Detect duplicates Filter byrelevance score Filter by topic Review and rescueexcluded docs Save processeddocuments Tag named entities 1. Load data and scrub textSelect your dataset (in either CSV orKNIME's table format), create uniquedocument IDs, and scrub the text tomake it ready for further processing. 3. Find duplicates andremove boilerplatingDetecting duplicated documents ishighly recommended if you plan to tagngrams or use topic modelling. 4. Tag and filter termsTag names and ngrams, standardiseplurals and other variants, and removeterms that are rare or uninformative. Theoutputs will be suitable for topicmodelling or term frequency analyses. TextKleanerA Knime workflow for preparing textual datasets for topic modelling and other types of analysis.Created by Angus Veitch, January 2020. Version 0.1.1. 2. Filter documents byrelevance, terms or topicsFind and remove unwanted documentsfrom your dataset. The excludeddocuments are saved in a separate file,which you can review at any stage. Tag Ngrams Filter andstandardise terms Load data Scrub text Create Document IDs Removeboilerplate text Filter bykey phrases Detect duplicates Filter byrelevance score Filter by topic Review and rescueexcluded docs Save processeddocuments Tag named entities


