Keyword Search

This component extracts the most relevant English keywords in a corpus (a collection of documents) using three specific techniques:

- Topic Extraction using LDA: this technique collects a set of keywords for each different topic which clusters documents in different groups.

- Term Co-Occurrence: this other technique finds pair of keywords which appear together often in different documents.

- Max(TF-IDF) measure: a ranking which measures the importance of terms throughout the corpus.

This component takes as input a column of Document type (from String to Document node) and it then identifies keywords in the corpus according to the hyper-parameters defined in configuration dialogue. The collected keywords are then provided in three tables at the output, one of each of the three techniques above.

The component by default is applying basic text pre-processing (e.g. stopwords and symbols removal) based on the English language. This pre-processing can be deactivated via the dialogue and performed outside of the component when working with other or multiple languages.

Options

Deactivate Text Pre-Proccessing: The component provides by default pre-processing for English text. If you want to analyze another language and/or apply a custom text pre-processing outside of the component (upstream) then check this box.
Document Column: Select column with list of documents to carry out search on.
Parallel LDA: Beta: The beta parameter defines the prior on per-topic multinomial distribution over words.It defines the prior weight of word w in a topic. The library uses the given beta for all words.Normally a number much less than 1, e.g. 0.001, to strongly prefer sparse word distributions,i.e. few words per topic.
Parallel LDA: Alpha: The alpha parameter defines the Dirichlet prior on the per-document topic distributions.It defines the prior weight of topic k in a document. The library uses the given alpha for all topics.Normally a number less than 1, e.g. 0.1, to prefer sparse topic distributions,i.e. few topics per document.
Parallel LDA: No. of Topics: Enter desired number of topics for LDA
Parallel LDA: Number of Words per Topic: Number of words per topic
IDF Measure: Select IDF variant to be used for TF-IDF

Input Ports

: This component requires input of text columns in String format.

Output Ports

: Output of nouns, adjectives and verbs along with weights defined by LDA in a olumn.
: Output of nouns, adjectives and verbs along with counts of terms occurring in corpus.
: Table output of terms with highest TF-IDF between all documents.

Keyword Search

Options

Input Ports

Output Ports

Nodes

Extensions

Links

Download