Icon

topic extraction beg 1.knar

Topic Extraction

This workflow shows how to extract topics from text documents using the Topic Extractor node, and how to determine an optimal number of topics using the Elbow method.

PreprocessingTagging, filtering, lemmatizing, ... Elbow MethodTo determine the best number of clusters (topics). Transformation Extract topics and create the tag cloud Topic Extraction This workflow shows how to extract topics from text documents using the Topic Extractor node.It reads textual data from a table (or, alternatively, the data can be fetched directly from news websites using the RSS Feed Reader node) and converts them intodocuments. The documents are then preprocessed, i.e. tagged, filtered, lemmatized, etc. After that, the Topic Extractor node can be applied to the preprocesseddocuments. However, the node requires users to input the number of topics that should be extracted beforehand.There exist already a couple of methods to determine the best number of topics, we would use the "Elbow Method" in this workflow. The method basically runs k-meansclustering on the input documents for a range of values of the number of clusters (e.g, from 1 to 20), and for each value calculates the within-cluster sum of squarederrors (SSE), which is the sum of the distances of each data point in a cluster to its cluster center. Then, the SSE value for each number of clusters is plotted in a ScatterPlot. The best number of clusters should be the one where there is a drop of the SSE value, giving an angle in the plot. Note that the Elbow method doesn't always work for all data sets. If there is not a clear elbow to be found in the plot, try using a different approach, like the SilhouetteCoefficient.After finding out the optimal number of clusters/topics for the documents, the Topic Extractor node can be executed and a tag cloud is created to visualize the topics'terms. Extract topics fromdocumentsPOS tagging, lemmatization, stop word, number, ... filteringFilter terms based ontoo low or high frequenciesReduce thedims of featurespaceCount theoccurances of eachtermCreating one tag cloudfor each topic20 iterationsk: 1-20kCreate bit vectorsfor documentsNode 767Node 769Node 770Node 771Node 772Node 773Node 777 Variable to TableColumn (deprecated) Topic Extractor(Parallel LDA) Preprocessing Preprocessing PCA (deprecated) k-Means Loop End Calculate sum ofsquared errors Term Count Tag Clouds Counting Loop Start Java Edit Variable Find Elbow Document Vector Scatter Plot Case Converter Table Creator Stop Word Filter Excel Reader (XLS) DuplicateRow Filter Strings To Document PreprocessingTagging, filtering, lemmatizing, ... Elbow MethodTo determine the best number of clusters (topics). Transformation Extract topics and create the tag cloud Topic Extraction This workflow shows how to extract topics from text documents using the Topic Extractor node.It reads textual data from a table (or, alternatively, the data can be fetched directly from news websites using the RSS Feed Reader node) and converts them intodocuments. The documents are then preprocessed, i.e. tagged, filtered, lemmatized, etc. After that, the Topic Extractor node can be applied to the preprocesseddocuments. However, the node requires users to input the number of topics that should be extracted beforehand.There exist already a couple of methods to determine the best number of topics, we would use the "Elbow Method" in this workflow. The method basically runs k-meansclustering on the input documents for a range of values of the number of clusters (e.g, from 1 to 20), and for each value calculates the within-cluster sum of squarederrors (SSE), which is the sum of the distances of each data point in a cluster to its cluster center. Then, the SSE value for each number of clusters is plotted in a ScatterPlot. The best number of clusters should be the one where there is a drop of the SSE value, giving an angle in the plot. Note that the Elbow method doesn't always work for all data sets. If there is not a clear elbow to be found in the plot, try using a different approach, like the SilhouetteCoefficient.After finding out the optimal number of clusters/topics for the documents, the Topic Extractor node can be executed and a tag cloud is created to visualize the topics'terms. Extract topics fromdocumentsPOS tagging, lemmatization, stop word, number, ... filteringFilter terms based ontoo low or high frequenciesReduce thedims of featurespaceCount theoccurances of eachtermCreating one tag cloudfor each topic20 iterationsk: 1-20kCreate bit vectorsfor documentsNode 767Node 769Node 770Node 771Node 772Node 773Node 777 Variable to TableColumn (deprecated) Topic Extractor(Parallel LDA) Preprocessing Preprocessing PCA (deprecated) k-Means Loop End Calculate sum ofsquared errors Term Count Tag Clouds Counting Loop Start Java Edit Variable Find Elbow Document Vector Scatter Plot Case Converter Table Creator Stop Word Filter Excel Reader (XLS) DuplicateRow Filter Strings To Document

Nodes

Extensions

Links