0 ×

02_​Text_​Classification_​from_​Forum_​Posts

Workflow

Assigning Topics to the KNIME Forum Posts
This workflow performs a supervised topic classification on the forum posts. The training set consists of the description files of the KNIME nodes. Topic classes are the nodes top categories in the Node Repository (IO, Data Manipulation, etc ...) from KNIME versions prio to 3.0. The model is built on this training set and applied to forum posts. Top three topics with highest probability are chosen for the post topic class. A Tree Ensemble is used as classification model.
NLPtext processingnatural language processingtopic detectionKNIME forum
ReportingIn order to see the report execute the entire workflowand then click "Open the report" button in the toolbar. Reads contents from the KNIME Forum site after web-crawlingThis node reads the pages downloaded from the KNIME Forum web site () after web crawling.The web crawling part used to be here as well as an alternative to read the prepared data.However, the KNIME Forum web site has changed over the years and the old workflows doesnot parse the extract content correctly anymore.If you want to implement your own web crawler, use the Palladian or the Selenium nodes.Palladian extension is available in the community extensions. The Selenium nodes can bedownloaded from: http://seleniumnodes.com/ KNIME Forum: Classify Topics in Posts This workflow assigns topics to the forum posts. Prepare Data Set from KNIME HTML node description files. Node category is topic class. This data set is fortraining and evaluation of model for topic classification. Train and evaluate a random forest model Train random forest model on description files and apply model to forum posts 1st topic2nd topic3rd topictraining and test setfrom BoW tokeywordsfrom BoW tokexwordsfix keyword domainextract title, year, etc ...from documenttopics in Row0topics in Row1topics in Row2extract posts from 01-01-2007to 31-12-2012remove HTMLread all HTML files with node descriptionsdestination folder forunzipped filesremote reading from serverrandom forestrandom forestwith same configsettingsclassify posts Data to Report Data to Report Data to Report Partitioning stringmanipulations data preparation data preparation document vector document vector fix domain classes 1-3 document properties Joiner View Row0 View Row1 View Row2 data cleaning xml files Create Temp Dir Scorer Table Reader Unzip Files Tree Ensemble Learner(deprecated) Tree Ensemble Predictor(deprecated) Tree Ensemble Learner(deprecated) Tree Ensemble Predictor(deprecated) ReportingIn order to see the report execute the entire workflowand then click "Open the report" button in the toolbar. Reads contents from the KNIME Forum site after web-crawlingThis node reads the pages downloaded from the KNIME Forum web site () after web crawling.The web crawling part used to be here as well as an alternative to read the prepared data.However, the KNIME Forum web site has changed over the years and the old workflows doesnot parse the extract content correctly anymore.If you want to implement your own web crawler, use the Palladian or the Selenium nodes.Palladian extension is available in the community extensions. The Selenium nodes can bedownloaded from: http://seleniumnodes.com/ KNIME Forum: Classify Topics in Posts This workflow assigns topics to the forum posts. Prepare Data Set from KNIME HTML node description files. Node category is topic class. This data set is fortraining and evaluation of model for topic classification. Train and evaluate a random forest model Train random forest model on description files and apply model to forum posts 1st topic2nd topic3rd topictraining and test setfrom BoW tokeywordsfrom BoW tokexwordsfix keyword domainextract title, year, etc ...from documenttopics in Row0topics in Row1topics in Row2extract posts from 01-01-2007to 31-12-2012remove HTMLread all HTML files with node descriptionsdestination folder forunzipped filesremote reading from serverrandom forestrandom forestwith same configsettingsclassify postsData to Report Data to Report Data to Report Partitioning stringmanipulations data preparation data preparation document vector document vector fix domain classes 1-3 document properties Joiner View Row0 View Row1 View Row2 data cleaning xml files Create Temp Dir Scorer Table Reader Unzip Files Tree Ensemble Learner(deprecated) Tree Ensemble Predictor(deprecated) Tree Ensemble Learner(deprecated) Tree Ensemble Predictor(deprecated)

Download

Get this workflow from the following link: Download

Resources

Nodes

02_​Text_​Classification_​from_​Forum_​Posts consists of the following 148 nodes(s):

Plugins

02_​Text_​Classification_​from_​Forum_​Posts contains nodes provided by the following 9 plugin(s):