Icon

Building Sentiment Predictor --Solution

Building a Sentiment Analysis Predictive Model - Supervised Machine Learning

This workflow uses a Kaggle Dataset including 14K customer tweets towards six US airlines (https://www.kaggle.com/crowdflower/twitter-airline-sentiment). Contributors annotated the valence of the tweets as positive, negative or neutral. Once users are satisfied with the model evaluation, they should export (1) the Vector Space and (2) the Trained Model for deployment over non-annotated data.

If you use this workflow, please cite: 
F. Villaroel Ordenes & R. Silipo, “Machine learning for marketing on the KNIME Hub: The development of a live repository for marketing applications”, Journal of Business Research 137(1):393-410, DOI: 10.1016/j.jbusres.2021.08.036.

Building a Sentiment Analysis Predictive Model - Supervised Machine Learning -- SOLUTION This workflow uses a Kaggle Dataset including 14K customer tweets towards six US airlines (https://www.kaggle.com/crowdflower/twitter-airline-sentiment). Contributorsannotated the valence of the tweets as positive, negative or neutral. Once users are satisfied with the model evaluation, they should export (1) the Vector Space and (2) the TrainedModel for deployment over non-annotated data. Your task here is to train differnet models and save them for deployment on non-annotated data. 1. Read annotatedtwitter dataset.Besides the nodeto read CSV filesbelow, KNIMEprovides a widerange of nodes toread differentdatastet formats(e.g., parquet,json, images etc.). 2. Data Manipulation/Preparation. Simplified datamanipulation process. Users working on text miningmight want to add a "Spell Checker" node to handlegrammar issues (e.g., happy = hapy). Here the mostimportant node is "strings to document", which formatsseveral string columns (e.g., author, text, title) into asingle document that can be text-mined in KNIME. 3. Use of Text Mining to Transform Data into Numbers. Enrichment means using dictionaries (e.g.,LIWC) to tag words into determined categories. This might serve for purposes such as making surethese words are not removed, or to create intensity measures (word category percentages) perdocument. Preprocessing allows users to simplify the analysis by (1) removing punctuation andstopwords, (2) performing steeming, and (3) executing other preprocessing tasks. Based on thepreprocessed documents, a document vector is created to represent each document in a vector space.This blog post describes different document encoding options https://www.knime.com/blog/text-encoding-a-review Try and Test different Machine Learning Models:Use the train and test set from partition node to train at least three different classifiers and compute its accuracy and time to taken to train each model.Feel free to also try Hyperparameter Optimization on models if time permits..As part of solution there are total 5 learners and corresponding predictors tested. Random Forest Learner is parameter optimized for total number of tree models and splitting criterion. SECOND MODEL: HYPERPARAMETER OPTIMIZED RANDOM FOREST (COMPUTATIONALLY EXPENSIVE PROCESS!!)- Drag a suitable Learner node and connect first output port of Partitioning node. - Connect the output from second port in Predictor node. Connect model input port with model output port from the Learner.- Train the model and test it on corresponding Predictor node.- - Here Random Forest Model is optimized on number of models in the forest and its splitting criteion. Best parameters is evaluated based on maximum accuracy.- Accuracy is measured using the Scorer node THIRD MODEL: GRADIENT BOOSTED TREES LEARNER- Drag a suitable Learner node and connect first output port of Partitioning node. - Connect the output from second port in Predictor node. Connect model input port with model output port from the Learner.- Train the model and test it on corresponding Predictor node.- Accuracy is measured using the Scorer node.. FIRST MODEL: WEKA's NAIVE BAYES MULTINOMIAL- Drag a suitable Learner node and connect first output port of Partitioning node. - Connect the output from second port in Predictor node. Connect model input port with model output port from the Learner.- Train the model and test it on corresponding Predictor node.- Accuracy is measured using the Scorer node FOURTH MODEL: SVM LEARNER- Drag a suitable Learner node and connect first output port of Partitioning node. - Connect the output from second port in Predictor node. Connect model input port with model output port from the Learner.- Train the model and test it on corresponding Predictor node.- Accuracy is measured using the Scorer nodeP.S: Same as in original workflow. Its good to keep this side by side for better comparision with others. FIFTH MODEL: RMSProp MultiLayer Perceptron Learner (Bonus)- Drag a suitable Learner node and connect first output port of Partitioning node. - Connect the output from second port in Predictor node. Connect model input port with model output port from the Learner.- Train the model and test it on corresponding Predictor node.- Accuracy is measured using the Scorer node EXECUTION TIME Can you think of an apptopriate node/process to measure total execution time inhours? Common steps(e.g., lowercase,stopwords, etc.)Transform words to termfrequency orother measuresConvert strings toto documentsExtract Sentimentannotation from the metadata in the document columnExport Model RMS Prop80/20excludedocumentcolumnTrain MLP classifieroptimized by RPropExport Vector Spacefor deployment69.3% accuracyDrag and dropKaggle DatasetN=14640Tweets fromconsumers toairlinesExport Model GradBoosterGradient Booster Score 76.3%Write NB ModelRandom Forest Best Score 75.7%Write SVM ModelSVM Scorer 77.2%Multinomial Weka 75% accuracySet step size, start, stopfor models in ensemble.Create parameters:Model: (600, 800, 1000)Criterion: (Gini Index, Information Gain, Information Gain Ratio)Execution Time in hours Enrichment andPreprocessing BoW andVector Space Strings To Document DuplicateRow Filter Document DataExtractor Model Writer Partitioning Column Filter Gradient BoostedTrees Learner Gradient BoostedTrees Predictor Random ForestLearner Random ForestPredictor MultiLayerPerceptronPredictor RProp MLP Learner Model Writer Scorer CSV Reader Model Writer Scorer Model Writer Scorer SVM Learner SVM Predictor Model Writer Scorer Scorer NaiveBayesMultinomial(3.7) Weka Predictor(3.7) Parameter OptimizationLoop Start (Table) ParameterOptimization Loop End Table Creator Execution Time Building a Sentiment Analysis Predictive Model - Supervised Machine Learning -- SOLUTION This workflow uses a Kaggle Dataset including 14K customer tweets towards six US airlines (https://www.kaggle.com/crowdflower/twitter-airline-sentiment). Contributorsannotated the valence of the tweets as positive, negative or neutral. Once users are satisfied with the model evaluation, they should export (1) the Vector Space and (2) the TrainedModel for deployment over non-annotated data. Your task here is to train differnet models and save them for deployment on non-annotated data. 1. Read annotatedtwitter dataset.Besides the nodeto read CSV filesbelow, KNIMEprovides a widerange of nodes toread differentdatastet formats(e.g., parquet,json, images etc.). 2. Data Manipulation/Preparation. Simplified datamanipulation process. Users working on text miningmight want to add a "Spell Checker" node to handlegrammar issues (e.g., happy = hapy). Here the mostimportant node is "strings to document", which formatsseveral string columns (e.g., author, text, title) into asingle document that can be text-mined in KNIME. 3. Use of Text Mining to Transform Data into Numbers. Enrichment means using dictionaries (e.g.,LIWC) to tag words into determined categories. This might serve for purposes such as making surethese words are not removed, or to create intensity measures (word category percentages) perdocument. Preprocessing allows users to simplify the analysis by (1) removing punctuation andstopwords, (2) performing steeming, and (3) executing other preprocessing tasks. Based on thepreprocessed documents, a document vector is created to represent each document in a vector space.This blog post describes different document encoding options https://www.knime.com/blog/text-encoding-a-review Try and Test different Machine Learning Models:Use the train and test set from partition node to train at least three different classifiers and compute its accuracy and time to taken to train each model.Feel free to also try Hyperparameter Optimization on models if time permits..As part of solution there are total 5 learners and corresponding predictors tested. Random Forest Learner is parameter optimized for total number of tree models and splitting criterion. SECOND MODEL: HYPERPARAMETER OPTIMIZED RANDOM FOREST (COMPUTATIONALLY EXPENSIVE PROCESS!!)- Drag a suitable Learner node and connect first output port of Partitioning node. - Connect the output from second port in Predictor node. Connect model input port with model output port from the Learner.- Train the model and test it on corresponding Predictor node.- - Here Random Forest Model is optimized on number of models in the forest and its splitting criteion. Best parameters is evaluated based on maximum accuracy.- Accuracy is measured using the Scorer node THIRD MODEL: GRADIENT BOOSTED TREES LEARNER- Drag a suitable Learner node and connect first output port of Partitioning node. - Connect the output from second port in Predictor node. Connect model input port with model output port from the Learner.- Train the model and test it on corresponding Predictor node.- Accuracy is measured using the Scorer node.. FIRST MODEL: WEKA's NAIVE BAYES MULTINOMIAL- Drag a suitable Learner node and connect first output port of Partitioning node. - Connect the output from second port in Predictor node. Connect model input port with model output port from the Learner.- Train the model and test it on corresponding Predictor node.- Accuracy is measured using the Scorer node FOURTH MODEL: SVM LEARNER- Drag a suitable Learner node and connect first output port of Partitioning node. - Connect the output from second port in Predictor node. Connect model input port with model output port from the Learner.- Train the model and test it on corresponding Predictor node.- Accuracy is measured using the Scorer nodeP.S: Same as in original workflow. Its good to keep this side by side for better comparision with others. FIFTH MODEL: RMSProp MultiLayer Perceptron Learner (Bonus)- Drag a suitable Learner node and connect first output port of Partitioning node. - Connect the output from second port in Predictor node. Connect model input port with model output port from the Learner.- Train the model and test it on corresponding Predictor node.- Accuracy is measured using the Scorer node EXECUTION TIME Can you think of an apptopriate node/process to measure total execution time inhours? Common steps(e.g., lowercase,stopwords, etc.)Transform words to termfrequency orother measuresConvert strings toto documentsExtract Sentimentannotation from the metadata in the document columnExport Model RMS Prop80/20excludedocumentcolumnTrain MLP classifieroptimized by RPropExport Vector Spacefor deployment69.3% accuracyDrag and dropKaggle DatasetN=14640Tweets fromconsumers toairlinesExport Model GradBoosterGradient Booster Score 76.3%Write NB ModelRandom Forest Best Score 75.7%Write SVM ModelSVM Scorer 77.2%Multinomial Weka 75% accuracySet step size, start, stopfor models in ensemble.Create parameters:Model: (600, 800, 1000)Criterion: (Gini Index, Information Gain, Information Gain Ratio)Execution Time in hours Enrichment andPreprocessing BoW andVector Space Strings To Document DuplicateRow Filter Document DataExtractor Model Writer Partitioning Column Filter Gradient BoostedTrees Learner Gradient BoostedTrees Predictor Random ForestLearner Random ForestPredictor MultiLayerPerceptronPredictor RProp MLP Learner Model Writer Scorer CSV Reader Model Writer Scorer Model Writer Scorer SVM Learner SVM Predictor Model Writer Scorer Scorer NaiveBayesMultinomial(3.7) Weka Predictor(3.7) Parameter OptimizationLoop Start (Table) ParameterOptimization Loop End Table Creator Execution Time

Nodes

Extensions

Links