Icon

03_​Imbalanced_​Sentiment_​Analysis_​with_​XGBoost

Imbalanced Sentiment Analysis with XGBoost
The workflow shows how to use row weights in XGBoost to even out class imbalances.The dataset used here is a sample of IMDB movie reviews with 90% negative and only 10% positive reviews.The data is preprocessed and represented as a bit vector that models a Bag of Words.Before the models are trained (one without and one with row weights), we extract the relative class frequencies to calculate the class_weights,which are just the inverse of the relative class frequencies.After training, the workflow shows the performance of the two models in various metrics that show how overall accuracy can be traded off forsensitivity for the minority class. Data Import and Preprocessing Predictive Modeling and Scoring The dataset is an imbalanced random sample of:Large Movie Review Dataset v1.0Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A., Y., Potts, C. (2011)Learning Word Vectors for Sentiment Analysis. Proceedings of the 49thAnnual Meeting of the Association for Computational Linguistics: HumanLanguage Technologies. 142-150, https://aclanthology.org/P11-1015/ Color by sentimentlabelTraining / test setPreprocessing of documentsVisualize modelperformanceUnweightedCalculateclass_weightAddclass_weightWeightedClass distribution Color Manager Partitioning Preprocessing XGBoost Predictor Binary ClassificationInspector XGBoost TreeEnsemble Learner Statistics Math Formula Joiner XGBoost TreeEnsemble Learner XGBoost Predictor Joiner Table Reader Bar Chart The workflow shows how to use row weights in XGBoost to even out class imbalances.The dataset used here is a sample of IMDB movie reviews with 90% negative and only 10% positive reviews.The data is preprocessed and represented as a bit vector that models a Bag of Words.Before the models are trained (one without and one with row weights), we extract the relative class frequencies to calculate the class_weights,which are just the inverse of the relative class frequencies.After training, the workflow shows the performance of the two models in various metrics that show how overall accuracy can be traded off forsensitivity for the minority class. Data Import and Preprocessing Predictive Modeling and Scoring The dataset is an imbalanced random sample of:Large Movie Review Dataset v1.0Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A., Y., Potts, C. (2011)Learning Word Vectors for Sentiment Analysis. Proceedings of the 49thAnnual Meeting of the Association for Computational Linguistics: HumanLanguage Technologies. 142-150, https://aclanthology.org/P11-1015/ Color by sentimentlabelTraining / test setPreprocessing of documentsVisualize modelperformanceUnweightedCalculateclass_weightAddclass_weightWeightedClass distribution Color Manager Partitioning Preprocessing XGBoost Predictor Binary ClassificationInspector XGBoost TreeEnsemble Learner Statistics Math Formula Joiner XGBoost TreeEnsemble Learner XGBoost Predictor Joiner Table Reader Bar Chart

Nodes

Extensions

Links