The project was developed by a group of students from FORE School of Management.
It was aimed to develop a predictive model for classifying passengers' tweets about different US airlines into three categories: positive, negative, and neutral. The primary objective was to predict a tweet's classification based on its content.
I. Data Preprocessing
File Reader Node:
- The project commenced with the File Reader node, which loaded the dataset containing passengers' tweets.
- The dataset included textual content, airline-related comments, and labels indicating whether the tweets were positive, negative, or neutral.
Text/Data Cleaning:
- Data cleaning was a crucial step to prepare the text data for analysis. The following operations were performed:
- Case Converter: Text data was standardized to lowercase, ensuring uniformity and eliminating inconsistencies in capitalization.
- Number Filter: Any numerical values were filtered to remove numeric artifacts and ensure a cleaner text corpus.
- Punctuation Erasure: Punctuation marks were removed to focus on the text's content.
- POS Tagger: Parts of speech tagging was used to identify and label the grammatical components of each word.
- Stop Word Filter: Common stop words were removed to reduce noise and enhance text analysis.
- Porter Stemmer: Words were stemmed to their root form to standardize variations and improve feature extraction.
- nChars Filter: Words with a specific number of characters were filtered as per project requirements.
II. Text Feature Engineering
Bag of Words Creator:
- The Bag of Words Creator node converted the cleaned text data into numerical representations, creating a feature vector for each tweet.
TF-IDF Transformation:
- The TF-IDF (Term Frequency-Inverse Document Frequency) transformation was applied to the bag of words, giving more weight to unique terms within the context of each document.
Column Expressions:
- The Column Expressions node allowed for custom transformations if necessary. The specific operations here would depend on project requirements.
III. Vectorization
Document Vector:
- The Document Vector node further transformed the text data into a format suitable for machine learning, allowing numerical representation of each tweet.
IV. Data Labeling
Category to Class:
- The Category to Class node was utilized to convert the categorical labels ("positive," "negative," "neutral") into a numerical format suitable for classification (e.g., positive: 1, negative: 2, neutral: 3).
V. Model Development
Tree Ensemble Learner:
- The project employed a Random Forest approach using the Tree Ensemble Learner node. The learner was configured with parameters like the number of trees and maximum depth, allowing the model to be trained on the transformed data.
VI. Model Application
Tree Ensemble Predictor:
- The Tree Ensemble Predictor node used the trained model to make predictions on new data or test samples.
VII. Model Evaluation
Scorer:
- A Scorer node was used to evaluate the model's performance. Common metrics like accuracy, precision, recall, and F1-score were calculated to assess the effectiveness of the predictive model.
To use this workflow in KNIME, download it from the below URL and open it in KNIME:
Download WorkflowDeploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.
Try NodePit Runner!Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to mail@nodepit.com.
Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.