Icon

SECOND ASSIGNMENT 2 FINALLLLLL1

Import the dataset “Election.csv” which contains information about voters such as age, income, education, marital status, and voting decision.

Explore the dataset to check data types, distributions, and identify columns with missing values or potential data inconsistencies.

Handle missing data.
Numerical columns are replaced with the median, and categorical columns with the most frequent value (mode) to avoid data loss and maintain consistency.

Verify that all missing values have been handled correctly.
The dataset is now clean and ready for model training.

Convert categorical variables (Gender, Married, Religious) into numerical format using one-hot encoding so that they can be used in machine learning models.

Split the dataset into 70% training and 30% testing using stratified sampling based on the “Vote” column.
This ensures the same proportion of “Undecided” and “Decided” voters in both sets.

Train a Logistic Regression model to classify voters as “Undecided” or “Decided”. This model is simple and interpretable, useful for understanding key relationships between variables.

Train a Decision Tree model for classification.
It creates a tree structure based on features to predict the voting status and helps visualize decision rules.

Train a Random Forest model composed of multiple decision trees.
This model typically gives higher accuracy and better generalization by reducing overfitting.

Used the Table Creator to manually enter one new voter profile (the 50‑year‑old man) with the same columns as in the original dataset so the model can make a prediction on this specific case.

Added a Missing Value node to handle any potential missing data in this new row using the same rules as for the training data, keeping preprocessing consistent.”

Used One to Many to convert categorical variables (like Gender, Married, Religious) into dummy columns, matching the encoding used when training the models.

This node applies the final Random Forest model (our best‑performing model) to the same profile to get the main prediction we use in our analysis.

Use a Value Counter node to measure the distribution of the target variable “Vote.” This step identifies how many voters are labeled as “Decided” and “Undecided,” allowing us to detect class imbalance in the dataset. The resulting counts (5373 decided vs. 3627 undecided) confirm a representation bias that may affect model performance.

This statistic shows a more detailed version of numeric, nominal, top/bottom, such as for example top 20 missing values when it comes to the genders.

Table Creator
Missing Value
Statistics View
Statistics
Missing Value
Logistic Regression Learner
One to Many
One to Many
Logistic Regression Predictor
CSV Reader
Table Partitioner
Decision Tree Predictor
Value Counter
Decision Tree Learner
Random Forest Learner
Random Forest Predictor
Scorer
Statistics
Scorer
Random Forest Predictor
Line Plot
Scorer
Statistics

Nodes

Extensions

Links