Icon

Assignment_​2_​group_​18

Data preparation

  1. Remove all rows that have missing values

    • we do not try to "guess" these data points as removing these rows with missing data is insignificant to the dataset (only 4 rows total).

  2. Remove Observation column as it is only used as an index in the csv file. After removing missing values it also contains gaps, which the model might use causing it to "overfit" the training data.

  3. Partition the data set into training and testing data. This is crucial as the model needs to be tested on data that it hasn't seen before.After testing a couple different ratios we landed on 70/30 where 70% is the training data and 30% is the testing data. this amounts to 6 297 rows for training and 2 699 rows for testing

Candidate models comparison

  1. Use training data as input to learner node (the machine learning algorithm)

    • Pick the target value, in our case vote. This is what the learner will try to predict later

  2. Plug trained node into predictor node and importantly the test data, which the learner node has never seen.

  3. Plug the predictor into a scorer node to compare the two columns: Vote and Predicted Vote.

    • this shows the "confusion" matrix as well as accuracy statistics, which we use to compare the models

      • the statistics used are the overall Accuracy and Cohen's kappa; which takes into account the possibility of predicting the correct value at random

  4. On our chosen model (gradient boosted trees) we used the ROC curve node to show how much better the model is at predicting the class compared to just randomly guessing

Based on the overall accuracy statistics of the models, we can deduce that the Gradient Boosted Trees based model performs the best

Now to test the model on a single data point, we just swap the test data (previously 2699 rows) with a single row, representing the 50 year old man. The model predicts the person is Undecided with a confidence of 99% (0.998)

The most correct way to do this would be to connect the model from before to the new test data, because now we are essentially creating a new model, however we tested both and they yield the same result, so for the clarity of the workflow we prefer to present it this way.

We could also use the entire dataset instead of the partitioned one, since we don't need the test data, however this actually made the model more uncertain seen by the prediction vote confidence going from 0.998 (just training partition) and 0.997 (whole dataset).

CSV Reader
Removes all rows that have at least 1 missing value
Missing Value
Constant Value Column (deprecated)
Partition into training data (70%) and Test data (30%)
Table Partitioner
Decision Tree Learner
Random Forest Learner
Decision Tree Predictor
Column Filter
adds model name column
Constant Value Column (deprecated)
Bar Chart
Alternatively fill in missing values
Missing Value
Line Plot
Gradient Boosted Trees Predictor
Gradient Boosted Trees Learner
rounds to 3 decimal points
Number Rounder
puts column name first
Column Resorter
remove Observation column as it is just an index. ensures model is just trained on relevant data
Column Filter
Positive class = undecided
ROC Curve
Gradient Boosted Trees Predictor
0.983
Scorer
0.984
Scorer
Gradient Boosted Trees Learner
0.978
Scorer
0.984
Scorer
Constant Value Column (deprecated)
Random Forest Predictor
Concatenate tables into one for comparison
Concatenate
Column Filter
Concatenate
Tree Ensemble Learner
Column Filter
Tree Ensemble Predictor
Row Filter
Row Filter
Concatenate
Row Filter
Column Filter
Row Filter
CSV Reader
filter out all but Overall Accuracy and Cohen's kappa
Column Filter
Constant Value Column (deprecated)

Nodes

Extensions

Links