Icon

Assignment_​2_​group_​18

Data preparation

  1. Remove all rows that have missing values

    • we do not try to "guess" these data points as removing these rows with missing data is insignificant to the dataset (only 4 rows total).

  2. Remove Observation column as it is only used as an index in the csv file. After removing missing values it also contains gaps, which the model might use causing it to "overfit" the training data.

  3. Partition the data set into training and testing data. This is crucial as the model needs to be tested on data that it hasn't seen before.After testing a couple different ratios we landed on 70/30 where 70% is the training data and 30% is the testing data. this amounts to 6 297 rows for training and 2 699 rows for testing

Unsuited models

Candidate models comparison

  1. Use training data as input to learner node (the machine learning algorithm)

    • Pick the target value, in our case vote. This is what the learner will try to predict later

  2. Plug trained node into predictor node and importantly the test data, which the learner node has never seen.

  3. Plug the predictor into a scorer node to compare the two columns: Vote and Predicted Vote.

    • this shows the "confusion" matrix as well as accuracy statistics, which we use to compare the models

      • the statistics used are the overall Accuracy and Cohen's kappa; which takes into account the possibility of predicting the correct value at random

  4. Alternatively use the ROC curve node to show how much better the model is at predicting the vote compared to just randomly guessing

Based on the overall accuracy statistics of the models, we can deduce that the Gradient Boosted Trees based model performs the best

Now to test the model on a single data point, we just swap the test data (previously 2699 rows) with a single row, representing the 50 year old man. The model predicts the person is Undecided with a confidence of 99% (0.998)

The most correct way to do this would be to connect the model from before to the new test data, because now we are essentially creating a new model, however we tested both and they yield the same result, so for the clarity of the workflow we prefer to present it this way.

We could also use the entire dataset instead of the partitioned one, since we don't need the training data, however this actually made the model more uncertain seen by the prediction vote confidence going from 0.998 (partitioned) and 0.997 (whole dataset).

Constant Value Column (deprecated)
Column Filter
adds model name column
Constant Value Column (deprecated)
Bar Chart
Alternatively fill in missing values
Missing Value
Gradient Boosted Trees Predictor
Gradient Boosted Trees Learner
rounds to 3 decimal points
Number Rounder
puts column name first
Column Resorter
remove Observation column as it is just an index. ensures model is just trained on relevant data
Column Filter
SOTA Learner
SOTA Predictor
0.708
Scorer
Positive class = undecided
ROC Curve
Naive Bayes Predictor
0.724
Scorer
Naive Bayes Learner
0.984
Scorer
Column Filter
Tree Ensemble Learner
Tree Ensemble Predictor
Concatenate
Row Filter
Constant Value Column (deprecated)
CSV Reader
Removes all rows that have at least 1 missing value
Missing Value
Partition into training data (70%) and Test data (30%)
Table Partitioner
Decision Tree Learner
Random Forest Learner
Denne krever et annet format
Logistic Regression Learner
Decision Tree Predictor
Line Plot
Gradient Boosted Trees Predictor
0.983
Scorer
0.984
Scorer
Gradient Boosted Trees Learner
0.978
Scorer
Constant Value Column (deprecated)
Random Forest Predictor
Concatenate tables into one for comparison
Concatenate
Concatenate
Logistic Regression Predictor
Column Filter
0.734
Scorer
Row Filter
Row Filter
Row Filter
Column Filter
CSV Reader
filter out all but Overall Accuracy and Cohen's kappa
Column Filter

Nodes

Extensions

Links