Icon

Prediction within Spreadsheets

<p><strong>Prediction within Spreadsheets</strong></p><p>This workflow shows how to load and prepare data for training a predictive model using data from some Excel files.</p><p>In this example, we access various data about Olympics Games results and information about the athletes. We use this information to train a Gradient Boosted Trees model to predict, whether an athlete will win "No medal", "Bronze/Silver", or "Gold" medal.</p><p>💡 <em>To view each node's configuration, select the node and see the configuration pane on the right side of the workflow editor.</em></p><p>Let's walk through the different nodes involved in this operation:</p><p><strong>Excel Reader nodes:</strong></p><ul><li><p>Since the folder with the data is already included when you download the workflow, in the "File and Sheet" tab, we choose to "Read from" the "Current workflow data area" and select the dataset.</p></li><li><p>In the "Data Area" tab, we select to read the "Whole sheet". This configuration allows us to have the sheet as it is. The intent is to respect its original structure.</p></li></ul><p><strong>Column Renamer node:</strong></p><ul><li><p>In the Summer Olympic Games information sheet, we rename the column "country_noc" to "country_edition".</p></li></ul><p><strong>Value Lookup nodes:</strong></p><ul><li><p>With the first Value Lookup node, we use "athlete_id" as lookup and key column and add the gender and date of birth of the athletes to the athlete event results.</p></li><li><p>In the second Value Lookup node, we use "edition_id" as lookup and key column and add the country and start date of each event to the athlete event results.</p></li></ul><p><strong>Missing Value node:</strong></p><p>After a Value Lookup operation, it is very likely that some missing rows might have originated when a correspondent key in the lookup table was not found. It's important to check for missing values as the lack of information could create issues during the model training later on.</p><ul><li><p>We exclude rows that have missing values in the columns "country_edition", "sex", "born", or "start_date"as not found.</p></li></ul><p><strong>Expression + Bar Chart nodes:</strong></p><ul><li><p><strong>Expression node:</strong> We define one expression to bin the target variable ("medal") into three categories: "No medal", "Bronze/Silver", and "Gold".</p></li><li><p><strong>Bar Chart node: </strong>We plot the distribution of the newly created target variable ("Medal - Target Variable") in a Bar Chart.</p></li></ul><p><strong>Create additional features metanode:</strong></p><p>This step is also referred to as Feature Engineering. We are creating two additional features:</p><ul><li><p><strong>Date&amp;Time Difference node:</strong> We calculate the difference (in years) between the athlete's date of birth and the start date of the competition ("Athlete Age").</p></li><li><p><strong>Expression node: </strong>We implement a logic to indicate whether the competition took place in the athlete's home country ("athlete_at_home").</p></li></ul><p><strong>Prepare data for model training metanode</strong></p><ul><li><p><strong>Column Resorter node: </strong>We resort the table so that the important columns are moved to the beginning of the table.</p></li><li><p><strong>Column Renamer node: </strong>We rename the column "country_noc" to "country_athlete" since we have two country columns in the dataset, one for the athlete and one for the game.</p></li><li><p><strong>Table Cropper node:</strong> We crop the table to keep only the columns that are needed in the model training (column range: "Medal - Target Variable" to "athlete_at_home").</p></li><li><p><strong>Domain Calculator node: </strong>We update all variables domains, a necessary operation to allow the Gradient Boosted Trees algorithm to do its job.</p></li></ul><p><strong>Table Partitioner node:</strong></p><p>To train and test the model appropriately, we need a training and testing set.</p><ul><li><p>Partition type: Relative - 80%/20% split</p></li><li><p>Sampling strategy: Stratified sampling on target column</p></li></ul><p><strong>Gradient Boosted Trees Learner + Gradient Boosted Trees Predictor nodes:</strong></p><ul><li><p><strong>Gradient Boosted Trees Learner node:</strong> We use the training set (top port of the Partitioning node) to train the model with the following configuration:</p><ul><li><p>Target column: Medal - Target Variable</p></li><li><p>We include all other columns in the training (we did the filtering already)</p></li><li><p>Tree depth: 3</p></li><li><p>Number of models: 400</p></li><li><p>Learning rate: 0.1 (default value)</p></li></ul></li><li><p><strong>Gradient Boosted Trees Predictor node: </strong>We connect the test set (bottom port of the Partitioning node) and the trained model (output of Learner node) to the Predictor node. Make sure to "append individual class probabilities".</p></li></ul><p><strong>Scorer + ROC Curve nodes:</strong></p><p>To evaluate the model's accuracy.</p><ul><li><p><strong>Scorer node:</strong> Provides a static performance evaluation based on the actual value vs. prediction.</p><ul><li><p>First column: "Medal - Target Variable"</p></li><li><p>Second column: "Prediction (Medal - Target Variable)"</p></li></ul></li><li><p><strong>ROC Curve node:</strong> Gives an objective model evaluation not affected by the subjective choice of choosing a cut-off value to the probabilities to assign the prediction.</p><ul><li><p>Target column: Medal - Target Variable</p></li><li><p>Positive class value: Gold</p></li><li><p>In the Include/Exclude panel only include P (Medal - Target Variable=Gold)</p></li></ul></li></ul>

Data access

  1. Athletes' bio information

  2. Athlete event results data of Summer Olympic Games

  3. Summer Olympic Games information

Preprocessing data

Prepare data for training:

  • Merge the datasets

  • Handle missing values

  • Create target variable and two additional features (Feature Engineering)

  • Cleaning the data

  • Partition into training and test set

Feature Enginerring

Create two additional features:

  • The athlete's age when participated in Olympic Games edition

  • Whether the game took place in the athlete's home country

Training Gradient Boosted Tree

Applying trained models

Evaluating the trained models

Medal vs. predicted medal

Visualization

ROC curves and decision tree

Visualizing plot distribution

Prediction with Spreadsheets


This workflow demonstrates how to train and evaluate a predictive model using spreadsheet data.

In this example, we access three datasets:

  1. The bio information of the athletes who participated in any of the Summer or Winter Olympics.

  2. The athlete event results of the 1896-2020 Summer Olympic Games.

  3. Additional information about the Summer Olympic Games.

The goal is to train a Gradient Boosted Trees to predict whether an athlete will win a medal based on features such as sport, home country of the athlete, gender, age of the athlete, and whether the game takes place in the athlete's home country. We start by preparing the data for model training, including merging the data, removing missing values, creating the target variable ("Medal - Target Variable") along with two additional features, and filtering the data to include only specific columns relevant for model training. We split the dataset into a training and a test set and use it to train a Gradient Boosted Trees model. Lastly, we evaluate the model's performance using a confusion matrix and plotting a ROC curve.

For a detailed overview of each node in this workflow, refer to the workflow description in the Info panel.


💡 To view each node's configuration, select the node and see the configuration pane on the right side of the workflow editor.

Read bio informationof athletes
Excel Reader
Read athlete event results
Excel Reader
Apply trained modelto test data
Gradient Boosted Trees Predictor
Read Summer OlympicGames information
Excel Reader
Add gender anddate of birth toathlete event results
Value Lookup
Rename column"country_noc"
Column Renamer
Add country andstart date toathlete event results
Value Lookup
Exclude rows withmissing values
Missing Value
Prepare data for model training
Plot ROC Curve
ROC Curve
Plot distributionof target variable
Bar Chart
Confusion matrix
Scorer
Create additional features
Train and test
Table Partitioner
Train model to predict whether athlete will win a medal
Gradient Boosted Trees Learner
Create target variable
Expression

Nodes

Extensions

Links