Icon

Exam_​20240122

MNIST CLASSIFICATION NODES TO BE USED 1. Read (separately) the mnist_train_0.1.csv and the mnist_test_0.5.csv files, which do not have column header and row ID2. For both of them, rename as “Label” the column “Column0” and set as StringValue its data type3. Normalize between 0.0 and 1.0 the data from mnist_train_0.1.csv4. Apply the same normalization model to the test subset (the data from mnist_test_0.5.csv)5. Split the data from mnist_train_0.1.csv into training and validation subsets with 80:20 ratio, stratified sampling based on the “Label” column and seed equal to 426. Generate 5 random numbers (to be used as seed values for the classification model definition) with these specifications: • the 5 numbers must be unique integer values between 0 and 100000 • the output of the generator must be a column named “rnd_seed • as seed for the random generation, use 07. Define a Group Loop based on the “rnd_seed” column to repeatedly train a classification model by taking into account the following: a) each random value in the “rnd_seed” column must be passed as Flow Variable to the learner b) each Flow Variable must be created such that it is of type String and named “Seed” c) when creating a variable, an input port to get values is not given, so the Flow Variable Ports must be used to this aim8. Train a Random Forest model with default settings, the “Label” column as Target Column and seed value set through the variable “Seed”9. Use the trained model to perform classification on the validation and the test subsets (separately)10. For both validation and test (separately), process the results in such a way that the Accuracy Statistics are reduced to a table with only the “Overall” row and two columns: • “Accuracy” (with the classification result) • “Seed” (with the corresponding value of the “Seed” variable used for classification)11. Before the end of the loop, concatenate the tables so that: • test results follow validation results • the “_test” suffix is used to label the test results in case of duplicated rows 12. End the loop (by using default settings except for the addition of the iteration column)13. Write the collected results into a .csv file using these settings: • name the file as Exam_20240122_YOUR-STUDENT-ID-NUMBER_results.csv • save the file in the same folder as the workflow • append data if the file already exists • write column header unless the file exists • write row IDQUESTION 1: what is the highest validation accuracy? Which is the corresponding seed value?QUESTION 2: what is the highest test accuracy? Which is the corresponding seed value? The connections between these nodes can be used as they are here: CSV Reader CSV Reader Random NumbersGenerator Group Loop Start Variable Creator Column Rename Column Rename Normalizer Normalizer (Apply) Partitioning Random ForestLearner Random ForestPredictor Random ForestPredictor Scorer Scorer Column Filter Column Filter Row Filter Row Filter Column Expressions Column Expressions Concatenate Loop End CSV Writer MNIST CLASSIFICATION NODES TO BE USED 1. Read (separately) the mnist_train_0.1.csv and the mnist_test_0.5.csv files, which do not have column header and row ID2. For both of them, rename as “Label” the column “Column0” and set as StringValue its data type3. Normalize between 0.0 and 1.0 the data from mnist_train_0.1.csv4. Apply the same normalization model to the test subset (the data from mnist_test_0.5.csv)5. Split the data from mnist_train_0.1.csv into training and validation subsets with 80:20 ratio, stratified sampling based on the “Label” column and seed equal to 426. Generate 5 random numbers (to be used as seed values for the classification model definition) with these specifications: • the 5 numbers must be unique integer values between 0 and 100000 • the output of the generator must be a column named “rnd_seed • as seed for the random generation, use 07. Define a Group Loop based on the “rnd_seed” column to repeatedly train a classification model by taking into account the following: a) each random value in the “rnd_seed” column must be passed as Flow Variable to the learner b) each Flow Variable must be created such that it is of type String and named “Seed” c) when creating a variable, an input port to get values is not given, so the Flow Variable Ports must be used to this aim8. Train a Random Forest model with default settings, the “Label” column as Target Column and seed value set through the variable “Seed”9. Use the trained model to perform classification on the validation and the test subsets (separately)10. For both validation and test (separately), process the results in such a way that the Accuracy Statistics are reduced to a table with only the “Overall” row and two columns: • “Accuracy” (with the classification result) • “Seed” (with the corresponding value of the “Seed” variable used for classification)11. Before the end of the loop, concatenate the tables so that: • test results follow validation results • the “_test” suffix is used to label the test results in case of duplicated rows 12. End the loop (by using default settings except for the addition of the iteration column)13. Write the collected results into a .csv file using these settings: • name the file as Exam_20240122_YOUR-STUDENT-ID-NUMBER_results.csv • save the file in the same folder as the workflow • append data if the file already exists • write column header unless the file exists • write row IDQUESTION 1: what is the highest validation accuracy? Which is the corresponding seed value?QUESTION 2: what is the highest test accuracy? Which is the corresponding seed value? The connections between these nodes can be used as they are here: CSV Reader CSV Reader Random NumbersGenerator Group Loop Start Variable Creator Column Rename Column Rename Normalizer Normalizer (Apply) Partitioning Random ForestLearner Random ForestPredictor Random ForestPredictor Scorer Scorer Column Filter Column Filter Row Filter Row Filter Column Expressions Column Expressions Concatenate Loop End CSV Writer

Nodes

Extensions

Links