Icon

04. Data Mining - solution

Data Mining - Solution
Activity II: Linear Regression - Read weather.table data - Split the data into rows up to 2016 (training set) and rows from 2017 on (test set) - Train a linear regression model that predicts the AIR_TEMP as a function of all other features in the dataset - Use the model to predict the temperature in 2017 and evaluate the model with the Numeric Scorer node- Optional: 1. Calculate the mean temperature per month in the training data2. Join the mean temperature per month to the test set3. Use the Numeric Scorer to see if the average monthly temperature provides a better prediction than the Linear Regression model Activity III: k-Means - Read location_data.table data - Filter the data to entries from California (region_code = CA) - Perform k-means clustering with k=3. Use only latitude and longitude for clustering. - Optional: plot latitude and longitude in a view (OSM Map or Scatter Plot) and use the view to visually optimize k Activity I: Decision Trees - Partition the fully joined data into a training and test set (50%, Stratified Sampling on Target) - Train a Decision Tree on the training set to predict Target - Use the trained model to predict Target in the test set - Evaluate the accuracy of the model with the Scorer node - What is the overall accuracy of your model? - Optional: evaluate the accuracy and robustness of the model with the ROC Curve node Mean temperatureper monthPredict AIR_TEMPIn Californiasplit 2017Node 312Combine bothevaluationsRead weather.tableCombine with test dataLocations_data Decision TreePredictor Partitioning GroupBy Column Rename Linear RegressionLearner Row Filter OSM Map View Color Manager DecisionTree Learner ROC Curve Fully Joined Data k-Means Row Splitter RegressionPredictor Scatter Plot Numeric Scorer Numeric Scorer Scorer Column Appender Table Reader Joiner Table Reader Activity II: Linear Regression - Read weather.table data - Split the data into rows up to 2016 (training set) and rows from 2017 on (test set) - Train a linear regression model that predicts the AIR_TEMP as a function of all other features in the dataset - Use the model to predict the temperature in 2017 and evaluate the model with the Numeric Scorer node- Optional: 1. Calculate the mean temperature per month in the training data2. Join the mean temperature per month to the test set3. Use the Numeric Scorer to see if the average monthly temperature provides a better prediction than the Linear Regression model Activity III: k-Means - Read location_data.table data - Filter the data to entries from California (region_code = CA) - Perform k-means clustering with k=3. Use only latitude and longitude for clustering. - Optional: plot latitude and longitude in a view (OSM Map or Scatter Plot) and use the view to visually optimize k Activity I: Decision Trees - Partition the fully joined data into a training and test set (50%, Stratified Sampling on Target) - Train a Decision Tree on the training set to predict Target - Use the trained model to predict Target in the test set - Evaluate the accuracy of the model with the Scorer node - What is the overall accuracy of your model? - Optional: evaluate the accuracy and robustness of the model with the ROC Curve node Mean temperatureper monthPredict AIR_TEMPIn Californiasplit 2017Node 312Combine bothevaluationsRead weather.tableCombine with test dataLocations_data Decision TreePredictor Partitioning GroupBy Column Rename Linear RegressionLearner Row Filter OSM Map View Color Manager DecisionTree Learner ROC Curve Fully Joined Data k-Means Row Splitter RegressionPredictor Scatter Plot Numeric Scorer Numeric Scorer Scorer Column Appender Table Reader Joiner Table Reader

Nodes

Extensions

Links