Icon

Final_​Project_​Keller

Final Project - IT4015 Applied Business Intelligence Jason Keller 4/19/2022Proposed workUsing two datasets of wine quality (red & white wines), determine if the alcohol level is a good predictor of quality.Data Sourcehttps://www.kaggle.com/datasets/vishalkumbhar1997/wine-quality-prediction-with-logistic-regressionTechniques usedData Cleansing:* Duplicate Row Filter - removed duplicate rows within the dataset* Row Filter - removed missing citric, fixed, and volatile acidities, and pH and alcohol levels.* Missing Value - replaced missing values for three features (sulphates, chlorides, and residual sugars ) w/ means.* Rule Engine - converted numeric quality to textual description ( quality > 6.5 is "good", otherwise "bad" )* Row Splitter - split dataset by quality (top: "good", bottom: "bad")Data Manipulation:* Concatenate - unioned both datasets (red & white wines) into a single dataset* GroupBy - group the "good" wines by type (red or white) and determine the number of wines, the average (mean) of fixed and volatile acidities, and the average (mean), minimum, and maximum of alcohol level per each type. * Partitioning - splits the data based on relative % size, then samples the data based on stratification or drawn randomlyData Visualizations:* Bar Chart - grouped averages (pre-grouped) and normal - displays the average (mean) for fixed and volatile acidities and alcohol level per each type (red or white).* ROC Curve - displays the fixed and volatile acidities and alcohol level as a predictor of quality ("good"). ** Using a Decision Tree model, the AUC of alcohol is 0.913. ** Using a Random Forest model, the AUC of alcohol is 0.921.* Scorer (JavaScript) - displays a confusion matrix for accurracy of predicting quality ** Using a Decison Tree model, the overall accuracy is 84.16% ** Using a Random Forest model, the overall accuracy is 85.44%* Numeric Scorer - displays statistical values to explain the variation around its mean. ** Using a Linear Regression model, the alcohol level has a R^2 of 0.844.Data Science:* Decision Tree Learner / Decision Tree Predictor - predicts the quality of wine using the Gini Index and MDL pruning method.* Random Forest Learner / Random Forest Predictor - predicts the quality of wine using the Information Gain Ratio.* Linear Regression Learner / Regression Predictor - predicts the alcohol level given all other features.OutcomesBased on the AUCs (area under the curve) of two different training models, it appears that alcohol level is ~90% accurate in predictingthe quality of wine. Using the confusion matrices, the two different training modes are ~84-85% accurate in predicting the overall qualityof the wine. As an aside, I used a linear regression model to determine the variance of alcohol levels compared to the other features.The results showed that alcohol level has a R^2 of ~84% and fits the linear regression model well.CommentsThis project demonstrates my ability to apply the lessons and techniques that I have learned over these past few months. As I use theKNIME tool more, my comfort level increases and I foresee usage of this tool and its techniques being applied within my workprocesses. The Data Science aspect is extremely beneficial. In regards to this specific dataset, the data is well-organized and is fairlyclean (or of good quality). Also of note, the Kaggle website is a useful source for obtaining data to experiment with. wine_quality_red.csvwine_quality_white.xlsxunion bothred & white datasetssplit datasetby qualitytop: "good"bottom: "bad"avg "fixed" acidityavg "volatile" acidityavg, min & max alcoholgrouped by typegrouped averages(pre-grouped)grouped averages(normal)train model topredict qualityapply model totest setstratisify data on quality75%/25%train model topredict qualityapply model totest setstratisify data on quality75%/25%train model topredict alcohol levelapply model totest setrandomly draw data70%/30%Evaluate linear regression modelremove duplicatesand missing valuesscoring metricsscoring metrics CSV Reader Excel Reader Concatenate Row Splitter GroupBy Bar Chart Bar Chart DecisionTree Learner Decision TreePredictor ROC Curve Partitioning Random ForestLearner Random ForestPredictor Partitioning ROC Curve Linear RegressionLearner RegressionPredictor Partitioning Numeric Scorer Data Cleansing Scorer (JavaScript) Scorer (JavaScript) Final Project - IT4015 Applied Business Intelligence Jason Keller 4/19/2022Proposed workUsing two datasets of wine quality (red & white wines), determine if the alcohol level is a good predictor of quality.Data Sourcehttps://www.kaggle.com/datasets/vishalkumbhar1997/wine-quality-prediction-with-logistic-regressionTechniques usedData Cleansing:* Duplicate Row Filter - removed duplicate rows within the dataset* Row Filter - removed missing citric, fixed, and volatile acidities, and pH and alcohol levels.* Missing Value - replaced missing values for three features (sulphates, chlorides, and residual sugars ) w/ means.* Rule Engine - converted numeric quality to textual description ( quality > 6.5 is "good", otherwise "bad" )* Row Splitter - split dataset by quality (top: "good", bottom: "bad")Data Manipulation:* Concatenate - unioned both datasets (red & white wines) into a single dataset* GroupBy - group the "good" wines by type (red or white) and determine the number of wines, the average (mean) of fixed and volatile acidities, and the average (mean), minimum, and maximum of alcohol level per each type. * Partitioning - splits the data based on relative % size, then samples the data based on stratification or drawn randomlyData Visualizations:* Bar Chart - grouped averages (pre-grouped) and normal - displays the average (mean) for fixed and volatile acidities and alcohol level per each type (red or white).* ROC Curve - displays the fixed and volatile acidities and alcohol level as a predictor of quality ("good"). ** Using a Decision Tree model, the AUC of alcohol is 0.913. ** Using a Random Forest model, the AUC of alcohol is 0.921.* Scorer (JavaScript) - displays a confusion matrix for accurracy of predicting quality ** Using a Decison Tree model, the overall accuracy is 84.16% ** Using a Random Forest model, the overall accuracy is 85.44%* Numeric Scorer - displays statistical values to explain the variation around its mean. ** Using a Linear Regression model, the alcohol level has a R^2 of 0.844.Data Science:* Decision Tree Learner / Decision Tree Predictor - predicts the quality of wine using the Gini Index and MDL pruning method.* Random Forest Learner / Random Forest Predictor - predicts the quality of wine using the Information Gain Ratio.* Linear Regression Learner / Regression Predictor - predicts the alcohol level given all other features.OutcomesBased on the AUCs (area under the curve) of two different training models, it appears that alcohol level is ~90% accurate in predictingthe quality of wine. Using the confusion matrices, the two different training modes are ~84-85% accurate in predicting the overall qualityof the wine. As an aside, I used a linear regression model to determine the variance of alcohol levels compared to the other features.The results showed that alcohol level has a R^2 of ~84% and fits the linear regression model well.CommentsThis project demonstrates my ability to apply the lessons and techniques that I have learned over these past few months. As I use theKNIME tool more, my comfort level increases and I foresee usage of this tool and its techniques being applied within my workprocesses. The Data Science aspect is extremely beneficial. In regards to this specific dataset, the data is well-organized and is fairlyclean (or of good quality). Also of note, the Kaggle website is a useful source for obtaining data to experiment with. wine_quality_red.csvwine_quality_white.xlsxunion bothred & white datasetssplit datasetby qualitytop: "good"bottom: "bad"avg "fixed" acidityavg "volatile" acidityavg, min & max alcoholgrouped by typegrouped averages(pre-grouped)grouped averages(normal)train model topredict qualityapply model totest setstratisify data on quality75%/25%train model topredict qualityapply model totest setstratisify data on quality75%/25%train model topredict alcohol levelapply model totest setrandomly draw data70%/30%Evaluate linear regression modelremove duplicatesand missing valuesscoring metricsscoring metrics CSV Reader Excel Reader Concatenate Row Splitter GroupBy Bar Chart Bar Chart DecisionTree Learner Decision TreePredictor ROC Curve Partitioning Random ForestLearner Random ForestPredictor Partitioning ROC Curve Linear RegressionLearner RegressionPredictor Partitioning Numeric Scorer Data Cleansing Scorer (JavaScript) Scorer (JavaScript)

Nodes

Extensions

Links