Icon

Workflow Final Challenge

The Impact of Movie Posters and Data on Boxoffice Revenues

This workflow aims at performing inference on boxoffice revenues (adjusted for dollar inflation with CPI) by combining information from structured data about movies from IMDb and scraped posters retrieved through TMDB and its API, complying with the related Terms of Service.

Text mining on movie titles is performed with the lexicon approach, while image feature analysis is performed in two steps, face detection through a pre trained convolutional neural network (MTCNN) and image feature extraction (labels and colors) through Google Cloud Vision API.

Portability is ensured through Knime URL protocol and Conda Environment Propagation.

To make the workflow run, all the data and all the workflows present in its Knime Hub directory need to be present in the working directory (please check their description in order to check their requirements and in order to configure them in the right way). The posters folder is just a placeholder, every poster needs to be scraped again, and it will be saved there.

Additionally, the main dataset needs to be downloaded (and put in the working directory) from the given link in the external resources section of this page. Please remove the space in the name of the downloaded .csv file before execution.

CORRELATION MATRIX AND SCATTERPLOTIn order to have a first insight about the possible associations (positive or negative)between the numerical variables that we are going to consider in the model, we canbuild the correlation matrix. ATTENTION: always remember that here we can only talk about associations andnot about causation. We have also provided a scatterplot to visualize the datapoints in a 2 dimensionalspace. CORRELATION MATRIX AND SCATTERPLOTIn this section we use a Z-Score Normalization to normalize some of the numerical features present in the dataset. This operation is preliminar to the construction of the correlation matrix, in which we can see the correlation between the pairs ofnumerical variables, bringing useful insights about the association between them. The association, which does not implycausation, can be high/low but also positive and negative. Then, we also decided to use a scatterplot to visualize the customizable pairs of variables. The results obtained can be linked andsupport even further the results of the correlation matirix. WORLDWIDE INCOME ADJUSTED FOR INFLATIONAs anticipated before, we need to adjust the value of the worldwide income according toinflation, using the CPI index. First, to add the CPI value of the corresponding year to the Movies Datset, we perform aleft outer join based on the year. Then, we create a new variable (worldwide income adjusted) applying a very simple mathformula to adjust the value for inflation. The formula considered is the following: theincome value is multiplied by the CPI of 2020 (last year considered in dataset) and thendivided by the CPI of the year of movie distribution. DOUBLE FILTER ON GENREWe have already anticipated our choice of performing a double filtering on the movies genres. This will allow us toincrease the interpretability of the results that we will obtain in the models constructed to test our hypothesis. Thetwo filters are the following: FILTER 1: keep only movies having at least one of the top 5 genres. FILTER 2: from those remained, filter out all movies having more than one top5-genre. LANGUAGE(S) PRE-VISUALIZATION OPERATIONSIn the Data Visualization part we have decided to display also some plots for the languages of the movies. For this reason, we nowneed to perform some pre-visualizations operations. First, we notice that each movie can have more than one language and we group the dataset according to the language(s)counting the rows/movies for each of them. Then, we sort by the number of movies that each language(s) has and we keep only those language(s) having more than 600movies. In fact, we have decided to visualize only the top languages, meaning the most frequent ones. ATTENTION The node Continent Variable is the first time an Rnode is executed in the workflow, and all is ready tobe deployed with the propagation node. Unfortunately, the interactions library (used later inthe analysis) is not available in conda-forge, the mainconda repository, and it needs to be installed from Rconsole (AKA from CRAN repository).Thus, please enter the following commands in yourterminal:- conda activate r-environment- R- install.packages('interactions')- 51- quit()- conda deactivate r-environment POSTER AND ENGLISH TITLE RETRIEVAL THROUGH TMDB APIThe workflow called here is very articulated, and uses the TMDb API to retrieve the poster URLs from the website and to get theEnglish titles of the movies (present in TMDb and with poster URL), which we need for text mining, since the original dataset fromKaggle only has the original ones and the Italian ones. Once we have the URLs of the posters, the workflow performs imagescraping and downloads them all from the website, while saving them in a local folder. The output of the workflow is the joinbetween a table with the rows with NA in worldwide gross income filtered and a table with the title ID of each movies and its Englishtitle. The NAs for worldwide gross income are filtered because our inference is on this variable, as the dependent one. All this operations are possible because the TMDb API can use the IMDb ID (present in the original Kaggle dataset) as identifier fora movie. The complete poster scraping process is obviously not activated by default, since it would take days to run, and it needs to beactivated from inside the called workflow. Dependent Variable: Worldwide gross income adjustedIndependent Variable: movie poster's features and title featuresModerator Variables: Genres of the movieControl Variables: Average vote, number of votes, reviews from critics and usersInteractions: genre with each of the features in the independent variable Model 1: Main effect + ControlsModel 2: Main effect + Moderator Variables + ControlsModel 3: Main effect + Interactions + Controls To visualize the interactions access to themetanode is needed EXPLANATORY ANALYTICS: MODELS AND INTERACTIONSFInally, in this last section we test our hypothesis, referring both to the main effect and to the interaction effects. We are using an R Snipper node to include the R code to perform first a NegativeBinomial Regression and then a Linear Regression, that will reveal us if the variables considered are significant or not in predicting the Worldwide Gross Income (adjusted) value. Note that allnumerical variables are normalized (except for the dependent one). For this last part, we have also included the Interactions Plot, to visualize graphically the results of the moderator effect. LABELS: Hat, Plant, Gesture,Cloud, SkyGENRES: Comedy, Action, Crime,Drama, Romance OVERDISPERSION CHECK ATTENTION!If you use the workflow with other sample data you may probably have to check theR Snippet node. The labels identified may be different considering different images and, as aconsequence, the top 5 may vary. For this reason, check that the label names in the R Snippet node are the one thatyou obtained! MULTICOLLINEARITY CHECK: remove votes and reviews FREQUENCY TABLES FOR CATEGORICALVARIABLES LEXICON TEXT MINING AND DATA RETRIEVALIn this workflow we have performed a lexicon-based text mining on the movies’ titles using thethree features present in the VAD dictionary:Valence, Arousal and Dominance. After adding thetitle length as a new variable, we started the textmining process tagging the word with POS taggerand using the VAD dictionary. Then, in the pre-processing part – as we are not considering largeportion of text but just the movies’ titles, we justerased punctuation, filtered numbers, convertedeverything to lower case and filtered out all titleswithout any tag (for example, titles with only propernames). After creating the Bag of words (andadded a Term column to keep each term withoutthe tag attached), we used a joiner to add to eachterm tagged its value of Valence, Arousal andDominance through the VAD dictionary. Eventually,through a GroupBy node, we grouped the term bythe movies’ titles and keep the main data about thefilm and the value of Valence, Arousal andDominance of the title. DEPLOY OF FACE DETECTION MTCNN AND GOOGLECLOUD VISION APIThis workflow is used to count the number of faces in themovies’ posters through a pre-trained CNN, and to performcolor analysis and label detection (on the aforementionedimages) via the Google Cloud Vision API. From each poster,the dominant color is extracted, while the most popular andsignificant labels among the posters are selected andencoded with dummy variables for each observation. CPI(1947-2020) left outer joinbased on year(CPI added tomovies dataset)worldwide grossincome adjustedsplit genre column(movies may have more than 1 genre)We read the main datasetgroup by language(s)sort by # of movies in thatlanguage(s)keep only languageswith # movies > 600normalize some numerical variablescorrelation matrix fornumerical variablesscatter matrix customizablenormalize numericalvariables FILTER 1keep only movieswith at least one of the top5 genresFILTER 2keep only movieswith only 1 of top genresWe cast yearto number typeeliminate unneccessaryvariablesModel 2Main effect + Genres + ControlsModel 1Main effect + ControlsWe call TMDb API Posterand Data Retrieval and getits output tableupdateWe save the table to make the called workflow run by itselfWe call the workflow Lexicon Text Mining onTitles with VAD and get its outputWe save the table to make thecalled workflow run by itselfoverdispersionWe call the workflowDeploy of Face Detection MTCNNand Google Cloud Vision API and get its outputVariance InflationFactor (VIF) CSV Reader Joiner Math Formula Cell Splitter CSV Reader Binary Genre GroupBy Sorter Row Filter Normalizer Linear Correlation Scatter Matrix(local) Data Cleaning Movie Data Cleaning CPI Continent Variable Occurrencesof Genre Data Visualization Linear Correlation Normalizer Rule-basedRow Filter Processing:Top Genres Rule-basedRow Filter Column Aggregator Scatter Matrix(local) String To Number Column Filter R Snippet R Snippet Model 3:Interactions Call Workflow(Table Based) Linear Correlation Table Writer Call Workflow(Table Based) Table Writer R Snippet Frequency Column Filter Call Workflow(Table Based) CORRELATION MATRIX AND SCATTERPLOTIn order to have a first insight about the possible associations (positive or negative)between the numerical variables that we are going to consider in the model, we canbuild the correlation matrix. ATTENTION: always remember that here we can only talk about associations andnot about causation. We have also provided a scatterplot to visualize the datapoints in a 2 dimensionalspace. CORRELATION MATRIX AND SCATTERPLOTIn this section we use a Z-Score Normalization to normalize some of the numerical features present in the dataset. This operation is preliminar to the construction of the correlation matrix, in which we can see the correlation between the pairs ofnumerical variables, bringing useful insights about the association between them. The association, which does not implycausation, can be high/low but also positive and negative. Then, we also decided to use a scatterplot to visualize the customizable pairs of variables. The results obtained can be linked andsupport even further the results of the correlation matirix. WORLDWIDE INCOME ADJUSTED FOR INFLATIONAs anticipated before, we need to adjust the value of the worldwide income according toinflation, using the CPI index. First, to add the CPI value of the corresponding year to the Movies Datset, we perform aleft outer join based on the year. Then, we create a new variable (worldwide income adjusted) applying a very simple mathformula to adjust the value for inflation. The formula considered is the following: theincome value is multiplied by the CPI of 2020 (last year considered in dataset) and thendivided by the CPI of the year of movie distribution. DOUBLE FILTER ON GENREWe have already anticipated our choice of performing a double filtering on the movies genres. This will allow us toincrease the interpretability of the results that we will obtain in the models constructed to test our hypothesis. Thetwo filters are the following: FILTER 1: keep only movies having at least one of the top 5 genres. FILTER 2: from those remained, filter out all movies having more than one top5-genre. LANGUAGE(S) PRE-VISUALIZATION OPERATIONSIn the Data Visualization part we have decided to display also some plots for the languages of the movies. For this reason, we nowneed to perform some pre-visualizations operations. First, we notice that each movie can have more than one language and we group the dataset according to the language(s)counting the rows/movies for each of them. Then, we sort by the number of movies that each language(s) has and we keep only those language(s) having more than 600movies. In fact, we have decided to visualize only the top languages, meaning the most frequent ones. ATTENTION The node Continent Variable is the first time an Rnode is executed in the workflow, and all is ready tobe deployed with the propagation node. Unfortunately, the interactions library (used later inthe analysis) is not available in conda-forge, the mainconda repository, and it needs to be installed from Rconsole (AKA from CRAN repository).Thus, please enter the following commands in yourterminal:- conda activate r-environment- R- install.packages('interactions')- 51- quit()- conda deactivate r-environment POSTER AND ENGLISH TITLE RETRIEVAL THROUGH TMDB APIThe workflow called here is very articulated, and uses the TMDb API to retrieve the poster URLs from the website and to get theEnglish titles of the movies (present in TMDb and with poster URL), which we need for text mining, since the original dataset fromKaggle only has the original ones and the Italian ones. Once we have the URLs of the posters, the workflow performs imagescraping and downloads them all from the website, while saving them in a local folder. The output of the workflow is the joinbetween a table with the rows with NA in worldwide gross income filtered and a table with the title ID of each movies and its Englishtitle. The NAs for worldwide gross income are filtered because our inference is on this variable, as the dependent one. All this operations are possible because the TMDb API can use the IMDb ID (present in the original Kaggle dataset) as identifier fora movie. The complete poster scraping process is obviously not activated by default, since it would take days to run, and it needs to beactivated from inside the called workflow. Dependent Variable: Worldwide gross income adjustedIndependent Variable: movie poster's features and title featuresModerator Variables: Genres of the movieControl Variables: Average vote, number of votes, reviews from critics and usersInteractions: genre with each of the features in the independent variable Model 1: Main effect + ControlsModel 2: Main effect + Moderator Variables + ControlsModel 3: Main effect + Interactions + Controls To visualize the interactions access to themetanode is needed EXPLANATORY ANALYTICS: MODELS AND INTERACTIONSFInally, in this last section we test our hypothesis, referring both to the main effect and to the interaction effects. We are using an R Snipper node to include the R code to perform first a NegativeBinomial Regression and then a Linear Regression, that will reveal us if the variables considered are significant or not in predicting the Worldwide Gross Income (adjusted) value. Note that allnumerical variables are normalized (except for the dependent one). For this last part, we have also included the Interactions Plot, to visualize graphically the results of the moderator effect. LABELS: Hat, Plant, Gesture,Cloud, SkyGENRES: Comedy, Action, Crime,Drama, Romance OVERDISPERSION CHECK ATTENTION!If you use the workflow with other sample data you may probably have to check theR Snippet node. The labels identified may be different considering different images and, as aconsequence, the top 5 may vary. For this reason, check that the label names in the R Snippet node are the one thatyou obtained! MULTICOLLINEARITY CHECK: remove votes and reviews FREQUENCY TABLES FOR CATEGORICALVARIABLES LEXICON TEXT MINING AND DATA RETRIEVALIn this workflow we have performed a lexicon-based text mining on the movies’ titles using thethree features present in the VAD dictionary:Valence, Arousal and Dominance. After adding thetitle length as a new variable, we started the textmining process tagging the word with POS taggerand using the VAD dictionary. Then, in the pre-processing part – as we are not considering largeportion of text but just the movies’ titles, we justerased punctuation, filtered numbers, convertedeverything to lower case and filtered out all titleswithout any tag (for example, titles with only propernames). After creating the Bag of words (andadded a Term column to keep each term withoutthe tag attached), we used a joiner to add to eachterm tagged its value of Valence, Arousal andDominance through the VAD dictionary. Eventually,through a GroupBy node, we grouped the term bythe movies’ titles and keep the main data about thefilm and the value of Valence, Arousal andDominance of the title. DEPLOY OF FACE DETECTION MTCNN AND GOOGLECLOUD VISION APIThis workflow is used to count the number of faces in themovies’ posters through a pre-trained CNN, and to performcolor analysis and label detection (on the aforementionedimages) via the Google Cloud Vision API. From each poster,the dominant color is extracted, while the most popular andsignificant labels among the posters are selected andencoded with dummy variables for each observation. CPI(1947-2020) left outer joinbased on year(CPI added tomovies dataset)worldwide grossincome adjustedsplit genre column(movies may have more than 1 genre)We read the main datasetgroup by language(s)sort by # of movies in thatlanguage(s)keep only languageswith # movies > 600normalize some numerical variablescorrelation matrix fornumerical variablesscatter matrix customizablenormalize numericalvariables FILTER 1keep only movieswith at least one of the top5 genresFILTER 2keep only movieswith only 1 of top genresWe cast yearto number typeeliminate unneccessaryvariablesModel 2Main effect + Genres + ControlsModel 1Main effect + ControlsWe call TMDb API Posterand Data Retrieval and getits output tableupdateWe save the table to make the called workflow run by itselfWe call the workflow Lexicon Text Mining onTitles with VAD and get its outputWe save the table to make thecalled workflow run by itselfoverdispersionWe call the workflowDeploy of Face Detection MTCNNand Google Cloud Vision API and get its outputVariance InflationFactor (VIF) CSV Reader Joiner Math Formula Cell Splitter CSV Reader Binary Genre GroupBy Sorter Row Filter Normalizer Linear Correlation Scatter Matrix(local) Data Cleaning Movie Data Cleaning CPI Continent Variable Occurrencesof Genre Data Visualization Linear Correlation Normalizer Rule-basedRow Filter Processing:Top Genres Rule-basedRow Filter Column Aggregator Scatter Matrix(local) String To Number Column Filter R Snippet R Snippet Model 3:Interactions Call Workflow(Table Based) Linear Correlation Table Writer Call Workflow(Table Based) Table Writer R Snippet Frequency Column Filter Call Workflow(Table Based)

Nodes

Extensions

Links