Icon

KNIME_​Movie_​Final_​Project

Import

EDA

Data Cleaning

Justification

-EDA revealed missing values in columns such as homepage, tagline, keywords, production_companies and cast. Possible measures include deleting columns irrelevant to future analysis and applying 'unknown' to ensure all rows are categorized.

-The distribution analysis showed that budget and revenue contain a large number of zero values. Because this data is important for analysis, we need to consider removing data with a value of 0 during the data cleaning phase.

-The date column is being recognized as a string, which indicates a formatting issue. It needs to be converted to the correct date format.

-Box plot analysis confirms extreme outliers in revenue and vote_count, which need to be needled in data cleaning phase.

-The heatmap and linear correlation analysis suggest a positive relationship between revenue and vote_count, popularity and vote_count indicating that financially successful movies tend to receive higher audience engagement. This justified the use of clustering to segment performance levels.

-There are inconsistencyin in Text Columns. Columns affected are dirctors, genres, production_companies, cast, and keyword. They contain multiple values separated by "|". To facilitate future AI analysis, we will output all the information in each column into a separate table.

Data Preparation and Transformation

Data Preparation and Transformation

Data Export

Missing values: ​​in the text columns (genres, production_companies, cast, director) replace with the "unknown" so all rows have the category.
Missing Value
Missing values: Delete films with 0 runtime, considered as incomplete recordings.
Row Filter
Format: Converting release_date to Date format in a new column
String to Date&Time
Missing Values: because 0 represent missing financial information, so we Remove rows where both budget AND revenue are missing.
Rule-based Row Filter
Spelling: Standardize the text columns (title, genres, cast, director, production_companies), convert them to lowercase, and remove extra spaces.
String Manipulation (Multi Column)
Apply K-Means clustering (k=3) to segment movies based on vote and popularity.
k-Means
Standardize popularity and vote_count using Z-score normalization.
Normalizer
Assign meaningful labels to each cluster based on average better interpretability.
Rule Engine
Standardize revenue and vote_count using Z-score normalization.
Normalizer
Apply color coding to performance segments to enhance visualization and pattern recognition in plots.
Color Manager
Select relevant variables (id, revenue, vote_count) for clustering analysis while excluding unrelated columns.
Column Filter
Apply color coding to performance segments to enhance visualization and pattern recognition in plots.
Color Manager
Apply K-Means clustering (k=3) to segment movies based on financial performance and audience engagement.
k-Means
Assign meaningful labels to each cluster based on average better interpretability.
Rule Engine
Merge cluster assignments back into the full dataset using the movie ID to preserve all original attributes.
Joiner
Import dataset tmdb_movies_data.csv
CSV Reader
Merge cluster assignments back into the full dataset using the movie ID to preserve all original attributes.
Joiner
Export as movie2.csv file for the future use.
CSV Writer
Examining the dataset to understand the data types, detect missing values, and count the number of unique values in each column
Statistics View
Export as movie1.csv file for the future use.
CSV Writer
Explore the relationship between two numerical variables (Identify Correlation, Detect Outliers, Examine Distribution Patterns, Check Multicollinearity, etc.)
Scatter Plot
Export as movie4.csv file for the future use.
CSV Writer
View the distribution of data represented by each column
Histogram
Export as movie3.csv file for the future use.
CSV Writer
View the correlation between data
Heatmap
Export as movie6.csv file for the future use.
CSV Writer
Check the Outliers in columns
Box Plot
Export as movie5.csv file for the future use.
CSV Writer
Remove duplicated rows
Column Filter
Selecting the id and production_companies_transformed columns to prepare a table dedicated to the production companies associated with each film
Column Filter
Remove rowswith outliers
Numeric Outliers
The list of production companies (separated by “|”) is divided into a collection, to separate each production company in list.
Cell Splitter
Rename for the clearer view.
Column Renamer
Each collection is transformed to a pair id and genre.
Ungroup
Transformation of each collection into several lines, one per pair (id and production company), in order to obtain a standardized film-company table.
Ungroup
Rename for the clearer view.
Column Renamer
Selection of the id and genres_transformed columns in order to isolate only the information necessary to create a relational film-genre table.
Column Filter
Remove duplicated rows
Column Filter
Splitting the genres_transformed column into a collection.
Cell Splitter
Splitting the list of directors.
Cell Splitter
Selecting the id and director_transformed columns to prepare a table dedicated to directors for each film.
Column Filter
Examine the database contents to prepare for EDA.
Table View
Rename for the clearer view.
Column Renamer
Removal of columns not needed for a future AI project (imdb_id, homepage, tagline, keywords and overview), their content is difficult to analysis, have too many missing value, or they carry information challenging to verify.
Column Filter
Pair each row (id, director) in order to obtain a normalized film-director table.
Ungroup
Splitting the list of actors/actresses.
Cell Splitter
Selecting the id and cast_transformed columns to create a film-actor relational table
Column Filter
Rename for the clearer view.
Column Renamer
Each actor collection is transformed into several rows pair each movie.
Ungroup
Duplicates: Remove Deduplication based on the ID and original_title
Duplicate Row Filter
Select relevant variables (id, popularity, vote_count) for clustering analysis while excluding unrelated columns.
Column Filter
For better check correlations between each two values
Linear Correlation

Nodes

Extensions

Links