Icon

Final Project - TMBD Dataset

Import the Data

Initial EDA : Exploratory Data Analysis - Identifying the number of rows and columns
Data Cleaning
Data transformation -
Data Analysis - Calculates Profit and ROI. Aggregates revenue performance by Genre.

Visualisation - EDA -

splitting and transform into a list compound categorical variables into individual components for analysis

The "Movies" Table : This metanode extracts and structures the main movies table, ensuring that each row represents a single movie with only one value per column, making it ready for analysis.

The "Genres" Table : Handle the list hint. Transform the pipe-separated genres column into a format where each genre has its own row.


PROJECT: TMDB Movies Data Preparation Pipeline Objective: Prepare raw movie data for AI Revenue Prediction.

tmdb_movies_data
CSV Reader
identified 52.4% (5696)of [budget] values are zero. identified 55.4% (6016)of [revenue] values are zero. Detected missing values in [cast] [tagline] [homepage][directory] [keywords] [overview] [genres] [production_companies].
Statistics
Some fields contain lists
Table Manipulator
Displaying descriptive statistics for numerical data exploration.
Statistics View
Emphasising the Rows with value and Counting occurrences of each unique value
Value Counter
The scatter plot illustrates the relationship between budget and revenue, highlighting the overall trend and the presence of outliers.
Scatter Plot
Displaying only string columns for categorical data exploration.
Table View
Removed duplicates to ensure observation uniqueness.
Duplicate Row Filter
keywords
Cell Splitter
Top 10 most profitable movies
Bar Chart
Split the pipe-separated genre string into a list collection.
Cell Splitter
To regroup genre by decade to compare
Joiner
Top directors by average ROI
Bar Chart
Split the list so each genre gets its own row.
Ungroup
Movies-Profit Table to CSV
CSV Writer
production_companies
Cell Splitter
Created a new column [profit] by subtracting budget from revenue, providing a financial performance measure for each movie.
Math Formula
Aggregation: Computing average performance metrics per genre.
GroupBy
genres table to CSV
CSV Writer
Joined Table Movies with profit and Table Genres
Joiner
Movies-ROI Table to CSV
CSV Writer
Filled missing financials with Median to avoid the effect of outliers and categorical missing values with 'Unknown'.
Missing Value
Statistical Test: Checking Pearson correlation. Found 0.708 correlation between [Budget]/[Revenue], Found 0.073 correlation between [Budget]/[vote_average].
Linear Correlation
Calculated Return on Investment (ROI) using revenue and budget.
Math Formula
Converted string dates to Date&Time(yyyy-MM-dd) objects for temporal analysis.
String to Date&Time
Selected the [release_date] column to prepare for decade calculation.
Column Filter
replaces zero values with missing values in [budget][revenue] column and created new column with missing values[Budget][Revenue]
Expression
used to modify the table structure by removing none transform old columns of the original table, and reordering the append columns.
Table Manipulator
Removed rows with "unknown" Genre classification, unnecessary for a future AI project
Row Filter
Compare Movies
Bar Chart
String cleaning : Normalized titles to lowercase and trimmed whitespace to ensure consistency.
String Manipulation
Converted [release_date] into its corresponding decade.
Math Formula
Isolated Genre data, retaining 'id' as the foreign key.
Column Filter
decade vs genre
Bar Chart
Cast
Cell Splitter
Writing the final table to a CSV file.
CSV Writer
Correlation Plot: Confirming that higher budget generally yields higher revenue.
Scatter Plot
Filtered out columns containing lists or multiple values to keep only columns with a single value per movie, creating a clean table for analysis.
Column Filter
Category Analysis: Identifying high-grossing genres
Bar Chart

Nodes

Extensions

Links