Icon

eda

Exploratory Data Analysis (EDA):

  • Data Inventory: Successfully ingested raw dataset containing 10,866 rows and 21 observations.

  • Visual Verification: Initial profiling via Table View no significant inconsistencies, spelling or formatting errors detected.

  • Missing values :

    • High missing rate: homepage (73% ), tagline (26% ) and keywords (13%) identified, do no directly related to the object, removed.

    • Minor missing rate: genres, product_company and cast, requiring imputation.

  • Outlier : Box Plot reveals significant outliers in revenue and budget.

Financial Analysis

  • Variable Isolation: Refined the dataset to focus exclusively on core financial metrics (budget, revenue) and their inflation-adjusted counterparts to ensure historical comparability.

  • Correlation: Confirmed a positive trend between budget and revenue.

  • Distribution: Validated a heavily right-skewed via Histogram; confirmed that the majority of films earn under $500e.

  • Final Output: Exported clean dataset for financial analysis.

Genre Analysis

  • Isolation: Used Column Filter to extract only id and genres.

  • Normalization: Applied Cell Splitter followed by Ungroup to transform multi-valued genre strings into a list, ensuring each movie is correctly indexed under every category.

  • Aggregation: Utilized Group By on the split genre results with a Count aggregation on id to determine the total volume of films produced per category.

  • Visualization: The Bar Chart reveals that Drama is the most frequent genre in the dataset ( 1,500 ), followed closely by Comedy, Thriller and Action.

  • Final Output: Exported the dataset via CSV Writer for final reporting.

Rating Analysis

  • Isolation: Used a Column Filter to isolate popularity, vote_average, and vote_count for analysis.

  • Distribution :The scatter plot: no strong linear correlation between popularity and vote_average. high scores do not drive high revenue directly.

  • Outlier:Box Plot, identified significant outliers in vote_count, indicating a small subset of films receive disproportionately higher audience engagement.

  • Final Output: Exported the dataset via CSV Writer for final reporting.

Data cleaning:

  • Integrity Enforcement: Applied Duplicate Row Filter to remove redundant records and ensure the uniqueness of each movie .

  • Text Normalization:

    • Utilized String Cleaner to strip unnecessary space.

    • Used String Manipulation (Multi Column) to capitalize categories and maintain consistency across text-based fields.

  • Temporal Formatting: Converted the release_date column from string format to a standardized Date&Time format.

  • Dimensionality Reduction: Implemented a Column Filter to remove low-relevance features such as imdb_id, homepage, and overview to focus on analytical objects.

  • Constraint Filtering: Applied a Row Filter to exclude records with unrealistic financial data, specifically targeting movies with budget and revenue below $10,000.

  • Missing values: Applied Missing Value node, using median imputation for runtime and mean imputation for vote_average to maintain statistical distribution.

  • Correlation Verification: Validated the cleaned dataset through Linear Correlation and Heatmap nodes, confirming strong relationships between revenue, budget, and vote.

Basic Information

  • Isolation: Used a Column Filter to select core descriptive variables, including runtime, release_year, release_date, production_companies, original_title, id, director

  • Purpose: This step isolates essential identification and contextual attributes of each movie, separating descriptive information from financial and engagement metrics.

  • Final Output: Exported the dataset via CSV Writer for final reporting.

dataset input
CSV Reader
identify missing values in "homepage"(73%) and tagline(26%);minor missing values were detected in genres and cast, imputation required.
Statistics View
identify outliers in revenue and budget
Box Plot
initial profiling:verified dataset contains 10866 rows 21 columns, no significant errors in format, inconsistencies, spelling
Table View
removed duplicate 1 row to ensure uniqueness
Duplicate Row Filter
runtime median inputation, vote_average mean imputation
Missing Value
isolated core financial metrics(budget and revenue)
Column Filter
identify distribution : confirmed most of the revenue is under 5e
Histogram
Linear Correlation
confirmed no statistical errors
Statistics View
outliers detected in vote_count
Box Plot
isolated popularity and vote
Column Filter
select basic informations( title, director, cast etc..)
Column Filter
correlation: strong correlation between revenue and budget, popularity and vote_count, revenue and vote_count
Heatmap
confirmed no statistical errors
Statistics View
export genre clean dataset for further analysis
CSV Writer
exported clean financial data for further analysis
CSV Writer
exported clean data (rating)for further analysis
CSV Writer
removed imdb_id, homepage, tagline, keywords, overview due to incompleteness or low relevance to analytical object
Column Filter
visualized the relationship between budget and revenue to confirm a positive trend, and identify outliers.
Scatter Plot
exported clean data (information)for further analysis
CSV Writer
standardize, avoid unnecessary spaces
String Cleaner
standardize date&time format (release_date)
String to Date&Time
standardize: formatting (Capitalize) to ensure categorical consistency
String Manipulation (Multi Column)
genre distribution, drama and comedy are the top
Bar Chart
filter rows budget and revenue > 10000,
Row Filter
visualized the relationship between popularity and vote_average (no strong liner correlation)
Scatter Plot
ungroup the list into rows, keep id
Ungroup
confirmed no statistical errors
Statistics View
select the new column genres_split
Column Filter
isolated id and genres
Column Filter
group by genres_split
GroupBy
split genres by '|'
Cell Splitter
confirmed no statistical errors
Statistics View

Nodes

Extensions

Links