Icon

eda

Exploratory Data Analysis (EDA):

  • Data Inventory: Successfully ingested raw dataset containing 10,866 rows and 21 observations.

  • Visual Verification: Initial profiling via Table View no significant inconsistencies, spelling or formatting errors detected.

  • Missing values :

    • High missing rate: homepage (73% ), tagline (26% ) and keywords (13%) identified, do no directly related to the object, removed.

    • Minor missing rate: genres, product_company and cast, requiring imputation.

  • Outlier : Box Plot reveals significant outliers in revenue and budget.

Financial Correlation and Distribution Analysis

  • Variable Isolation: Refined the dataset to focus exclusively on core financial metrics (budget, revenue) and their inflation-adjusted counterparts to ensure historical comparability.

  • Visual Insights:

    • Correlation: Confirmed a positive trend between budget and revenue.

    • Distribution: Validated a heavily right-skewed via Histogram; confirmed that the vast majority of films earn under $500e.

  • Final Output: Exported clean dataset for ROI analysis.

Genre Prevalence Analysis

  • Isolation: Used Column Filter to extract only id and genres.

  • Normalization: Applied Cell Splitter followed by Ungroup to transform multi-valued genre strings into a list, ensuring each movie is correctly indexed under every applicable category.

  • Aggregation: Utilized Group By on the split genre results with a Count aggregation on id to determine the total volume of films produced per category.

  • Visualization: The Bar Chart reveals that Drama is the most frequent genre in the dataset ( 1,500 ), followed closely by Comedy and Thriller/Action.

Rating Analysis

  • Isolation: Used a Column Filter to isolate popularity, vote_average, and vote_count for analysis.

  • Correlation :

    • The Scatter Plot reveals a weak linear correlation between popularity and vote_average.

    • Observation: High-rated films (score > 8) often maintain low popularity scores, suggesting that critical acclaim does not always drive mass-market viral success.

  • Outlier and Distribution :Box Plot, Identified significant outliers in vote_count, indicating a small subset of "blockbuster" films receive disproportionately higher audience engagement.

  • Final Output: Exported the dataset via CSV Writer for final reporting.

Data cleaning:

  • Integrity Enforcement: Applied Duplicate Row Filter to remove redundant records and ensure the uniqueness of each movie .

  • Text Normalization:

    • Utilized String Cleaner to strip unnecessary space.

    • Used String Manipulation (Multi Column) to capitalize categories and maintain consistency across text-based fields.

  • Temporal Formatting: Converted the release_date column from string format to a standardized Date&Time format.

  • Dimensionality Reduction: Implemented a Column Filter to remove low-relevance features such as imdb_id, homepage, and overview to focus on analytical objects.

  • Constraint Filtering: Applied a Row Filter to exclude records with unrealistic financial data, specifically targeting movies with budget and revenue below $10,000.

  • Missing values: Applied Missing Value node, using median imputation for runtime and mean imputation for vote_average to maintain statistical distribution.

  • Correlation Verification: Validated the cleaned dataset through Linear Correlation and Heatmap nodes, confirming strong relationships between revenue, budget, and vote.

dataset input
CSV Reader
identify missing values in "homepage"(73%) and tagline(26%);minor missing values were detected in genres and cast, imputation required.
Statistics View
identify outliers in revenue and budget
Box Plot
initial profiling:verified dataset contains 10866 rows 21 columns, no significant errors in format, inconsistencies, spelling
Table View
removed duplicate 1 row to ensure uniqueness
Duplicate Row Filter
runtime median inputation, vote_average mean imputation
Missing Value
isolated core financial metrics(budget and revenue)
Column Filter
identify distribution : confirmed most of the revenue is under 5e
Histogram
Linear Correlation
no statistical errors
Statistics View
outliers detected in vote_count
Box Plot
isolated popularity and vote
Column Filter
select basic informations
Column Filter
correlation: strong correlation between revenue and budget, popularity and vote_count, revenue and vote_count
Heatmap
export genre dataset for further analysis
CSV Writer
exported clean data for further analysis
CSV Writer
vote(rating)
CSV Writer
removed imdb_id, homepage, tagline, keywords, overview due to incompleteness or low relevance to analytical object
Column Filter
visualized the relationship between budget and revenue to confirm a positive trend, and identify outliers.
Scatter Plot
information
CSV Writer
standardize, avoid unnecessary spaces
String Cleaner
standardize date&time format (release_date)
String to Date&Time
standardize: formatting (Capitalize) to ensure categorical consistency
String Manipulation (Multi Column)
genre distribution, drama and comedy are the top
Bar Chart
filter rows budget and revenue > 10000,
Row Filter
visualized the relationship between popularity and vote_average (no strong liner correlation)
Scatter Plot
ungroup the list into rows, keep id
Ungroup
confirmed no errors
Statistics View
select the new column genres_split
Column Filter
isolated id and genres
Column Filter
group by genres_split
GroupBy
split genres by '|'
Cell Splitter
Statistics View

Nodes

Extensions

Links