Justification
-EDA revealed missing values in columns such as homepage, tagline, keywords, production_companies and cast. Possible measures include deleting columns irrelevant to future analysis and applying 'unknown' to ensure all rows are categorized.
-The distribution analysis showed that budget and revenue contain a large number of zero values. Because this data is important for analysis, we need to consider removing data with a value of 0 during the data cleaning phase.
-The date column is being recognized as a string, which indicates a formatting issue. It needs to be converted to the correct date format.
-Box plot analysis confirms extreme outliers in revenue and vote_count, which need to be needled in data cleaning phase.
-The heatmap and linear correlation analysis suggest a positive relationship between revenue and vote_count, popularity and vote_count indicating that financially successful movies tend to receive higher audience engagement. This justified the use of clustering to segment performance levels.
-There are inconsistencyin in Text Columns. Columns affected are dirctors, genres, production_companies, cast, and keyword. They contain multiple values separated by "|". To facilitate future AI analysis, we will output all the information in each column into a separate table.