Icon

TMDb API Poster and Data Retrieval

TMDB API Poster and Data Retrieval

This workflow can be run by itself with the dataset linked among the relevant materials (please remove the space in the name of the downloaded .csv file before execution), but it is in truth complementary to another flow, that can be found at the same directory of this workflow on Knime Hub.

This flow aims at using the IMDB id of each movie in the dataset to find the respective poster URL and English Title through the API from TMDB. Optionally, the flow also performs poster scraping from TMDB, by exploiting the poster URLs previously found.

Poster scraping (it needs days for this amount of movies) needs to be manually activated from inside the metanode. It is only activated for one movie by default. The images are saved in a folder in the working directory.

A TMDB API Key is needed to make this workflow run. Please read TMDB Terms of Use linked in the relevant materials section.

POSTER AND ENGLISH TITLE RETRIEVAL THROUGH TMDB APIThe metanode here is very articulated, and uses the TMDB API to retrieve the poster URLs from the website and to get the English titles of the movies (present in TMDBand with poster URL), which we need for text mining, since the original dataset from Kaggle only has the original ones and the Italian ones. Once we have the URLs of theposters, the node performs image scraping and downloads them all from the website, while saving them in a local folder. The output of the node is a table with the IMDb titleID of each movies and its English title.The other nodes here read again the main dataset from Kaggle and filter the rows which have NA in worldwide gross income, since our inference is on this variable, as thedependent one. In the end, we join this filtered table and the table outputted by the metanode. All this operations are possible because the TMDB API can use the IMDb ID (present in the original Kaggle dataset) as identifier for a movie. The complete poster scraping process is obviously not activated by default, since it would take days to run, and it needs to be activated from inside the following metanode,by editing an internal row filter node. API Key InputWe read the main datasetFilter the rowswith missing values in worldwide incomeWe exclude Italian titleand original title because we don't need them and toavoid conflicts in thenamespaceWe join the English titles to the rest of the datasetMetanode to scrape images and get English titles through the TMDB API. Scraping of the images needs to be activated manually from insidethe metanode by editing a row filter node (to speed up the workflow in casethe posters are already downloaded)We send out the output tablefor possible external workflowsAPI Key TMDb CSV Reader Row Filter Column Filter Joiner ImageExtraction fromTMDb and TMDb API ContainerInput (Table) ContainerOutput (Table) POSTER AND ENGLISH TITLE RETRIEVAL THROUGH TMDB APIThe metanode here is very articulated, and uses the TMDB API to retrieve the poster URLs from the website and to get the English titles of the movies (present in TMDBand with poster URL), which we need for text mining, since the original dataset from Kaggle only has the original ones and the Italian ones. Once we have the URLs of theposters, the node performs image scraping and downloads them all from the website, while saving them in a local folder. The output of the node is a table with the IMDb titleID of each movies and its English title.The other nodes here read again the main dataset from Kaggle and filter the rows which have NA in worldwide gross income, since our inference is on this variable, as thedependent one. In the end, we join this filtered table and the table outputted by the metanode. All this operations are possible because the TMDB API can use the IMDb ID (present in the original Kaggle dataset) as identifier for a movie. The complete poster scraping process is obviously not activated by default, since it would take days to run, and it needs to be activated from inside the following metanode,by editing an internal row filter node. API Key InputWe read the main datasetFilter the rowswith missing values in worldwide incomeWe exclude Italian titleand original title because we don't need them and toavoid conflicts in thenamespaceWe join the English titles to the rest of the datasetMetanode to scrape images and get English titles through the TMDB API. Scraping of the images needs to be activated manually from insidethe metanode by editing a row filter node (to speed up the workflow in casethe posters are already downloaded)We send out the output tablefor possible external workflowsAPI Key TMDb CSV Reader Row Filter Column Filter Joiner ImageExtraction fromTMDb and TMDb API ContainerInput (Table) ContainerOutput (Table)

Nodes

Extensions

Links