JKISeason 4-14 - News Article Scraper and Reader

Challenge 14: News Article Scraper and ReaderLevel: Medium Description: Julian works as a researcher at a media and journalism research institute in Dublin. Every morning, he scans RSS news feeds from authoritative outlets, e.g. BBC, to stay informed and gather content for his research. However, doing this manually is tedious and time-consuming. RSS feeds only provide limited summaries, and Julian often has to click through each article to extract the full text and fetch associated images. He needs a way to automate content and image retrieval and organize it into an interactive news reader. To help Julian, you decide to build a workflow that reads the BBC World RSS news feeds, scrapes full articles, and extracts the first image of each article along its caption. The workflow should also allow Julian to view the full text of scraped news articles and the associated image interactively. Can you help Julian automate the process? Beginner-friendly objectives: 1. Read the BBC World RSS news feeds, filter out news that contain "videos" in the URL, and format the date & time info to your liking. 2. Scrape and extract the full text of each news article, the first image and its caption (if available). 3. Visualize news article details (e.g., titles, publication date, etc.), as well as the full scraped text, the associated first image and caption (remove ".webp" from image URLs and retain only .jpg files). Intermediate-friendly objectives: 1. Make the selection of each news articles and its corresponding image more flexible with widgets, creating an interactive data app that makes reading news more engaging. 2. Add beautification elements to your data app (e.g., a title, a subtitle, emoji, instructions on the intended use, etc.). 3. Web scraping can be prone to issues because website structures may change, sites may rate-limit traffic, or the Internet connection may temporarily drop or become unstable. All of that breaks the scraping logic and leads to errors or missing data, if not handled carefully. Add error handling techniques to make sure that for each news article the scraper deals with errors or missing data gracefully (if the news scraper runs without errors, you can simulate them by temporarily disabling your Internet connection). 4. Log detailed errors for each failed article scarping attempt (e.g., reason of failure, the failing node, etc.). Make sure to also add the date and timestamp of when the error occurred. Remember to upload your solution with tag JKISeason4-14 to your public space on KNIME Community Hub. To increase the visibility of your solution, also post it to this challenge thread on KNIME Forum.

URL: Just KNIME It! https://www.knime.com/just-knime-it

JKISeason 4-14 - News Article Scraper and Reader

Nodes

Extensions

Links

Download