0 ×

Web Page Content Extractor

StreamablePalladian for KNIME version 2.4.1.202103282119 by palladian.ws

This node provides different content extraction algorithms, which allow extracting textual content from web pages, discarding irrelevant elements like navigation, ads, footers, headers, etc. The extractors are generally optimized for typical news article and blog post web pages.

The node expects an “XML Document” cell input which contains the HTML page from which to extract the content. To properly parse a web page, use Palladian’s HTML Parser node.

Currently, the following algorithms are provided:

Readability

A port of the JavaScript browser bookmarklet “Readability” by Arc90 -- a great tool for extracting content from HTML pages. “Readability […] takes a crack at wiping out all that junk so you can have a more enjoyable reading experience. […] its success rate is pretty respectable (we’d guess over 90% of web sites are handled properly)”. Readability operates on the document’s DOM tree. Basically, it assigns all elements a score for their contents. Metrics for the scoring are length of their text content, number of commas and link density. Also, class and id names are taken into consideration; for example, elements with class name sidebar contain unlikely actual content in contrast to elements with class article. Website, JavaScript Source.

Palladian

The Palladian content extractor extracts clean sentences from (English) texts. That is, short phrases are not included in the output. Consider Readability for general content. The main difference is that this class also finds sentences in comment sections of web pages.

Options

Input
Column in the input table with the DOM documents to extract.
Remove input column
Enable to drop the input column from the result table.
Output column prefix
The name prefix for the added columns.
Extraction algorithm
The extraction algorithm to use.

Input Ports

Icon
Input with HTML documents as XML cells.

Output Ports

Icon
Table with appended columns (main content text, main content XML node, document title)

Best Friends (Incoming)

Best Friends (Outgoing)

Workflows

Installation

To use this node in KNIME, install Palladian for KNIME from the following update site:

KNIME 4.3

A zipped version of the software site can be downloaded here.

You don't know what to do with this link? Read our NodePit Product and Node Installation Guide that explains you in detail how to install nodes to your KNIME Analytics Platform.

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform. Browse NodePit from within KNIME, install nodes with just one click and share your workflows with NodePit Space.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.