This node is currently not available in KNIME v5.12 — instead we’re showing this page for KNIME v4.6. You can use the version menu in the title bar to permanently switch your preferred version. This will also show the link to the update site.

Content Extractor

Go to Product

This Node Is Deprecated — This node is kept for backwards-compatibility, but the usage in new workflows is no longer recommended. The documentation below might contain more information.

Note: We recommend to replace this node with the Web Page Content Extractor which does not use KNIME’s Textprocessing cells as output format and is thus more flexible.

This node provides different content extraction algorithms, which allow extracting textual content from web pages, discarding irrelevant elements like navigation, ads, footers, headers, etc. The extractors are generally optimized for typical news article and blog post web pages. Currently, three algorithms are provided:

Readability

A port of the JavaScript browser bookmarklet “Readability” by Arc90 -- a great tool for extracting content from HTML pages. “Readability […] takes a crack at wiping out all that junk so you can have a more enjoyable reading experience. […] its success rate is pretty respectable (we’d guess over 90% of web sites are handled properly)”. Readability operates on the document’s DOM tree. Basically, it assigns all elements a score for their contents. Metrics for the scoring are length of their text content, number of commas and link density. Also, class and id names are taken into consideration; for example, elements with class name sidebar contain unlikely actual content in contrast to elements with class article. Website, JavaScript Source.

Palladian

The Palladian content extractor extracts clean sentences from (English) texts. That is, short phrases are not included in the output. Consider Readability for general content. The main difference is that this class also finds sentences in comment sections of web pages.

Options

Input: Column in the input table with the DOM documents to extract.
Extraction algorithm: The extraction algorithm to use.

Input Ports

: Input with (X)HTML documents parsed as DOM/XML.

Output Ports

: Text documents with extracted content.

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

meister_nlpNodePit Space

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Go to Product

Installation

To use this node in KNIME, install the extension Palladian for KNIME: Deprecated Nodes (depends on KNIME Textprocessing) from the below update site following our NodePit Product and Node Installation Guide:

v4.6

A zipped version of the software site can be downloaded here.

Plugin provider: palladian.ws

Plugin version: 2.9.0.202309281529

On NodePit since: 2022-06-11

Last update: 2026-06-07

Tags: Deprecated

KNIME versions: From v3.6 to v5.2

NodePit ExclusiveOnly available on NodePit

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!