This node provides different content extraction algorithms, which allow extracting textual content from web pages, discarding irrelevant elements like navigation, ads, footers, headers, etc. The extractors are generally optimized for typical news article and blog post web pages.
The node expects an “XML Document” cell input which contains the HTML page from which to extract the content. To properly parse a web page, use Palladian’s HTML Parser node.
Currently, the following algorithms are provided:
A port of the JavaScript browser bookmarklet “Readability” by Arc90
-- a great tool for extracting content from HTML pages. “Readability
[…] takes a crack at wiping out all that junk so you can have a more
enjoyable reading experience. […] its success rate is pretty respectable
(we’d guess over 90% of web sites are handled properly)”. Readability
operates on the document’s DOM tree. Basically, it assigns all elements a
score for their contents. Metrics for the scoring are length of their
text content, number of commas and link density. Also, class
and id
names are taken into consideration; for example, elements
with class name sidebar
contain unlikely actual content in
contrast to elements with class article
.
Website,
JavaScript Source.
The Palladian content extractor extracts clean sentences from (English) texts. That is, short phrases are not included in the output. Consider Readability for general content. The main difference is that this class also finds sentences in comment sections of web pages.
You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.
To use this node in KNIME, install the extension Palladian for KNIME from the below update site following our NodePit Product and Node Installation Guide:
A zipped version of the software site can be downloaded here.
Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.
Try NodePit Runner!Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to mail@nodepit.com.
Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.