1 ×

HTML Parser

Palladian Nodes for KNIME Workbench version 1.8.0.201907271536 by palladian.ws; Philipp Katz, Klemens Muthmann, David Urbansky.

This HTML parser is based on Validator.nu.

Quotation from the web page: The Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)

Options

Input
Column in the input table with HTTP Result or paths to local files to parse.
Make absolute URLs
When enabled, all relative URLs in the document are converted to absolute ones. This simplifies/permits further processing steps with the URLs obtained from the document.

Input Ports

Input table containing HTTP Results, binary object data, or file paths with (X)HTML data to be parsed. Note: Although technically possible, it is not recommended to input http links directly into the parser. Use the “HTTP Retriever” for downloading instead and input the HTTP Results into this node.

Output Ports

Output table with parsed (X)HTML documents appended. In case, a document could not be parsed, a “missing value” is appended.

Best Friends (Incoming)

Best Friends (Outgoing)

Workflows

Installation

To use this node in KNIME, install Palladian for KNIME from the following update site:

KNIME 4.0
Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.