0 ×

Clean HTML Retriever

StreamableMMI Data Analytics Node extensions for KNIME Workbench version 0.0.14.qualifier by MMI Agency

This node takes URL from a column, retrieves its content (assuming to be in HTML form) for parsing. If HTML content is available in another column, it can take HTML content directly instead of pulling from URL. HTML content is then parsed and cleaned up using HtmlCleaner to output in XHTML form. The result can be configured to output in either String for XML type.

Options

URL Column Name
URL column name
Content Column Name
Content column name. If available, the node will use this content instead of pulling from URL.
Output Column Name
Column name of the resulting parsed XHTML content, default name is "XHTML".
Output result as XML
Output result as String or XML type. XML type is useful when this node is part of XML analysis workflow.
User agent
User agent to be used in header for HTTP request.
Number of retries
Number of retries after a failure per URL requests.
Make absolute URLs
Convert all relative URLs in the documents into absolute URLs.

Input Ports

Icon
An input table that contains URL / content columns

Output Ports

Icon
An output table URL and XHTML results

Best Friends (Incoming)

Best Friends (Outgoing)

Workflows

Installation

To use this node in KNIME, install MMI Data Analytics Nodes from the following update site:

KNIME 4.3

A zipped version of the software site can be downloaded here.

You don't know what to do with this link? Read our NodePit Product and Node Installation Guide that explains you in detail how to install nodes to your KNIME Analytics Platform.

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform. Browse NodePit from within KNIME, install nodes with just one click and share your workflows with NodePit Space.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.