Clean HTML Retriever

This node takes URL from a column, retrieves its content (assuming to be in HTML form) for parsing. If HTML content is available in another column, it can take HTML content directly instead of pulling from URL. HTML content is then parsed and cleaned up using HtmlCleaner to output in XHTML form. The result can be configured to output in either String for XML type.

Options

URL Column Name
URL column name
Content Column Name
Content column name. If available, the node will use this content instead of pulling from URL.
Output Column Name
Column name of the resulting parsed XHTML content, default name is "XHTML".
Output result as XML
Output result as String or XML type. XML type is useful when this node is part of XML analysis workflow.
User agent
User agent to be used in header for HTTP request.
Number of retries
Number of retries after a failure per URL requests.
Make absolute URLs
Convert all relative URLs in the documents into absolute URLs.

Input Ports

Icon
An input table that contains URL / content columns

Output Ports

Icon
An output table URL and XHTML results

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.