0 ×

Webpage Retriever

StreamableKNIME REST Client Nodes version 4.2.0.v202006301416 by KNIME AG, Zurich, Switzerland

This node can be used to retrieve webpages by issuing HTTP GET requests and parsing the requested HTML webpages. For parsing, jsoup is used as library which implements the WHATWG HTML5 specification. The parsed HTML will be cleaned by removing comments and, optionally, replacing relative URLs by absolute ones.

By default, the output table will contain a column with the parsed HTML converted into XHTML. However, you can specify to get the parsed HTML as string output instead.

The node allows you to either send a request to a fixed URL (which is specified in the dialog) or to a list of URLs provided by an optional input table. Every URL will result in one request which in turn will result in one row in the output table. You can define custom request headers in the dialog.

The node supports several authentication methods, e.g. BASIC and DIGEST. Other authentication methods may be provided by additional extensions.

Cookies can be send to the server via the Request Header tab by setting the "Cookie" header. In order to receive cookies, set the "Extract cookies" option. Any cookies sent by the server are then extracted and appended as a List Cell in the output.

Options

General Settings

URL
Select a constant URL or a column from the input table that contains the URLs that you want to request and parse.
Delay (ms)
Here you specify a delay between two consecutive requests, e.g. in order to avoid overloading the web service.
Concurrency
Number of concurrent requests
Ignore hostname mismatches
If checked, the node trusts the server's SSL certificate even if it was generated for a different host.
Trust all certificates
If checked, the node trusts all certificates regardless of their origin or expiration date.
Fail on connection problems (e.g. timeout, certificate errors, …)
By default, the node will fail if a request fails. If this option is unchecked, connection problems will result in a missing value in the output containing the error message and the node will continue.
Fail on http errors (e.g. page not found)
By default, the node will fail if a request fails. If this option is unchecked, failed requests (HTTP status codes 4xx and 5xx ) will result in a missing value in the output containing the appropriate status code as well as the error message and the node will continue.
Follow redirects
If checked, the node will follow redirects (HTTP status code 3xx ).
Timeout (s)
Timeout for a single request in seconds.
Output column name
The name of the created output column.
Output as XML
If checked, the output will be an XML column containing the parsed HTML converted into XHMTL. Otherwise, the output will be a String column containing the parsed HTML.
Replace relative URLs with absolute URLs
If checked, relative URLs in the HTML will be replaced by the absolute ones. This may simplify further processing.
Extract cookies
If checked, the cookies sent by the server are extracted from the response and appended as a list column. A missing value is appended if the server doesn't send cookies.
Cookie column name
The name of the column containing a list of cookies in the output table.

Authentication

Type
The authentication type, e.g. no authentication, BASIC, or DIGEST
Use credentials
Uses the selected credentials instead of username and password provided in the dialog.
Username
The username used for authentication.
Password
The corresponding password used for authentication.

Request Headers

Merge
If you click this button the request headers from the selected template on the left are merged with the already existing header definitions in the table below.
Replace
If you click this button the request headers from the selected template on the left will replace any existing header definitions in the table below.
Header key
The HTTP header key, e.g. Accept or X-Custom-Key .
Header value
The value for the header which can be a constant value or a reference to a flow variable, a column, a credential name, or a credential password (see the kind option).
Header kind
The kind of the value specified, which is either a constant value or a reference to a flow variable, a column, a credential name, or a credential password

Input Ports

Optional data table containing the variable parameters of the requests.

Output Ports

Data table containing the parsed HTML either as string or as XHTML and optionally the cookies as list of strings.

Workflows

Installation

To use this node in KNIME, install KNIME REST Client Extension from the following update site:

KNIME 4.2

A zipped version of the software site can be downloaded here. Read our FAQs to get instructions about how to install nodes from a zipped update site.

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.