Webpage Retriever

This node can be used to retrieve webpages by issuing HTTP GET requests and parsing the requested HTML webpages. For parsing, jsoup is used as library which implements the WHATWG HTML5 specification. The parsed HTML will be cleaned by removing comments and, optionally, replacing relative URLs by absolute ones.

By default, the output table will contain a column with the parsed HTML converted into XHTML. However, you can specify to get the parsed HTML as string output instead.

The node allows you to either send a request to a fixed URL (which is specified in the dialog) or to a list of URLs provided by an optional input table. Every URL will result in one request which in turn will result in one row in the output table. You can define custom request headers in the dialog.

The node supports several authentication methods, e.g. BASIC and DIGEST. Other authentication methods may be provided by additional extensions.

Cookies can be send to the server via the Request Header tab by setting the "Cookie" header. In order to receive cookies, set the "Extract cookies" option. Any cookies sent by the server are then extracted and appended as a List Cell in the output.

The node supports the Credential port as input (see dynamic input ports). If the port is added, it must supply a Credential that can be embedded into the HTTP Authorization header, and all request done by the node will use the Credential from the port, regardless of other node settings. The OAuth2 Authenticator nodes provide such a Credential for example.

Options

General Settings

URL
Select a constant URL or a column from the input table that contains the URLs that you want to request and parse.
Handling of invalid URLs
Specifies how invalid URLs are handled. Depending on the selected mode, this node either inserts missing values as responses, fails the node execution on encountering the first invalid URL, or omits such rows from the output. The latter option filters rows based on URL validity of the URL column. For REST client nodes, all URLs conforming to RFC 1738 and using the HTTP or HTTPS protocol are considered valid.
Delay (ms)
Here you specify a delay between two consecutive requests, e.g. in order to avoid overloading the web service.
Concurrency
Number of concurrent requests
Ignore hostname mismatches
If checked, the node trusts the server's SSL certificate even if it was generated for a different host.
Trust all certificates
If checked, the node trusts all certificates regardless of their origin or expiration date.
Follow redirects
If checked, the node will follow redirects (HTTP status code 3xx ).
Send large data in chunks
Specifies whether HTTP Chunked Transfer Encoding is allowed to be used by the node. If enabled, messages with a large body size are being sent to the server in a series of chunks.
Timeout (s)
Timeout for a single request in seconds.
Output column name
The name of the created output column.
Output as XML
If checked, the output will be an XML column containing the parsed HTML converted into XHMTL. Otherwise, the output will be a String column containing the parsed HTML.
Replace relative URLs with absolute URLs
If checked, relative URLs in the HTML will be replaced by the absolute ones. This may simplify further processing.
Extract cookies
If checked, the cookies sent by the server are extracted from the response and appended as a list column. A missing value is appended if the server doesn't send cookies.
Cookie column name
The name of the column containing a list of cookies in the output table.

Authentication

Type
The authentication type, e.g. no authentication, BASIC, or DIGEST
Use credentials
Uses the selected credentials instead of username and password provided in the dialog.
Username
The username used for authentication.
Password
The corresponding password used for authentication.

Proxy

Direct connection (no proxy)
This option disables the proxy for this Webpage Retriever node. This will bypass KNIME-wide proxies as well.
Use KNIME-wide proxy settings
Uses the same proxy as the KNIME platform. In the KNIME Analytics Platform, this can be changed under File > Preferences > General > Network Connections.
Use node-specific proxy settings
This option enables the configuration fields in the "Proxy" tab. The proxy settings apply only to this Webpage Retriever node.
Proxy Protocol
This option describes the proxy protocol to use. HTTP, HTTPS and SOCKS can be selected.
Proxy Host
Specifies the proxy host address.
Proxy Port
Specifies the port that should be used at the proxy host.
Workflow Credentials
If enabled, this option allows to select credentials stored in the workflow to be used for the username and password. Filling the "Username" and "Password" fields is not needed then.
Username
If the option "Proxy host needs authentication" is enabled, this field specifies the username to use. Always uses Basic authentication for proxy hosts.
Password
If the option "Proxy host needs authentication" is enabled, this field specifies the password to use. Always uses Basic authentication for proxy hosts.
Excluded Hosts
If the option "Exclude hosts from proxy" is enabled, this field specifies the hosts that will be ignored by the proxy connection. Requests to excluded hosts will use a direct connection. If multiple hosts are specified, they should be separated using a vertical bar ('|').

Error Handling

Fail on connection problems (e.g. timeout, certificate errors, …)
This option describes what should happen if there was a problem establishing the connection to the server. The node either fails in execution or outputs status code and error message in the corresponding row.
Server-side errors (HTTP 5XX)
These options describe what should happen if a response with a 5XX status code is received. These status codes usually describe errors on the server side.
Client-side errors (HTTP 4XX)
These options describe what should happen if a response with a 4XX status code is received. These status codes usually describe client-side errors such as an incorrect web address.
Fail node execution or output missing value
This option describes what should happen after a request has failed. The node either fails in execution or outputs a missing value in the row of the output table that corresponds to this request. This option can be set separately for server- and client-side errors.
Retry on error
Specifies whether the node should retry a request if the initial request received a response indicating a server-side error.
Number of retries
The maximum number of retries to perform for server errors (count does not include the initial request).
Retry delay
The delay to apply between the first request and the first retry. For each subsequent retry, the delay is doubled.
Rate-limiting error (HTTP 429)
This status code can be returned by a server to indicate that the rate of incoming requests has been too high.
Pause execution
In case of a rate-limiting error, wait for the set amount of time before retrying the request. Note that this delay is static and does not increase with subsequent attempts, neither does it count as retries for server-side errors.
Output additional column with error cause
If enabled, each output row corresponding to a request will contain an additional cell that, in case the request has failed, will provide a description of the error cause. If the request was successful, the cell will contain a missing value.

Request Headers

Merge
If you click this button the request headers from the selected template on the left are merged with the already existing header definitions in the table below.
Replace
If you click this button the request headers from the selected template on the left will replace any existing header definitions in the table below.
Header key
The HTTP header key, e.g. Accept or X-Custom-Key. Note that some header keys such as Origin are silently ignored by default for security reasons. You can configure KNIME AP to allow any header key by setting the sun.net.http.allowRestrictedHeaders system property in the knime.ini configuration file to true.
Header value
The value for the header which can be a constant value or a reference to a flow variable, a column, a credential name, or a credential password (see the kind option).
Header kind
The kind of the value specified, which is either a constant value or a reference to a flow variable, a column, a credential name, or a credential password.
Fail on missing header value
Setting this option makes the node fail once a header input value is not available anymore, e.g. due to a missing value. Is enabled by default.

Input Ports

Icon
Optional data table containing the variable parameters of the requests.
Icon
A Credential, that can be embedded into the HTTP Authorization header. If this port is added, then all request done by the node will always use the Credential from the port, regardless of other node settings. The OAuth2 Authenticator nodes provide such a Credential for example.

Output Ports

Icon
Data table containing the parsed HTML either as string or as XHTML and optionally the cookies as list of strings.

Popular Predecessors

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.