Webpage Loader

This node can be used to load webpages by issuing HTTP GET requests and parsing the requested HTML webpages. For parsing, jsoup is used as library which implements the WHATWG HTML5 specification. The parsed HTML will be cleaned by removing comments and, optionally, replacing relative URLs by absolute ones.

By default, the output table will contain a column with the parsed HTML converted into XHTML. However, you can specify to get the parsed HTML as string output instead.

The node allows you to either send a request to a fixed URL (which is specified in the dialog) or to a list of URLs provided by an optional input table. Every URL will result in one request which in turn will result in one row in the output table. You can define custom request headers in the dialog.

The node supports several authentication methods, e.g. BASIC and DIGEST. Other authentication methods may be provided by additional extensions.

Options

General Settings

URL
Select a constant URL or a column from the input table that contains the URLs that you want to request and parse.
Delay (ms)
Here you specify a delay between two consecutive requests, e.g. in order to avoid overloading the web service.
Concurrency
Number of concurrent requests
Ignore hostname mismatches
If checked, the node trusts the server's SSL certificate even if it was generated for a different host.
Trust all certificates
If checked, the node trusts all certificates regardless of their origin or expiration date.
Fail on connection problems (e.g. timeout, certificate errors, …)
By default, the node will fail if a request fails. If this option is unchecked, connection problems will result in a missing value in the output containing the error message and the node will continue.
Fail on http errors (e.g. page not found)
By default, the node will fail if a request fails. If this option is unchecked, failed requests (HTTP status codes 4xx and 5xx ) will result in a missing value in the output containing the appropriate status code as well as the error message and the node will continue.
Follow redirects
If checked, the node will follow redirects (HTTP status code 3xx ).
Timeout (s)
Timeout for a single request in seconds.
Output column name
The name of the created output column.
Output as XML
If checked, the output will be an XML column containing the parsed HTML converted into XHMTL. Otherwise, the output will be a String column containing the parsed HTML.
Replace relative URLs with absolute URLs
If checked, relative URLs in the HTML will be replaced by the absolute ones. This may simplify further processing.

Authentication

Type
The authentication type, e.g. no authentication, BASIC, or DIGEST
Use credentials
Uses the selected credentials instead of username and password provided in the dialog.
Username
The username used for authentication.
Password
The corresponding password used for authentication.

Request Headers

Merge
If you click this button the request headers from the selected template on the left are merged with the already existing header definitions in the table below.
Replace
If you click this button the request headers from the selected template on the left will replace any existing header definitions in the table below.
Header key
The HTTP header key, e.g. Accept or X-Custom-Key .
Header value
The value for the header which can be a constant value or a reference to a flow variable, a column, a credential name, or a credential password (see the kind option).
Header kind
The kind of the value specified, which is either a constant value or a reference to a flow variable, a column, a credential name, or a credential password

Input Ports

Icon
Optional data table containing the variable parameters of the requests.

Output Ports

Icon
Data table containing the parsed HTML either as string or as XHTML.

Popular Predecessors

  • No recommendations found

Popular Successors

  • No recommendations found

Views

This node has no views

Workflows

  • No workflows found

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.