Table Extractor

This node conveniently converts HTML tables into KNIME tables. It uses some simple heuristics to determine the column names. The result are three KNIME tables. The first KNIME table contains the HTML table’s content, the second and third KNIME tables contain the HTML table’s header and footer.

As KNIME supports no spanning rows or columns, rowspan and colspan attributes in the HTML table are mapped by simply copying the original cell’s content to the spanning cells.

The HTML table’s header is detected by checking, whether (1) all cells are of type th, or (2) the cells are contained within a thead element. The HTML table’s footer is detected through the tfoot element.

The KNIME table’s column names are generated from HTML table’s header. In case the header is a single row, the column names equal the HTML table’s header names. In case there is more than one row in the HTML table’s header, the rows are concatenated with “ > ”. In case the HTML table contains no header, the column names are synthetically generated (“column0”, “column1”, …)

Here’s an example HTML table:

Header A
Header B Header C Header D
Cell 1 Cell 2 Cell 3
Cell 4 Cell 5
Cell 6 Cell 7

This will result in the following KNIME content table:

Header A > Header B Header A > Header C Header A > Header D
Cell 1 Cell 2 Cell 3
Cell 4 Cell 5 Cell 5
Cell 4 Cell 6 Cell 7

And the KNIME header table:

Header A > Header B Header A > Header C Header A > Header D
Header A Header A Header A
Header B Header C Header D

The footer table would be empty in this example, because the HTML table contains no footer.

Options

Search Mode
Select “Input Column” if your input table already contains the WebElement which you want to use. Otherwise select “Find Element(s)” to open a picker dialog which allows you to chose the WebElement(s). For more information about the dialog, please check the documentation of the “Find Elements” node.
Input
(when mode is “Input Column”) Input column which provides the WebElement(s)
Remove input column
(when mode is “Input Column”) Removes the input column from the result table
Selector
(when mode is “Find Element(s)”) Click the “Edit” button to open the picker dialog.
Property
Select which property to extract
  • innerHTML: The inner HTML markup of each cell
  • outerHTML: The outer HTML markup of each cell (same as innerHTML, but including the cell’s own markup as well)
  • innerText: The rendered text content of the cell (consider it as the result a user would get if they highlighted the cell with the cursor and copied the text to their clipboard)
  • textContent: The text content of the cell (in contrast to innerText) including e.g. parts which are invisible
Treat rowspan and colspan as missing cells
Enable to empty strings or missing values (depending on next settings) for rowspan or colspan cells instead of repeating the original value.
Create missing values for missing cells
If this option is checked, missing value cells (instead of empty string cells) will be created for those cells, which are not explicitly defined within the HTML table.
Skip empty rows
Enable to skip entirely empty rows from the output tables.

Input Ports

Icon
Table with a column providing a WebElement which resembles the table to extract

Output Ports

Icon
The table content
Icon
The table’s header data
Icon
The table’s footer data

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

  • No workflows found

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.