XML Easy Reader 1

v1.10 - 10 January 2022
PLEASE NOTE THIS COMPONENT IS STILL PROTOTYPE AND SUBJECT TO SOME CHANGES - FEEDBACK WELCOME!

Python Script updated with some improvements, and also recoded so that it can be tested (in part) outside of KNIME (e.g. using VSCode) for easier development.

Reads the supplied XML file, using the specified path as a local file system file, but if that fails, attempts to read it as a URL.

This component uses Python 3 so you must have Python 3 installed and available in your KNIME environment. It makes use of the following Python modules: cElementTree, pandas, urllib

The XML data is output in grouped tabular format, which means that the rows should be ungrouped (use an upgroup node). Those data items that are expected to be repeated across all rows for a "group" should be excluded from the selection of columns to be ungrouped. In that way, repeated data is "copied down" where appropriate across row items.

Outputs of the columns and their paths is generated on the "Column Paths" port and on the "Path to Column Mapping" port. The Column Paths port is "by column name" and so if there is column-name clash (which can occur if more than one element in the XML has the same element-name) the resulting rows on this port will be deficient, as will the resulting data output.

The "Path to Column Mapping" port shows the same information, but is "path centric" and so will contain any columns for which "name clash" has occurred.

The "Column Name Clash" port will identify clashing names. This port should return no data if no name clash occurred, but can be used to quickly verify that all expected columns have been handled correctly.

The name of a csv "Column Name to Path" mapping file can be supplied, which allows you to specify which elements/columns to return, based on their paths. By specifying a different column name here, the column will be renamed on the output.

Paths follow a basic "pseudo xpath" format. No additional xpath syntax should be used as it will not be recognised, and will result in data in the file being ignored.

Element paths are defined by the format //element1/element2/element3
Attribute paths are defined by the format //element1/element2/element3/@attributename

Rows in the Column Name - Path mapping table can be "commented out". To do this, all that is necessary is that the path be "invalidated", and this can easily be achieved by, for example, adding a '#' to the end of the line

e.g. In the following example, the paths for the * and orderperson lines have been "invalidated" so are ignored
Column Name,Path
*,*#
Order Id,//shiporder/@orderid
orderperson,//shiporder/orderperson#

The path will change if you specify a different collection subtree, and/or root. If you are having difficulty working out the correct path, execute the node and take a look in the Column Paths output port to see what the paths are with the current configuration.

v 1.0 (Prototype) @takbb Brian Bates

This is a fully functioning prototype, but may well be suitable for your needs. If you wish to use it, please test it with your data to see that it works well for you before relying on it!
Please provide feedback on any issues found, or any suggestions for improvement, or usability.

Options

Auto remove empty columns: Automatically delete any empty columns that are generated from the xml. This removes any grouping columns that contain no data of their own. Note though that ticking this option may also remove other columns that simply happen to contain no data in this xml file.%%00010%%00010Uncheck this if you want to return empty columns (including grouping columns) and handle them later in the workflow
Use Full column names: Causes column names to be based in full path names rather than just element names, removing the column-name-clash.
Auto Group Pad: Automatically pad any elements that are found to contain more than one value within their parent-node. %%00010%%00010If set on, this may slow processing considerably on large data sets, as it keeps a count of the entries for every node and attempts to pad any optional entries for all children of the same node. The recommendation is to switch this off and to manually define any optional groups based on user-knowledge of the data set, or XSD.%%00010%%00010If set off, no automatic group detection occurs, and any groups that may contain optional elements will potentially have mismatches of what should otherwise be related data items within their child elements, unless the group is specified as an Optional Element Group.%%00010%%00010
Maximum Auto Unpack Threshold: The maximum threshold for which automatic unpacking of column data will take place. Any column for which there is a single row containing greater than than this number of entries will not be automatically unpacked.%%00010%%00010The higher the number the longer the process of unpacking will take, and the greater is the chance that this could lead to inconsistent results, or very large result sets containing large numbers of duplucated rows. (This will be especially the case if the XML contains many discrete sub trees each with multiple values, as the end result will be the product of the values from each subtree).%%00010%%00010Keep this number low unless you are satisfied that there is only a single tree, or that there are not large numbers of entries for single elements across multiple subtrees.%%00010%%00010This can be used in conjunction with the auto-unpack exclusion list so that specific problematic columns can be excluded from auto-unpacking, and then the threshold can be set higher.%%00010%%00010The default (-1) means no automatic unpacking takes place.%%00010%%00010Depending on the data, the maximum unpack depth may not unpack all elements. This is to protect against cartesian products of very large nested subtrees. If this is found to be case, and unpacking is required, the Maximum Auto Unpack Depth should be set to -1, and then a series of UnGroup nodes used to perform the unpacking in a defined sequence, or alternatively auto-unpack to a certain depth and then use other nodes to handle the nested data as appropriate.
Limit to First n Rows: Limit the XML reader to reading the first n rows. Default 0 means read entire file
Root Element: Enter the name of the root element, If left blank, the default will be the first element found
Row element: Enter the name of the row element. If left blank, the default will be the first child element of the root (or subtree) element
Sub Tree element: Enter the name of the element identifying the subtree to be read (or leave blank). The subtree defines the actual tree to be parsed and treated as if it were an individual row. (You can think of this in it treated a portion of the xml document as if it were actually the entire document for the purposes of retrieving the required information, or put another way, think of this as an "alternative root" element, if the document consists of multiple sub-trees below root). %%00010%%00010You should specifiy the subtree as being the smallest tree within the document that contains all the data required for collection (including any specified "key element").%%00010%%00010You can leave this cell blank, and simply change the row element if you prefer, but sometimes it is better for documentary purposes to note what the "actual" row element in an XML file is, and simply adjust the sub tree for specific processing.
Key element: Enter the path(s) for an identifiers or other context-provider elements and attributes within the returned structure. This may be left blank, and is only required if there are elements or attributes that are needed for context, that is not available from within the Collection Subtree. They key element MUST be below the element identified as the Row element (or Subtree Element if specified) if it is to have its data collected.
Collect from subtree: Enter the top level node for subtree for which data should be collected. Regardless of the other settings, if a value is entered here, it must appear as a substring of the path for an element in order that an element's value be collected for output. The only exception to this is an element defined as the "key element", which must be subordinate to the row element (or sub tree element, if a subtree is specified), but does not have to be in the collection subtree.%%00010%%00010The collection subtree must be subordinate to the row element (subordinate to the sub tree element, if one is specified)
Optional Element Groups: Enter the path(s) for any group elements that contain optional elements. Optional elements may cause mismatches of data within a group unless these are known about. Specify the full path to the group element. %%00010%%00010The immediate sub-element for the groups will be handled to pad missing elements as they arise
Auto Unpack Exclusions: Enter the output column names paths for all elements/attributes that should not be unpacked (one per line)%%00010%%00010A wildcard character * may be appended to the end of the path to indicate that the exclusion applies to all paths beginning with the sequence of characters. No other wildcards may be used, and the *, if used, may appear only at the end of the name.
XML File location (absolute file reference or URL): Enter the full path to the XML file. This can either be an absolute file path to the XML file, or a URL. If the location given cannot be interpreted as either one of those, the node will fail.
Column Name - Path Mapping File: Enter the full path to a CSV file containing two columns:%%00010%%00010Column Name%%00010Path%%00010%%00010If this is specified, it will be used in place of an XSD file, and therefore allows specific tailoring of the resultant output to include only specific elements from the XML, or to rename elements%%00010%%00010If the file contains a row%%00010*,*%%00010All available elements will be returned from the XML, otherwise only elements that can be identified from the paths given will be returned.%%00010%%00010The column name for a given path defines the custom column name to be returned for the given path. If *,* is specified, but there is no column name entry for the given path, the element name from the xml will be used. %%00010%%00010Note in this version, that if more than one element has the same name, "name clash" will occur, and the resulting column contents is undefined. This will be resolved in a future version.%%00010%%00010It is suggested, therefore that initially, no csv file is used for an initial "test run", and then the resultant "Column Path" data output is used as the basis of generating this file, and then edited to avoid column name clash.%%00010%%00010Paths are a simplied "xpath" format and contain only the following:%%00010basic element: \element\element2\element3 %%00010basic attribute: \element\element2\element3\@attribute%%00010 %%00010%%00010

Input Ports

This node has no input ports

Output Ports

: The data read from the XML file, based on the column names created either from the raw element/attribute names, or from the Column Name - Path mapping file (if supplied)
: A list of column names, with their associated paths that can be used to create a Column Name - Path mapping file.
: The mapping of paths to column names which can be used to manually check for column name clash. If two paths result in the same column name, you should create a Column Name - Path mapping file (using the output from Port 2) and edit the column names to be returned
: Shows any mappings for which "Column Name clash" has occurred. If there is any data on this port, steps should be taken to provide a suitable mapping file to resolve the issue.

XML Easy Reader 1

Options

Input Ports

Output Ports

Nodes

Extensions

Links

Download