CSV Reader (Labs)

Use this node to read CSV files into your workflow. The node will produce a data table with numbers and types of columns guessed automatically.

Options

Source
Select a file location which stores the data you want to read. When clicking on the browse button, there are two default file system options to choose from:
  • The current Hub space: Allows to select a file relative to the Hub space on which the workflow is run.
  • URL: Allows to specify a URL (e.g. file://, http:// or knime:// protocol).
File encoding
Defines the character set used to read a CSV file that contains characters in a different encoding. You can choose from a list of character encodings (UTF-8, UTF-16, etc.), or specify any other encoding supported by your Java Virtual Machine (VM). The default value uses the default encoding of the Java VM, which may depend on the locale or the Java property "file.encoding".
  • OS default: Uses the default decoding set by the operating system.
  • ISO-8859-1: ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1.
  • US-ASCII: Seven-bit ASCII, also referred to as US-ASCII.
  • UTF-8: Eight-bit UCS Transformation Format.
  • UTF-16: Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark in the file.
  • UTF-16BE: Sixteen-bit UCS Transformation Format, big-endian byte order.
  • UTF-16LE: Sixteen-bit UCS Transformation Format, little-endian byte order.
  • Other: Enter a valid charset name supported by the Java Virtual Machine.
Custom encoding
A custom character set used to read a CSV file.
Skip first lines of file
Use this option to skip lines that do not fit in the table structure (e.g. multi-line comments).
The specified number of lines are skipped in the input file before the parsing starts. Skipping lines prevents parallel reading of individual files.
Comment line character
Defines the character indicating line comments.
Row delimiter
Defines the character string delimiting rows. Can get detected automatically.
  • Line break: Uses the line break character as row delimiter. This option is platform-agnostic.
  • Custom: Uses the provided string as row delimiter.
Custom row delimiter
Defines the character to be used as custom row delimiter.
Column delimiter
Defines the character string delimiting columns. Use '\t' for tab characters. Can get detected automatically.
Quoted strings contain no row delimiters
Check this box if there are no quotes that contain row delimiters inside the files. Row delimiters should not be inside of quotes for parallel reading of individual files.
Quote character
The character indicating quotes. Can get detected automatically.
Quote escape character
The character used for escaping quotes inside an already quoted value. Can get detected automatically.
Autodetect format
By pressing this button, the format of the file will be guessed automatically. It is not guaranteed that the correct values are being detected.
Number of characters for autodetection
Specifies on how many characters of the selected file should be used for autodetection. The autodetection by default is based on the first 1024 * 1024 characters.
First row contains column names
Select this box if the first row contains column name headers.
Skip first data rows
Use this option to skip the specified number of valid data rows. This has no effect on which row will be chosen as a column header. Skipping rows prevents parallel reading of individual files.
Limit number of rows
If enabled, only the specified number of data rows are read. The column header row (if selected) is not taken into account. Limiting rows prevents parallel reading of individual files.
Maximum number of rows
Defines the maximum number of rows that are read.
First column contains RowIDs
Select this box if the first column contains RowIDs (no duplicates allowed).
If row has fewer columns
Specifies the behavior in case some rows are shorter than others.
  • Fail: if there are shorter rows in the input file the node execution fails.
  • Insert missing: the shorter rows are completed with missing values.
Decimal separator
Specifies the decimal separator character for parsing numbers. The decimal separator is only used for the parsing of double values. Note that the decimal separator must differ from the thousands separator. You must always provide a decimal separator.
Thousands separator
Specifies the thousands separator character for parsing numbers. The thousands separator is used for integer, long and double parsing. Note that the thousands separator must differ from the decimal separator. It is possible to leave the thousands separator unspecified.
Replace empty quoted string by missing values
Select this box if you want quoted empty strings to be replaced by missing value cells.
Quoted strings
Specifies the behavior in case there are quoted strings in the input table.
  • Remove quotes and trim whitespace: Quotes will be removed from the value followed by trimming any leading/trailing whitespaces.
  • Keep quotes: Quotes of a value will be kept. Note: No trimming will be done inside the quotes.
Limit scanned rows
If enabled, only the specified number of input rows are used to analyze the file (i.e to determine the column types). This option is recommended for long files where the first n rows are representative for the whole file. The "Skip first data rows" option has no effect on the scanning. Note also, that this option and the "Limit number of rows" option are independent from each other, i.e., if the value in "Limit number of rows" is smaller than the value specified here, we will still read as many rows as specified here.
If schema changes
Specifies the node behavior if the content of the configured file/folder changes between executions, i.e., columns are added/removed to/from the file(s) or their types change. The following options are available:
  • Fail: If set, the node fails if the column names in the file have changed. Changes in column types will not be detected.
  • Use new schema: If set, the node will compute a new table specification for the current schema of the file at the time when the node is executed. Note that the node will not output a table specification before execution and that it will not apply transformations, therefore the transformation tab is disabled.
  • Ignore (deprecated): If set, the node tries to ignore the changes and outputs a table with the old table specification. This option is deprecated and should never be selected for new workflows, as it may lead to invalid data in the resulting table. Use one of the other options instead.
Maximum number of columns
Sets the number of allowed columns (default 8192 columns) to prevent memory exhaustion. The node will fail if the number of columns exceeds the set limit.
Limit memory per column
If selected the memory per column is restricted to 1MB in order to prevent memory exhaustion. Uncheck this option to disable these memory restrictions.
How to combine columns
Specifies how to deal with reading multiple files in which not all column names are identical.
  • Fail if different: The node will fail if multiple files are read and not all files have the same column names.
  • Union: Any column that is part of any input file is considered. If a file is missing a column, it is filled up with missing values.
  • Intersection: Only columns that appear in all files are considered for the output table.
Prepend file index to RowID
Select this box if you want to prepend a prefix that depends on the index of the source file to the RowIDs. The prefix for the first file is "File_0_", for the second "File_1_" and so on. This option is useful if the RowIDs within a single file are unique but the same RowIDs appear in multiple files. Prepending the file index prevents parallel reading of individual files.
Append file path column
Select this box if you want to add a column containing the path of the file from which the row is read. The node will fail if adding the column with the provided name causes a name collision with any of the columns in the read table.
File path column name
The name of the column containing the file path.
Enforce types
Controls how columns whose type changes are dealt with. If selected, the mapping to the KNIME type you configured is attempted. The node will fail if that is not possible. If unselected, the KNIME type corresponding to the new type is used.
Transformations
Use this option to modify the structure of the table. You can deselect each column to filter it out of the output table, use the arrows to reorder the columns, or change the column name or column type of each column. Note that the positions of columns are reset in the dialog if a new file or folder is selected. Whether and where to add unknown columns during execution is specified via the special row <any unknown new column>. It is also possible to select the type new columns should be converted to. Note that the node will fail if this conversion is not possible e.g. if the selected type is Integer but the new column is of type Double.

Input Ports

Icon
The file system connection.

Output Ports

Icon
Data table based on the file being read with number and types of columns guessed automatically.

Popular Predecessors

  • No recommendations found

Popular Successors

  • No recommendations found

Views

This node has no views

Workflows

  • No workflows found

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.