Text Chunker

Text chunking is a technique for splitting larger documents into smaller paragraphs. The chunks overlap to contain a piece of the context. Chunk size and overlap can be configured.

For generic texts, the node will try to keep semantic relations by prioritizing to place sentences within a paragraph in the same chunk. If a specific programming or formatting language is specified, the node considers language-specific syntax when splitting the document.

Options

Document column

Select the column containing the documents to be chunked.

Chunk size

Specify the maximum chunk size.

Chunk overlap

Specify by how many characters the chunks should overlap.

Separators

Select whether the document will be split based on separators for generic text or code/markup.

Available options:

  • Text: The document will be split at common separators for generic text.
  • Code/Markup: Language-specific syntax will be used to split the document.
Language

Select the language that will be considered when splitting the text.

Available options:

  • C#: C#​ syntax will be used to split the texts.
  • C++: C++ syntax will be used to split the texts.
  • COBOL: COBOL syntax will be used to split the texts.
  • Go: Go syntax will be used to split the texts.
  • Haskell: Haskell syntax will be used to split the texts.
  • HTML: HTML syntax will be used to split the texts.
  • Java: Java syntax will be used to split the texts.
  • JavaScript: JavaScript syntax will be used to split the texts.
  • Kotlin: Kotlin syntax will be used to split the texts.
  • LaTeX: LaTeX syntax will be used to split the texts.
  • Lua: Lua syntax will be used to split the texts.
  • Markdown: Markdown syntax will be used to split the texts.
  • PHP: PHP syntax will be used to split the texts.
  • Protobuf: Protobuf syntax will be used to split the texts.
  • Python: Python syntax will be used to split the texts.
  • RST: RST syntax will be used to split the texts.
  • Ruby: Ruby syntax will be used to split the texts.
  • Rust: Rust syntax will be used to split the texts.
  • Scala: Scala syntax will be used to split the texts.
  • SOL: SOL syntax will be used to split the texts.
  • Swift: Swift syntax will be used to split the texts.
  • TypeScript: TypeScript syntax will be used to split the texts.
Output column

Select whether the chunks should replace the original column or be appended to the table in a new column.

Available options:

  • Replace: The text chunks will replace the original texts.
  • Append: The text chunks will be appended to the table in a new column.
Output column name

Provide the name of the new column containing the chunks.

Input Ports

Icon

Table containing a string column.

Output Ports

Icon

Table containing the text chunks.

Popular Predecessors

  • No recommendations found

Popular Successors

  • No recommendations found

Views

This node has no views

Workflows

  • No workflows found

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.