Icon

16_​Tika_​Parsing

Apache Tika integration

The goal of the workflow is to show how to parse content of files using Tika nodes, detect the languages of the content using Tika language detector and finally assign a POS tag for each english word found in the document files. First, the Tika parser reads files from a specified directory and parses their content (any detected attachments/embedded files will be extracted as well). A language detector node is then used to detect languages used in the contents. Any file not written in english is filtered out. The remaining files are converted into documents, where a Stanford tagger is then applied to assign a POS tag for each term.

This workflow shows how to parse files of various formats as well as their attachments, if exist, using Tika parser nodes and detect the languages of the content using Tikalanguage detector. Language detection and filtering Attachmentsparsing Detect languagesRead various fileformats such as.docx .pdf .emlParse emailattachmentsDetect languagesused in the filesFilter out emailcontent(we only wantthe attachments)Convert to documentsand assign POStags Filter outnon-english textsFilter outnon-english texts Tika LanguageDetector Tika Parser Tika ParserURL Input Tika LanguageDetector Row Filter Processing Filtering Filtering Bag Of WordsCreator This workflow shows how to parse files of various formats as well as their attachments, if exist, using Tika parser nodes and detect the languages of the content using Tikalanguage detector. Language detection and filtering Attachmentsparsing Detect languagesRead various fileformats such as.docx .pdf .emlParse emailattachmentsDetect languagesused in the filesFilter out emailcontent(we only wantthe attachments)Convert to documentsand assign POStags Filter outnon-english textsFilter outnon-english texts Tika LanguageDetector Tika Parser Tika ParserURL Input Tika LanguageDetector Row Filter Processing Filtering Filtering Bag Of WordsCreator

Nodes

Extensions

Links