Icon

Challenge 15 - Extracting a Table from a PDF

Challenge 15 - Extracting a Table from a PDF
Description: Given a text-based PDF document with a table, can you partially extract the table into a KNIME data table for further analysis? For this challenge we will extract the table from this PDF document and attempt topartially reconstruct it within KNIME. The corresponding KNIME table should contain the following columns: Day, Max, Min, Norm, Depart, Heat, and Cool. Note 1: Your final output should be a table, not a single row with all the relevant data. Note 2: The Tika Parser node is better suited for this task than the PDF Parser node. We completed this task without components, regular expressions, or code-snippet nodes. In fact, our solution has a total of 10 nodes, butlabeling the columns required a bit of manual effort. Challenge 15 - Extracting a Table from a PDF DATA PREPROCESSING FINAL TABLE Read PDFSplit Contentby \nKeep desiredColumns (11 - [13-44])Convert Columnsto RowsSplit header by \sinto indp. ColumnsCheck fordouble spacesSplitHeader & TableFix SpacesSpeed & DirectionSplit table by \sinto indp. ColumnsExclude RemainingColumnFix Spaces% POSSReset HeaderNameReset HeaderNameRedefine somecolumns Convert toLookup tableBuild finaltable with headersFlag columnfor renamingTika Parser Cell Splitter Column Filter Transpose Cell Splitter String Manipulation Row Splitter String Manipulation Cell Splitter Column Filter String Manipulation ExtractColumn Header ExtractColumn Header Rule Engine Transpose Insert ColumnHeader RowID Description: Given a text-based PDF document with a table, can you partially extract the table into a KNIME data table for further analysis? For this challenge we will extract the table from this PDF document and attempt topartially reconstruct it within KNIME. The corresponding KNIME table should contain the following columns: Day, Max, Min, Norm, Depart, Heat, and Cool. Note 1: Your final output should be a table, not a single row with all the relevant data. Note 2: The Tika Parser node is better suited for this task than the PDF Parser node. We completed this task without components, regular expressions, or code-snippet nodes. In fact, our solution has a total of 10 nodes, butlabeling the columns required a bit of manual effort. Challenge 15 - Extracting a Table from a PDF DATA PREPROCESSING FINAL TABLE Read PDFSplit Contentby \nKeep desiredColumns (11 - [13-44])Convert Columnsto RowsSplit header by \sinto indp. ColumnsCheck fordouble spacesSplitHeader & TableFix SpacesSpeed & DirectionSplit table by \sinto indp. ColumnsExclude RemainingColumnFix Spaces% POSSReset HeaderNameReset HeaderNameRedefine somecolumns Convert toLookup tableBuild finaltable with headersFlag columnfor renamingTika Parser Cell Splitter Column Filter Transpose Cell Splitter String Manipulation Row Splitter String Manipulation Cell Splitter Column Filter String Manipulation ExtractColumn Header ExtractColumn Header Rule Engine Transpose Insert ColumnHeader RowID

Nodes

Extensions

Links