PDF Table Parsing

From LemonWiki共筆
Revision as of 18:31, 15 June 2026 by Planetoid (talk | contribs) (Created page with "== Technical Notes on PDF Table Parsing == PDF table parsing usually cannot rely solely on the raw output produced by tools such as <code>pdfplumber.extract_tables()</code>, Camelot, Tabula, or similar libraries. Since PDF is primarily a layout-oriented format rather than a structured data format, practical implementations often require additional rules, state management, and post-processing steps in order to produce stable and usable datasets. The following notes summ...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Technical Notes on PDF Table Parsing

PDF table parsing usually cannot rely solely on the raw output produced by tools such as pdfplumber.extract_tables(), Camelot, Tabula, or similar libraries. Since PDF is primarily a layout-oriented format rather than a structured data format, practical implementations often require additional rules, state management, and post-processing steps in order to produce stable and usable datasets.

The following notes summarize common technical issues that can serve as a reference when developing a PDF table parser.

  1. Terminology
  2. Section Detection Requires Cross-Page State
  3. A Single Section May Contain Multiple Table Schemas
  4. Parent-Child Row Relationships Require Forward-Fill Support
  5. Cross-Page Continuation Is More Than Repeated Headers
  6. Headers, Notes, and Data Rows Should Be Detected in Stages
  7. Metadata Should Be Stored at the Record Level
  8. Implementation Recommendations

1. Terminology

Parser

A parser is a program responsible for parsing PDF content and converting it into structured data. It usually does more than simply read text or tables; it also needs to identify sections, table types, column positions, data rows, continuations, and source metadata.

Record

A record is a standardized data entry produced by the parser.

A single row in a PDF does not necessarily correspond to one record. The same record may be split across multiple rows, continue across pages, or require inherited information from a parent row before it becomes complete.

Section / Subsection

section and subsection refer to the section and subsection in a document.

These fields are typically used to describe the source context of the data, such as which section or table block a particular record comes from.

Table Type

table_type is a table-type label defined by the parser to indicate what kind of data structure the current table represents.

Examples include:

  • summary_table
  • detail_table
  • parent_child_table
  • cross_reference_table
  • unknown_table

The purpose of table_type is to help the parser determine which set of parsing rules should be applied.

It is not a built-in PDF field, nor is it information automatically provided by pdfplumber, Camelot, or Tabula.

Table Schema

A table schema refers to the column structure and parsing rules for a specific type of table.

It usually includes the number of columns, column order, standardized column names, column aliases, required fields, nullable fields, and whether special continuation handling or parent-child row handling is needed.

In PDF table parsing, a table schema is not necessarily a database schema. Instead, it is a set of rules used by the parser to align and standardize table data.

Column Mapping

column mapping refers to the process of mapping original PDF columns to standardized fields.

For example, different documents may use headers such as “Item Name,” “Object Name,” or “Target Name,” but all of them can be normalized into item_name in the output data.

Alias Mapping

alias mapping refers to a lookup table for header aliases.

Because PDF table headers often vary across versions, formats, or reporting units, the parser needs to map multiple header names to the same standardized field.

Forward-Fill

forward-fill refers to the process of allowing later rows to inherit field values from previous rows.

In a parent-child row structure, a parent row may appear only once, while subsequent child rows omit repeated information. In this case, the parser needs to fill the parent-row information into the child rows.

Parent Row / Child Row

A parent row is a row that provides main category, group, or summary information.

A child row is a detail row that belongs to a parent row.

In PDF tables, it is common for a parent row to list primary information while child rows only list detailed items. Without handling parent-child row relationships, the output data may lose necessary context.

Cross-Page Continuation

cross-page continuation refers to a situation where the same record is split and continues onto the next page.

This may occur in names, descriptive text, category fields, remarks, or other long fields.

Cross-page continuation handling requires determining whether text on the next page belongs to an unfinished record from the previous page, instead of treating it directly as a new data row.

Metadata

metadata refers to auxiliary information that describes the data source and parsing state.

Common fields include page, table_index, table_type, section, subsection, source_file, raw_row, and parse_warning.

Metadata helps with debugging, auditing, data traceability, and data quality checks.

2. Section Detection Requires Cross-Page State

Section, subsection, and table category information in PDF documents may not be repeated on every page. Therefore, the parser needs to preserve the current section, subsection, or other document-level context and reuse it on subsequent pages.

Common scenarios include:

  • Continuation pages may contain only table content without repeating the section title.
  • Table of contents, summary, or explanatory pages may contain multiple section names at once. If the parser only checks whether a page contains certain keywords, it may misclassify the page.
  • Section titles may use different punctuation or spacing due to layout variations, such as full-width or half-width parentheses, enumeration marks, colons, full-width periods, or extra spaces.
  • When switching sections, subsections, or table types, inherited state from the previous table must be cleared to prevent data from being incorrectly carried over into the next block.

A recommended approach is to determine sections using a combination of text position, title patterns, and contextual state, rather than relying only on keyword matching across the entire page.

3. A Single Section May Contain Multiple Table Schemas

A single section may contain multiple types of tables. If the parser determines column mapping only from the section name, columns may be misaligned.

When parsing each table, the parser should first inspect the header or the first few rows of the table to determine:

  • table_type
  • Number of columns
  • Column order
  • Standardized column names
  • Whether parent-child row relationships exist
  • Whether special continuation handling is required

Header names may also contain aliases, abbreviations, or terms that vary across document versions. For example, “Item Name,” “Object Name,” and “Target Name” may refer to similar fields in different documents. The implementation should therefore maintain an alias mapping that normalizes different header names into consistent fields.

4. Parent-Child Row Relationships Require Forward-Fill Support

Some PDF tables use parent rows to list primary information, while subsequent child rows only list detail fields. For example, a parent row may contain a main category, group name, case number, or summary information, while child rows only list sub-item numbers, detail names, or detailed content.

These tables require forward-fill support, where child rows inherit field values from the parent row.

Design considerations include:

  • Parent row state should be isolated by context such as section, subsection, and table_type.
  • Parent row state may need to persist across pages.
  • A single text field should not be used as the parent row identifier, because different parent rows may have identical or similar names.
  • Backfilling should consider at least the section, subsection, table type, and parent-row identifier.

When the section or table changes, parent row state should be explicitly reset to avoid carrying data from the previous table into the next one.

5. Cross-Page Continuation Is More Than Repeated Headers

When PDF tables span multiple pages, data may be split in various ways. A parser that only handles repeated headers on the next page is usually insufficient for real-world documents.

Common cross-page continuation cases include:

  • The first row on the next page contains only the latter part of a field from the previous page.
  • Names, categories, descriptive text, or remarks are split across pages.
  • The next page starts with multiple repeated header rows before continuing an unfinished record from the previous page.
  • Continuation text and a new data-row identifier appear in the same row.
  • Empty columns are compressed, causing column positions to shift.
  • Table borders, line breaks, or OCR results split what was originally one row into multiple rows.

Therefore, cross-page processing should preserve original column positions and should not remove empty columns too early. When necessary, the parser should use column coordinates, column indexes, identifier patterns, and the previous record state to determine which field a continuation fragment belongs to.

Continuation merging should also depend on the type of text:

  • Chinese text fragments can usually be concatenated directly.
  • English text fragments usually require an inserted space.
  • Multi-value fields, category fields, or remarks may need to preserve line breaks, enumeration marks, or delimiters.
  • Numeric fields should avoid accidentally concatenating text fragments.

6. Headers, Notes, and Data Rows Should Be Detected in Stages

PDF tables often contain headers, unit descriptions, notes, subtotals, totals, footers, or blank rows. If a parser treats any row containing text as a data row, it can easily produce incorrect records.

A recommended parsing flow is to separate the process into stages:

  1. Wait for header keywords to appear.
  2. Detect the header area and column names.
  3. Determine where data rows begin.
  4. Process regular data rows.
  5. Process cross-page continuations.
  6. Exclude notes, units, subtotals, totals, and footers.
  7. Output standardized records.

Missing fields may be filled with null. However, if columns have shifted, the parser must not blindly apply the schema based on the order of a compressed array, as this can misalign the entire row.

7. Metadata Should Be Stored at the Record Level

When parsing PDF tables, source context should be preserved in addition to the extracted data itself. This metadata is important for debugging, auditing, data traceability, and downstream flattened outputs.

Each record should preferably include at least:

  • page
  • table_index
  • table_type
  • section
  • subsection
  • source_file
  • raw_row or original row data
  • parse_warning or parsing warning

Writing metadata at the record level helps prevent context loss when JSON is flattened into CSV, imported into a database, or passed around as individual records.

If the document itself contains a year, quarter, period, version, batch, or category information, that information should also be explicitly preserved to avoid mixing data from different batches.

8. Implementation Recommendations

A PDF table parser can use a layered design that separates extraction, classification, cleaning, state management, normalization, and validation.

A recommended workflow is:

  1. Extract text, tables, and coordinate information from the PDF.
  2. Detect page sections and table blocks.
  3. Determine table_type based on header keywords.
  4. Apply the corresponding schema and header alias mapping.
  5. Identify data rows, parent rows, child rows, and continuation rows.
  6. Manage cross-page state, parent row state, and other document-level state.
  7. Convert the data into standardized records.
  8. Add source metadata.
  9. Run regression checks and data quality checks.
  10. Output JSON, CSV, or database-ready records.

This design reduces the risk of coupling individual rules too tightly to a specific PDF layout and makes it easier to adjust only the affected parts when the document format changes.