Editing PDF Table Parsing (section)

=== 1. Terminology ===

'''Parser'''

A <code>parser</code> is a program responsible for parsing PDF content and converting it into structured data. It usually does more than simply read text or tables; it also needs to identify sections, table types, column positions, data rows, continuations, and source metadata.

'''Record'''

A <code>record</code> is a standardized data entry produced by the parser.

A single row in a PDF does not necessarily correspond to one record. The same record may be split across multiple rows, continue across pages, or require inherited information from a parent row before it becomes complete.

'''Section / Subsection'''

<code>section</code> and <code>subsection</code> refer to the section and subsection in a document.

These fields are typically used to describe the source context of the data, such as which section or table block a particular record comes from.

'''Table Type'''

<code>table_type</code> is a table-type label defined by the parser to indicate what kind of data structure the current table represents.

Examples include:

* <code>summary_table</code>
* <code>detail_table</code>
* <code>parent_child_table</code>
* <code>cross_reference_table</code>
* <code>unknown_table</code>

The purpose of <code>table_type</code> is to help the parser determine which set of parsing rules should be applied.

It is not a built-in PDF field, nor is it information automatically provided by <code>pdfplumber</code>, Camelot, or Tabula.

'''Table Schema'''

A <code>table schema</code> refers to the column structure and parsing rules for a specific type of table.

It usually includes the number of columns, column order, standardized column names, column aliases, required fields, nullable fields, and whether special continuation handling or parent-child row handling is needed.

In PDF table parsing, a <code>table schema</code> is not necessarily a database schema. Instead, it is a set of rules used by the parser to align and standardize table data.

'''Column Mapping'''

<code>column mapping</code> refers to the process of mapping original PDF columns to standardized fields.

For example, different documents may use headers such as “Item Name,” “Object Name,” or “Target Name,” but all of them can be normalized into <code>item_name</code> in the output data.

'''Alias Mapping'''

<code>alias mapping</code> refers to a lookup table for header aliases.

Because PDF table headers often vary across versions, formats, or reporting units, the parser needs to map multiple header names to the same standardized field.

'''Forward-Fill'''

<code>forward-fill</code> refers to the process of allowing later rows to inherit field values from previous rows.

In a parent-child row structure, a parent row may appear only once, while subsequent child rows omit repeated information. In this case, the parser needs to fill the parent-row information into the child rows.

'''Parent Row / Child Row'''

A <code>parent row</code> is a row that provides main category, group, or summary information.

A <code>child row</code> is a detail row that belongs to a parent row.

In PDF tables, it is common for a parent row to list primary information while child rows only list detailed items. Without handling parent-child row relationships, the output data may lose necessary context.

'''Cross-Page Continuation'''

<code>cross-page continuation</code> refers to a situation where the same record is split and continues onto the next page.

This may occur in names, descriptive text, category fields, remarks, or other long fields.

Cross-page continuation handling requires determining whether text on the next page belongs to an unfinished record from the previous page, instead of treating it directly as a new data row.

'''Metadata'''

<code>metadata</code> refers to auxiliary information that describes the data source and parsing state.

Common fields include <code>page</code>, <code>table_index</code>, <code>table_type</code>, <code>section</code>, <code>subsection</code>, <code>source_file</code>, <code>raw_row</code>, and <code>parse_warning</code>.

Metadata helps with debugging, auditing, data traceability, and data quality checks.