Editing PDF Table Parsing (section)

=== 6. Headers, Notes, and Data Rows Should Be Detected in Stages ===

PDF tables often contain headers, unit descriptions, notes, subtotals, totals, footers, or blank rows. If a parser treats any row containing text as a data row, it can easily produce incorrect records.

A recommended parsing flow is to separate the process into stages:

# Wait for header keywords to appear.
# Detect the header area and column names.
# Determine where data rows begin.
# Process regular data rows.
# Process cross-page continuations.
# Exclude notes, units, subtotals, totals, and footers.
# Output standardized records.

Missing fields may be filled with <code>null</code>. However, if columns have shifted, the parser must not blindly apply the schema based on the order of a compressed array, as this can misalign the entire row.