PDF Table Parsing: Difference between revisions

Jump to navigation Jump to search
44 bytes removed ,  Yesterday at 20:48
m
mNo edit summary
 
Line 11: Line 11:
=== 1. Terminology ===
=== 1. Terminology ===


==== Parser ====
'''Parser'''


A <code>parser</code> is a program responsible for parsing PDF content and converting it into structured data. It usually does more than simply read text or tables; it also needs to identify sections, table types, column positions, data rows, continuations, and source metadata.
A <code>parser</code> is a program responsible for parsing PDF content and converting it into structured data. It usually does more than simply read text or tables; it also needs to identify sections, table types, column positions, data rows, continuations, and source metadata.


==== Record ====
'''Record'''


A <code>record</code> is a standardized data entry produced by the parser.
A <code>record</code> is a standardized data entry produced by the parser.
Line 21: Line 21:
A single row in a PDF does not necessarily correspond to one record. The same record may be split across multiple rows, continue across pages, or require inherited information from a parent row before it becomes complete.
A single row in a PDF does not necessarily correspond to one record. The same record may be split across multiple rows, continue across pages, or require inherited information from a parent row before it becomes complete.


==== Section / Subsection ====
'''Section / Subsection'''


<code>section</code> and <code>subsection</code> refer to the section and subsection in a document.
<code>section</code> and <code>subsection</code> refer to the section and subsection in a document.
Line 27: Line 27:
These fields are typically used to describe the source context of the data, such as which section or table block a particular record comes from.
These fields are typically used to describe the source context of the data, such as which section or table block a particular record comes from.


==== Table Type ====
'''Table Type'''


<code>table_type</code> is a table-type label defined by the parser to indicate what kind of data structure the current table represents.
<code>table_type</code> is a table-type label defined by the parser to indicate what kind of data structure the current table represents.
Line 43: Line 43:
It is not a built-in PDF field, nor is it information automatically provided by <code>pdfplumber</code>, Camelot, or Tabula.
It is not a built-in PDF field, nor is it information automatically provided by <code>pdfplumber</code>, Camelot, or Tabula.


==== Table Schema ====
'''Table Schema'''


A <code>table schema</code> refers to the column structure and parsing rules for a specific type of table.
A <code>table schema</code> refers to the column structure and parsing rules for a specific type of table.
Line 51: Line 51:
In PDF table parsing, a <code>table schema</code> is not necessarily a database schema. Instead, it is a set of rules used by the parser to align and standardize table data.
In PDF table parsing, a <code>table schema</code> is not necessarily a database schema. Instead, it is a set of rules used by the parser to align and standardize table data.


==== Column Mapping ====
'''Column Mapping'''


<code>column mapping</code> refers to the process of mapping original PDF columns to standardized fields.
<code>column mapping</code> refers to the process of mapping original PDF columns to standardized fields.
Line 57: Line 57:
For example, different documents may use headers such as “Item Name,” “Object Name,” or “Target Name,” but all of them can be normalized into <code>item_name</code> in the output data.
For example, different documents may use headers such as “Item Name,” “Object Name,” or “Target Name,” but all of them can be normalized into <code>item_name</code> in the output data.


==== Alias Mapping ====
'''Alias Mapping'''


<code>alias mapping</code> refers to a lookup table for header aliases.
<code>alias mapping</code> refers to a lookup table for header aliases.
Line 63: Line 63:
Because PDF table headers often vary across versions, formats, or reporting units, the parser needs to map multiple header names to the same standardized field.
Because PDF table headers often vary across versions, formats, or reporting units, the parser needs to map multiple header names to the same standardized field.


==== Forward-Fill ====
'''Forward-Fill'''


<code>forward-fill</code> refers to the process of allowing later rows to inherit field values from previous rows.
<code>forward-fill</code> refers to the process of allowing later rows to inherit field values from previous rows.
Line 69: Line 69:
In a parent-child row structure, a parent row may appear only once, while subsequent child rows omit repeated information. In this case, the parser needs to fill the parent-row information into the child rows.
In a parent-child row structure, a parent row may appear only once, while subsequent child rows omit repeated information. In this case, the parser needs to fill the parent-row information into the child rows.


==== Parent Row / Child Row ====
'''Parent Row / Child Row'''


A <code>parent row</code> is a row that provides main category, group, or summary information.
A <code>parent row</code> is a row that provides main category, group, or summary information.
Line 77: Line 77:
In PDF tables, it is common for a parent row to list primary information while child rows only list detailed items. Without handling parent-child row relationships, the output data may lose necessary context.
In PDF tables, it is common for a parent row to list primary information while child rows only list detailed items. Without handling parent-child row relationships, the output data may lose necessary context.


==== Cross-Page Continuation ====
'''Cross-Page Continuation'''


<code>cross-page continuation</code> refers to a situation where the same record is split and continues onto the next page.
<code>cross-page continuation</code> refers to a situation where the same record is split and continues onto the next page.
Line 85: Line 85:
Cross-page continuation handling requires determining whether text on the next page belongs to an unfinished record from the previous page, instead of treating it directly as a new data row.
Cross-page continuation handling requires determining whether text on the next page belongs to an unfinished record from the previous page, instead of treating it directly as a new data row.


==== Metadata ====
'''Metadata'''


<code>metadata</code> refers to auxiliary information that describes the data source and parsing state.
<code>metadata</code> refers to auxiliary information that describes the data source and parsing state.

Navigation menu