15,020
edits
mNo edit summary |
m (→1. Terminology) |
||
| Line 11: | Line 11: | ||
=== 1. Terminology === | === 1. Terminology === | ||
'''Parser''' | |||
A <code>parser</code> is a program responsible for parsing PDF content and converting it into structured data. It usually does more than simply read text or tables; it also needs to identify sections, table types, column positions, data rows, continuations, and source metadata. | A <code>parser</code> is a program responsible for parsing PDF content and converting it into structured data. It usually does more than simply read text or tables; it also needs to identify sections, table types, column positions, data rows, continuations, and source metadata. | ||
'''Record''' | |||
A <code>record</code> is a standardized data entry produced by the parser. | A <code>record</code> is a standardized data entry produced by the parser. | ||
| Line 21: | Line 21: | ||
A single row in a PDF does not necessarily correspond to one record. The same record may be split across multiple rows, continue across pages, or require inherited information from a parent row before it becomes complete. | A single row in a PDF does not necessarily correspond to one record. The same record may be split across multiple rows, continue across pages, or require inherited information from a parent row before it becomes complete. | ||
'''Section / Subsection''' | |||
<code>section</code> and <code>subsection</code> refer to the section and subsection in a document. | <code>section</code> and <code>subsection</code> refer to the section and subsection in a document. | ||
| Line 27: | Line 27: | ||
These fields are typically used to describe the source context of the data, such as which section or table block a particular record comes from. | These fields are typically used to describe the source context of the data, such as which section or table block a particular record comes from. | ||
'''Table Type''' | |||
<code>table_type</code> is a table-type label defined by the parser to indicate what kind of data structure the current table represents. | <code>table_type</code> is a table-type label defined by the parser to indicate what kind of data structure the current table represents. | ||
| Line 43: | Line 43: | ||
It is not a built-in PDF field, nor is it information automatically provided by <code>pdfplumber</code>, Camelot, or Tabula. | It is not a built-in PDF field, nor is it information automatically provided by <code>pdfplumber</code>, Camelot, or Tabula. | ||
'''Table Schema''' | |||
A <code>table schema</code> refers to the column structure and parsing rules for a specific type of table. | A <code>table schema</code> refers to the column structure and parsing rules for a specific type of table. | ||
| Line 51: | Line 51: | ||
In PDF table parsing, a <code>table schema</code> is not necessarily a database schema. Instead, it is a set of rules used by the parser to align and standardize table data. | In PDF table parsing, a <code>table schema</code> is not necessarily a database schema. Instead, it is a set of rules used by the parser to align and standardize table data. | ||
'''Column Mapping''' | |||
<code>column mapping</code> refers to the process of mapping original PDF columns to standardized fields. | <code>column mapping</code> refers to the process of mapping original PDF columns to standardized fields. | ||
| Line 57: | Line 57: | ||
For example, different documents may use headers such as “Item Name,” “Object Name,” or “Target Name,” but all of them can be normalized into <code>item_name</code> in the output data. | For example, different documents may use headers such as “Item Name,” “Object Name,” or “Target Name,” but all of them can be normalized into <code>item_name</code> in the output data. | ||
'''Alias Mapping''' | |||
<code>alias mapping</code> refers to a lookup table for header aliases. | <code>alias mapping</code> refers to a lookup table for header aliases. | ||
| Line 63: | Line 63: | ||
Because PDF table headers often vary across versions, formats, or reporting units, the parser needs to map multiple header names to the same standardized field. | Because PDF table headers often vary across versions, formats, or reporting units, the parser needs to map multiple header names to the same standardized field. | ||
'''Forward-Fill''' | |||
<code>forward-fill</code> refers to the process of allowing later rows to inherit field values from previous rows. | <code>forward-fill</code> refers to the process of allowing later rows to inherit field values from previous rows. | ||
| Line 69: | Line 69: | ||
In a parent-child row structure, a parent row may appear only once, while subsequent child rows omit repeated information. In this case, the parser needs to fill the parent-row information into the child rows. | In a parent-child row structure, a parent row may appear only once, while subsequent child rows omit repeated information. In this case, the parser needs to fill the parent-row information into the child rows. | ||
'''Parent Row / Child Row''' | |||
A <code>parent row</code> is a row that provides main category, group, or summary information. | A <code>parent row</code> is a row that provides main category, group, or summary information. | ||
| Line 77: | Line 77: | ||
In PDF tables, it is common for a parent row to list primary information while child rows only list detailed items. Without handling parent-child row relationships, the output data may lose necessary context. | In PDF tables, it is common for a parent row to list primary information while child rows only list detailed items. Without handling parent-child row relationships, the output data may lose necessary context. | ||
'''Cross-Page Continuation''' | |||
<code>cross-page continuation</code> refers to a situation where the same record is split and continues onto the next page. | <code>cross-page continuation</code> refers to a situation where the same record is split and continues onto the next page. | ||
| Line 85: | Line 85: | ||
Cross-page continuation handling requires determining whether text on the next page belongs to an unfinished record from the previous page, instead of treating it directly as a new data row. | Cross-page continuation handling requires determining whether text on the next page belongs to an unfinished record from the previous page, instead of treating it directly as a new data row. | ||
'''Metadata''' | |||
<code>metadata</code> refers to auxiliary information that describes the data source and parsing state. | <code>metadata</code> refers to auxiliary information that describes the data source and parsing state. | ||