Editing
PDF Table Parsing
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== 5. Cross-Page Continuation Is More Than Repeated Headers === When PDF tables span multiple pages, data may be split in various ways. A parser that only handles repeated headers on the next page is usually insufficient for real-world documents. Common cross-page continuation cases include: * The first row on the next page contains only the latter part of a field from the previous page. * Names, categories, descriptive text, or remarks are split across pages. * The next page starts with multiple repeated header rows before continuing an unfinished record from the previous page. * Continuation text and a new data-row identifier appear in the same row. * Empty columns are compressed, causing column positions to shift. * Table borders, line breaks, or OCR results split what was originally one row into multiple rows. Therefore, cross-page processing should preserve original column positions and should not remove empty columns too early. When necessary, the parser should use column coordinates, column indexes, identifier patterns, and the previous record state to determine which field a continuation fragment belongs to. Continuation merging should also depend on the type of text: * Chinese text fragments can usually be concatenated directly. * English text fragments usually require an inserted space. * Multi-value fields, category fields, or remarks may need to preserve line breaks, enumeration marks, or delimiters. * Numeric fields should avoid accidentally concatenating text fragments.
Summary:
Please note that all contributions to LemonWiki共筆 are considered to be released under the Creative Commons Attribution-NonCommercial-ShareAlike (see
LemonWiki:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Current events
Recent changes
Random page
Help
Categories
Tools
What links here
Related changes
Special pages
Page information