Mutool Glyph Index Error

From LemonWiki共筆
Revision as of 14:07, 3 February 2026 by Planetoid (talk | contribs) (Created page with "'''mutool glyph index error''' is an issue that occurs when using the mutool command from the MuPDF toolset to process PDF files, resulting in text extraction failure due to abnormal font glyph index mapping. This problem is commonly found in scanned documents processed through Optical Character Recognition (OCR). __TOC__ == Problem Description == When mutool processes certain PDF files, it generates the following error message: <pre>warning: FT_Get_Advance(HiddenHor...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

mutool glyph index error is an issue that occurs when using the mutool command from the MuPDF toolset to process PDF files, resulting in text extraction failure due to abnormal font glyph index mapping. This problem is commonly found in scanned documents processed through Optical Character Recognition (OCR).

Problem Description

When mutool processes certain PDF files, it generates the following error message:

warning: FT_Get_Advance(HiddenHorzOCR,126): invalid glyph index
Error!: Invalid PDF format

This error causes the text extraction process to abort, preventing normal retrieval of PDF content. The issue is particularly prone to occur in scanned documents containing OCR hidden text layers[1].

Technical Background

PDF Font Embedding Mechanism

To ensure consistent cross-platform display, PDF files typically embed the fonts used within the document. To reduce file size, font subsetting technology is commonly employed, embedding only the glyphs actually used[2].

FreeType Library

mutool relies on the FreeType library for font rendering. The FT_Get_Advance() function is responsible for obtaining the advance width of glyphs, a critical parameter for text layout[3]. When a provided glyph index does not exist in the font, the function returns an error.

Root Cause

The root cause of this error can be summarized as follows:

  1. OCR Font Defects
    • Hidden fonts generated by OCR software (such as HiddenHorzOCR) often contain incomplete or erroneous glyph definitions
    • Glyph index mapping tables are inconsistent with actual glyph data
  2. Font Subsetting Errors
    • Font subsetting during PDF generation may produce invalid indices
    • Character encoding to glyph index correspondence is corrupted
  3. PDF Specification Compliance Differences
    • Some PDF generation tools do not strictly adhere to PDF/A or ISO 32000 standards
    • Font table structures do not conform to OpenType or TrueType specifications

Software Error Tolerance Differences

PDF Processing Engine Developer Font Error Handling Strategy Specification Compliance
MuPDF (mutool) Artifex Software Strict validation, fail on error High
Xpdf (pdftotext) Glyph & Cog Lenient handling, continue with errors Medium
Poppler freedesktop.org Moderate strictness Medium-High

Different PDF engines’ approaches to handling font errors result in varying processing outcomes for the same document across different tools[4].

Solutions

Multiple Fallback Mechanism

It is recommended to implement a cascading converter fallback approach:

First Layer: mutool draw -F txt
   ↓ (failure)
Second Layer: pdftotext -enc UTF-8
   ↓ (failure)
Third Layer: pdfly extract-text


References

  1. PyMuPDF Development Team (2023). “Font errors during document processing”. GitHub Discussions.https://github.com/pymupdf/PyMuPDF/discussions/2385
  2. Adobe Systems (2008). PDF Reference, sixth edition: Adobe Portable Document Format version 1.7. https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf
  3. The FreeType Project (2024). “FreeType-2 API Reference: Quick retrieval of advance values”. https://freetype.org/freetype2/docs/reference/ft2-quick_advance.html
  4. Artifex Software (2024). “MuPDF Documentation”. https://mupdf.readthedocs.io/_/downloads/en/1.26.1/pdf/


See Also: