Mutool Glyph Index Error
mutool glyph index error is an issue that occurs when using the mutool command from the MuPDF toolset to process PDF files, resulting in text extraction failure due to abnormal font glyph index mapping. This problem is commonly found in scanned documents processed through Optical Character Recognition (OCR).
Problem Description
When mutool processes certain PDF files, it generates the following error message:
warning: FT_Get_Advance(HiddenHorzOCR,126): invalid glyph index Error!: Invalid PDF format
This error causes the text extraction process to abort, preventing normal retrieval of PDF content. The issue is particularly prone to occur in scanned documents containing OCR hidden text layers[1].
Technical Background
PDF Font Embedding Mechanism
To ensure consistent cross-platform display, PDF files typically embed the fonts used within the document. To reduce file size, font subsetting technology is commonly employed, embedding only the glyphs actually used[2].
FreeType Library
mutool relies on the FreeType library for font rendering. The FT_Get_Advance() function is responsible for obtaining the advance width of glyphs, a critical parameter for text layout[3]. When a provided glyph index does not exist in the font, the function returns an error.
Root Cause
The root cause of this error can be summarized as follows:
- OCR Font Defects
- Hidden fonts generated by OCR software (such as HiddenHorzOCR) often contain incomplete or erroneous glyph definitions
- Glyph index mapping tables are inconsistent with actual glyph data
- Font Subsetting Errors
- Font subsetting during PDF generation may produce invalid indices
- Character encoding to glyph index correspondence is corrupted
- PDF Specification Compliance Differences
- Some PDF generation tools do not strictly adhere to PDF/A or ISO 32000 standards
- Font table structures do not conform to OpenType or TrueType specifications
Software Error Tolerance Differences
| PDF Processing Engine | Developer | Font Error Handling Strategy | Specification Compliance |
|---|---|---|---|
| MuPDF (mutool) | Artifex Software | Strict validation, fail on error | High |
| Xpdf (pdftotext) | Glyph & Cog | Lenient handling, continue with errors | Medium |
| Poppler | freedesktop.org | Moderate strictness | Medium-High |
Different PDF engines’ approaches to handling font errors result in varying processing outcomes for the same document across different tools[4].
Solutions
Multiple Fallback Mechanism
It is recommended to implement a cascading converter fallback approach:
First Layer: mutool draw -F txt ↓ (failure) Second Layer: pdftotext -enc UTF-8 ↓ (failure) Third Layer: pdfly extract-text
References
- PyMuPDF Development Team (2023). “Font errors during document processing”. GitHub Discussions.https://github.com/pymupdf/PyMuPDF/discussions/2385
- Adobe Systems (2008). PDF Reference, sixth edition: Adobe Portable Document Format version 1.7. https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf
- The FreeType Project (2024). “FreeType-2 API Reference: Quick retrieval of advance values”. https://freetype.org/freetype2/docs/reference/ft2-quick_advance.html
- Artifex Software (2024). “MuPDF Documentation”. https://mupdf.readthedocs.io/_/downloads/en/1.26.1/pdf/
See Also: