Mutool Glyph Index Error

mutool glyph index error is an issue that occurs when using the mutool command from the MuPDF toolset to process PDF files, resulting in text extraction failure due to abnormal font glyph index mapping. This problem is commonly found in scanned documents processed through Optical Character Recognition (OCR).

Problem Description

When mutool processes certain PDF files, it generates the following error message:

warning: FT_Get_Advance(HiddenHorzOCR,126): invalid glyph index
Error!: Invalid PDF format

This error causes the text extraction process to abort, preventing normal retrieval of PDF content. The issue is particularly prone to occur in scanned documents containing OCR hidden text layers[1].

Technical Background

PDF Font Embedding Mechanism

To ensure consistent cross-platform display, PDF files typically embed the fonts used within the document. To reduce file size, font subsetting technology is commonly employed, embedding only the glyphs actually used[2].

FreeType Library

mutool relies on the FreeType library for font rendering. The FT_Get_Advance() function is responsible for obtaining the advance width of glyphs, a critical parameter for text layout[3]. When a provided glyph index does not exist in the font, the function returns an error.

Root Cause

The root cause of this error can be summarized as follows:

OCR Font Defects
- Hidden fonts generated by OCR software (such as HiddenHorzOCR) often contain incomplete or erroneous glyph definitions
- Glyph index mapping tables are inconsistent with actual glyph data
Font Subsetting Errors
- Font subsetting during PDF generation may produce invalid indices
- Character encoding to glyph index correspondence is corrupted
PDF Specification Compliance Differences
- Some PDF generation tools do not strictly adhere to PDF/A or ISO 32000 standards
- Font table structures do not conform to OpenType or TrueType specifications

Software Error Tolerance Differences

PDF Processing Engine	Developer	Font Error Handling Strategy	Specification Compliance
MuPDF (mutool)	Artifex Software	Strict validation, fail on error	High
Xpdf (pdftotext)	Glyph & Cog	Lenient handling, continue with errors	Medium
Poppler	freedesktop.org	Moderate strictness	Medium-High

Different PDF engines’ approaches to handling font errors result in varying processing outcomes for the same document across different tools[4].

Solutions

Multiple Fallback Mechanism

It is recommended to implement a cascading converter fallback approach:

First Layer: mutool draw -F txt
   ↓ (failure)
Second Layer: pdftotext -enc UTF-8
   ↓ (failure)
Third Layer: pdfly extract-text

References

PyMuPDF Development Team (2023). “Font errors during document processing”. GitHub Discussions.https://github.com/pymupdf/PyMuPDF/discussions/2385
Adobe Systems (2008). PDF Reference, sixth edition: Adobe Portable Document Format version 1.7. https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf
The FreeType Project (2024). “FreeType-2 API Reference: Quick retrieval of advance values”. https://freetype.org/freetype2/docs/reference/ft2-quick_advance.html
Artifex Software (2024). “MuPDF Documentation”. https://mupdf.readthedocs.io/_/downloads/en/1.26.1/pdf/

See Also:

Mutool Glyph Index Error

Contents

Problem Description

Technical Background

PDF Font Embedding Mechanism

FreeType Library

Root Cause

Software Error Tolerance Differences

Solutions

References

Navigation menu

Mutool Glyph Index Error

Problem Description

Technical Background

PDF Font Embedding Mechanism

FreeType Library

Root Cause

Software Error Tolerance Differences

Solutions

References

Navigation menu

Search