Mutool Glyph Index Error: Difference between revisions

Latest revision as of 17:18, 3 February 2026

mutool glyph index error is an issue that occurs when using the mutool command from the MuPDF toolset to process PDF files, resulting in text extraction failure due to abnormal font glyph index mapping. This problem is commonly found in scanned documents processed through Optical Character Recognition (OCR).

Problem Description[edit]

When mutool processes certain PDF files, it generates the following error message:

warning: FT_Get_Advance(HiddenHorzOCR,126): invalid glyph index
Error!: Invalid PDF format

This error causes the text extraction process to abort, preventing normal retrieval of PDF content. The issue is particularly prone to occur in scanned documents containing OCR hidden text layers ^[1].

Technical Background[edit]

PDF Font Embedding Mechanism[edit]

To ensure consistent cross-platform display, PDF files typically embed the fonts used within the document. To reduce file size, font subsetting technology is commonly employed, embedding only the glyphs actually used ^[2].

FreeType Library[edit]

mutool relies on the FreeType library for font rendering. The FT_Get_Advance() function is responsible for obtaining the advance width of glyphs, a critical parameter for text layout ^[3]. When a provided glyph index does not exist in the font, the function returns an error.

Root Cause[edit]

The root cause of this error can be summarized as follows:

OCR Font Defects
- Hidden fonts generated by OCR software (such as HiddenHorzOCR) often contain incomplete or erroneous glyph definitions
- Glyph index mapping tables are inconsistent with actual glyph data
Font Subsetting Errors
- Font subsetting during PDF generation may produce invalid indices
- Character encoding to glyph index correspondence is corrupted
PDF Specification Compliance Differences
- Some PDF generation tools do not strictly adhere to PDF/A or ISO 32000 standards
- Font table structures do not conform to OpenType or TrueType specifications

Software Error Tolerance Differences[edit]

PDF Processing Engine	Developer	Font Error Handling Strategy	Specification Compliance
MuPDF (mutool)	Artifex Software	Strict validation, fail on error	High
Xpdf (pdftotext)	Glyph & Cog	Lenient handling, continue with errors	Medium
Poppler	freedesktop.org	Moderate strictness	Medium-High

Different PDF engines’ approaches to handling font errors result in varying processing outcomes for the same document across different tools ^[4].

Solutions[edit]

Multiple Fallback Mechanism

It is recommended to implement a cascading converter fallback approach:

First Layer: mutool draw -F txt
   ↓ (failure)
Second Layer: pdftotext -enc UTF-8
   ↓ (failure)
Third Layer: pdfly extract-text

References[edit]

↑ PyMuPDF Development Team (2023). “Font errors during document processing”. GitHub Discussions.https://github.com/pymupdf/PyMuPDF/discussions/2385
↑ Adobe Systems (2008). PDF Reference, sixth edition: Adobe Portable Document Format version 1.7. https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf
↑ The FreeType Project (2024). “FreeType-2 API Reference: Quick retrieval of advance values”. https://freetype.org/freetype2/docs/reference/ft2-quick_advance.html
↑ Artifex Software (2024). “MuPDF Documentation”. https://mupdf.readthedocs.io/_/downloads/en/1.26.1/pdf/

[1] PyMuPDF Development Team (2023). “Font errors during document processing”. GitHub Discussions.https://github.com/pymupdf/PyMuPDF/discussions/2385

[2] Adobe Systems (2008). PDF Reference, sixth edition: Adobe Portable Document Format version 1.7. https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf

[3] The FreeType Project (2024). “FreeType-2 API Reference: Quick retrieval of advance values”. https://freetype.org/freetype2/docs/reference/ft2-quick_advance.html

[4] Artifex Software (2024). “MuPDF Documentation”. https://mupdf.readthedocs.io/_/downloads/en/1.26.1/pdf/

[1]

[2]

[3]

[4]

@@ Line 9: / Line 9: @@
 <pre>warning: FT_Get_Advance(HiddenHorzOCR,126): invalid glyph index
 Error!: Invalid PDF format</pre>
-This error causes the text extraction process to abort, preventing normal retrieval of PDF content. The issue is particularly prone to occur in scanned documents containing OCR hidden text layers[1].
+This error causes the text extraction process to abort, preventing normal retrieval of PDF content. The issue is particularly prone to occur in scanned documents containing OCR hidden text layers <ref>PyMuPDF Development Team (2023). “Font errors during document processing”. GitHub Discussions.https://github.com/pymupdf/PyMuPDF/discussions/2385</ref>.
 == Technical Background ==
@@ Line 15: / Line 15: @@
 === PDF Font Embedding Mechanism ===
-To ensure consistent cross-platform display, PDF files typically embed the fonts used within the document. To reduce file size, '''font subsetting''' technology is commonly employed, embedding only the glyphs actually used[2].
+To ensure consistent cross-platform display, PDF files typically embed the fonts used within the document. To reduce file size, '''font subsetting''' technology is commonly employed, embedding only the glyphs actually used <ref>Adobe Systems (2008). ''PDF Reference, sixth edition: Adobe Portable Document Format version 1.7''. https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf</ref>.
 === FreeType Library ===
-mutool relies on the FreeType library for font rendering. The <code>FT_Get_Advance()</code> function is responsible for obtaining the advance width of glyphs, a critical parameter for text layout[3]. When a provided glyph index does not exist in the font, the function returns an error.
+mutool relies on the FreeType library for font rendering. The <code>FT_Get_Advance()</code> function is responsible for obtaining the advance width of glyphs, a critical parameter for text layout <ref>The FreeType Project (2024). “FreeType-2 API Reference: Quick retrieval of advance values”. https://freetype.org/freetype2/docs/reference/ft2-quick_advance.html</ref>. When a provided glyph index does not exist in the font, the function returns an error.
 == Root Cause ==
@@ Line 60: / Line 60: @@
 |}
-Different PDF engines’ approaches to handling font errors result in varying processing outcomes for the same document across different tools[4].
+Different PDF engines’ approaches to handling font errors result in varying processing outcomes for the same document across different tools <ref>Artifex Software (2024). “MuPDF Documentation”. https://mupdf.readthedocs.io/_/downloads/en/1.26.1/pdf/</ref>.
 == Solutions ==
@@ Line 75: / Line 75: @@
-== References ==
+== See Also ==
+* [[PDF conversion]]
+* [[Text Extraction]]
-# PyMuPDF Development Team (2023). “Font errors during document processing”. GitHub Discussions.https://github.com/pymupdf/PyMuPDF/discussions/2385
-# Adobe Systems (2008). ''PDF Reference, sixth edition: Adobe Portable Document Format version 1.7''. https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf
-# The FreeType Project (2024). “FreeType-2 API Reference: Quick retrieval of advance values”. https://freetype.org/freetype2/docs/reference/ft2-quick_advance.html
-# Artifex Software (2024). “MuPDF Documentation”. https://mupdf.readthedocs.io/_/downloads/en/1.26.1/pdf/
+== References ==
-'''See Also''':
+<references />
-* [[PDF conversion]]
-* [[Text Extraction]]
 [[Category: PDF Processing]]
 [[Category: Document Conversion]]
 [[Category: Revised with LLMs]]

Mutool Glyph Index Error: Difference between revisions

Latest revision as of 17:18, 3 February 2026

Contents

Problem Description[edit]

Technical Background[edit]

PDF Font Embedding Mechanism[edit]

FreeType Library[edit]

Root Cause[edit]

Software Error Tolerance Differences[edit]

Solutions[edit]

See Also[edit]

References[edit]

Navigation menu

Mutool Glyph Index Error: Difference between revisions

Latest revision as of 17:18, 3 February 2026

Problem Description[edit]

Technical Background[edit]

PDF Font Embedding Mechanism[edit]

FreeType Library[edit]

Root Cause[edit]

Software Error Tolerance Differences[edit]

Solutions[edit]

See Also[edit]

References[edit]

Navigation menu

Search