OCR: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
Line 38: Line 38:


== OCR scripts ==
== OCR scripts ==
Scripts
 
* [https://github.com/thiagoalessio/tesseract-ocr-for-php thiagoalessio/tesseract-ocr-for-php: A wrapper to work with Tesseract OCR inside PHP.] 有提供繁體中文 model 檔案({{kbd | key=chi_tra (Chinese traditional)}}) <ref>[https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#languages-and-scripts LANGUAGES AND SCRIPTS]</ref>,但是繁體中文辨識結果不佳。 {{access | date=2022-04-20}}
[https://github.com/ocropus/ocropy ocropus/ocropy: Python-based tools for document analysis and OCR]
** Language: PHP
* Script Language: Python
** License: [https://github.com/thiagoalessio/tesseract-ocr-for-php/blob/main/MIT-LICENSE MIT License]
* Support Language: < 10. {{exclaim}} 沒有提供中文 model 檔案 {{access | date=2022-04-20}} More on [https://github.com/ocropus-archive/DUP-ocropy/wiki/Models Models · ocropus-archive/DUP-ocropy Wiki]
* [https://github.com/ocropus/ocropy ocropus/ocropy: Python-based tools for document analysis and OCR] 沒有提供中文 model 檔案 {{access | date=2022-04-20}}
* License: [https://github.com/ocropus/ocropy/blob/master/LICENSE Apache License 2.0]
** Language: Python
 
** License: [https://github.com/ocropus/ocropy/blob/master/LICENSE Apache License 2.0]
[https://github.com/tesseract-ocr/tesseract tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)] {{access | date=2022-06-19}}
*[https://github.com/tesseract-ocr/tesseract tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)] {{access | date=2022-06-19}}
* Script Language: C++; Fork on PHP [https://github.com/thiagoalessio/tesseract-ocr-for-php thiagoalessio/tesseract-ocr-for-php: A wrapper to work with Tesseract OCR inside PHP.] <ref>[https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#languages-and-scripts LANGUAGES AND SCRIPTS]</ref>,
** Language: C++:
* Support Language: 100+ contains Traditional Chinese 但是繁體中文辨識結果不佳。 {{access | date=2022-04-20}}. More on [https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html Languages/Scripts supported in different versions of Tesseract | tessdoc]
** License: [https://github.com/tesseract-ocr/tesseract/blob/main/LICENSE Apache License 2.0]
* License: [https://github.com/tesseract-ocr/tesseract/blob/main/LICENSE Apache License 2.0]. PHP Fork: [https://github.com/thiagoalessio/tesseract-ocr-for-php/blob/main/MIT-LICENSE MIT License]


== OCR API ==
== OCR API ==

Revision as of 11:52, 6 May 2024

OCR (optical character recognition), 光學字元辨識、圖片轉文字


OCR tools

圖片轉換成文字

  • Google Photos 將圖片上傳到 Google Photos,再點選「複製圖像中的文字」[Last visited: 2022-09-30]
  • MS Office 2003 需額外安裝的Office 工具: Microsoft Office Document Imaging (你也可以輕鬆做文字辨識(OCR))
    1. (.pdf檔案轉為.mdi) PDF列印到 MS Office 2003 Document Imaging
    2. (.mdi檔案轉為word檔) MS Office 2003 Document Imaging(.mdi) -> 使用OCR辨識/傳送文字到Word
Owl icon.jpg 講個秘訣:因為線上服務免費版會限制 PDF 檔案頁數,可使用切割軟體 PDF split and merge tools

PDF轉換成文字

OCR scripts

ocropus/ocropy: Python-based tools for document analysis and OCR

tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository) [Last visited: 2022-06-19]

OCR API

OCR API


相關頁面

常用文件的解析度設定

常用用途的解析度設定

  • 文字辨識 75~150 dpi
  • 圖文交雜 100~150 dpi
  • 圖檔(螢幕上觀看) 150~250 dpi Icon_exclaim.gif 個人經驗: 簡報掃描的圖檔,如果是小字 300 dpi 可以辨識,但建議調整到 600 dpi。
  • 圖檔(有列印需求) 300 dpi以上
  • 名片 150~200 dpi

出處:PCHome 2005/8

References

相關文章