Speech to text: Difference between revisions

Speech to text (edit)

Revision as of 11:44, 6 June 2026

2,754 bytes added , 6 June

no edit summary

Planetoid

Bureaucrats, Administrators

15,049

edits

@@ Line 1: / Line 1: @@
-== Speech to text 工具 ==
+Comparison of speech to text (transcription) software
-{{Gd}} [https://github.com/openai/whisper openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision]
-* Support Language: 99 languages
+== Speech to text software ==
-* Input file: Audio files
+[https://notebooklm.google.com/ NotebookLM] {{access | date = 2026-06-06}}
-* Speaker identification: Need to integrate with (1) [https://github.com/m-bain/whisperX m-bain/whisperX: WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)] or (2) [https://github.com/pyannote/pyannote-audio pyannote/pyannote-audio: Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding]
+* Input file format: Audio file (max 200 MB)<ref>[https://support.google.com/notebooklm/answer/16269187?hl=en#zippy=%2Cfile-size-limit-for-sources-in-notebooklm Frequently asked questions - NotebookLM Help]: "The current limit is 500,000 words per source or up to 200MB for local uploads. There's no page limit."</ref>
-* Real-Time Subtitles or Translation: Not Available
+* Supported languages: 80+ languages<ref>[https://support.google.com/notebooklm/answer/16261963?hl=en&co=GENIE.Platform%3DDesktop#zippy= Change output language in NotebookLM - Computer - NotebookLM Help]</ref>
-* Related:
+* Speaker identification: Prompt-based
-** [https://huggingface.co/spaces/Xenova/whisper-webgpu Whisper WebGPU - a Hugging Face Space by Xenova]
+* Price: Free and paid tiers available
-** [https://github.com/aaaddress1/Whisper.py?fbclid=IwAR1rwZH-USj2NIt8pLYRGhIqWvQWUj1FQTx83qpBncno3ANWDUBI_duWr9M aaaddress1/Whisper.py: 白癡喔還要下 pip install 誰會用啦—隨開即用 Windows 版 OpenAI Whisper 逐字稿產生器] on {{Win}} 介紹：[https://www.playpcesor.com/2023/04/whisperdesktop-ai.html WhisperDesktop 語音轉文字免費單機軟體，AI 影片字幕實測比較]
+* Output file format: TXT (Prompt-based)
-** 🎙️ MacWhisper https://goodsnooze.gumroad.com/l/macwhisper on {{Mac}}
+* Notes: (1) Timestamp formatting is not supported {{exclaim}} (2) Mixed-language audio (e.g. Mandarin Chinese/English): Automatically translated into the target language as specified in the user prompt
-[https://cloud.google.com/speech/?hl=zh-tw Speech API - 語音辨識  |  Google Cloud] 「語音轉文字採用機器學習技術」，免費版語音辨識的額度 60 分鐘，詳 [https://cloud.google.com/speech-to-text/pricing 定價  |  Cloud Speech API Documentation  |  Google Cloud]。 {{access | date = 2018-09-04}}
+[https://asr.yating.tw/ 雅婷逐字稿]
-* Input: microphone & audio file (For audio file which longer than 1 minute, upload files to Google cloud storage.
+* Input file format: Audio file or video file
-* Language: 120 languages <ref>[https://cloud.google.com/speech-to-text/docs/languages?hl=zh-tw Language Support  |  Cloud Speech-to-Text API  |  Google Cloud]</ref>
+* Supported languages: (1) Mandarin Chinese & English, (2) Mandarin Chinese, English & Taiwanese (3) English
-* Sample code:
+* Speaker identification: Yes {{Gd}}
-* Related: [[Troubleshooting of Google cloud speech to text]])
+* Price: Free and paid tiers available
+* Output file format: PDF, TXT, ODT, DOCX, SRT, CSV
+* Notes:
-[https://azure.microsoft.com/zh-tw/services/cognitive-services/speech/ Bing 語音 API - 語音辨識軟體 | Microsoft Azure]
+[https://gemini.google.com/app Gemini]
-* Input: Audio file. Format: wav & ogg<ref>[https://docs.microsoft.com/zh-tw/azure/cognitive-services/speech-service/rest-speech-to-text 語音轉換文字 API 參考（REST）-語音服務 - Azure Cognitive Services | Microsoft Docs]</ref>
+* Input file format: Audio file or video file<ref>[https://support.google.com/gemini/answer/14903178?hl=en&co=GENIE.Platform%3DDesktop&sjid=5876216419430700379-NC Upload & analyze files in Gemini Apps - Computer - Gemini Apps Help]</ref> The Gemini app doesn't support direct audio file uploads larger than 20 MB — you'll need to either use the File API or upload the file to Google Drive first and then link it from within the Gemini app.<ref>[https://ai.google.dev/gemini-api/docs/audio Audio understanding - generateContent API | Google AI for Developers]</ref>
-* Language: Traditional Chinese, Simplified Chinese & English and more on the list<ref>[https://docs.microsoft.com/zh-tw/azure/cognitive-services/speech-service/language-support#speech-to-text 語言支援-語音服務 - Azure Cognitive Services | Microsoft Docs]</ref>
+* Supported languages:
-* Sample code: [https://github.com/Azure-Samples/SpeechToText-REST Azure-Samples/SpeechToText-REST: REST Samples of Speech To Text API]
+* Speaker identification: Prompt-based
-* Related:
+* Price: Free and paid tiers available
+* Output file format: TXT (Prompt-based)
+* Notes:
-[https://app.clipchamp.com/ Clipchamp] {{access | date = 2025-04-02}}
+[https://app.clipchamp.com/ Clipchamp] {{access | date = 2026-06-06}}
-* Input: audio or video file
+* Input file format: audio or video file
 * Support Language: 80+ languages<ref>[https://support.microsoft.com/en-us/topic/how-to-use-autocaptions-in-clipchamp-ccb0520b-38f6-4fa9-aca8-872c2964946a How to use autocaptions in Clipchamp - Microsoft Support]</ref><ref>[https://learn.microsoft.com/zh-tw/azure/ai-services/speech-service/language-support?tabs=stt 語言支援 - 語音服務 - Azure AI services | Microsoft Learn]</ref>
+* Speaker identification:
+* Output file format: SRT
 * Comments: The free version seems to have no limitation on video duration, and you can also use AI to convert videos or audio into transcripts for free. However, during testing, the subtitles displayed for each time code were not complete sentences.
 [https://ink.dwave.cc/en-US/pricing Meeting Ink - AI notetaker to transcribe and summarize your meetings and recordings.]
+* Input file: Audio files
 * Support Language:
-* Input file: Audio files
+* Speaker identification: Yes {{Gd}}
-* Speaker identification: Available {{Gd}}
 * Real-Time Subtitles or Translation: Pro plan only ''$''
 * Free limit: 30 minutes max
-[https://tw.olami.ai/open/website/apiandsolution/api_solution OLAMI 中文語音辨識 API｜歐拉蜜人工智慧開放平台（威盛電子）] {{access | date = 2018-09-05}}
+[https://huggingface.co/spaces/Xenova/whisper-web Whisper Web - a Hugging Face Space by Xenova]
-* Input: Audio file. Format: wav & speex <ref>[https://tw.olami.ai/wiki/?mp=api_asr&content=api_asr1.html 文件中心 - OLAMI - 歐拉蜜人工智慧開放平台]</ref>
+* Input file: Audio files
-* Language: Traditional Chinese & Simplified Chinese <ref>[https://github.com/olami-developers/olami-api-quickstart-curl-samples/tree/master/cloud-speech-recognition olami-api-quickstart-curl-samples/cloud-speech-recognition at master · olami-developers/olami-api-quickstart-curl-samples]</ref>
+* Support Language: English
-* Sample code: [https://github.com/olami-developers/olami-api-quickstart-curl-samples/tree/master/cloud-speech-recognition olami-developers/olami-api-quickstart-curl-samples]
+* Speaker identification: No {{exclaim}}
-* Related: [[Troubleshooting of Olami speech to text]]
+* Output file format: TXT or JSON (contains timestamp info.)
 影片要產生文字，可利用 youtube 的 [https://support.google.com/youtube/answer/6373554?hl=en Use automatic captioning - YouTube Help]，約需要半天時間 {{access | date = 2018-09-04}} 教學: [https://www.techbang.com/posts/2107 YouTube超佛心，自動幫你加入字幕！ | T客邦]
 * Input: Video
 * Language:
-* Sample code:
-* Related:
-[https://www.xfyun.cn/doccenter/asr 语音识别 - 讯飞开放平台] {{access | date=2018-09-06}}
-* Input: speex audio file less than 1 minute <ref>[https://doc.xfyun.cn/rest_api/%E8%AF%AD%E9%9F%B3%E5%90%AC%E5%86%99.html 语音听写 · 科大讯飞REST_API开发指南]</ref>
-* Language: 中文（普通话）、英文、中文（粤语）、中文（四川话）
-* Sample code:
-* Related:
-[https://aws.amazon.com/tw/transcribe/ Amazon Transcribe – 自動語音辨識 – AWS] (API documentation: [https://docs.aws.amazon.com/transcribe/latest/dg/what-is-transcribe.html What Is Amazon Transcribe? - Amazon Transcribe]) {{access | date=2018-09-05}}
-* Input: Audio file (Stored in S3 bucket). "Valid formats for the audio are mp3, mp4, wav and flac. <ref>[https://docs.aws.amazon.com/transcribe/latest/dg/API_StartTranscriptionJob.html StartTranscriptionJob - Amazon Transcribe] For best results, use a lossless format, such as FLAC or WAV with PCM 16-bit encoding.Your audio input can be sampled at any rate between 8000 and 48000 Hz. We suggest that you use 8000 Hz for low-quality audio and 16000 Hz for high-quality audio.</ref>"
-* Language: English, Spanish
 * Sample code:
 * Related:
@@ Line 67: / Line 62: @@
 * Related:
 * Free limit: 5 minutes
-[https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file SYSTRAN/faster-whisper: Faster Whisper transcription with CTranslate2]
-* Language: Fork from OpenAI Whisper
-* Sample code: [https://colab.research.google.com/drive/1TqmzTY5ZXcYBoBGbwSVBtwxlFajMIcRc?usp=sharing]
-* Related:
-* Free limit:
-* Instruction: [https://gsyan888.blogspot.com/2023/11/faster-whisper.html 雄::gsyan: 以 Faster Whisper 將影音辨識為文字檔案(字幕或逐字稿)]
 [https://www.mygoodtape.com/ Good Tape]
@@ Line 89: / Line 77: @@
 * Free limit:
-[https://github.com/Const-me/Whisper Const-me/Whisper: High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model] on {{Win}}
+[https://web.itranscribe.co/#/homepage iTranscribe: Transcribe Audio & Video to Text]
 * Language:
 * Sample code:
@@ Line 95: / Line 83: @@
 * Free limit:
-[https://web.itranscribe.co/#/homepage iTranscribe: Transcribe Audio & Video to Text]
+[https://www.capcut.cn/ 剪映官網-全能易用的桌面端剪輯軟體-輕而易剪 上演大幕] 中國軟體 {{exclaim}}
 * Language:
 * Sample code:
@@ Line 101: / Line 89: @@
 * Free limit:
-[https://www.capcut.cn/ 剪映官網-全能易用的桌面端剪輯軟體-輕而易剪 上演大幕] 中國軟體 {{exclaim}}
+''$'' [https://goodsnooze.gumroad.com/l/macwhisper MacWhisper] on {{Mac}}
+* Input file format: Audio file or video file
+* Supported languages:
+* Speaker identification: Yes {{Gd}}
+* Price: Free or Pro plan
+* Output file format: TXT, DOCX, SRT, VTT, JSON and more
+* Notes:
+== Speech to text API ==
+{{Gd}} [https://github.com/openai/whisper openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision]
+* Support Language: 99 languages
+* Input file: Audio files
+* Speaker identification: Need to integrate with (1) [https://github.com/m-bain/whisperX m-bain/whisperX: WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)] or (2) [https://github.com/pyannote/pyannote-audio pyannote/pyannote-audio: Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding]
+* Real-Time Subtitles or Translation: Not Available
+* Related:
+** [https://github.com/aaaddress1/Whisper.py?fbclid=IwAR1rwZH-USj2NIt8pLYRGhIqWvQWUj1FQTx83qpBncno3ANWDUBI_duWr9M aaaddress1/Whisper.py: 白癡喔還要下 pip install 誰會用啦—隨開即用 Windows 版 OpenAI Whisper 逐字稿產生器] on {{Win}} 介紹：[https://www.playpcesor.com/2023/04/whisperdesktop-ai.html WhisperDesktop 語音轉文字免費單機軟體，AI 影片字幕實測比較]
+[https://cloud.google.com/speech/?hl=zh-tw Speech API - 語音辨識  |  Google Cloud] 「語音轉文字採用機器學習技術」，免費版語音辨識的額度 60 分鐘，詳 [https://cloud.google.com/speech-to-text/pricing 定價  |  Cloud Speech API Documentation  |  Google Cloud]。 {{access | date = 2018-09-04}}
+* Input: microphone & audio file (For audio file which longer than 1 minute, upload files to Google cloud storage.
+* Language: 120 languages <ref>[https://cloud.google.com/speech-to-text/docs/languages?hl=zh-tw Language Support  |  Cloud Speech-to-Text API  |  Google Cloud]</ref>
+* Sample code:
+* Related: [[Troubleshooting of Google cloud speech to text]])
+[https://azure.microsoft.com/zh-tw/services/cognitive-services/speech/ Bing 語音 API - 語音辨識軟體 | Microsoft Azure]
+* Input: Audio file. Format: wav & ogg<ref>[https://docs.microsoft.com/zh-tw/azure/cognitive-services/speech-service/rest-speech-to-text 語音轉換文字 API 參考（REST）-語音服務 - Azure Cognitive Services | Microsoft Docs]</ref>
+* Language: Traditional Chinese, Simplified Chinese & English and more on the list<ref>[https://docs.microsoft.com/zh-tw/azure/cognitive-services/speech-service/language-support#speech-to-text 語言支援-語音服務 - Azure Cognitive Services | Microsoft Docs]</ref>
+* Sample code: [https://github.com/Azure-Samples/SpeechToText-REST Azure-Samples/SpeechToText-REST: REST Samples of Speech To Text API]
+* Related:
+[https://tw.olami.ai/open/website/apiandsolution/api_solution OLAMI 中文語音辨識 API｜歐拉蜜人工智慧開放平台（威盛電子）] {{access | date = 2018-09-05}}
+* Input: Audio file. Format: wav & speex <ref>[https://tw.olami.ai/wiki/?mp=api_asr&content=api_asr1.html 文件中心 - OLAMI - 歐拉蜜人工智慧開放平台]</ref>
+* Language: Traditional Chinese & Simplified Chinese <ref>[https://github.com/olami-developers/olami-api-quickstart-curl-samples/tree/master/cloud-speech-recognition olami-api-quickstart-curl-samples/cloud-speech-recognition at master · olami-developers/olami-api-quickstart-curl-samples]</ref>
+* Sample code: [https://github.com/olami-developers/olami-api-quickstart-curl-samples/tree/master/cloud-speech-recognition olami-developers/olami-api-quickstart-curl-samples]
+* Related: [[Troubleshooting of Olami speech to text]]
+[https://www.xfyun.cn/doccenter/asr 语音识别 - 讯飞开放平台] {{access | date=2018-09-06}}
+* Input: speex audio file less than 1 minute <ref>[https://doc.xfyun.cn/rest_api/%E8%AF%AD%E9%9F%B3%E5%90%AC%E5%86%99.html 语音听写 · 科大讯飞REST_API开发指南]</ref>
+* Language: 中文（普通话）、英文、中文（粤语）、中文（四川话）
+* Sample code:
+* Related:
+[https://aws.amazon.com/tw/transcribe/ Amazon Transcribe – 自動語音辨識 – AWS] (API documentation: [https://docs.aws.amazon.com/transcribe/latest/dg/what-is-transcribe.html What Is Amazon Transcribe? - Amazon Transcribe]) {{access | date=2018-09-05}}
+* Input: Audio file (Stored in S3 bucket). "Valid formats for the audio are mp3, mp4, wav and flac. <ref>[https://docs.aws.amazon.com/transcribe/latest/dg/API_StartTranscriptionJob.html StartTranscriptionJob - Amazon Transcribe] For best results, use a lossless format, such as FLAC or WAV with PCM 16-bit encoding.Your audio input can be sampled at any rate between 8000 and 48000 Hz. We suggest that you use 8000 Hz for low-quality audio and 16000 Hz for high-quality audio.</ref>"
+* Language: English, Spanish
+* Sample code:
+* Related:
+[https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file SYSTRAN/faster-whisper: Faster Whisper transcription with CTranslate2]
+* Language: Fork from OpenAI Whisper
+* Sample code: [https://colab.research.google.com/drive/1TqmzTY5ZXcYBoBGbwSVBtwxlFajMIcRc?usp=sharing]
+* Related:
+* Free limit:
+* Instruction: [https://gsyan888.blogspot.com/2023/11/faster-whisper.html 雄::gsyan: 以 Faster Whisper 將影音辨識為文字檔案(字幕或逐字稿)]
+[https://github.com/Const-me/Whisper Const-me/Whisper: High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model] on {{Win}}
 * Language:
 * Sample code:
 * Related:
 * Free limit:
+== Related pages ==
+* [[Video to text]]
+* [[Taiwanese Mandarin Text-to-Speech Services]]
+* [[Troubleshooting of Google cloud speech to text]]
+* [[Troubleshooting of Olami speech to text]]
+* [[Troubleshooting of whisperX]]
+[[Category: Tool]]

Speech to text: Difference between revisions

Speech to text (edit)

Revision as of 11:44, 6 June 2026

Navigation menu

Search