Comparison of common data file formats: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
 
Line 1: Line 1:
拿到資料檔案該怎麼開?各格式限制與注意事項
How to Open Data Files: Common Formats, Limitations, and Notes


== 常見資料格式與限制 ==
{{LanguageSwitcher | content = [[Comparison of common data file formats | EN]], [[Comparison of common data file formats in Mandarin | 漢字]] }}
=== 1. CSV 檔案 ===
CSV 檔案:不同欄位值用逗號間隔。
CSV 格式本身沒有筆數上限,限制來自開啟它的軟體。<ref>Row Zero — CSV Row Limits by Spreadsheet: https://rowzero.com/blog/csv-row-limit</ref>


常見問題是
== Common Data Formats and Limitations ==
大部分人會使用 Microsoft Excel 開啟 CSV, (1) 中文亂碼:點選兩下發現中文亂碼,檔首需要額外加上 BOM 符號,或者改用匯入檔案方式,才不會遇到亂碼、(2) 如果欄位值包含換行符號的話,Excel 容易會出錯,導致明明是同一筆資料,卻出現在下一筆而出現欄位錯置的錯誤。(3) 資料類型強迫轉換:常見的問題是手機號碼的第一位號碼 0 被吃掉,因為被強制轉換成整數。


如果你拿到是包含換行符號欄位值的 CSV 檔案,則不建議使用 Excel,而是改用 LibreOffice Calc。
=== CSV Files ===
CSV files: different field values are separated by commas.
The CSV format itself has no row limit — the limit comes from the software used to open it.<ref>Row Zero — CSV Row Limits by Spreadsheet: https://rowzero.com/blog/csv-row-limit</ref>


=== 2. TSV 檔案 ===
Common issues arise because most people open CSV files with Microsoft Excel: (1) '''Garbled Chinese characters''': double-clicking may result in garbled text; a BOM character needs to be prepended to the file, or the file must be imported rather than opened directly. (2) '''Newline characters in field values''': if a field value contains a newline, Excel tends to misinterpret it, causing data from the same record to appear on the next row and resulting in column misalignment. (3) '''Forced type conversion''': a common issue is that the leading zero of a phone number is dropped because the value is coerced into an integer.
TSV 檔案:不同欄位值用定位鍵 (TAB) 間隔。如果欄位值包含逗號,可以改用 TSV 檔案。
TSV 與 CSV 同屬純文字格式,筆數上限同樣取決於開啟的軟體,格式本身無限制。


=== 3. Excel 檔案===
If you receive a CSV file containing newline characters in field values, it is not recommended to use Excel; use LibreOffice Calc instead.
Excel 檔案:常見的檔案類型,但是儲存格文字長度與欄位數量都有上限:單一儲存格內的文字不能超過 32,767 字,而欄位數上限是 16,384 欄 <ref>Microsoft Support — Excel specifications and limits: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3</ref>。其次雖然資料筆數理論上限是 1,048,576 筆(= 2²⁰),但如果筆數超過 10 萬筆,電腦操作 Excel 就會頓頓的,反而不方便分析。如果是舊版 Excel(2003 以前)允許的筆數更少,僅 65,536 筆(= 2¹⁶)× 256 欄。<ref>[Did You Know] Microsoft Excel has a Limit of Maximum 1,048,576 Rows and 16,384 Columns – AskVG https://www.askvg.com/did-you-know-microsoft-excel-has-a-limit-of-maximum-1048576-rows-and-16384-columns/</ref><ref>[https://superuser.com/questions/366468/what-is-the-maximum-allowed-rows-in-a-microsoft-excel-xls-or-xlsx What is the maximum allowed rows in a Microsoft Excel .xls or .xlsx - Super User]</ref>


常見問題是
=== TSV Files ===
Excel 雖然方便但經常會把長數字(例如:訂單編號)自動轉成科學記號顯示<ref>[https://errerrors.blogspot.com/2023/04/how-to-fix-rounding-after-15-digits-on-excel.html 解決 Excel 長數字的最後一位歸零或出現科學記號的問題]</ref>,或把特定字串(如 1-2、MAR3)誤判為日期<ref>[https://www.techbang.com/posts/110723-excel-human-genome Excel功能出包,迫使科學家必須重新命名人類基因!現在這個Bug被修復了 | T客邦]</ref>,造成資料遺失或資料錯誤。
TSV files: different field values are separated by tab characters. If field values contain commas, TSV can be used as an alternative.
TSV and CSV are both plain-text formats; the row limit likewise depends on the software used to open them, not the format itself.


=== 4. JSON 檔案 ===
=== Excel Files ===
JSON 檔案:適合複雜格式或者說自由風格的欄位定義。曾經拿到包含多筆資料,約 250 MB 檔案大小的 JSON 檔案。因為發生檔案讀取錯誤,需要額外切割檔案內容。
Excel files: a widely used file format, but with limits on cell text length and column count: a single cell cannot contain more than 32,767 characters, and the maximum number of columns is 16,384.<ref>Microsoft Support Excel specifications and limits: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3</ref> Although the theoretical row limit is 1,048,576 rows (= 2²⁰), performance becomes noticeably sluggish beyond 100,000 rows, making analysis inconvenient. Older versions of Excel (2003 and earlier) allowed even fewer rows — only 65,536 rows (= 2¹⁶) × 256 columns.<ref>[Did You Know] Microsoft Excel has a Limit of Maximum 1,048,576 Rows and 16,384 Columns – AskVG: https://www.askvg.com/did-you-know-microsoft-excel-has-a-limit-of-maximum-1048576-rows-and-16384-columns/</ref><ref>What is the maximum allowed rows in a Microsoft Excel .xls or .xlsx - Super User: https://superuser.com/questions/366468/what-is-the-maximum-allowed-rows-in-a-microsoft-excel-xls-or-xlsx</ref>
JSON 格式本身沒有筆數上限,瓶頸是讀取時必須將整個檔案載入記憶體,因此檔案愈大愈容易發生記憶體不足的錯誤。<ref>Quora What is the practical limit on the size of a JSON message?: https://www.quora.com/What-is-the-practical-limit-on-the-size-of-a-JSON-message-for-an-internet-application</ref>


實務上建議不要將多筆資料寫在 JSON,而改成 JSONL,也就是 JSON 加上換行符號區別多筆資料。適合使用程式用串流方式逐行讀取,而不致於發生記憶體錯誤。
Common issues include Excel automatically converting long numbers (e.g., order IDs) into scientific notation,<ref>Fixing the issue of long numbers being rounded or displayed in scientific notation in Excel (written in Mandarin): https://errerrors.blogspot.com/2023/04/how-to-fix-rounding-after-15-digits-on-excel.html</ref> or misidentifying certain strings (such as <code>1-2</code> or <code>MAR3</code>) as dates,<ref>Excel bug forces scientists to rename human genes — and the bug has now been fixed | T客邦 (written in Mandarin): https://www.techbang.com/posts/110723-excel-human-genome</ref> resulting in data loss or corruption.


=== 5. SQLite 或 Parquet 檔案 ===
=== JSON Files ===
SQLite 或 Parquet 檔案:當資料量超過 Excel 筆數上限,可以考慮使用這兩種格式。
JSON files: suitable for complex or flexibly structured field definitions. A JSON file containing multiple records with a size of approximately 250 MB has been encountered in practice; a read error occurred and the file had to be split before processing.
The JSON format itself has no row limit, but the bottleneck is that the entire file must be loaded into memory when read — the larger the file, the more likely an out-of-memory error will occur.<ref>Quora — What is the practical limit on the size of a JSON message?: https://www.quora.com/What-is-the-practical-limit-on-the-size-of-a-JSON-message-for-an-internet-application</ref>


SQLite 每張資料表的理論筆數上限是 2⁶⁴ 筆(約 1.8 × 10¹⁹),但這個數字實際上不可能達到,因為資料庫檔案大小上限 281 TB 會先被遇到。在最大資料庫容量下,可儲存的筆數約為 2 × 10¹³ 筆(前提是沒有索引且每筆資料極小)。<ref>SQLite 官方文件 Implementation Limits For SQLite: https://sqlite.org/limits.html</ref>
In practice, it is recommended to avoid storing multiple records in a single JSON file and to use JSONL instead that is, JSON with newline characters delimiting individual records. This format is suitable for streaming line-by-line with a program, preventing out-of-memory errors.


Parquet 沒有硬性的筆數上限。格式本身以 Row Group 為單位儲存資料。Row Group 是 Parquet 檔案內部的水平分割單位,每個 Row Group 包含一段連續的資料列,以欄位為單位分開存放——例如一個有 100 萬筆、10 個欄位的資料集,在一個 Row Group 裡會被切成 10 段欄位資料分別儲存,而非逐列存放。這樣的設計讓查詢時只需讀取需要的欄位,不必掃描整列,大幅提升讀取效率。每個 Row Group 預設上限為 100 萬筆,但一個檔案可以包含任意數量的 Row Group,因此整個檔案的筆數沒有理論上限 <ref>Apache arrow-rs GitHub issue #5797 — "Row groups are limited to 1M rows by default": https://github.com/apache/arrow-rs/issues/5797</ref>。實務上已有人成功寫入 5 億~ 10 億筆資料<ref>Andy Cutler — 10 Billion Rows: Parquet File Size and Distribution When using CETAS: https://www.serverlesssql.com/row-size-and-parquet-file-distribution/</ref>,瓶頸通常是硬碟空間,而非格式本身。
=== SQLite or Parquet Files ===
SQLite or Parquet files: when the data volume exceeds Excel's row limit, these two formats are worth considering.


=== 6. SQL 檔案 ===
The theoretical row limit per table in SQLite is 2⁶⁴ rows (approximately 1.8 × 10¹⁹), but this number is practically unreachable because the database file size limit of 281 TB will be hit first. At maximum database capacity, the number of storable rows is approximately 2 × 10¹³ (assuming no indexes and minimal row size).<ref>SQLite official documentation — Implementation Limits For SQLite: https://sqlite.org/limits.html</ref>
SQL 檔案:關聯式資料庫的匯出格式,內容為多列的 SQL 語法(如 {{kbd | key=<nowiki>INSERT INTO ...</nowiki>}}),可直接匯入至資料庫。


SQL 檔案本身沒有筆數上限,格式本身只是純文字,限制同樣來自開啟或匯入它的工具。
Parquet has no hard row limit. The format stores data in units called Row Groups — the horizontal partitioning unit within a Parquet file. Each Row Group contains a contiguous segment of rows stored column by column. For example, a dataset with 1,000,000 rows and 10 columns would be split into 10 separate column chunks within a single Row Group, rather than stored row by row. This design allows queries to read only the required columns without scanning entire rows, greatly improving read performance. Each Row Group defaults to a limit of 1,000,000 rows, but a single file can contain any number of Row Groups, so there is no theoretical row limit for the file as a whole.<ref>Apache arrow-rs GitHub issue #5797 — "Row groups are limited to 1M rows by default": https://github.com/apache/arrow-rs/issues/5797</ref> In practice, people have successfully written 500 million to 1 billion rows,<ref>Andy Cutler — 10 Billion Rows: Parquet File Size and Distribution When using CETAS: https://www.serverlesssql.com/row-size-and-parquet-file-distribution/</ref> with the bottleneck typically being disk space rather than the format itself.


常見問題是
=== SQL Files ===
* (1) 檔案過大:大量資料匯出後動輒數 GB,用文字編輯器開啟會導致當機或無回應。
SQL files: the export format for relational databases, consisting of multiple lines of SQL statements (e.g., <code>INSERT INTO ...</code>), which can be imported directly into a database.
* (2) 跨資料庫相容性:不同資料庫引擎(如 MySQL、PostgreSQL、SQLite)的 SQL 語法有差異,從 A 引擎匯出的 .sql 檔案,未必能直接匯入另一個引擎。可參考 [[排除_MySQL_技術疑難問題#從_MySQL_資料轉移至_MSSQL | 從_MySQL_資料轉移至_MSSQL]]


如果只需要搬移資料而非還原完整資料庫結構,建議改用 CSV 或 Parquet,避免上述相容性問題。
SQL files themselves have no row limit; the format is plain text, and limitations again come from the tools used to open or import them.


Common issues include:
* '''(1) Large file size''': bulk data exports can easily reach several gigabytes; opening such files in a text editor may cause it to freeze or become unresponsive.
* '''(2) Cross-database compatibility''': different database engines (such as MySQL, PostgreSQL, and SQLite) have differences in SQL syntax; a <code>.sql</code> file exported from engine A may not be directly importable into another engine. See also: Troubleshooting MySQL Technical Issues — Migrating Data from MySQL to MSSQL (written in Mandarin).
If the goal is simply to migrate data rather than restore a complete database structure, it is recommended to use CSV or Parquet instead, in order to avoid the compatibility issues described above.


== References ==
== References ==
Line 52: Line 52:


[[Category: Software]]
[[Category: Software]]
[[Category: Spreadsheet]]  
[[Category: Spreadsheet]]
[[Category: MySQL]]
[[Category: MySQL]]
[[Category: Data Science]]
[[Category: Data Science]]
[[Category: Revised with LLMs]]
[[Category: Revised with LLMs]]

Navigation menu