Comparison of common data file formats: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
(Created page with "拿到資料檔案該怎麼開?各格式限制與注意事項 == 常見資料格式與限制 == === 1. CSV 檔案 === CSV 檔案:不同欄位值用逗號間隔。 CSV 格式本身沒有筆數上限,限制來自開啟它的軟體。<ref>Row Zero — CSV Row Limits by Spreadsheet: https://rowzero.com/blog/csv-row-limit</ref> 常見問題是 大部分人會使用 Microsoft Excel 開啟 CSV, (1) 中文亂碼:點選兩下發現中文亂碼,檔首需要額外加...")
 
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
拿到資料檔案該怎麼開?各格式限制與注意事項
How to Open Data Files: Common Formats, Limitations, and Notes


== 常見資料格式與限制 ==
{{LanguageSwitcher | content = [[Comparison of common data file formats | EN]], [[Comparison of common data file formats in Mandarin | 漢字]] }}
=== 1. CSV 檔案 ===
CSV 檔案:不同欄位值用逗號間隔。
CSV 格式本身沒有筆數上限,限制來自開啟它的軟體。<ref>Row Zero — CSV Row Limits by Spreadsheet: https://rowzero.com/blog/csv-row-limit</ref>


常見問題是
大部分人會使用 Microsoft Excel 開啟 CSV, (1) 中文亂碼:點選兩下發現中文亂碼,檔首需要額外加上 BOM 符號,或者改用匯入檔案方式,才不會遇到亂碼、(2) 如果欄位值包含換行符號的話,Excel 容易會出錯,導致明明是同一筆資料,卻出現在下一筆而出現欄位錯置的錯誤。(3) 資料類型強迫轉換:常見的問題是手機號碼的第一位號碼 0 被吃掉,因為被強制轉換成整數。


如果你拿到是包含換行符號欄位值的 CSV 檔案,則不建議使用 Excel,而是改用 LibreOffice Calc。
== Quick Comparison Table of Common Formats ==
<table class="wikitable">
    <tr>
        <th>Format</th>
        <th>Hierarchical</th>
        <th>Human-Readable</th>
        <th>Row Count / Size Limit</th>
        <th>Common Issues</th>
    </tr>
    <tr>
        <td>'''CSV'''</td>
        <td>No (tabular)</td>
        <td>High</td>
        <td>No inherent limit in the format itself; depends on the software used to open it (Excel: ~1.04 million rows)</td>
        <td>Garbled text in non-English characters; newline characters within field values can cause row misalignment; data type coercion (e.g., leading zero in phone numbers gets dropped)</td>
    </tr>
    <tr>
        <td>'''TSV'''</td>
        <td>No (tabular)</td>
        <td>High</td>
        <td>Same as CSV; no inherent limit in the format itself</td>
        <td>Same issues as CSV, but avoids delimiter misinterpretation when field values contain commas</td>
    </tr>
    <tr>
        <td>'''Excel'''</td>
        <td>No (tabular)</td>
        <td>High</td>
        <td>Theoretical limit of 1,048,576 rows × 16,384 columns; 32,767 character limit per cell; performance lags once row count exceeds ~100,000</td>
        <td>Long numbers get converted to scientific notation; certain strings get misinterpreted as dates</td>
    </tr>
    <tr>
        <td>'''JSON'''</td>
        <td>Yes</td>
        <td>Medium</td>
        <td>No inherent row limit in the format itself; the bottleneck is that the entire file must be loaded into memory when read</td>
        <td>Large files (250MB+) are prone to out-of-memory errors; JSONL is recommended for streaming/line-by-line reading instead</td>
    </tr>
    <tr>
        <td>'''SQLite'''</td>
        <td>No (relational)</td>
        <td>Low (requires a tool to open)</td>
        <td>Theoretical limit of 2⁶⁴ rows, but in practice constrained by the 281 TB database file size limit (roughly 2×10¹³ rows)</td>
        <td>Requires a database tool for reading/writing; not convenient for direct viewing</td>
    </tr>
    <tr>
        <td>'''Parquet'''</td>
        <td>No (columnar)</td>
        <td>Low (requires a tool to open)</td>
        <td>No theoretical row limit in the format itself (organized by Row Groups, default limit of 1 million rows per group, but a file can contain any number of Row Groups)</td>
        <td>The practical bottleneck is disk space rather than the format itself; there are documented cases of successfully writing 500 million to 1 billion rows</td>
    </tr>
    <tr>
        <td>'''SQL file'''</td>
        <td>No</td>
        <td>Medium</td>
        <td>No inherent limit as plain text; the limit comes from the tool used to import it</td>
        <td>Text editors can crash or become unresponsive when the file is too large; SQL syntax compatibility issues across different database engines</td>
    </tr>
</table>


=== 2. TSV 檔案 ===
'''Recommendations:'''
TSV 檔案:不同欄位值用定位鍵 (TAB) 間隔。如果欄位值包含逗號,可以改用 TSV 檔案。
# Need nested/free-form fields → JSON (use JSONL for large datasets)
TSV 與 CSV 同屬純文字格式,筆數上限同樣取決於開啟的軟體,格式本身無限制。
# Data exceeds a million rows and you only need to move the data → Parquet or CSV
# Need to restore the full database structure (including relations and indexes) → SQL or SQLite
# General use with spreadsheet software → CSV / Excel


=== 3. Excel 檔案===
== Common Data Formats and Limitations ==
Excel 檔案:常見的檔案類型,但是儲存格文字長度與欄位數量都有上限:單一儲存格內的文字不能超過 32,767 字,而欄位數上限是 16,384 欄 <ref>Microsoft Support — Excel specifications and limits: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3</ref>。其次雖然資料筆數理論上限是 1,048,576 筆(= 2²⁰),但如果筆數超過 10 萬筆,電腦操作 Excel 就會頓頓的,反而不方便分析。如果是舊版 Excel(2003 以前)允許的筆數更少,僅 65,536 筆(= 2¹⁶)× 256 欄。<ref>[Did You Know] Microsoft Excel has a Limit of Maximum 1,048,576 Rows and 16,384 Columns – AskVG https://www.askvg.com/did-you-know-microsoft-excel-has-a-limit-of-maximum-1048576-rows-and-16384-columns/</ref>


常見問題是
=== CSV Files ===
Excel 雖然方便但經常會把長數字(例如:訂單編號)自動轉成科學記號顯示,或把特定字串(如 1-2、MAR3)誤判為日期,造成資料遺失或資料錯誤。
CSV files: different field values are separated by commas.
The CSV format itself has no row limit — the limit comes from the software used to open it.<ref>Row Zero — CSV Row Limits by Spreadsheet: https://rowzero.com/blog/csv-row-limit</ref>


=== 4. JSON 檔案 ===
Common issues arise because most people open CSV files with Microsoft Excel: (1) '''Garbled Chinese characters''': double-clicking may result in garbled text; a BOM character needs to be prepended to the file, or the file must be imported rather than opened directly. (2) '''Newline characters in field values''': if a field value contains a newline, Excel tends to misinterpret it, causing data from the same record to appear on the next row and resulting in column misalignment. (3) '''Forced type conversion''': a common issue is that the leading zero of a phone number is dropped because the value is coerced into an integer.
JSON 檔案:適合複雜格式或者說自由風格的欄位定義。曾經拿到包含多筆資料,約 250 MB 檔案大小的 JSON 檔案。因為發生檔案讀取錯誤,需要額外切割檔案內容。
JSON 格式本身沒有筆數上限,瓶頸是讀取時必須將整個檔案載入記憶體,因此檔案愈大愈容易發生記憶體不足的錯誤。<ref>Quora — What is the practical limit on the size of a JSON message?: https://www.quora.com/What-is-the-practical-limit-on-the-size-of-a-JSON-message-for-an-internet-application</ref>


實務上建議不要將多筆資料寫在 JSON,而改成 JSONL,也就是 JSON 加上換行符號區別多筆資料。適合使用程式用串流方式逐行讀取,而不致於發生記憶體錯誤。
If you receive a CSV file containing newline characters in field values, it is not recommended to use Excel; use LibreOffice Calc instead.


=== 5. SQLite 或 Parquet 檔案 ===
=== TSV Files ===
SQLite 或 Parquet 檔案:當資料量超過 Excel 筆數上限,可以考慮使用這兩種格式。
TSV files: different field values are separated by tab characters. If field values contain commas, TSV can be used as an alternative.
TSV and CSV are both plain-text formats; the row limit likewise depends on the software used to open them, not the format itself.


SQLite 每張資料表的理論筆數上限是 2⁶⁴ 筆(約 1.8 × 10¹⁹),但這個數字實際上不可能達到,因為資料庫檔案大小上限 281 TB 會先被遇到。在最大資料庫容量下,可儲存的筆數約為 2 × 10¹³ 筆(前提是沒有索引且每筆資料極小)。<ref>SQLite 官方文件 — Implementation Limits For SQLite: https://sqlite.org/limits.html</ref>
=== Excel Files ===
Excel files: a widely used file format, but with limits on cell text length and column count: a single cell cannot contain more than 32,767 characters, and the maximum number of columns is 16,384.<ref>Microsoft Support — Excel specifications and limits: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3</ref> Although the theoretical row limit is 1,048,576 rows (= 2²⁰), performance becomes noticeably sluggish beyond 100,000 rows, making analysis inconvenient. Older versions of Excel (2003 and earlier) allowed even fewer rows — only 65,536 rows (= 2¹⁶) × 256 columns.<ref>[Did You Know] Microsoft Excel has a Limit of Maximum 1,048,576 Rows and 16,384 Columns – AskVG: https://www.askvg.com/did-you-know-microsoft-excel-has-a-limit-of-maximum-1048576-rows-and-16384-columns/</ref><ref>What is the maximum allowed rows in a Microsoft Excel .xls or .xlsx - Super User: https://superuser.com/questions/366468/what-is-the-maximum-allowed-rows-in-a-microsoft-excel-xls-or-xlsx</ref>


Parquet 沒有硬性的筆數上限。格式本身以 Row Group 為單位儲存資料。Row Group 是 Parquet 檔案內部的水平分割單位,每個 Row Group 包含一段連續的資料列,以欄位為單位分開存放——例如一個有 100 萬筆、10 個欄位的資料集,在一個 Row Group 裡會被切成 10 段欄位資料分別儲存,而非逐列存放。這樣的設計讓查詢時只需讀取需要的欄位,不必掃描整列,大幅提升讀取效率。每個 Row Group 預設上限為 100 萬筆,但一個檔案可以包含任意數量的 Row Group,因此整個檔案的筆數沒有理論上限 <ref>Apache arrow-rs GitHub issue #5797 — "Row groups are limited to 1M rows by default": https://github.com/apache/arrow-rs/issues/5797</ref>。實務上已有人成功寫入 5 億~ 10 億筆資料<ref>Andy Cutler 10 Billion Rows: Parquet File Size and Distribution When using CETAS: https://www.serverlesssql.com/row-size-and-parquet-file-distribution/</ref>,瓶頸通常是硬碟空間,而非格式本身。
Common issues include Excel automatically converting long numbers (e.g., order IDs) into scientific notation,<ref>Fixing the issue of long numbers being rounded or displayed in scientific notation in Excel (written in Mandarin): https://errerrors.blogspot.com/2023/04/how-to-fix-rounding-after-15-digits-on-excel.html</ref> or misidentifying certain strings (such as <code>1-2</code> or <code>MAR3</code>) as dates,<ref>Excel bug forces scientists to rename human genes — and the bug has now been fixed | T客邦 (written in Mandarin): https://www.techbang.com/posts/110723-excel-human-genome</ref> resulting in data loss or corruption.


=== 6. SQL 檔案 ===
=== JSON Files ===
SQL 檔案:關聯式資料庫的匯出格式,內容為多列的 SQL 語法(如 {{kbd | key=<nowiki>INSERT INTO ...</nowiki>}}),可直接匯入至資料庫。
JSON files: suitable for complex or flexibly structured field definitions. A JSON file containing multiple records with a size of approximately 250 MB has been encountered in practice; a read error occurred and the file had to be split before processing.
The JSON format itself has no row limit, but the bottleneck is that the entire file must be loaded into memory when read — the larger the file, the more likely an out-of-memory error will occur.<ref>Quora — What is the practical limit on the size of a JSON message?: https://www.quora.com/What-is-the-practical-limit-on-the-size-of-a-JSON-message-for-an-internet-application</ref>


SQL 檔案本身沒有筆數上限,格式本身只是純文字,限制同樣來自開啟或匯入它的工具。
In practice, it is recommended to avoid storing multiple records in a single JSON file and to use JSONL instead — that is, JSON with newline characters delimiting individual records. This format is suitable for streaming line-by-line with a program, preventing out-of-memory errors.


常見問題是
=== SQLite or Parquet Files ===
* (1) 檔案過大:大量資料匯出後動輒數 GB,用文字編輯器開啟會導致當機或無回應。
SQLite or Parquet files: when the data volume exceeds Excel's row limit, these two formats are worth considering.
* (2) 跨資料庫相容性:不同資料庫引擎(如 MySQL、PostgreSQL、SQLite)的 SQL 語法有差異,從 A 引擎匯出的 .sql 檔案,未必能直接匯入另一個引擎。可參考 [[排除_MySQL_技術疑難問題#從_MySQL_資料轉移至_MSSQL | 從_MySQL_資料轉移至_MSSQL]]


如果只需要搬移資料而非還原完整資料庫結構,建議改用 CSV 或 Parquet,避免上述相容性問題。
The theoretical row limit per table in SQLite is 2⁶⁴ rows (approximately 1.8 × 10¹⁹), but this number is practically unreachable because the database file size limit of 281 TB will be hit first. At maximum database capacity, the number of storable rows is approximately 2 × 10¹³ (assuming no indexes and minimal row size).<ref>SQLite official documentation — Implementation Limits For SQLite: https://sqlite.org/limits.html</ref>


Parquet has no hard row limit. The format stores data in units called Row Groups — the horizontal partitioning unit within a Parquet file. Each Row Group contains a contiguous segment of rows stored column by column. For example, a dataset with 1,000,000 rows and 10 columns would be split into 10 separate column chunks within a single Row Group, rather than stored row by row. This design allows queries to read only the required columns without scanning entire rows, greatly improving read performance. Each Row Group defaults to a limit of 1,000,000 rows, but a single file can contain any number of Row Groups, so there is no theoretical row limit for the file as a whole.<ref>Apache arrow-rs GitHub issue #5797 — "Row groups are limited to 1M rows by default": https://github.com/apache/arrow-rs/issues/5797</ref> In practice, people have successfully written 500 million to 1 billion rows,<ref>Andy Cutler — 10 Billion Rows: Parquet File Size and Distribution When using CETAS: https://www.serverlesssql.com/row-size-and-parquet-file-distribution/</ref> with the bottleneck typically being disk space rather than the format itself.
=== SQL Files ===
SQL files: the export format for relational databases, consisting of multiple lines of SQL statements (e.g., <code>INSERT INTO ...</code>), which can be imported directly into a database.
SQL files themselves have no row limit; the format is plain text, and limitations again come from the tools used to open or import them.
Common issues include:
* '''(1) Large file size''': bulk data exports can easily reach several gigabytes; opening such files in a text editor may cause it to freeze or become unresponsive.
* '''(2) Cross-database compatibility''': different database engines (such as MySQL, PostgreSQL, and SQLite) have differences in SQL syntax; a <code>.sql</code> file exported from engine A may not be directly importable into another engine. See also: Troubleshooting MySQL Technical Issues — Migrating Data from MySQL to MSSQL (written in Mandarin).
If the goal is simply to migrate data rather than restore a complete database structure, it is recommended to use CSV or Parquet instead, in order to avoid the compatibility issues described above.


== References ==
== References ==
Line 52: Line 119:


[[Category: Software]]
[[Category: Software]]
[[Category: Spreadsheet]]  
[[Category: Spreadsheet]]
[[Category: MySQL]]
[[Category: MySQL]]
[[Category: Data Science]]
[[Category: Data Science]]
[[Category: Revised with LLMs]]
[[Category: Revised with LLMs]]

Latest revision as of 11:35, 2 July 2026

How to Open Data Files: Common Formats, Limitations, and Notes

🌐 Switch language: EN, 漢字



Quick Comparison Table of Common Formats[edit]

Format Hierarchical Human-Readable Row Count / Size Limit Common Issues
CSV No (tabular) High No inherent limit in the format itself; depends on the software used to open it (Excel: ~1.04 million rows) Garbled text in non-English characters; newline characters within field values can cause row misalignment; data type coercion (e.g., leading zero in phone numbers gets dropped)
TSV No (tabular) High Same as CSV; no inherent limit in the format itself Same issues as CSV, but avoids delimiter misinterpretation when field values contain commas
Excel No (tabular) High Theoretical limit of 1,048,576 rows × 16,384 columns; 32,767 character limit per cell; performance lags once row count exceeds ~100,000 Long numbers get converted to scientific notation; certain strings get misinterpreted as dates
JSON Yes Medium No inherent row limit in the format itself; the bottleneck is that the entire file must be loaded into memory when read Large files (250MB+) are prone to out-of-memory errors; JSONL is recommended for streaming/line-by-line reading instead
SQLite No (relational) Low (requires a tool to open) Theoretical limit of 2⁶⁴ rows, but in practice constrained by the 281 TB database file size limit (roughly 2×10¹³ rows) Requires a database tool for reading/writing; not convenient for direct viewing
Parquet No (columnar) Low (requires a tool to open) No theoretical row limit in the format itself (organized by Row Groups, default limit of 1 million rows per group, but a file can contain any number of Row Groups) The practical bottleneck is disk space rather than the format itself; there are documented cases of successfully writing 500 million to 1 billion rows
SQL file No Medium No inherent limit as plain text; the limit comes from the tool used to import it Text editors can crash or become unresponsive when the file is too large; SQL syntax compatibility issues across different database engines

Recommendations:

  1. Need nested/free-form fields → JSON (use JSONL for large datasets)
  2. Data exceeds a million rows and you only need to move the data → Parquet or CSV
  3. Need to restore the full database structure (including relations and indexes) → SQL or SQLite
  4. General use with spreadsheet software → CSV / Excel

Common Data Formats and Limitations[edit]

CSV Files[edit]

CSV files: different field values are separated by commas. The CSV format itself has no row limit — the limit comes from the software used to open it.[1]

Common issues arise because most people open CSV files with Microsoft Excel: (1) Garbled Chinese characters: double-clicking may result in garbled text; a BOM character needs to be prepended to the file, or the file must be imported rather than opened directly. (2) Newline characters in field values: if a field value contains a newline, Excel tends to misinterpret it, causing data from the same record to appear on the next row and resulting in column misalignment. (3) Forced type conversion: a common issue is that the leading zero of a phone number is dropped because the value is coerced into an integer.

If you receive a CSV file containing newline characters in field values, it is not recommended to use Excel; use LibreOffice Calc instead.

TSV Files[edit]

TSV files: different field values are separated by tab characters. If field values contain commas, TSV can be used as an alternative. TSV and CSV are both plain-text formats; the row limit likewise depends on the software used to open them, not the format itself.

Excel Files[edit]

Excel files: a widely used file format, but with limits on cell text length and column count: a single cell cannot contain more than 32,767 characters, and the maximum number of columns is 16,384.[2] Although the theoretical row limit is 1,048,576 rows (= 2²⁰), performance becomes noticeably sluggish beyond 100,000 rows, making analysis inconvenient. Older versions of Excel (2003 and earlier) allowed even fewer rows — only 65,536 rows (= 2¹⁶) × 256 columns.[3][4]

Common issues include Excel automatically converting long numbers (e.g., order IDs) into scientific notation,[5] or misidentifying certain strings (such as 1-2 or MAR3) as dates,[6] resulting in data loss or corruption.

JSON Files[edit]

JSON files: suitable for complex or flexibly structured field definitions. A JSON file containing multiple records with a size of approximately 250 MB has been encountered in practice; a read error occurred and the file had to be split before processing. The JSON format itself has no row limit, but the bottleneck is that the entire file must be loaded into memory when read — the larger the file, the more likely an out-of-memory error will occur.[7]

In practice, it is recommended to avoid storing multiple records in a single JSON file and to use JSONL instead — that is, JSON with newline characters delimiting individual records. This format is suitable for streaming line-by-line with a program, preventing out-of-memory errors.

SQLite or Parquet Files[edit]

SQLite or Parquet files: when the data volume exceeds Excel's row limit, these two formats are worth considering.

The theoretical row limit per table in SQLite is 2⁶⁴ rows (approximately 1.8 × 10¹⁹), but this number is practically unreachable because the database file size limit of 281 TB will be hit first. At maximum database capacity, the number of storable rows is approximately 2 × 10¹³ (assuming no indexes and minimal row size).[8]

Parquet has no hard row limit. The format stores data in units called Row Groups — the horizontal partitioning unit within a Parquet file. Each Row Group contains a contiguous segment of rows stored column by column. For example, a dataset with 1,000,000 rows and 10 columns would be split into 10 separate column chunks within a single Row Group, rather than stored row by row. This design allows queries to read only the required columns without scanning entire rows, greatly improving read performance. Each Row Group defaults to a limit of 1,000,000 rows, but a single file can contain any number of Row Groups, so there is no theoretical row limit for the file as a whole.[9] In practice, people have successfully written 500 million to 1 billion rows,[10] with the bottleneck typically being disk space rather than the format itself.

SQL Files[edit]

SQL files: the export format for relational databases, consisting of multiple lines of SQL statements (e.g., INSERT INTO ...), which can be imported directly into a database.

SQL files themselves have no row limit; the format is plain text, and limitations again come from the tools used to open or import them.

Common issues include:

  • (1) Large file size: bulk data exports can easily reach several gigabytes; opening such files in a text editor may cause it to freeze or become unresponsive.
  • (2) Cross-database compatibility: different database engines (such as MySQL, PostgreSQL, and SQLite) have differences in SQL syntax; a .sql file exported from engine A may not be directly importable into another engine. See also: Troubleshooting MySQL Technical Issues — Migrating Data from MySQL to MSSQL (written in Mandarin).

If the goal is simply to migrate data rather than restore a complete database structure, it is recommended to use CSV or Parquet instead, in order to avoid the compatibility issues described above.

References[edit]

  1. Row Zero — CSV Row Limits by Spreadsheet: https://rowzero.com/blog/csv-row-limit
  2. Microsoft Support — Excel specifications and limits: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3
  3. [Did You Know] Microsoft Excel has a Limit of Maximum 1,048,576 Rows and 16,384 Columns – AskVG: https://www.askvg.com/did-you-know-microsoft-excel-has-a-limit-of-maximum-1048576-rows-and-16384-columns/
  4. What is the maximum allowed rows in a Microsoft Excel .xls or .xlsx - Super User: https://superuser.com/questions/366468/what-is-the-maximum-allowed-rows-in-a-microsoft-excel-xls-or-xlsx
  5. Fixing the issue of long numbers being rounded or displayed in scientific notation in Excel (written in Mandarin): https://errerrors.blogspot.com/2023/04/how-to-fix-rounding-after-15-digits-on-excel.html
  6. Excel bug forces scientists to rename human genes — and the bug has now been fixed | T客邦 (written in Mandarin): https://www.techbang.com/posts/110723-excel-human-genome
  7. Quora — What is the practical limit on the size of a JSON message?: https://www.quora.com/What-is-the-practical-limit-on-the-size-of-a-JSON-message-for-an-internet-application
  8. SQLite official documentation — Implementation Limits For SQLite: https://sqlite.org/limits.html
  9. Apache arrow-rs GitHub issue #5797 — "Row groups are limited to 1M rows by default": https://github.com/apache/arrow-rs/issues/5797
  10. Andy Cutler — 10 Billion Rows: Parquet File Size and Distribution When using CETAS: https://www.serverlesssql.com/row-size-and-parquet-file-distribution/

Data factory flow