14,953
edits
m (→3. Excel 檔案) |
No edit summary |
||
| Line 1: | Line 1: | ||
How to Open Data Files: Common Formats, Limitations, and Notes | |||
= | {{LanguageSwitcher | content = [[Comparison of common data file formats | EN]], [[Comparison of common data file formats in Mandarin | 漢字]] }} | ||
== Common Data Formats and Limitations == | |||
=== CSV Files === | |||
CSV files: different field values are separated by commas. | |||
The CSV format itself has no row limit — the limit comes from the software used to open it.<ref>Row Zero — CSV Row Limits by Spreadsheet: https://rowzero.com/blog/csv-row-limit</ref> | |||
Common issues arise because most people open CSV files with Microsoft Excel: (1) '''Garbled Chinese characters''': double-clicking may result in garbled text; a BOM character needs to be prepended to the file, or the file must be imported rather than opened directly. (2) '''Newline characters in field values''': if a field value contains a newline, Excel tends to misinterpret it, causing data from the same record to appear on the next row and resulting in column misalignment. (3) '''Forced type conversion''': a common issue is that the leading zero of a phone number is dropped because the value is coerced into an integer. | |||
If you receive a CSV file containing newline characters in field values, it is not recommended to use Excel; use LibreOffice Calc instead. | |||
=== TSV Files === | |||
TSV files: different field values are separated by tab characters. If field values contain commas, TSV can be used as an alternative. | |||
TSV and CSV are both plain-text formats; the row limit likewise depends on the software used to open them, not the format itself. | |||
=== | === Excel Files === | ||
Excel files: a widely used file format, but with limits on cell text length and column count: a single cell cannot contain more than 32,767 characters, and the maximum number of columns is 16,384.<ref>Microsoft Support — Excel specifications and limits: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3</ref> Although the theoretical row limit is 1,048,576 rows (= 2²⁰), performance becomes noticeably sluggish beyond 100,000 rows, making analysis inconvenient. Older versions of Excel (2003 and earlier) allowed even fewer rows — only 65,536 rows (= 2¹⁶) × 256 columns.<ref>[Did You Know] Microsoft Excel has a Limit of Maximum 1,048,576 Rows and 16,384 Columns – AskVG: https://www.askvg.com/did-you-know-microsoft-excel-has-a-limit-of-maximum-1048576-rows-and-16384-columns/</ref><ref>What is the maximum allowed rows in a Microsoft Excel .xls or .xlsx - Super User: https://superuser.com/questions/366468/what-is-the-maximum-allowed-rows-in-a-microsoft-excel-xls-or-xlsx</ref> | |||
Common issues include Excel automatically converting long numbers (e.g., order IDs) into scientific notation,<ref>Fixing the issue of long numbers being rounded or displayed in scientific notation in Excel (written in Mandarin): https://errerrors.blogspot.com/2023/04/how-to-fix-rounding-after-15-digits-on-excel.html</ref> or misidentifying certain strings (such as <code>1-2</code> or <code>MAR3</code>) as dates,<ref>Excel bug forces scientists to rename human genes — and the bug has now been fixed | T客邦 (written in Mandarin): https://www.techbang.com/posts/110723-excel-human-genome</ref> resulting in data loss or corruption. | |||
=== | === JSON Files === | ||
JSON files: suitable for complex or flexibly structured field definitions. A JSON file containing multiple records with a size of approximately 250 MB has been encountered in practice; a read error occurred and the file had to be split before processing. | |||
The JSON format itself has no row limit, but the bottleneck is that the entire file must be loaded into memory when read — the larger the file, the more likely an out-of-memory error will occur.<ref>Quora — What is the practical limit on the size of a JSON message?: https://www.quora.com/What-is-the-practical-limit-on-the-size-of-a-JSON-message-for-an-internet-application</ref> | |||
In practice, it is recommended to avoid storing multiple records in a single JSON file and to use JSONL instead — that is, JSON with newline characters delimiting individual records. This format is suitable for streaming line-by-line with a program, preventing out-of-memory errors. | |||
Parquet | === SQLite or Parquet Files === | ||
SQLite or Parquet files: when the data volume exceeds Excel's row limit, these two formats are worth considering. | |||
The theoretical row limit per table in SQLite is 2⁶⁴ rows (approximately 1.8 × 10¹⁹), but this number is practically unreachable because the database file size limit of 281 TB will be hit first. At maximum database capacity, the number of storable rows is approximately 2 × 10¹³ (assuming no indexes and minimal row size).<ref>SQLite official documentation — Implementation Limits For SQLite: https://sqlite.org/limits.html</ref> | |||
Parquet has no hard row limit. The format stores data in units called Row Groups — the horizontal partitioning unit within a Parquet file. Each Row Group contains a contiguous segment of rows stored column by column. For example, a dataset with 1,000,000 rows and 10 columns would be split into 10 separate column chunks within a single Row Group, rather than stored row by row. This design allows queries to read only the required columns without scanning entire rows, greatly improving read performance. Each Row Group defaults to a limit of 1,000,000 rows, but a single file can contain any number of Row Groups, so there is no theoretical row limit for the file as a whole.<ref>Apache arrow-rs GitHub issue #5797 — "Row groups are limited to 1M rows by default": https://github.com/apache/arrow-rs/issues/5797</ref> In practice, people have successfully written 500 million to 1 billion rows,<ref>Andy Cutler — 10 Billion Rows: Parquet File Size and Distribution When using CETAS: https://www.serverlesssql.com/row-size-and-parquet-file-distribution/</ref> with the bottleneck typically being disk space rather than the format itself. | |||
=== SQL Files === | |||
SQL files: the export format for relational databases, consisting of multiple lines of SQL statements (e.g., <code>INSERT INTO ...</code>), which can be imported directly into a database. | |||
SQL files themselves have no row limit; the format is plain text, and limitations again come from the tools used to open or import them. | |||
Common issues include: | |||
* '''(1) Large file size''': bulk data exports can easily reach several gigabytes; opening such files in a text editor may cause it to freeze or become unresponsive. | |||
* '''(2) Cross-database compatibility''': different database engines (such as MySQL, PostgreSQL, and SQLite) have differences in SQL syntax; a <code>.sql</code> file exported from engine A may not be directly importable into another engine. See also: Troubleshooting MySQL Technical Issues — Migrating Data from MySQL to MSSQL (written in Mandarin). | |||
If the goal is simply to migrate data rather than restore a complete database structure, it is recommended to use CSV or Parquet instead, in order to avoid the compatibility issues described above. | |||
== References == | == References == | ||
| Line 52: | Line 52: | ||
[[Category: Software]] | [[Category: Software]] | ||
[[Category: Spreadsheet]] | [[Category: Spreadsheet]] | ||
[[Category: MySQL]] | [[Category: MySQL]] | ||
[[Category: Data Science]] | [[Category: Data Science]] | ||
[[Category: Revised with LLMs]] | [[Category: Revised with LLMs]] | ||