Data science glossary: Difference between revisions

Latest revision as of 16:51, 22 April 2024

資料科學、生成式 AI (Generative AI) 相關詞彙

這篇文章「Data science glossary」內容還在撰寫中，如果有不完整的部分，歡迎你直接動手修改。

C[edit]

CSV (Comma-Separated Values) 逗號間隔的檔案。相關設定：
- 記錄分隔字元：通常每筆紀錄的記錄分隔字元使用換行符號
- 欄位分隔符號、欄位分隔字元 (field delimiter, column delimiter, separator)：通常使用逗號，只會有一個字元 (character)。部分廣義的 CSV 檔案會使用分號、定位鍵作為欄位分隔符號。
- 文字辨識符號 (field enclosure character, text qualifier)：通常使用雙引號符號，只會有一個字元 (character)。
- escape：當欄位值包含文字辨識符號，則需要 escape
- 欄位開始列、資料開始列。通常欄位開始列是第1列、資料開始列是第2列。有時候第2列會加上欄位說明，導致資料開始列是第3列。

D[edit]

data [繁] 資料 [簡] 数据。「指未經過處理的原始記錄。」(資料來源: 維基百科)
Data extraction [繁] 資料萃取、資料提取^[1] [簡] 数据提取、数据抽取。「從資料來源萃取資料的流程，通常資料來源是非結構化資料。以利進一步資料處理或資料儲存。^[2]」。相關詞彙: Extract, transform, load (ETL)
Data ingestion [繁] 資料擷取 [簡] 数据获取、数据摄取、数据接入。「將不同來源的資料，集中放置或匯入到同一目的地的流程^[3]^[4]」。
Data transformation [繁] 資料轉換、資料變換^[5] [簡] 数据转换。「將資料轉換成不同的格式或結構的流程。資料轉換是資料整合或資料管理的基礎，其任務包含了資料整理 (data wrangling)、資料倉儲 (Data warehouse) 等。^[6]」依據資料分析目的，「將原始資料轉換成乾淨的、檢核過的、可以使用的格式。 (cleansed, validated, and ready-to-use form) ^[7]」

E[edit]

Exploratory data analysis (EDA) [繁] 探索式資料分析 [簡] 探索性数据分析^[8]。

K[edit]

Knowledge discovery in databases (KDD) [繁] 資料庫的知識探索 [簡] 数据库的知识发现。KDD 處理程序包含「data preparation, data selection, data cleaning, incorporation of appropriate prior knowledge, and proper interpretation of the results of mining, are essential to ensure that useful knowledge is derived from the data. 」從原始資料中萃取有價值的知識。(Fayyad, Piatetsky-Shapiro, and Smyth 1996^[9])

M[edit]

(data) model [繁] 資料模型、模型 [簡] 数据模型、模型。「在軟體工程中，資料模型是定義資料如何輸入和與輸出的一種模型。」(資料來源: 維基百科)

P[edit]

pattern [繁] 樣式 [簡] 模式。「從資料中找出隱藏的規則性或因果關係，即尋找樣式」(資料來源: 陳允傑, 2018^[10])

Q[edit]

Qualitative data [繁] 質性資料、定性資料 [簡] 定性数据。
Qualitative research [繁] 質性研究、質化研究、定性研究 [簡] 定性研究。
Quantitative data [繁] 量化資料、定量資料 [簡] 定量数据。
Quantitative research [繁] 量化研究 [簡] 定量研究。相關頁面: 量化分析與質化分析研究的整合

R[edit]

RAG (Retrieval Augmented Generation) [繁] 檢索增強生成、 [簡] 检索增强生成：「為了解決機器幻覺問題，Meta的研究人員發表了一篇關於一種名為「檢索增強生成」（Retrieval Augmented Generation，簡稱RAG）的技術論文。這種技術為文本生成模型增加了一個資訊檢索組件，這是大型語言模型（LLM）已經擅長的。這允許對LLM的內部知識進行微調和調整，使其更精準且更新。」^[11]

S[edit]

(database) schema [繁] (資料庫) 模式、架構 [簡] (数据库) 模式、架构。"Schema is a set of interrelated database objects, such as tables, table columns, data types of the columns, indexes, foreign keys, and so on." (MySQL^[12]) 相關文件: Create database schema document

system prompt, system message [繁] 系統提示、系統訊息、[簡] 系统提示：「系統訊息有助於設定助理的行為模式。例如，您可以修改助理的個性或提供關於其在對話過程中應如何行為的具體指示。」^[13]、「使用系統提示，您可以為對話設定基調 (stage)，指定角色、個性、語氣或其他相關資訊信息，以幫助更好地理解和回應用戶的輸入。系統提示可以包括： (1) 任務指示和目標、(2) 個性特徵、角色和語調指南、(3) 用戶輸入的情境資訊、(4) 創意限制和風格指導、(5) 外部知識、數據或參考材料、(6) 規則、指導方針和限定話題邊界 (guardrails)、(7) 輸出驗證標準和要求」^[14]

T[edit]

data type [繁] 資料類型 [簡] 数据类型。「資料類型描述了數值的表示法、解釋和結構，並以演算法操作，或是物件在記憶體中的儲存區，或者其它儲存裝置。」 (資料來源: 維基百科)。相關文件: 資料表欄位設計時，針對不同資料類型，建議的資料型態。

參考資料[edit]

[1] ta extraction - 資料提取

[2] Data extraction - Wikipedia

[3] What is data ingestion? - Definition from WhatIs.com

[4] What is Data Ingestion? | Alooma

[5] ta transformation - 資料轉換法

[6] Data transformation - Wikipedia

[7] Top 7 Best Practices for Data Transformation | Import.io

[8] 探索性数据分析(EDA),你会使用吗？

[9] Fayyad, Piatetsky-Shapiro, and Smyth (1996). From Data Mining to Knowledge Discovery in Databases | AI Magazine

[10] 博客來-Python 資料科學與人工智慧應用實務

[11] What is Retrieval Augmented Generation (RAG)?

[12] MySQL :: MySQL 5.7 Reference Manual :: MySQL Glossary

[13] Text generation - OpenAI API

[14] System prompts

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

@@ Line 1: / Line 1: @@
-資料科學相關詞彙
+資料科學、生成式 AI (Generative AI) 相關詞彙
 {{Template:Draft}}
+== C ==
+* [https://zh.wikipedia.org/wiki/%E9%80%97%E5%8F%B7%E5%88%86%E9%9A%94%E5%80%BC CSV] (Comma-Separated Values) 逗號間隔的檔案。相關設定：
+** 記錄分隔字元：通常每筆紀錄的記錄分隔字元使用[[Return symbol | 換行符號]]
+** 欄位分隔符號、欄位分隔字元 (field delimiter, column delimiter, separator)：通常使用逗號，只會有一個字元 (character)。部分廣義的 CSV 檔案會使用分號、定位鍵作為欄位分隔符號。
+** 文字辨識符號 (field enclosure character, text qualifier)：通常使用雙引號符號，只會有一個字元 (character)。
+** escape：當欄位值包含文字辨識符號，則需要 escape
+** 欄位開始列、資料開始列。通常欄位開始列是第1列、資料開始列是第2列。有時候第2列會加上欄位說明，導致資料開始列是第3列。
 == D ==
@@ Line 26: / Line 34: @@
 * Quantitative data [繁] 量化資料、定量資料 [簡] 定量数据。
 * Quantitative research [繁] [http://terms.naer.edu.tw/detail/1678721/ 量化研究] [簡] [https://baike.baidu.com/item/%E5%AE%9A%E9%87%8F%E7%A0%94%E7%A9%B6 定量研究]。相關頁面: [[Quantitative research and qualitative research integration | 量化分析與質化分析研究的整合]]
+== R ==
+* RAG (Retrieval Augmented Generation) [繁] 檢索增強生成、 [簡] 检索增强生成：「為了解決機器幻覺問題，Meta的研究人員發表了一篇關於一種名為「檢索增強生成」（Retrieval Augmented Generation，簡稱RAG）的技術論文。這種技術為文本生成模型增加了一個資訊檢索組件，這是大型語言模型（LLM）已經擅長的。這允許對LLM的內部知識進行微調和調整，使其更精準且更新。」<ref>[https://vercel.com/guides/retrieval-augmented-generation What is Retrieval Augmented Generation (RAG)?]</ref>
 == S ==
 * [https://en.wikipedia.org/wiki/Database_schema (database) schema] [繁] [https://zh.wikipedia.org/wiki/Schema_(%E6%95%B0%E6%8D%AE%E5%BA%93) (資料庫) 模式、架構] [簡] (数据库) 模式、架构。"Schema is a set of interrelated database objects, such as tables, table columns, data types of the columns, indexes, foreign keys, and so on." (MySQL<ref>[https://dev.mysql.com/doc/refman/5.7/en/glossary.html#glos_schema MySQL :: MySQL 5.7 Reference Manual :: MySQL Glossary]</ref>) 相關文件: [[Create database schema document]]
+* system prompt, system message [繁] 系統提示、系統訊息、[簡] 系统提示：「系統訊息有助於設定助理的行為模式。例如，您可以修改助理的個性或提供關於其在對話過程中應如何行為的具體指示。」<ref>[https://platform.openai.com/docs/guides/text-generation Text generation - OpenAI API]</ref>、「使用系統提示，您可以為對話設定基調 (stage)，指定角色、個性、語氣或其他相關資訊信息，以幫助更好地理解和回應用戶的輸入。系統提示可以包括： (1) 任務指示和目標、(2) 個性特徵、角色和語調指南、(3) 用戶輸入的情境資訊、(4) 創意限制和風格指導、(5) 外部知識、數據或參考材料、(6) 規則、指導方針和限定話題邊界 (guardrails)、(7) 輸出驗證標準和要求」<ref>[https://docs.anthropic.com/claude/docs/system-prompts#what-is-a-system-prompt System prompts]</ref>
 == T ==
@@ Line 37: / Line 50: @@
 <references/>
-[[Category:Academic]]
+[[Category: Academic]]
-[[Category:Glossary]]
+[[Category: Glossary]]
-[[Category:Data Science]]
+[[Category: Data Science]]
+[[Category: Artificial intelligence]]
+[[Category: Generative AI]]

Data science glossary: Difference between revisions

Latest revision as of 16:51, 22 April 2024

Contents

C[edit]

D[edit]

E[edit]

K[edit]

M[edit]

P[edit]

Q[edit]

R[edit]

S[edit]

T[edit]

參考資料[edit]

Navigation menu

Data science glossary: Difference between revisions

Latest revision as of 16:51, 22 April 2024

C[edit]

D[edit]

E[edit]

K[edit]

M[edit]

P[edit]

Q[edit]

R[edit]

S[edit]

T[edit]

參考資料[edit]

Navigation menu

Search