Named entity recognition tools: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
 
(25 intermediate revisions by the same user not shown)
Line 1: Line 1:
Named entity recognition (NER) 或稱命名實體辨識、專有名詞辨識
Named entity recognition (NER) 或稱[https://zh.wikipedia.org/wiki/%E5%91%BD%E5%90%8D%E5%AE%9E%E4%BD%93%E8%AF%86%E5%88%AB 命名實體識別]、實體識別、專有名詞辨識
 
== Amazon Comprehend ==
[https://aws.amazon.com/tw/comprehend/ Amazon Comprehend – 自然語言處理(NLP) 和機器學習 (ML)]
* license:
* language support:
* programming language:
* Score: Available. "Each entity also has a score that indicates the level of confidence that Amazon Comprehend has that it correctly detected the entity type. You can filter out the entities with lower scores to reduce the risk of using incorrect detections.<ref>[https://docs.aws.amazon.com/zh_tw/comprehend/latest/dg/how-entities.html Detect Entities - Amazon Comprehend]</ref>"
* classes of entity: "COMMERCIAL_ITEM, DATE, EVENT, LOCATION, ORGANIZATION, OTHER, PERSON, QUANTITY and TITLE"<ref>[https://docs.aws.amazon.com/comprehend/latest/dg/how-entities.html Detect Entities - Amazon Comprehend]</ref> as the following:
 
<table class="wikitable sortable" style="border:1;">
<tr>
    <th>Type</th>
    <th>Description</th>
    <th>Type 中文</th>
</tr>
  <tr>
    <td>COMMERCIAL_ITEM</td>
    <td>A branded product</td>
    <td>商品</td>
  </tr>
  <tr>
    <td>DATE</span></td>
    <td>A full date (for example, 11/25/2017), day (Tuesday), month (May), or time (8:30 a.m.)</td>
    <td>日期</td>
  </tr>
  <tr>
    <td>EVENT</span></td>
    <td>An event, such as a festival, concert, election, etc.</td>
    <td>事件</td>
  </tr>
  <tr>
    <td>LOCATION</span></td>
    <td>A specific location, such as a country, city, lake, building, etc.</td>
    <td>地點</td>
  </tr>
  <tr>
    <td>ORGANIZATION</span></td>
    <td>Large organizations, such as a government, company, religion, sports team, etc.</td>
    <td>機構</td>
  </tr>
  <tr>
    <td>OTHER</span></td>
    <td>Entities that don't fit into any of the other entity categories</td>
    <td>其他</td>
  </tr>
  <tr>
    <td>PERSON</td>
    <td>Individuals, groups of people, nicknames, fictional characters</td>
    <td>人名</td>
  </tr>
  <tr>
    <td>QUANTITY</span></td>
    <td>A quantified amount, such as currency, percentages, numbers, bytes, etc.</td>
    <td>量詞</td>
  </tr>
  <tr>
    <td>TITLE</span></td>
    <td>An official name given to any creation or creative work, such as movies, books, songs, etc.</td>
    <td>抬頭</td>
  </tr>
</table>
 
 
== Apache OpenNLP ==
[https://opennlp.apache.org/ Apache OpenNLP]
* license: Apache License, Version 2.0
* language support: English, French, German, Italian and Dutch. Not support Chinese. [https://opennlp.apache.org/models.html Models Download - Apache OpenNLP]
* programming language: Java
* Score: Available at github: [https://github.com/apache/opennlp apache/opennlp: Mirror of Apache OpenNLP].
* classes of entity:
 
== Baidu 百度AI开放平台 ==
[https://ai.baidu.com/tech/nlp 语言处理基础技术-百度AI开放平台] "专名识别"<ref>[https://ai.baidu.com/docs#/NLP-Basic-API/63eec4cf 词法分析接口]</ref> / [https://github.com/baidu/lac baidu/lac: 百度NLP:分词,词性标注,命名实体识别]
* license:
* language support: simplified Chinese
* programming language: multiple
* Score:
* classes of entity:
<table border="1" class="wikitable sortable">
<tr><th>Class name in English (缩略词)</th><th>Class name in Simplified Chinese</th><th>Class name in Traditional Chinese</th></tr>
<tr><td>PER</td><td>人名</td><td>人名</td></tr>
<tr><td>LOC</td><td>地名</td><td>地名</td></tr>
<tr><td>ORG</td><td>机构名</td><td>機構名</td></tr>
<tr><td>TIME</td><td>时间</td><td>時間</td></tr>
</table>
 


== CKIP Neural Chinese Word Segmentation, POS Tagging, and NER ==
== CKIP Neural Chinese Word Segmentation, POS Tagging, and NER ==
Line 5: Line 91:
* license: [https://github.com/ckiplab/ckiptagger/blob/master/LICENSE GNU General Public License v3.0] {{Gd}}
* license: [https://github.com/ckiplab/ckiptagger/blob/master/LICENSE GNU General Public License v3.0] {{Gd}}
* language support: Traditional Chinese
* language support: Traditional Chinese
* programming language: Python
* Score:
* classes of entity<ref>[https://iptt.sinica.edu.tw/uploads/datas/2019/4/a251a61991139dc023d3559e93cd8d65.pdf 中文專有名詞辨識系統  簡報]</ref>
* classes of entity<ref>[https://iptt.sinica.edu.tw/uploads/datas/2019/4/a251a61991139dc023d3559e93cd8d65.pdf 中文專有名詞辨識系統  簡報]</ref>


<table border="1" class="wikitable sortable">
<table border="1" class="wikitable sortable">
<tr><th>Class name in English</th><th>Class name in Chinese</th></tr>
<tr><th>Class name in English</th><th>Class name in Traditional Chinese</th></tr>
<tr><td>person</td><td>人名</td></tr>
<tr><td>person</td><td>人名</td></tr>
<tr><td>norp</td><td>團體</td></tr>
<tr><td>norp</td><td>團體</td></tr>
<tr><td>facility</td><td>設施</td></tr>
<tr><td>FAC</td><td>設施</td></tr>
<tr><td>organization</td><td>組織</td></tr>
<tr><td>facility</td><td>設施*</td></tr>
<tr><td>ORG</td><td>組織</td></tr>
<tr><td>organization</td><td>組織*</td></tr>
<tr><td>gpe</td><td>地理</td></tr>
<tr><td>gpe</td><td>地理</td></tr>
<tr><td>location</td><td>地點</td></tr>
<tr><td>LOC</td><td>地點</td></tr>
<tr><td>location</td><td>地點*</td></tr>
<tr><td>product</td><td>商品</td></tr>
<tr><td>product</td><td>商品</td></tr>
<tr><td>event</td><td>事件</td></tr>
<tr><td>event</td><td>事件</td></tr>
<tr><td>work of art</td><td>藝術品</td></tr>
<tr><td>WORK</td><td>藝術品</td></tr>
<tr><td>work of art</td><td>藝術品*</td></tr>
<tr><td>law</td><td>法律</td></tr>
<tr><td>law</td><td>法律</td></tr>
<tr><td>language</td><td>語言</td></tr>
<tr><td>language</td><td>語言</td></tr>
Line 29: Line 121:
</table>
</table>


== Stanford CoreNLP ==
: [[Image:Owl icon.jpg]] Notes: Asterisk symbol means there are different class name in English but same class name in Chinese.
[https://stanfordnlp.github.io/CoreNLP/index.html Stanford CoreNLP – Natural language software | Stanford CoreNLP]
 
* license: GNU General Public License v3 {{Gd}}
* language support:
* classes of entity: "For English, by default, this annotator recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities (12 classes). <ref>[https://stanfordnlp.github.io/CoreNLP/ner.html#description Named Entity Recognition – NERClassifierCombiner | Stanford CoreNLP]</ref>"


== Google Cloud Natural Language ==
== Google Cloud Natural Language ==
Line 39: Line 128:
* license:  
* license:  
* language support: [https://cloud.google.com/natural-language/docs/languages 語言支援  |  Cloud Natural Language API  |  Google Cloud] included Traditional Chinese
* language support: [https://cloud.google.com/natural-language/docs/languages 語言支援  |  Cloud Natural Language API  |  Google Cloud] included Traditional Chinese
* programming language: multiple
* Score: Available. '''salience score''' in the [0, 1.0] range. "The salience score for an entity provides information about the importance or centrality of that entity to the entire document text. Scores closer to 0 are less salient, while scores closer to 1.0 are highly salient.<ref>[https://cloud.google.com/natural-language/docs/reference/rest/v1/Entity Entity  |  Cloud Natural Language API  |  Google Cloud]</ref>"
* classes of entity: Details on [https://cloud.google.com/natural-language/docs/reference/rest/v1/Entity Entity  |  Cloud Natural Language API  |  Google Cloud] -> Type of the entity e.g. "UNKNOWN, PERSON, LOCATION, ORGANIZATION, EVENT, WORK_OF_ART, CONSUMER_GOOD, OTHER, PHONE_NUMBER, ADDRESS, DATE, NUMBER and PRICE"
* classes of entity: Details on [https://cloud.google.com/natural-language/docs/reference/rest/v1/Entity Entity  |  Cloud Natural Language API  |  Google Cloud] -> Type of the entity e.g. "UNKNOWN, PERSON, LOCATION, ORGANIZATION, EVENT, WORK_OF_ART, CONSUMER_GOOD, OTHER, PHONE_NUMBER, ADDRESS, DATE, NUMBER and PRICE"


== Amazon Comprehend ==
[https://aws.amazon.com/tw/comprehend/ Amazon Comprehend – 自然語言處理(NLP) 和機器學習 (ML)]
* license:
* language support:
* classes of entity: "COMMERCIAL_ITEM, DATE, EVENT, LOCATION, ORGANIZATION, OTHER, PERSON, QUANTITY and TITLE"<ref>[https://docs.aws.amazon.com/comprehend/latest/dg/how-entities.html Detect Entities - Amazon Comprehend]</ref>


== IBM Watson ==
== IBM Watson ==
Line 51: Line 137:
* license:  
* license:  
* language support:
* language support:
* programming language:
* Score:
* classes of entity: "Date, Duration, EmailAddress, Facility, GeographicFeature, Hashtag, IPAddress, JobTitle, Location and more ..."<ref>[https://cloud.ibm.com/docs/services/natural-language-understanding?topic=natural-language-understanding-entity-types-version-2&locale=en Entity types (Version 2)]</ref>
* classes of entity: "Date, Duration, EmailAddress, Facility, GeographicFeature, Hashtag, IPAddress, JobTitle, Location and more ..."<ref>[https://cloud.ibm.com/docs/services/natural-language-understanding?topic=natural-language-understanding-entity-types-version-2&locale=en Entity types (Version 2)]</ref>
== Microsoft Azure Cognitive Services ==
[https://docs.microsoft.com/zh-tw/azure/cognitive-services/text-analytics/how-tos/text-analytics-how-to-entity-linking 搭配文字分析 API 使用實體辨識 - Azure Cognitive Services | Microsoft Docs] / [https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/how-tos/text-analytics-how-to-entity-linking?tabs=version-3 Use entity recognition with the Text Analytics API - Azure Cognitive Services | Microsoft Docs]
* license
* language support: English & Chinese. See details on [https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/language-support?tabs=named-entity-recognition Language support - Text Analytics API - Azure Cognitive Services | Microsoft Docs].
* programming language: The language if supports sending a REST API request. See details on [https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/how-tos/text-analytics-how-to-entity-linking?tabs=version-3#sending-a-rest-api-request Use entity recognition with the Text Analytics API - Azure Cognitive Services | Microsoft Docs]
* Score: Available.
* classes of entity: Person, PersonType, Location, Organization, Event, Product and more. See details on [https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/named-entity-types?tabs=general Supported Categories for Named Entity Recognition - Azure Cognitive Services | Microsoft Docs].
== MONPA ==
[https://github.com/monpa-team/monpa monpa-team/monpa: MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型]
* License: CC-BY-NC-SA 4.0 License
* language support: Chinese
* programming language: Python
* Score: Available at [https://github.com/monpa-team/monpa monpa-team/monpa github]
* classes of entity: PER ...


== spaCy ==
== spaCy ==
[https://spacy.io/ spaCy · Industrial-strength Natural Language Processing in Python]
[https://spacy.io/ spaCy · Industrial-strength Natural Language Processing in Python]
* license:  
* license: [https://github.com/explosion/spaCy/blob/master/LICENSE MIT License] {{Gd}}
* language support:
* language support:
* programming language: Python
* Score:
* classes of entity: "PERSON, NORP, FAC, ORG, GPE, LOC, PRODUCT, EVENT, WORK_OF_ART, LAW, LANGUAGE, DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL and CARDINAL <ref>[https://spacy.io/api/annotation#named-entities Annotation Specifications · spaCy API Documentation]</ref>"
* classes of entity: "PERSON, NORP, FAC, ORG, GPE, LOC, PRODUCT, EVENT, WORK_OF_ART, LAW, LANGUAGE, DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL and CARDINAL <ref>[https://spacy.io/api/annotation#named-entities Annotation Specifications · spaCy API Documentation]</ref>"
== Natural Language Toolkit (NLTK) ne_chunk classifier ==
[https://www.nltk.org/ NLTK :: Natural Language Toolkit]
NE Type Examples<ref>[https://www.nltk.org/book/ch07.html 7. Extracting Information from Text]: 5  Named Entity Recognition
</ref>
* ORGANIZATION Georgia-Pacific Corp., WHO
* PERSON Eddy Bonte, President Obama
* LOCATION Murray River, Mount Everest
* DATE June, 2008-06-29
* TIME two fifty a m, 1:30 p.m.
* MONEY 175 million Canadian Dollars, GBP 10.40
* PERCENT twenty pct, 18.75 %
* FACILITY Washington Monument, Stonehenge
* GPE South East Asia, Midlothian
== Stanford CoreNLP ==
[https://stanfordnlp.github.io/CoreNLP/index.html Stanford CoreNLP – Natural language software | Stanford CoreNLP]
* license: GNU General Public License v3 {{Gd}}
* language support: English, Chinese ..
* programming language: Java
* Score: Available
* classes of entity: "For English, by default, this annotator recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities (12 classes). <ref>[https://stanfordnlp.github.io/CoreNLP/ner.html#description Named Entity Recognition – NERClassifierCombiner | Stanford CoreNLP]</ref>"
== 卓騰語言科技中文斷詞 ==
[https://api.droidtown.co/ 卓騰語言科技中文斷詞 API]
* license:
* language support: Traditional Chinese
* programming language:
* Score:
* classes of entity: "person, location, time, measurement and more ... <ref>[https://api.droidtown.co/document/ 卓騰語言科技中文斷詞 API]</ref>"
== BosonNLP (out of service) ==
[https://bosonnlp.com/ BosonNLP]
* license:
* language support: simplified Chinese
* programming language: multiple
* Score:
* classes of entity: "time, location, person_name, org_name, company_name, product_name and job_title <ref>[http://docs.bosonnlp.com/ner.html 命名实体识别 — BosonNLP HTTP API 1.0 documentation]</ref>"
<table border="1" class="wikitable sortable">
<tr><th>Class name in English</th><th>Class name in Simplified Chinese</th><th>Class name in Traditional Chinese</th></tr>
<tr><td>time</td><td>时间</td><td>時間</td></tr>
<tr><td>location</td><td>地点</td><td>地點</td></tr>
<tr><td>person_name</td><td>人名</td><td>人名</td></tr>
<tr><td>org_name</td><td>组织名</td><td>組織名</td></tr>
<tr><td>company_name</td><td>公司名</td><td>公司名</td></tr>
<tr><td>product_name</td><td>产品名</td><td>產品名</td></tr>
<tr><td>job_title</td><td>职位</td><td>職位</td></tr>
</table>
== Other similar NER tools ==
* ''$'' [https://www.diffbot.com/ Diffbot]: [https://www.diffbot.com/dev/docs/article/ Article Extraction API Documentation - Diffbot] "Array of tags/entities, generated from analysis of the extracted text and cross-referenced with DBpedia and other data sources. Language-specific tags will be returned if the source text is in English, Chinese, French, German, Spanish or Russian."
* [https://products.wolframalpha.com/api/ Wolfram|Alpha APIs: Computational Knowledge Integration] "Wolfram|Alpha makes numerous assumptions when analyzing a query and deciding how to present its results. A simple example is a word that can refer to multiple things, like "pi", which is a well-known mathematical constant but is also the name of a movie." [https://products.wolframalpha.com/docs/WolframAlpha-API-Reference.pdf]
* [https://www.wikidata.org/wiki/Wikidata:Data_access Wikidata:Data access - Wikidata]
* [https://api.duckduckgo.com/api DuckDuckGo 即時答案 API]


== References ==
== References ==

Latest revision as of 13:49, 17 January 2023

Named entity recognition (NER) 或稱命名實體識別、實體識別、專有名詞辨識

Amazon Comprehend[edit]

Amazon Comprehend – 自然語言處理(NLP) 和機器學習 (ML)

  • license:
  • language support:
  • programming language:
  • Score: Available. "Each entity also has a score that indicates the level of confidence that Amazon Comprehend has that it correctly detected the entity type. You can filter out the entities with lower scores to reduce the risk of using incorrect detections.[1]"
  • classes of entity: "COMMERCIAL_ITEM, DATE, EVENT, LOCATION, ORGANIZATION, OTHER, PERSON, QUANTITY and TITLE"[2] as the following:
Type Description Type 中文
COMMERCIAL_ITEM A branded product 商品
DATE A full date (for example, 11/25/2017), day (Tuesday), month (May), or time (8:30 a.m.) 日期
EVENT An event, such as a festival, concert, election, etc. 事件
LOCATION A specific location, such as a country, city, lake, building, etc. 地點
ORGANIZATION Large organizations, such as a government, company, religion, sports team, etc. 機構
OTHER Entities that don't fit into any of the other entity categories 其他
PERSON Individuals, groups of people, nicknames, fictional characters 人名
QUANTITY A quantified amount, such as currency, percentages, numbers, bytes, etc. 量詞
TITLE An official name given to any creation or creative work, such as movies, books, songs, etc. 抬頭


Apache OpenNLP[edit]

Apache OpenNLP

Baidu 百度AI开放平台[edit]

语言处理基础技术-百度AI开放平台 "专名识别"[3] / baidu/lac: 百度NLP:分词,词性标注,命名实体识别

  • license:
  • language support: simplified Chinese
  • programming language: multiple
  • Score:
  • classes of entity:
Class name in English (缩略词)Class name in Simplified ChineseClass name in Traditional Chinese
PER人名人名
LOC地名地名
ORG机构名機構名
TIME时间時間


CKIP Neural Chinese Word Segmentation, POS Tagging, and NER[edit]

ckiplab/ckiptagger: CKIP Neural Chinese Word Segmentation, POS Tagging, and NER

Class name in EnglishClass name in Traditional Chinese
person人名
norp團體
FAC設施
facility設施*
ORG組織
organization組織*
gpe地理
LOC地點
location地點*
product商品
event事件
WORK藝術品
work of art藝術品*
law法律
language語言
date日期
time時間
percent比例
money
quantity數量
ordinal序數
cardinal數詞
Owl icon.jpg Notes: Asterisk symbol means there are different class name in English but same class name in Chinese.


Google Cloud Natural Language[edit]

Cloud Natural Language  |  Cloud Natural Language API  |  Google Cloud

  • license:
  • language support: 語言支援  |  Cloud Natural Language API  |  Google Cloud included Traditional Chinese
  • programming language: multiple
  • Score: Available. salience score in the [0, 1.0] range. "The salience score for an entity provides information about the importance or centrality of that entity to the entire document text. Scores closer to 0 are less salient, while scores closer to 1.0 are highly salient.[5]"
  • classes of entity: Details on Entity  |  Cloud Natural Language API  |  Google Cloud -> Type of the entity e.g. "UNKNOWN, PERSON, LOCATION, ORGANIZATION, EVENT, WORK_OF_ART, CONSUMER_GOOD, OTHER, PHONE_NUMBER, ADDRESS, DATE, NUMBER and PRICE"


IBM Watson[edit]

Watson Natural Language Understanding

  • license:
  • language support:
  • programming language:
  • Score:
  • classes of entity: "Date, Duration, EmailAddress, Facility, GeographicFeature, Hashtag, IPAddress, JobTitle, Location and more ..."[6]


Microsoft Azure Cognitive Services[edit]

搭配文字分析 API 使用實體辨識 - Azure Cognitive Services | Microsoft Docs / Use entity recognition with the Text Analytics API - Azure Cognitive Services | Microsoft Docs

MONPA[edit]

monpa-team/monpa: MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型

  • License: CC-BY-NC-SA 4.0 License
  • language support: Chinese
  • programming language: Python
  • Score: Available at monpa-team/monpa github
  • classes of entity: PER ...

spaCy[edit]

spaCy · Industrial-strength Natural Language Processing in Python

  • license: MIT License Good.gif
  • language support:
  • programming language: Python
  • Score:
  • classes of entity: "PERSON, NORP, FAC, ORG, GPE, LOC, PRODUCT, EVENT, WORK_OF_ART, LAW, LANGUAGE, DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL and CARDINAL [7]"

Natural Language Toolkit (NLTK) ne_chunk classifier[edit]

NLTK :: Natural Language Toolkit

NE Type Examples[8]

  • ORGANIZATION Georgia-Pacific Corp., WHO
  • PERSON Eddy Bonte, President Obama
  • LOCATION Murray River, Mount Everest
  • DATE June, 2008-06-29
  • TIME two fifty a m, 1:30 p.m.
  • MONEY 175 million Canadian Dollars, GBP 10.40
  • PERCENT twenty pct, 18.75 %
  • FACILITY Washington Monument, Stonehenge
  • GPE South East Asia, Midlothian

Stanford CoreNLP[edit]

Stanford CoreNLP – Natural language software | Stanford CoreNLP

  • license: GNU General Public License v3 Good.gif
  • language support: English, Chinese ..
  • programming language: Java
  • Score: Available
  • classes of entity: "For English, by default, this annotator recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities (12 classes). [9]"


卓騰語言科技中文斷詞[edit]

卓騰語言科技中文斷詞 API

  • license:
  • language support: Traditional Chinese
  • programming language:
  • Score:
  • classes of entity: "person, location, time, measurement and more ... [10]"


BosonNLP (out of service)[edit]

BosonNLP

  • license:
  • language support: simplified Chinese
  • programming language: multiple
  • Score:
  • classes of entity: "time, location, person_name, org_name, company_name, product_name and job_title [11]"
Class name in EnglishClass name in Simplified ChineseClass name in Traditional Chinese
time时间時間
location地点地點
person_name人名人名
org_name组织名組織名
company_name公司名公司名
product_name产品名產品名
job_title职位職位


Other similar NER tools[edit]

  • $ Diffbot: Article Extraction API Documentation - Diffbot "Array of tags/entities, generated from analysis of the extracted text and cross-referenced with DBpedia and other data sources. Language-specific tags will be returned if the source text is in English, Chinese, French, German, Spanish or Russian."
  • Wolfram|Alpha APIs: Computational Knowledge Integration "Wolfram|Alpha makes numerous assumptions when analyzing a query and deciding how to present its results. A simple example is a word that can refer to multiple things, like "pi", which is a well-known mathematical constant but is also the name of a movie." [1]

References[edit]