How to extract content from websites: Difference between revisions

← Older edit

How to extract content from websites (edit)

Revision as of 11:30, 15 July 2025

719 bytes added , 15 July 2025

m

no edit summary

Planetoid

Bureaucrats, Administrators

15,063

edits

@@ Line 2: / Line 2: @@
 == Methods for article extraction ==
-[https://github.com/mozilla/readability mozilla/readability: A standalone version of the readability lib]<ref>[https://videoinu.com/blog/firefox-reader-view-heuristics/ How does Firefox's Reader View work?]</ref>
+=== Open source solution for article extraction ===
+{{Gd}} [https://github.com/mozilla/readability mozilla/readability: A standalone version of the readability lib]<ref>[https://videoinu.com/blog/firefox-reader-view-heuristics/ How does Firefox's Reader View work?]</ref>
 * Demo:
 * Pricing / Free Limit: free
@@ Line 8: / Line 9: @@
 * License: Apache License, Version 2.0 {{Gd}}
 * Source code of client: Available on GitHub & Container: [https://hub.docker.com/r/phpdockerio/readability-js-server phpdockerio/readability-js-server - Docker Image | Docker Hub]
+{{Gd}} [https://github.com/postlight/parser postlight/parser: 📜 Extract meaningful content from the chaos of a web page] (Replacement of [https://github.com/postlight/mercury-parser postlight/mercury-parser])
+* Demo:
+* Requirement: Node.js
+* License: Apache License, Version 2.0 or MIT license {{Gd}}
+* Container
 [https://github.com/timothytylee/full-text-rss timothytylee/full-text-rss: Fork of Full-Text RSS to improve handling of non UTF-8 sites]
@@ Line 13: / Line 20: @@
 * Requirement: PHP
 * License: GNU Affero General Public License v3.0 {{Gd}}
-* Container
-[https://github.com/postlight/mercury-parser postlight/mercury-parser: 📜 Extract meaningful content from the chaos of a web page]
-* Demo:
-* Requirement: Node.js
-* License: Apache License, Version 2.0 or MIT license {{Gd}}
 * Container
@@ Line 32: / Line 33: @@
 * License:
 * Container
+[https://totheweb.com/learning_center/tools-convert-html-text-to-plain-text-for-content-review/ Free Tool: Convert Your Webpage to Plain Text » ToTheWeb]
+* Demo:
+* Requirement:
+* License: GPL + Apache License Version 2.0 {{Gd}}
+* Container
+[https://github.com/crscheid/php-article-extractor crscheid/php-article-extractor: A PHP library to extract article text from web pages]
+* Pricing / Free Limit: free
+* Source code of client: Available on GitHub
+* License:
+=== Commercial solution for article extraction ===
 ''$'' [https://www.diffbot.com/products/extract/ Diffbot | Extract Content From Websites Automatically] two weeks free trial
@@ Line 38: / Line 52: @@
 * Requirement:
 * License:
-* Container
+* Source code of client: [https://www.diffbot.com/dev/docs/libraries/ Diffbot Libraries - Diffbot]
 ''$'' [https://www.fivefilters.org/full-text-rss/ Full-Text RSS - FiveFilters.org]
@@ Line 44: / Line 58: @@
 * Requirement: PHP
 * License:
-* Source code of client: [https://www.diffbot.com/dev/docs/libraries/ Diffbot Libraries - Diffbot]
+* Source code of client:
-[https://totheweb.com/learning_center/tools-convert-html-text-to-plain-text-for-content-review/ Free Tool: Convert Your Webpage to Plain Text » ToTheWeb]
-* Demo:
-* Requirement:
-* License: GPL + Apache License Version 2.0 {{Gd}}
-* Container
-[https://github.com/crscheid/php-article-extractor crscheid/php-article-extractor: A PHP library to extract article text from web pages]
-* Pricing / Free Limit: free
-* Source code of client: Available on GitHub
-* License:
 [https://getpocket.com/developer/docs/v3/article-view Pocket Developer Program: Pocket API: Article View] not available {{exclaim}}
@@ Line 61: / Line 64: @@
 * Source code of client: n/a
 * License:
+''$'' [https://newsblur.com/ NewsBlur] > [https://newsblur.com/api The NewsBlur API] > GET /rss_feeds/original_story
+* Pricing:
+* License:
+* Source code of client:
+''$'' [https://feedbin.com/ Feedbin] > API > [https://github.com/feedbin/feedbin-api/blob/master/content/extract-full-content.md Extracting Full Content]
+* Pricing:
+* License:
+* Source code of client:
 == References ==
@@ Line 69: / Line 82: @@
 == Related pages ==
 * [[Named entity recognition tools]]
+* ''$'' [https://www.producthunt.com/products/diffbot/alternatives Best Diffbot Alternatives - 2024 | Product Hunt]
+{{Template:Data factory flow}}
 [[Category:Tool]]

How to extract content from websites: Difference between revisions

How to extract content from websites (edit)

Revision as of 11:30, 15 July 2025

Navigation menu

Search