Web scrape troubleshooting: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
 
Line 88: Line 88:
* [[Skill tree of web scraping]]
* [[Skill tree of web scraping]]
* [https://blog.gslin.org/archives/2022/08/22/10850/%e5%8d%97%e9%9f%93%e6%9c%80%e9%ab%98%e6%b3%95%e9%99%a2%e4%b9%9f%e5%b0%8d-web-scraping-%e7%b5%a6%e5%87%ba%e4%ba%86%e9%a1%9e%e4%bc%bc%e7%be%8e%e5%9c%8b%e7%9a%84%e5%88%a4%e4%be%8b/ 南韓最高法院也對 Web Scraping 給出了類似美國的判例 – Gea-Suan Lin's BLOG]
* [https://blog.gslin.org/archives/2022/08/22/10850/%e5%8d%97%e9%9f%93%e6%9c%80%e9%ab%98%e6%b3%95%e9%99%a2%e4%b9%9f%e5%b0%8d-web-scraping-%e7%b5%a6%e5%87%ba%e4%ba%86%e9%a1%9e%e4%bc%bc%e7%be%8e%e5%9c%8b%e7%9a%84%e5%88%a4%e4%be%8b/ 南韓最高法院也對 Web Scraping 給出了類似美國的判例 – Gea-Suan Lin's BLOG]
* [https://www.yzu.edu.tw/library/index.php/tw/news-tw/943-20181226-1 元智大學 Yuan Ze University - 圖書館 - 疑似侵權,你不可不知的 小撇步]


== References ==
== References ==

Latest revision as of 15:25, 9 June 2024

Before start to writing the scirpt of web scraping (crawler)[edit]

  • Is the website offer datasets or files? e.g. open data
  • Is the website offer API (Application programming interface)?


List of technical issues[edit]

  1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
    • Multiple sources of same column such as different HTML DOM but have the same column value.
    • Backup the HTML text of parent DOM element
    • (optional) Complete HTML file backup
  1. The IP was banned from server
  1. CAPTCHA
  1. AJAX
  1. The web page needed to signed in
  1. Blocking the request without Referer or other headers.
  1. Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds[1][2].
  1. Language and URL-encodes string
  1. How to extract content from websites
  1. Data cleaning issues e.g. Non-breaking space or other Whitespace character
  1. Is link a permanent link?
  1. Enable/Disable the CSS or JavaScript
Difficulty in implementing Descriptioin Approach Comments
Easy Well-formatted HTML elements Url is the resource of dataset.
Advanced Interactive websites Url is the resource of dataset. Require to simulate post form submit with the form data or user agent Using HTTP request and response data tool or PHP: cURL
more difficult Interactive websites Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. Using Selenium or Headless Chrome
Difficult Interactive websites Ajax


Search keyword strategy[edit]

How to find the unofficial (3rd-party) web crawler? Search keyword strategy suggested as follows:

  • target website + crawler site:github.com
  • target website + scraper site:github.com
  • target website + bot site:github.com
  • target website + download / downloader site:github.com
  • target website + browser client site:github.com

Further reading[edit]

References[edit]


Troubleshooting of ...

Template