Web scrape troubleshooting: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
 
(23 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Before start to writing the scirpt of web scraping (crawler) ==
* Is the website offer datasets or files? e.g. open data
* Is the website offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
== List of technical issues ==
== List of technical issues ==


Line 5: Line 11:
#* Backup the HTML text of parent DOM element
#* Backup the HTML text of parent DOM element
#* (optional) Complete HTML file backup
#* (optional) Complete HTML file backup
# The IP was banned from server
# The IP was banned from server
#* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
#* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
# CATCHA
 
# [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA]
 
# AJAX
# AJAX
#* [https://chrome.google.com/webstore/detail/autoscroll/kgkaecolmndcecnchojbndeanmiokofl/related Autoscroll] on {{Chrome}} or {{Edge}} written by [https://twitter.com/PeterLegierski Peter Legierski (@PeterLegierski) / Twitter]
# The web page needed to signed in
# The web page needed to signed in
# Blocking the request without {{kbd | key= Referer}} or other headers.
# Blocking the request without {{kbd | key= Referer}} or other headers.
# Connection timeout during a http request. e.g. In PHP {{kbd | key=default_socket_timeout}} is 30 seconds<ref>[https://www.php.net/manual/en/filesystem.configuration.php PHP: Runtime Configuration - Manual]</ref><ref>[https://curl.haxx.se/libcurl/c/libcurl-errors.html libcurl - Error Codes]</ref>.
# Connection timeout during a http request. e.g. In PHP {{kbd | key=default_socket_timeout}} is 30 seconds<ref>[https://www.php.net/manual/en/filesystem.configuration.php PHP: Runtime Configuration - Manual]</ref><ref>[https://curl.haxx.se/libcurl/c/libcurl-errors.html libcurl - Error Codes]</ref>.
# Language and [http://php.net/manual/en/function.urlencode.php URL-encodes string]
# Language and [http://php.net/manual/en/function.urlencode.php URL-encodes string]
# [[How to extract content from websites]]
# [[Data cleaning#Data_handling | Data cleaning]] issues e.g. [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] or other [https://en.wikipedia.org/wiki/Whitespace_character Whitespace character]
# [[Data cleaning#Data_handling | Data cleaning]] issues e.g. [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] or other [https://en.wikipedia.org/wiki/Whitespace_character Whitespace character]
# Is link a permanent link?
# Enable/Disable the CSS or JavaScript


<div class="table-responsive" style="width:100%;    min-height: .01%;    overflow-x: auto;">
<div class="table-responsive" style="width:100%;    min-height: .01%;    overflow-x: auto;">
Line 20: Line 41:
   <tr>
   <tr>
     <th>Difficulty in implementing</th>
     <th>Difficulty in implementing</th>
    <th>Descriptioin</th>
     <th>Approach</th>  
     <th>Approach</th>  
     <th>Comments</th>  
     <th>Comments</th>  
   </tr>
   </tr>
   <tr>
   <tr>
     <td>easy</td>
     <td>Easy</td>
     <td>Url is the resource of dataset</td>  
    <td>Well-formatted HTML elements</td>  
     <td>Url is the resource of dataset.</td>  
     <td></td>
     <td></td>
   </tr>
   </tr>
   <tr>
   <tr>
     <td>more difficult</td>
     <td>Advanced</td>
    <td>Interactive websites</td>
     <td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td>  
     <td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td>  
     <td>Using [[HTTP request and response data tool]] or [http://php.net/curl PHP: cURL]</td>
     <td>Using [[HTTP request and response data tool]] or [http://php.net/curl PHP: cURL]</td>
Line 35: Line 59:
   <tr>
   <tr>
     <td>more difficult</td>
     <td>more difficult</td>
    <td>Interactive websites</td>
     <td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td>  
     <td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td>  
     <td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td>
     <td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td>
   </tr>
   </tr>
   <tr>
   <tr>
     <td>difficult</td>
     <td>Difficult</td>
    <td>Interactive websites</td>
     <td>Ajax</td>  
     <td>Ajax</td>  
     <td></td>
     <td></td>
Line 46: Line 72:
</div>
</div>


== Before start to web scrpae ==
* Are they offer datasets?
* Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
== Skill tree of web scraping ==
Data extraction
* How they build the website & [[Information Architecture | information architecture]]
** Understanding the navigation system ★★☆☆☆
*** Understanding the classfication ★★☆☆☆
*** Parse the sitemap XML file ★★☆☆☆


* Understnding the web technology
** HTTP GET/POST ★★☆☆☆
** HTTP/CSS/Javascript ★★☆☆☆
** CSS seletor and DOM (Document Object Model) elements ★★☆☆☆ Related page: [[Xpath tools]]
** AJAX (Asynchronous JavaScript and XML) ★★★★☆
* Using API to retrieve data ★★☆☆☆
* Parse remote files to retrieve data ★★☆☆☆
* Using unofficial method to retrieve data
** [https://curl.haxx.se/ curl] command or [https://www.gnu.org/software/wget/manual/wget.html GNU Wget] command ★★☆☆☆ Related page: [[Troubleshooting of curl errors]]
** [https://www.selenium.dev/ SeleniumHQ Browser Automation] ★★★☆☆
* Forum submit
** submit the from without loggin ★★☆☆☆
** submit the from after logged the account ★★★☆☆
* Detection of abnormal data
** [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes HTTP status codes] ★★☆☆☆
** Data is wrong even the server throw HTTP 200 status code ★★★☆☆
* Etiquette of web scraping
** Limit of web request ★★☆☆☆
* Tom and Jerry
** VPN and proxy ★★☆☆☆
** Decode the [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] ★★★★☆
** Decentralized web scraping ★★★★☆


Data transforming
=== Search keyword strategy ===
* Character encoding ★☆☆☆☆
'''How to find the unofficial (3rd-party) web crawler?''' Search keyword strategy suggested as follows:
* Data cleaning e.g. unprintable characters ★★☆☆☆
* ''target website'' + crawler site:github.com
* [https://en.wikipedia.org/wiki/Regular_expression Regular expression]  ★★★☆☆
* ''target website'' + scraper site:github.com
* Selection of database engine ★★★★☆
* ''target website'' + bot site:github.com
* ''target website'' + download / downloader site:github.com
* ''target website'' + browser client site:github.com


== Further reading ==
== Further reading ==
Line 92: Line 86:
* Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition]
* Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition]
* [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia]
* [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia]
* [[Skill tree of web scraping]]
* [https://blog.gslin.org/archives/2022/08/22/10850/%e5%8d%97%e9%9f%93%e6%9c%80%e9%ab%98%e6%b3%95%e9%99%a2%e4%b9%9f%e5%b0%8d-web-scraping-%e7%b5%a6%e5%87%ba%e4%ba%86%e9%a1%9e%e4%bc%bc%e7%be%8e%e5%9c%8b%e7%9a%84%e5%88%a4%e4%be%8b/ 南韓最高法院也對 Web Scraping 給出了類似美國的判例 – Gea-Suan Lin's BLOG]


== References ==
== References ==
Line 100: Line 96:


[[Category:Programming]]
[[Category:Programming]]
[[Category:Data science]]
[[Category:Data Science]]
[[Category:Data collecting]]
[[Category:Data collecting]]
[[Category:web scraping]]

Latest revision as of 18:03, 16 January 2024

Before start to writing the scirpt of web scraping (crawler)[edit]

  • Is the website offer datasets or files? e.g. open data
  • Is the website offer API (Application programming interface)?


List of technical issues[edit]

  1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
    • Multiple sources of same column such as different HTML DOM but have the same column value.
    • Backup the HTML text of parent DOM element
    • (optional) Complete HTML file backup
  1. The IP was banned from server
  1. CAPTCHA
  1. AJAX
  1. The web page needed to signed in
  1. Blocking the request without Referer or other headers.
  1. Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds[1][2].
  1. Language and URL-encodes string
  1. How to extract content from websites
  1. Data cleaning issues e.g. Non-breaking space or other Whitespace character
  1. Is link a permanent link?
  1. Enable/Disable the CSS or JavaScript
Difficulty in implementing Descriptioin Approach Comments
Easy Well-formatted HTML elements Url is the resource of dataset.
Advanced Interactive websites Url is the resource of dataset. Require to simulate post form submit with the form data or user agent Using HTTP request and response data tool or PHP: cURL
more difficult Interactive websites Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. Using Selenium or Headless Chrome
Difficult Interactive websites Ajax


Search keyword strategy[edit]

How to find the unofficial (3rd-party) web crawler? Search keyword strategy suggested as follows:

  • target website + crawler site:github.com
  • target website + scraper site:github.com
  • target website + bot site:github.com
  • target website + download / downloader site:github.com
  • target website + browser client site:github.com

Further reading[edit]

References[edit]


Troubleshooting of ...

Template