Web scrape troubleshooting: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
(Created page with "list of technical issues # website revision #* multiple sources of same column #* backup the HTML text of parent DOM #* (optional) complete HTML file backup # CATCHA # AJAX...")
 
mNo edit summary
(38 intermediate revisions by the same user not shown)
Line 1: Line 1:
list of technical issues
== List of technical issues ==
# website revision
 
#* multiple sources of same column
# Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
#* backup the HTML text of parent DOM
#* Multiple sources of same column such as different HTML DOM but have the same column value.
#* (optional) complete HTML file backup
#* Backup the HTML text of parent DOM element
#* (optional) Complete HTML file backup
# The IP was banned from server
#* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
# CATCHA
# CATCHA
# AJAX
# AJAX
# The web page needed to signed in
# Blocking the request without {{kbd | key= Referer}} or other headers.
# Connection timeout during a http request. e.g. In PHP {{kbd | key=default_socket_timeout}} is 30 seconds<ref>[https://www.php.net/manual/en/filesystem.configuration.php PHP: Runtime Configuration - Manual]</ref><ref>[https://curl.haxx.se/libcurl/c/libcurl-errors.html libcurl - Error Codes]</ref>.
# Language and [http://php.net/manual/en/function.urlencode.php URL-encodes string]
# [[Data cleaning#Data_handling | Data cleaning]] issues e.g. [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] or other [https://en.wikipedia.org/wiki/Whitespace_character Whitespace character]
<div class="table-responsive" style="width:100%;    min-height: .01%;    overflow-x: auto;">
<table class="wikitable sortable" style="width:100%">
  <tr>
    <th>Difficulty in implementing</th>
    <th>Approach</th>
    <th>Comments</th>
  </tr>
  <tr>
    <td>easy</td>
    <td>Url is the resource of dataset</td>
    <td></td>
  </tr>
  <tr>
    <td>more difficult</td>
    <td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td>
    <td>Using [[HTTP request and response data tool]] or [http://php.net/curl PHP: cURL]</td>
  </tr>
  <tr>
    <td>more difficult</td>
    <td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td>
    <td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td>
  </tr>
  <tr>
    <td>difficult</td>
    <td>Ajax</td>
    <td></td>
  </tr>
</table>
</div>
== Before start to web scrpae ==
* Are they offer datasets?
* Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
== Further reading ==
* Stateless: [https://stackoverflow.com/questions/13200152/why-say-that-http-is-a-stateless-protocol Why say that HTTP is a stateless protocol? - Stack Overflow]
* Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition]
* [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia]
* [[Skill tree of web scraping]]
== References ==
<references />


{{Template:Troubleshooting}}


[[Category:Programming]]
[[Category:Programming]]
[[Category:Data Science]]
[[Category:Data collecting]]
[[Category:web scraping]]

Revision as of 11:16, 25 August 2020

List of technical issues

  1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
    • Multiple sources of same column such as different HTML DOM but have the same column value.
    • Backup the HTML text of parent DOM element
    • (optional) Complete HTML file backup
  2. The IP was banned from server
  3. CATCHA
  4. AJAX
  5. The web page needed to signed in
  6. Blocking the request without Referer or other headers.
  7. Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds[1][2].
  8. Language and URL-encodes string
  9. Data cleaning issues e.g. Non-breaking space or other Whitespace character
Difficulty in implementing Approach Comments
easy Url is the resource of dataset
more difficult Url is the resource of dataset. Require to simulate post form submit with the form data or user agent Using HTTP request and response data tool or PHP: cURL
more difficult Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. Using Selenium or Headless Chrome
difficult Ajax

Before start to web scrpae

  • Are they offer datasets?
  • Are they offer API (Application programming interface)?


Further reading

References


Troubleshooting of ...

Template