Web scrape troubleshooting: Difference between revisions

Jump to navigation Jump to search
m
no edit summary
mNo edit summary
mNo edit summary
(25 intermediate revisions by the same user not shown)
Line 1: Line 1:
list of technical issues
== List of technical issues ==
# Content of web page was changed (revision): Th expected web content (of specified DOM element) became empty.
 
# Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
#* Multiple sources of same column such as different HTML DOM but have the same column value.
#* Multiple sources of same column such as different HTML DOM but have the same column value.
#* Backup the HTML text of parent DOM element
#* Backup the HTML text of parent DOM element
#* (optional) Complete HTML file backup
#* (optional) Complete HTML file backup
# The IP was banned from server
# The IP was banned from server
#* Setting the temporization (sleep time) between each request ex: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation]
#* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
# CATCHA
# CATCHA
# AJAX
# AJAX
# The web page needed to signed in
# The web page needed to signed in
# Blocking the request without {{kbd | key= Referer}} or other headers.
# Connection timeout during a http request. e.g. In PHP {{kbd | key=default_socket_timeout}} is 30 seconds<ref>[https://www.php.net/manual/en/filesystem.configuration.php PHP: Runtime Configuration - Manual]</ref><ref>[https://curl.haxx.se/libcurl/c/libcurl-errors.html libcurl - Error Codes]</ref>.
# Language and [http://php.net/manual/en/function.urlencode.php URL-encodes string]
# [[Data cleaning#Data_handling | Data cleaning]] issues e.g. [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] or other [https://en.wikipedia.org/wiki/Whitespace_character Whitespace character]
<div class="table-responsive" style="width:100%;    min-height: .01%;    overflow-x: auto;">
<table class="wikitable sortable" style="width:100%">
  <tr>
    <th>Difficulty in implementing</th>
    <th>Approach</th>
    <th>Comments</th>
  </tr>
  <tr>
    <td>easy</td>
    <td>Url is the resource of dataset</td>
    <td></td>
  </tr>
  <tr>
    <td>more difficult</td>
    <td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td>
    <td>Using [[HTTP request and response data tool]] or [http://php.net/curl PHP: cURL]</td>
  </tr>
  <tr>
    <td>more difficult</td>
    <td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td>
    <td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td>
  </tr>
  <tr>
    <td>difficult</td>
    <td>Ajax</td>
    <td></td>
  </tr>
</table>
</div>
== Before start to web scrpae ==


Further reading
* Are they offer datasets?
* Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
 
 
== Further reading ==
* Stateless: [https://stackoverflow.com/questions/13200152/why-say-that-http-is-a-stateless-protocol Why say that HTTP is a stateless protocol? - Stack Overflow]
* Stateless: [https://stackoverflow.com/questions/13200152/why-say-that-http-is-a-stateless-protocol Why say that HTTP is a stateless protocol? - Stack Overflow]
* Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition]
* Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition]
* [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia]
* [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia]
* [[Skill tree of web scraping]]
== References ==
<references />


{{Template:Troubleshooting}}
{{Template:Troubleshooting}}


[[Category:Programming]]
[[Category:Programming]]
[[Category:Data science]]
[[Category:Data Science]]
[[Category:Data collecting]]
[[Category:Data collecting]]
[[Category:web scraping]]
Anonymous user

Navigation menu