Web scrape troubleshooting: Difference between revisions
Jump to navigation
Jump to search
mNo edit summary |
mNo edit summary |
||
(25 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== List of technical issues == | |||
# Content of web page was changed (revision): | |||
# Content of web page was changed (revision): The expected web content (of specified DOM element) became empty. | |||
#* Multiple sources of same column such as different HTML DOM but have the same column value. | #* Multiple sources of same column such as different HTML DOM but have the same column value. | ||
#* Backup the HTML text of parent DOM element | #* Backup the HTML text of parent DOM element | ||
#* (optional) Complete HTML file backup | #* (optional) Complete HTML file backup | ||
# The IP was banned from server | # The IP was banned from server | ||
#* Setting the temporization (sleep time) between each request | #* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]]. | ||
#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP | #* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP | ||
# CATCHA | # CATCHA | ||
# AJAX | # AJAX | ||
# The web page needed to signed in | # The web page needed to signed in | ||
# Blocking the request without {{kbd | key= Referer}} or other headers. | |||
# Connection timeout during a http request. e.g. In PHP {{kbd | key=default_socket_timeout}} is 30 seconds<ref>[https://www.php.net/manual/en/filesystem.configuration.php PHP: Runtime Configuration - Manual]</ref><ref>[https://curl.haxx.se/libcurl/c/libcurl-errors.html libcurl - Error Codes]</ref>. | |||
# Language and [http://php.net/manual/en/function.urlencode.php URL-encodes string] | |||
# [[Data cleaning#Data_handling | Data cleaning]] issues e.g. [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] or other [https://en.wikipedia.org/wiki/Whitespace_character Whitespace character] | |||
<div class="table-responsive" style="width:100%; min-height: .01%; overflow-x: auto;"> | |||
<table class="wikitable sortable" style="width:100%"> | |||
<tr> | |||
<th>Difficulty in implementing</th> | |||
<th>Approach</th> | |||
<th>Comments</th> | |||
</tr> | |||
<tr> | |||
<td>easy</td> | |||
<td>Url is the resource of dataset</td> | |||
<td></td> | |||
</tr> | |||
<tr> | |||
<td>more difficult</td> | |||
<td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td> | |||
<td>Using [[HTTP request and response data tool]] or [http://php.net/curl PHP: cURL]</td> | |||
</tr> | |||
<tr> | |||
<td>more difficult</td> | |||
<td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td> | |||
<td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td> | |||
</tr> | |||
<tr> | |||
<td>difficult</td> | |||
<td>Ajax</td> | |||
<td></td> | |||
</tr> | |||
</table> | |||
</div> | |||
== Before start to web scrpae == | |||
Further reading | * Are they offer datasets? | ||
* Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)? | |||
== Further reading == | |||
* Stateless: [https://stackoverflow.com/questions/13200152/why-say-that-http-is-a-stateless-protocol Why say that HTTP is a stateless protocol? - Stack Overflow] | * Stateless: [https://stackoverflow.com/questions/13200152/why-say-that-http-is-a-stateless-protocol Why say that HTTP is a stateless protocol? - Stack Overflow] | ||
* Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition] | * Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition] | ||
* [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia] | * [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia] | ||
* [[Skill tree of web scraping]] | |||
== References == | |||
<references /> | |||
{{Template:Troubleshooting}} | {{Template:Troubleshooting}} | ||
[[Category:Programming]] | [[Category:Programming]] | ||
[[Category:Data | [[Category:Data Science]] | ||
[[Category:Data collecting]] | [[Category:Data collecting]] | ||
[[Category:web scraping]] |
Revision as of 11:16, 25 August 2020
List of technical issues
- Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup
- The IP was banned from server
- Setting the temporization (sleep time) between each request e.g.: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation or Sleep random seconds in programming.
- The server responded with a status of 403: '403 forbidden' --> Change the network IP
- CATCHA
- AJAX
- The web page needed to signed in
- Blocking the request without Referer or other headers.
- Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds[1][2].
- Language and URL-encodes string
- Data cleaning issues e.g. Non-breaking space or other Whitespace character
Difficulty in implementing | Approach | Comments |
---|---|---|
easy | Url is the resource of dataset | |
more difficult | Url is the resource of dataset. Require to simulate post form submit with the form data or user agent | Using HTTP request and response data tool or PHP: cURL |
more difficult | Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. | Using Selenium or Headless Chrome |
difficult | Ajax |
Before start to web scrpae
- Are they offer datasets?
- Are they offer API (Application programming interface)?
Further reading
- Stateless: Why say that HTTP is a stateless protocol? - Stack Overflow
- Stateful: What is stateful? Webopedia Definition
- List of HTTP status codes - Wikipedia
- Skill tree of web scraping
References
Troubleshooting of ...
- PHP, cUrl, Python, selenium, HTTP status code errors
- Database: SQL syntax debug, MySQL errors, MySQLTuner errors or PostgreSQL errors
- HTML/Javascript: Troubleshooting of javascript, XPath
- Software: Mediawiki, Docker, FTP problems, online conference software
- Test connectivity for the web service, Web Ping, Network problem, Web user behavior, Web scrape troubleshooting
Template