Web scrape troubleshooting: Difference between revisions
Jump to navigation
Jump to search
mNo edit summary |
No edit summary |
||
| Line 1: | Line 1: | ||
list of technical issues | list of technical issues | ||
# Content of web page was changed (revision): | # Content of web page was changed (revision): The expected web content (of specified DOM element) became empty. | ||
#* Multiple sources of same column such as different HTML DOM but have the same column value. | #* Multiple sources of same column such as different HTML DOM but have the same column value. | ||
#* Backup the HTML text of parent DOM element | #* Backup the HTML text of parent DOM element | ||
| Line 10: | Line 10: | ||
# AJAX | # AJAX | ||
# The web page needed to signed in | # The web page needed to signed in | ||
<div class="table-responsive" style="width:100%; min-height: .01%; overflow-x: auto;"> | |||
<table class="wikitable" style="width:100%"> | |||
<tr> | |||
<th>Difficulty in implementing</th> | |||
<th>Approach</th> | |||
<th>Comments</th> | |||
</tr> | |||
<tr> | |||
<td>easy</td> | |||
<td>Url is the resource of dataset</td> | |||
<td></td> | |||
</tr> | |||
<tr> | |||
<td>more difficult</td> | |||
<td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td> | |||
<td></td> | |||
</tr> | |||
<tr> | |||
<td>more difficult</td> | |||
<td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td> | |||
<td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td> | |||
</tr> | |||
<tr> | |||
<td>difficult</td> | |||
<td>Ajax</td> | |||
<td></td> | |||
</tr> | |||
</table> | |||
</div> | |||
Further reading | Further reading | ||
Revision as of 13:54, 19 September 2018
list of technical issues
- Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup
- The IP was banned from server
- Setting the temporization (sleep time) between each request ex: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation
- The server responded with a status of 403: '403 forbidden' --> Change the network IP
- CATCHA
- AJAX
- The web page needed to signed in
| Difficulty in implementing | Approach | Comments |
|---|---|---|
| easy | Url is the resource of dataset | |
| more difficult | Url is the resource of dataset. Require to simulate post form submit with the form data or user agent | |
| more difficult | Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. | Using Selenium or Headless Chrome |
| difficult | Ajax |
Further reading
- Stateless: Why say that HTTP is a stateless protocol? - Stack Overflow
- Stateful: What is stateful? Webopedia Definition
- List of HTTP status codes - Wikipedia
Troubleshooting of ...
- PHP, cUrl, Python, selenium, HTTP status code errors
- Database: SQL syntax debug, MySQL errors, MySQLTuner errors or PostgreSQL errors
- HTML/Javascript: Troubleshooting of javascript, XPath
- Software: Mediawiki, Docker, FTP problems, online conference software
- Test connectivity for the web service, Web Ping, Network problem, Web user behavior, Web scrape troubleshooting
Template