Web scrape troubleshooting: Difference between revisions
Jump to navigation
Jump to search
mNo edit summary |
|||
| Line 20: | Line 20: | ||
<tr> | <tr> | ||
<th>Difficulty in implementing</th> | <th>Difficulty in implementing</th> | ||
<th>Descriptioin</th> | |||
<th>Approach</th> | <th>Approach</th> | ||
<th>Comments</th> | <th>Comments</th> | ||
</tr> | </tr> | ||
<tr> | <tr> | ||
<td> | <td>Easy</td> | ||
<td>Url is the resource of dataset</td> | <td>Well-formatted HTML elements</td> | ||
<td>Url is the resource of dataset.</td> | |||
<td></td> | <td></td> | ||
</tr> | </tr> | ||
<tr> | <tr> | ||
<td> | <td>Advanced</td> | ||
<td></td> | |||
<td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td> | <td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td> | ||
<td>Using [[HTTP request and response data tool]] or [http://php.net/curl PHP: cURL]</td> | <td>Using [[HTTP request and response data tool]] or [http://php.net/curl PHP: cURL]</td> | ||
| Line 35: | Line 38: | ||
<tr> | <tr> | ||
<td>more difficult</td> | <td>more difficult</td> | ||
<td></td> | |||
<td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td> | <td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td> | ||
<td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td> | <td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td> | ||
</tr> | </tr> | ||
<tr> | <tr> | ||
<td> | <td>Difficult</td> | ||
<td></td> | |||
<td>Ajax</td> | <td>Ajax</td> | ||
<td></td> | <td></td> | ||
Revision as of 11:18, 25 May 2021
List of technical issues
- Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup
- The IP was banned from server
- Setting the temporization (sleep time) between each request e.g.: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation or Sleep random seconds in programming.
- The server responded with a status of 403: '403 forbidden' --> Change the network IP
- CAPTCHA
- AJAX
- The web page needed to signed in
- Blocking the request without Referer or other headers.
- Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds[1][2].
- Language and URL-encodes string
- Data cleaning issues e.g. Non-breaking space or other Whitespace character
| Difficulty in implementing | Descriptioin | Approach | Comments |
|---|---|---|---|
| Easy | Well-formatted HTML elements | Url is the resource of dataset. | |
| Advanced | Url is the resource of dataset. Require to simulate post form submit with the form data or user agent | Using HTTP request and response data tool or PHP: cURL | |
| more difficult | Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. | Using Selenium or Headless Chrome | |
| Difficult | Ajax |
Before start to writing the scirpt of web scraping (crawler)
- Are the website offer datasets or files? e.g. open data
- Are the website offer API (Application programming interface)?
How to find the unofficial (3rd-party) web crawler? Search keyword strategy suggested as follows:
- target website + crawler
- target website + bot
- target website + download / downloader
Further reading
- Stateless: Why say that HTTP is a stateless protocol? - Stack Overflow
- Stateful: What is stateful? Webopedia Definition
- List of HTTP status codes - Wikipedia
- Skill tree of web scraping
References
Troubleshooting of ...
- PHP, cUrl, Python, selenium, HTTP status code errors
- Database: SQL syntax debug, MySQL errors, MySQLTuner errors or PostgreSQL errors
- HTML/Javascript: Troubleshooting of javascript, XPath
- Software: Mediawiki, Docker, FTP problems, online conference software
- Test connectivity for the web service, Web Ping, Network problem, Web user behavior, Web scrape troubleshooting
Template