Web scrape troubleshooting: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
mNo edit summary
Line 20: Line 20:
   <tr>
   <tr>
     <th>Difficulty in implementing</th>
     <th>Difficulty in implementing</th>
    <th>Descriptioin</th>
     <th>Approach</th>  
     <th>Approach</th>  
     <th>Comments</th>  
     <th>Comments</th>  
   </tr>
   </tr>
   <tr>
   <tr>
     <td>easy</td>
     <td>Easy</td>
     <td>Url is the resource of dataset</td>  
    <td>Well-formatted HTML elements</td>  
     <td>Url is the resource of dataset.</td>  
     <td></td>
     <td></td>
   </tr>
   </tr>
   <tr>
   <tr>
     <td>more difficult</td>
     <td>Advanced</td>
    <td></td>
     <td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td>  
     <td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td>  
     <td>Using [[HTTP request and response data tool]] or [http://php.net/curl PHP: cURL]</td>
     <td>Using [[HTTP request and response data tool]] or [http://php.net/curl PHP: cURL]</td>
Line 35: Line 38:
   <tr>
   <tr>
     <td>more difficult</td>
     <td>more difficult</td>
    <td></td>
     <td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td>  
     <td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td>  
     <td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td>
     <td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td>
   </tr>
   </tr>
   <tr>
   <tr>
     <td>difficult</td>
     <td>Difficult</td>
    <td></td>
     <td>Ajax</td>  
     <td>Ajax</td>  
     <td></td>
     <td></td>

Revision as of 11:18, 25 May 2021

List of technical issues

  1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
    • Multiple sources of same column such as different HTML DOM but have the same column value.
    • Backup the HTML text of parent DOM element
    • (optional) Complete HTML file backup
  2. The IP was banned from server
  3. CAPTCHA
  4. AJAX
  5. The web page needed to signed in
  6. Blocking the request without Referer or other headers.
  7. Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds[1][2].
  8. Language and URL-encodes string
  9. Data cleaning issues e.g. Non-breaking space or other Whitespace character
Difficulty in implementing Descriptioin Approach Comments
Easy Well-formatted HTML elements Url is the resource of dataset.
Advanced Url is the resource of dataset. Require to simulate post form submit with the form data or user agent Using HTTP request and response data tool or PHP: cURL
more difficult Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. Using Selenium or Headless Chrome
Difficult Ajax

Before start to writing the scirpt of web scraping (crawler)

  • Are the website offer datasets or files? e.g. open data
  • Are the website offer API (Application programming interface)?

How to find the unofficial (3rd-party) web crawler? Search keyword strategy suggested as follows:

  • target website + crawler
  • target website + bot
  • target website + download / downloader

Further reading

References


Troubleshooting of ...

Template