Web scrape troubleshooting: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
mNo edit summary
No edit summary
Line 1: Line 1:
list of technical issues
list of technical issues
# Content of web page was changed (revision): Th expected web content (of specified DOM element) became empty.
# Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
#* Multiple sources of same column such as different HTML DOM but have the same column value.
#* Multiple sources of same column such as different HTML DOM but have the same column value.
#* Backup the HTML text of parent DOM element
#* Backup the HTML text of parent DOM element
Line 10: Line 10:
# AJAX
# AJAX
# The web page needed to signed in
# The web page needed to signed in
<div class="table-responsive" style="width:100%;    min-height: .01%;    overflow-x: auto;">
<table class="wikitable" style="width:100%">
  <tr>
    <th>Difficulty in implementing</th>
    <th>Approach</th>
    <th>Comments</th>
  </tr>
  <tr>
    <td>easy</td>
    <td>Url is the resource of dataset</td>
    <td></td>
  </tr>
  <tr>
    <td>more difficult</td>
    <td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td>
    <td></td>
  </tr>
  <tr>
    <td>more difficult</td>
    <td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td>
    <td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td>
  </tr>
  <tr>
    <td>difficult</td>
    <td>Ajax</td>
    <td></td>
  </tr>
</table>
</div>


Further reading
Further reading

Revision as of 13:54, 19 September 2018

list of technical issues

  1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
    • Multiple sources of same column such as different HTML DOM but have the same column value.
    • Backup the HTML text of parent DOM element
    • (optional) Complete HTML file backup
  2. The IP was banned from server
  3. CATCHA
  4. AJAX
  5. The web page needed to signed in
Difficulty in implementing Approach Comments
easy Url is the resource of dataset
more difficult Url is the resource of dataset. Require to simulate post form submit with the form data or user agent
more difficult Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. Using Selenium or Headless Chrome
difficult Ajax

Further reading


Troubleshooting of ...

Template