Web scrape troubleshooting: Difference between revisions

Jump to navigation Jump to search
no edit summary
mNo edit summary
No edit summary
Line 1: Line 1:
list of technical issues
list of technical issues
# Content of web page was changed (revision): Th expected web content (of specified DOM element) became empty.
# Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
#* Multiple sources of same column such as different HTML DOM but have the same column value.
#* Multiple sources of same column such as different HTML DOM but have the same column value.
#* Backup the HTML text of parent DOM element
#* Backup the HTML text of parent DOM element
Line 10: Line 10:
# AJAX
# AJAX
# The web page needed to signed in
# The web page needed to signed in
<div class="table-responsive" style="width:100%;    min-height: .01%;    overflow-x: auto;">
<table class="wikitable" style="width:100%">
  <tr>
    <th>Difficulty in implementing</th>
    <th>Approach</th>
    <th>Comments</th>
  </tr>
  <tr>
    <td>easy</td>
    <td>Url is the resource of dataset</td>
    <td></td>
  </tr>
  <tr>
    <td>more difficult</td>
    <td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td>
    <td></td>
  </tr>
  <tr>
    <td>more difficult</td>
    <td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td>
    <td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td>
  </tr>
  <tr>
    <td>difficult</td>
    <td>Ajax</td>
    <td></td>
  </tr>
</table>
</div>


Further reading
Further reading

Navigation menu