Web scrape troubleshooting
list of technical issues
- Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup
- The IP was banned from server
- Setting the temporization (sleep time) between each request e.g.: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation or Sleep random seconds in programming.
- The server responded with a status of 403: '403 forbidden' --> Change the network IP
- The web page needed to signed in
- Blocking the request without or other headers.
- Connection timeout during a http request. e.g. In PHP . is 30 seconds
- Language and URL-encodes string
- Data cleaning issues e.g. Non-breaking space or other Whitespace character
|Difficulty in implementing||Approach||Comments|
|easy||Url is the resource of dataset|
|more difficult||Url is the resource of dataset. Require to simulate post form submit with the form data or user agent||Using HTTP request and response data tool or PHP: cURL|
|more difficult||Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.||Using Selenium or Headless Chrome|
- Stateless: Why say that HTTP is a stateless protocol? - Stack Overflow
- Stateful: What is stateful? Webopedia Definition
- List of HTTP status codes - Wikipedia