I am trying to scrape the following page:
http://www.flipkart.com/computers/compo ... 183d0f480f
I have created a type for the "Show More" link as well, however the scraper is only able to get the first 20 (of the 196) items on the page.
Also, once the "Show More" link is clicked, the rest of the page loads as one scrolls down - i.e. it auto-loads the content through ajax.
Please advise on how this can be scraped!
PS: I am attaching the project I have created so far. This works great for the 20 products on the page (link above), but is unable to get the rest.
Thanks,
Saahil
Scraping a Page with AutoLoad
-
- Posts: 8
- Joined: Wed May 09, 2012 9:04 am
Scraping a Page with AutoLoad
- Attachments
-
- flipkart_computer_components_try1.zip
- (40.44 KiB) Downloaded 522 times
Re: Scraping a Page with AutoLoad
Hi,
Here is a sample of how you can do that (look at the Load All Results actions tree). I imported the If / While premade from New Action button -> Execute Actions Tree -> More.... If you double click it you'll see how is configured. It will repeat while the no_more_results kind is not found. Note that I'm using a Navigate action instead of a Navigate Each one. The problem with using a Navigate Each is that it assumes you're actually navigating away from the page and returns back to the original page after is done, which in this case means going back to the 20 results page.
Also, I would recommend extracting some URLs to a table and then using a Navigate URLs action to navigate through them so you break apart your project into steps instead of attempt it to do it all at once. You could for instance, extract all the links to categories into a table and then use these URLs with a Navigate URLs actions, and do the same thing with links to product details. Note that with the Navigate URLs actions you can also use IDs the same way you are doing it with the Extract actions to keep track of parent - children relationships. If your URLs table has an ID column, you get to select this column when setting up the Navigate URLs action, and then you can extract it from a child Extract action as you'd normally do.
Here is a sample of how you can do that (look at the Load All Results actions tree). I imported the If / While premade from New Action button -> Execute Actions Tree -> More.... If you double click it you'll see how is configured. It will repeat while the no_more_results kind is not found. Note that I'm using a Navigate action instead of a Navigate Each one. The problem with using a Navigate Each is that it assumes you're actually navigating away from the page and returns back to the original page after is done, which in this case means going back to the 20 results page.
Also, I would recommend extracting some URLs to a table and then using a Navigate URLs action to navigate through them so you break apart your project into steps instead of attempt it to do it all at once. You could for instance, extract all the links to categories into a table and then use these URLs with a Navigate URLs actions, and do the same thing with links to product details. Note that with the Navigate URLs actions you can also use IDs the same way you are doing it with the Extract actions to keep track of parent - children relationships. If your URLs table has an ID column, you get to select this column when setting up the Navigate URLs action, and then you can extract it from a child Extract action as you'd normally do.
- Attachments
-
- sample.hsp
- (746.94 KiB) Downloaded 587 times
Juan Soldi
The Helium Scraper Team
The Helium Scraper Team