Scraping a Page with AutoLoad

Questions and answers about anything related to Helium Scraper
Post Reply
saahilgoel
Posts: 8
Joined: Wed May 09, 2012 9:04 am

Scraping a Page with AutoLoad

Post by saahilgoel » Wed May 09, 2012 10:10 am

I am trying to scrape the following page:

http://www.flipkart.com/computers/compo ... 183d0f480f

I have created a type for the "Show More" link as well, however the scraper is only able to get the first 20 (of the 196) items on the page.

Also, once the "Show More" link is clicked, the rest of the page loads as one scrolls down - i.e. it auto-loads the content through ajax.

Please advise on how this can be scraped!

PS: I am attaching the project I have created so far. This works great for the 20 products on the page (link above), but is unable to get the rest.

Thanks,
Saahil
Attachments
flipkart_computer_components_try1.zip
(40.44 KiB) Downloaded 522 times

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Scraping a Page with AutoLoad

Post by webmaster » Sat May 12, 2012 8:15 am

Hi,

Here is a sample of how you can do that (look at the Load All Results actions tree). I imported the If / While premade from New Action button -> Execute Actions Tree -> More.... If you double click it you'll see how is configured. It will repeat while the no_more_results kind is not found. Note that I'm using a Navigate action instead of a Navigate Each one. The problem with using a Navigate Each is that it assumes you're actually navigating away from the page and returns back to the original page after is done, which in this case means going back to the 20 results page.

Also, I would recommend extracting some URLs to a table and then using a Navigate URLs action to navigate through them so you break apart your project into steps instead of attempt it to do it all at once. You could for instance, extract all the links to categories into a table and then use these URLs with a Navigate URLs actions, and do the same thing with links to product details. Note that with the Navigate URLs actions you can also use IDs the same way you are doing it with the Extract actions to keep track of parent - children relationships. If your URLs table has an ID column, you get to select this column when setting up the Navigate URLs action, and then you can extract it from a child Extract action as you'd normally do.
Attachments
sample.hsp
(746.94 KiB) Downloaded 587 times
Juan Soldi
The Helium Scraper Team

Post Reply