Scraping Manta

Questions and answers about anything related to Helium Scraper
Post Reply
Nate
Posts: 4
Joined: Tue May 22, 2012 6:30 pm

Scraping Manta

Post by Nate » Tue May 22, 2012 7:38 pm

I put together a simple scrape to gather information from Manta.com, but I have found that the site is running exceptionally slow (I was getting about one entry every five minutes or so). I also noticed that when you load the site in the Helium Scraper Browser, the browser indicates that the page never stops loading (the "stop" button at the top of the browser never changes to the "refresh" button, like you should normally see after the page has completely finished loading.

I ran the SAVE URL feature for the pages I am interested in, and I would like to run a process to batch everything once I can figure out a way to speed each scrape in "Main" up.

I attached the document to help you visualize what I am doing. What advice do you have to speed up the scraping?

Thanks!
Attachments
manta day spas.zip
(36.5 KiB) Downloaded 525 times

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Scraping Manta

Post by webmaster » Thu May 24, 2012 2:26 am

Hi,

Try entering a smaller Navigation Timeout at Project -> Options.
Juan Soldi
The Helium Scraper Team

Nate
Posts: 4
Joined: Tue May 22, 2012 6:30 pm

Re: Scraping Manta

Post by Nate » Thu May 24, 2012 2:12 pm

Still Doesn't seem to change anything.

I have an idea for how to address this, but maybe you can help with the actual implementation of it all. How could we set up a script (Maybe A JS Execution) to force the page to stop loading after a certain period of time? Do you think that could work?

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Scraping Manta

Post by webmaster » Thu May 24, 2012 3:29 pm

How low did you set it? Try setting it to something like 5 - 10 seconds. Is not possible to run an script unless the page has completed loading so what you're suggesting couldn't be done.

To further improve speed, you can use a Start Processes action which would have more than one instance extracting at the same time: Just extract the URLs to every results page (or even better, generate them with the URL Variations premade since the URL contains the page number) and then use a Start Processes that goes through these URLs. Then in your Main actions tree, navigate through the "SERP-Level" links and extract the content inside each of them.

Note that in order for this to work you'll have to export and connect to your database from the database panel, Export Database -> Export and Connect, so that all instances use the same database.
Juan Soldi
The Helium Scraper Team

Post Reply