Recursive Web Site Scrape

nevrec · Post by **nevrec** » Tue May 29, 2012 3:04 am

How does one automatically scrape all the pages in a web?
Is there a website crawl option?

Post by **webmaster** » Tue May 29, 2012 7:47 pm

I guess I could write a premade that follows every link as long as it points to a URL inside a given domain and as long as it haven't been already followed. But then how would you determine what to extract from each page if you don't know what kind of layout each page will have? Are you just trying to extract URLs, or the whole HTML?

nevrec · Post by **nevrec** » Mon Jun 04, 2012 2:34 am

I'm data mining. Want to select from a representative page the information to extract.
Such as beginning and ending search critera for all pages and extract that information into a table, spreadsheet or csv file.
For instance:
Let's say the extraction table would contain the following:
Description
UPC
Price

For the description field I would look for the following deliminters in each page:
<div class="productDescriptionWrapper">
APC BACK-UPS ES BE550G 8-Outlet 550VA 330W UPS System
<div class="emptyClear">

It would then take the info between the delimiters and place it in the Description column.

Next would be UPC on that page and so on. If nothing was found it would leave the colum blank and then move on to the next item, when all items are parsed it would then go to the next page.

Post by **webmaster** » Tue Jun 05, 2012 3:23 am

Have you watched our getting started video?

I think what you're trying to do is just a normal extraction. I wouldn't attempt to write a crawler that follows every link because this would follow any kind of link to any kind of page. What I'd do is follow only links to the kind of page I'm interested in extracting data from, which is what the video tutorial above shows how to do.

nevrec · Post by **nevrec** » Tue Jun 05, 2012 3:33 pm

I think what you're trying to do is just a normal extraction. I wouldn't attempt to write a crawler that follows every link because this would follow any kind of link to any kind of page. What I'd do is follow only links to the kind of page I'm interested in extracting data from, which is what the video tutorial above shows how to do.

Yes I have. Problem is that there are too many links to follow and some may be missed by manually performing the operation, and the time involved would be very long.

Post by **webmaster** » Fri Jun 08, 2012 5:22 am

Hi,

The attached project should help you get started. Is a variation of the Auto Distribute URLs premade. Here is what you need to do:

Open the JS_IsGoodLink javascript gatherer from Project -> JavaScript Gatherers and change the value of var pattern from "^http://www.example.com/" to whatever site you're crawling, keeping the "^" symbol at the beginning. This just tells Helium Scraper what a link's URL needs to start with to be considered a valid link. You can test which links will be considered valid by clicking on the Select kind in browser button in the GoodLinks kind.
Go to the database panel, click Export Database, select Export and Connect and save the file.
Expand the Start actions tree and note the inner Repeat 3 times action. The number of repetitions will translate into how deep inside the links the extraction will go. Try using a small number first. Roughly, the amount of links to be extracted will be the average amount of links per page raised to the amount of repetitions. So it pretty much grows exponentially until you start getting repeated URL's, which will be ignored.
Finally, save the project and run the Start actions tree. Note that other instances of Helium Scraper will be created. Make sure not to close them.

The URLs found will be extracted to the Links table. If you want to extract anything other than each page's URL, you can place your extraction logic in the Extract actions tree, right underneath the Extract to table: 'Links' action. But what I would do it extract as many URLs as possible, and then use a Navigate URLs action, or even better, the Auto Distribute URLs premade I mentioned above to extract whatever you need from them.

Since this project is based on the Auto Distribute URLs one you might want to take a look at it as well.

Helium Scraper

Recursive Web Site Scrape

Recursive Web Site Scrape

Re: Recursive Web Site Scrape

Re: Recursive Web Site Scrape

Re: Recursive Web Site Scrape

Re: Recursive Web Site Scrape

Re: Recursive Web Site Scrape