How does one automatically scrape all the pages in a web?
Is there a website crawl option?
Recursive Web Site Scrape
Re: Recursive Web Site Scrape
I guess I could write a premade that follows every link as long as it points to a URL inside a given domain and as long as it haven't been already followed. But then how would you determine what to extract from each page if you don't know what kind of layout each page will have? Are you just trying to extract URLs, or the whole HTML?
Juan Soldi
The Helium Scraper Team
The Helium Scraper Team
Re: Recursive Web Site Scrape
I'm data mining. Want to select from a representative page the information to extract.
Such as beginning and ending search critera for all pages and extract that information into a table, spreadsheet or csv file.
For instance:
Let's say the extraction table would contain the following:
Description
UPC
Price
For the description field I would look for the following deliminters in each page:
<div class="productDescriptionWrapper">
APC BACK-UPS ES BE550G 8-Outlet 550VA 330W UPS System
<div class="emptyClear">
It would then take the info between the delimiters and place it in the Description column.
Next would be UPC on that page and so on. If nothing was found it would leave the colum blank and then move on to the next item, when all items are parsed it would then go to the next page.
Such as beginning and ending search critera for all pages and extract that information into a table, spreadsheet or csv file.
For instance:
Let's say the extraction table would contain the following:
Description
UPC
Price
For the description field I would look for the following deliminters in each page:
<div class="productDescriptionWrapper">
APC BACK-UPS ES BE550G 8-Outlet 550VA 330W UPS System
<div class="emptyClear">
It would then take the info between the delimiters and place it in the Description column.
Next would be UPC on that page and so on. If nothing was found it would leave the colum blank and then move on to the next item, when all items are parsed it would then go to the next page.
Re: Recursive Web Site Scrape
Have you watched our getting started video?
I think what you're trying to do is just a normal extraction. I wouldn't attempt to write a crawler that follows every link because this would follow any kind of link to any kind of page. What I'd do is follow only links to the kind of page I'm interested in extracting data from, which is what the video tutorial above shows how to do.
I think what you're trying to do is just a normal extraction. I wouldn't attempt to write a crawler that follows every link because this would follow any kind of link to any kind of page. What I'd do is follow only links to the kind of page I'm interested in extracting data from, which is what the video tutorial above shows how to do.
Juan Soldi
The Helium Scraper Team
The Helium Scraper Team
Re: Recursive Web Site Scrape
Yes I have. Problem is that there are too many links to follow and some may be missed by manually performing the operation, and the time involved would be very long.I think what you're trying to do is just a normal extraction. I wouldn't attempt to write a crawler that follows every link because this would follow any kind of link to any kind of page. What I'd do is follow only links to the kind of page I'm interested in extracting data from, which is what the video tutorial above shows how to do.
Re: Recursive Web Site Scrape
Hi,
The attached project should help you get started. Is a variation of the Auto Distribute URLs premade. Here is what you need to do:
Since this project is based on the Auto Distribute URLs one you might want to take a look at it as well.
The attached project should help you get started. Is a variation of the Auto Distribute URLs premade. Here is what you need to do:
- Open the JS_IsGoodLink javascript gatherer from Project -> JavaScript Gatherers and change the value of var pattern from "^http://www.example.com/" to whatever site you're crawling, keeping the "^" symbol at the beginning. This just tells Helium Scraper what a link's URL needs to start with to be considered a valid link. You can test which links will be considered valid by clicking on the Select kind in browser button in the GoodLinks kind.
- Go to the database panel, click Export Database, select Export and Connect and save the file.
- Expand the Start actions tree and note the inner Repeat 3 times action. The number of repetitions will translate into how deep inside the links the extraction will go. Try using a small number first. Roughly, the amount of links to be extracted will be the average amount of links per page raised to the amount of repetitions. So it pretty much grows exponentially until you start getting repeated URL's, which will be ignored.
- Finally, save the project and run the Start actions tree. Note that other instances of Helium Scraper will be created. Make sure not to close them.
Since this project is based on the Auto Distribute URLs one you might want to take a look at it as well.
- Attachments
-
- Crawler.hsp
- (539.49 KiB) Downloaded 610 times
Juan Soldi
The Helium Scraper Team
The Helium Scraper Team