Helium Scraper

Posted: **Sun May 06, 2012 5:12 am**

Hello, I have a Huge list of URLS and I am using the "navigate urls" action. I grab the info from the 20% of them today and I want to grab the rest of the info tomorrow, how can I start hs up from the very last URL I already scraped?

thks

Posted: **Tue May 08, 2012 3:29 am**

The easiest way to do this would be to have a table with all your URLs and then another table where you paste records from your main URLs table, and then use this table on your Navigate URLs action.

If you need a more automated way of doing this, you'll need to use a bit of SQL. Here is an example on how to do it. All you need is two tables, one called URLs with two columns (Id and URL) and another table called ExtractedURLs with only one column (UrlId). The UrlId column will contain the ID's of all the URLs that have already been extracted. In the project attached, there is an Extract action that extracts the current ID_URLs property of the BODY kind to the ExtractedURLs table. This property will extract the Id of the current URL. Then, the Navigate URLs action uses this query:

Code: Select all

SELECT [URL], [Id]
	FROM   [URLs] 
	WHERE  [Id] NOT IN 
		(SELECT [UrlId] FROM   [ExtractedURLs])

All this query does is select only the URLs from the URLs table that have not been extracted to the ExtractedURLs table. So all you'd need to do is fill up the URLs table with your URLs (you can leave the Id column blank and they will be auto-generated when you save the table) and add your extract action(s) inside the Navigate URLs action right before the Extract action that extracts to the ExtractedURLs table.

If you also want to limit the amount of URLs to be extracted each time your can use this query instead (just replace 100 for whatever amount you want to use):

Code: Select all

SELECT TOP 100 [URL], [Id]
	FROM   [URLs] 
	WHERE  [Id] NOT IN 
		(SELECT [UrlId] FROM   [ExtractedURLs])

Helium Scraper

Navigate URLS Question

Navigate URLS Question

Re: Navigate URLS Question