Only scrape changes

Questions and answers about anything related to Helium Scraper
Post Reply
Darrylglenn
Posts: 11
Joined: Wed Jun 29, 2011 8:16 pm

Only scrape changes

Post by Darrylglenn » Fri Jun 15, 2012 6:23 am

Hi Juan,

Last question... I think :lol:

I'm scraping products of webpages. Is it possible to scrape changes?
So, I scraped the website once... But I don't want it to fill up the database all over again..
Can HS do the following:

- If kinds are updates on a page (e.g. price, stock), update it in the database?
- If Products cannot be found, delete from database
- If New products are found, Add to database

Thanks alot.

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Only scrape changes

Post by webmaster » Mon Jun 18, 2012 5:41 pm

You cannot delete non-found items with the current built-in functionality, but you could do this with JavaScript.

Regarding updating existing items, I assume you want to prevent it from navigating into items that have not been updated (otherwise the result would be equivalent to clearing the database and extracting the whole thing again). If so, you'll need to find an identifier for each item (such as a product number) and set it as Unique in your Extract action. Then, on your Navigate Each action (which needs to occur after the Extract action) select Only if modified on whatever data table you've just extracted to. This will cause it to navigate only inside items that have just been added or updated so that you can get whatever extra details you need.

If you're not using a Navigate Each but a Navigate URLs then you'll need to do the filtering with SQL.
Juan Soldi
The Helium Scraper Team

crookedleaf
Posts: 38
Joined: Tue Dec 11, 2012 6:44 pm

Re: Only scrape changes

Post by crookedleaf » Tue Feb 19, 2013 6:45 am

sorry to revive an old thread, but i had a question about scraping changes.

the scrape i currently have is set up to go to a page that list confirmation numbers. it does a "navigate each" on the confirmation numbers, then scrapes the data off that page (confirmation number, name, date made, status, etc). when i set the "navigate each" to "only if modified on" the table it's extracting to, then set "unique" on the confirmation number, it never navigates into them.

i am essentially trying to get the scrape to only scrape confirmation numbers that are not already scraped. ie., new confirmations. but i am having no luck. can this be done?

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Only scrape changes

Post by webmaster » Wed Feb 20, 2013 3:49 pm

The thing is, they have not been modified in that table because they have not been extracted. You need to extract these links (either the text or the HTML or whatever is unique to them) to another table right before navigating into them. This is the one that needs to have the Unique column checked. Then on your Navigate Each do Only if modified on this table.
Juan Soldi
The Helium Scraper Team

crookedleaf
Posts: 38
Joined: Tue Dec 11, 2012 6:44 pm

Re: Only scrape changes

Post by crookedleaf » Wed Feb 20, 2013 7:50 pm

Oooooh, okay, perfect! Got it running. Thank you so much!

Post Reply