Page 1 of 1

Only scrape changes

Posted: Fri Jun 15, 2012 6:23 am
by Darrylglenn
Hi Juan,

Last question... I think :lol:

I'm scraping products of webpages. Is it possible to scrape changes?
So, I scraped the website once... But I don't want it to fill up the database all over again..
Can HS do the following:

- If kinds are updates on a page (e.g. price, stock), update it in the database?
- If Products cannot be found, delete from database
- If New products are found, Add to database

Thanks alot.

Re: Only scrape changes

Posted: Mon Jun 18, 2012 5:41 pm
by webmaster
You cannot delete non-found items with the current built-in functionality, but you could do this with JavaScript.

Regarding updating existing items, I assume you want to prevent it from navigating into items that have not been updated (otherwise the result would be equivalent to clearing the database and extracting the whole thing again). If so, you'll need to find an identifier for each item (such as a product number) and set it as Unique in your Extract action. Then, on your Navigate Each action (which needs to occur after the Extract action) select Only if modified on whatever data table you've just extracted to. This will cause it to navigate only inside items that have just been added or updated so that you can get whatever extra details you need.

If you're not using a Navigate Each but a Navigate URLs then you'll need to do the filtering with SQL.

Re: Only scrape changes

Posted: Tue Feb 19, 2013 6:45 am
by crookedleaf
sorry to revive an old thread, but i had a question about scraping changes.

the scrape i currently have is set up to go to a page that list confirmation numbers. it does a "navigate each" on the confirmation numbers, then scrapes the data off that page (confirmation number, name, date made, status, etc). when i set the "navigate each" to "only if modified on" the table it's extracting to, then set "unique" on the confirmation number, it never navigates into them.

i am essentially trying to get the scrape to only scrape confirmation numbers that are not already scraped. ie., new confirmations. but i am having no luck. can this be done?

Re: Only scrape changes

Posted: Wed Feb 20, 2013 3:49 pm
by webmaster
The thing is, they have not been modified in that table because they have not been extracted. You need to extract these links (either the text or the HTML or whatever is unique to them) to another table right before navigating into them. This is the one that needs to have the Unique column checked. Then on your Navigate Each do Only if modified on this table.

Re: Only scrape changes

Posted: Wed Feb 20, 2013 7:50 pm
by crookedleaf
Oooooh, okay, perfect! Got it running. Thank you so much!