Hi.
I'm trying to scrape threads from a target forum. It uses IDs as thread indexes.
Let's say today I scraped successfully from thread 100 to thread 50 (showed in decrescent way).
I want to run the helium job tomorrow. So I have thread id on my sqlite db starting from threadId 50.
Forum page shows all threads, and they go from thread 200 to 0.
I'm trying to write a function that does a while loop that does, more or less:
- Navigate the forum and search the most recent thread. Save the id to a variable.
- Query local database for first record of threadId
- If they do not equal, execute the Scrape global, else return.
Instead of using an if function to do the comparison, i'd like to use a while loop that iterates over each thread ID in the forum. If the Id is different from my first database record, then proceed to scrape, else return.
What should be the best way of achieving this?
Also: Is there a way to avoid duplicates in the database while scraping?
Maybe related: Is there a way, from inside the software, to do a sort of scheduled run?
Thanks! Helium is really a wonder of a software!
While Loop Function in recurring actions
Re: While Loop Function in recurring actions
That's now easy to do (since version 3.2.7.9) with the WhileAny function. In the documentation there's an example showing how to stop the extraction when a post with a certain text is found.
In your case, you could, first create a query at Project Explorer > Data Flow > Queries that gets the latest ID, and then compare the post's id to that. Something like this:
Note that the RowContainer is selected on top. This must select each row on the list of threads, which must contain both the thread ID and the thread link (the one that visits the actual thread).
Regarding duplicates, I wouldn't worry about that during extraction, you can just remove them on a query. If you right click a table set and select Create Query there's a Distinct option which will remove duplicates.
And regarding scheduled runs, the best way is using the command line, but you probably already know that. If you just need to run an extraction every X minutes then you could add a global called LoopAction with this code:
Then to run an extraction every 30 minutes (or more precisely with delays of 30 minutes) you'd do this:
Note that that'll keep running until you stop it. Here, MyGlobal is the name of the global you'd normally run manually to start the extraction.
In your case, you could, first create a query at Project Explorer > Data Flow > Queries that gets the latest ID, and then compare the post's id to that. Something like this:
Code: Select all
Query.LatestId
as (latestId)
Sequence.WhileAny
· Browser.Load
· "https://www.example.com"
Browser.TurnPages
· Select.NextButton
Select.RowContainer
· Select.ThreadId
as threadId
if
· =
· threadId
· latestId
· Sequence.Empty
· Sequence.Default
Select.ThreadLink
Browser.Navigate
Regarding duplicates, I wouldn't worry about that during extraction, you can just remove them on a query. If you right click a table set and select Create Query there's a Distinct option which will remove duplicates.
And regarding scheduled runs, the best way is using the command line, but you probably already know that. If you just need to run an extraction every X minutes then you could add a global called LoopAction with this code:
Code: Select all
function (action delayMinutes)
action
Action.Run
· Browser.Load
· "helium://start"
Browser.Wait
· *
· delayMinutes
· 1000
· 60
LoopAction
· action
· delayMinutes
Code: Select all
LoopAction
· Action.Extract
· MyGlobal
· "MyGlobal"
· 30
Juan Soldi
The Helium Scraper Team
The Helium Scraper Team