Hello.
First of all, thank you for a great app.
My question is if I can download the result of my query as html files. I was able to successfully retrieve the data from common crawl using your tool, but I need the extract htmls. Is it possible to do it? If yes, then how?
Thank you.
Common Crawler: is it possible to download the found html files?
-
- Posts: 2
- Joined: Mon Jun 08, 2020 2:56 pm
Re: Common Crawler: is it possible to download the found html files?
We've just updated Common Crawler to include the Sequence.WriteFile function. If you don't get an update prompt, this may be because we've migrated the publish location to AWS. If so, just uninstall it and reinstall it from here.
Once you have the latest version (3.2.4.9) you can do this to save the HTML into files:
Or if you just want to extract the full HTML into a table, it's even simpler:
Once you have the latest version (3.2.4.9) you can do this to save the HTML into files:
Code: Select all
Crawl.LoadAll
· "2018-2019?url=https%3A%2F%2Fwww.imdb.com%2Ftitle%2F%2A&limit=100&filter==status:200&filter==mime:text%2Fhtml"
as (digest fileName length mime mimeDetected offset status timestamp url urlKey crawlUrl)
extract
fileName
fileName
file
Gather.HTML
as html
Sequence.WriteFile
· html
· +
· fileName
· ".html"
· false
Code: Select all
Crawl.LoadAll
· "2018-2019?url=https%3A%2F%2Fwww.imdb.com%2Ftitle%2F%2A&limit=100&filter==status:200&filter==mime:text%2Fhtml"
as (digest fileName length mime mimeDetected offset status timestamp url urlKey crawlUrl)
extract
fileName
fileName
file
Gather.HTML
Juan Soldi
The Helium Scraper Team
The Helium Scraper Team
-
- Posts: 2
- Joined: Mon Jun 08, 2020 2:56 pm
Re: Common Crawler: is it possible to download the found html files?
Thank you very much for your help! I've just manually updated the app, followed your guide and it worked perfectly.