Common Crawler: is it possible to download the found html files?

mangowuvvr69 · Post by **mangowuvvr69** » Mon Jun 08, 2020 3:02 pm

Hello.

First of all, thank you for a great app.

My question is if I can download the result of my query as html files. I was able to successfully retrieve the data from common crawl using your tool, but I need the extract htmls. Is it possible to do it? If yes, then how?

Thank you.

Post by **webmaster** » Tue Jun 09, 2020 9:12 pm

We've just updated Common Crawler to include the Sequence.WriteFile function. If you don't get an update prompt, this may be because we've migrated the publish location to AWS. If so, just uninstall it and reinstall it from here.

Once you have the latest version (3.2.4.9) you can do this to save the HTML into files:

Code: Select all

Crawl.LoadAll
   ·  "2018-2019?url=https%3A%2F%2Fwww.imdb.com%2Ftitle%2F%2A&limit=100&filter==status:200&filter==mime:text%2Fhtml"
as (digest fileName length mime mimeDetected offset status timestamp url urlKey crawlUrl)
extract
   fileName
      fileName
   file
      Gather.HTML
      as html
      Sequence.WriteFile
         ·  html
         ·  +
               ·  fileName
               ·  ".html"
         ·  false

Or if you just want to extract the full HTML into a table, it's even simpler:

Code: Select all

Crawl.LoadAll
   ·  "2018-2019?url=https%3A%2F%2Fwww.imdb.com%2Ftitle%2F%2A&limit=100&filter==status:200&filter==mime:text%2Fhtml"
as (digest fileName length mime mimeDetected offset status timestamp url urlKey crawlUrl)
extract
   fileName
      fileName
   file
      Gather.HTML

mangowuvvr69 · Post by **mangowuvvr69** » Wed Jun 10, 2020 10:11 am

Thank you very much for your help! I've just manually updated the app, followed your guide and it worked perfectly.

Helium Scraper

Common Crawler: is it possible to download the found html files?

Common Crawler: is it possible to download the found html files?

Re: Common Crawler: is it possible to download the found html files?

Re: Common Crawler: is it possible to download the found html files?