Dealing with broken HTML

Questions and answers about anything related to Helium Scraper
Post Reply
Plenor
Posts: 3
Joined: Mon Dec 10, 2012 8:00 am

Dealing with broken HTML

Post by Plenor » Mon Dec 10, 2012 8:10 am

I'm trying to scrape a page that has terrible markup. For instance, none of the P tags are closed. How could I deal this this?

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Dealing with broken HTML

Post by webmaster » Tue Dec 11, 2012 6:09 pm

It depends on the page. Are you unable to select kinds? Do you have a sample URL? If so, which elements are you trying to select and extract?
Juan Soldi
The Helium Scraper Team

Plenor
Posts: 3
Joined: Mon Dec 10, 2012 8:00 am

Re: Dealing with broken HTML

Post by Plenor » Tue Dec 11, 2012 7:02 pm

I can't link the actual page but here's a sample you can test with http://helium.staticloud.com/

Plenor
Posts: 3
Joined: Mon Dec 10, 2012 8:00 am

Re: Dealing with broken HTML

Post by Plenor » Fri Dec 14, 2012 5:40 am

I tried running it through this beautifier (being sure to redo the kinds) with no luck: https://github.com/einars/js-beautify/b ... fy-html.js

webmaster
Site Admin
Posts: 521
Joined: Mon Dec 06, 2010 8:39 am
Contact:

Re: Dealing with broken HTML

Post by webmaster » Mon Dec 17, 2012 2:42 am

Hi,

I'm assuming you're at least able to create kinds properly in these broken pages.

I'm not sure if the pages you're trying to scrape are as broken as the one you sent me, but first, try using the Force elements into same row premade at Files -> Online Premade. Please follow carefully the instructions provided in the premade's description and see if this helps you.

If not, then you'll need both the premade mentioned and the premade attached called FlattenPage.hsp. This premade contains an actions tree called FlattenPage that if you run (both directly or from an Execute Actions Tree action) it will flatten the HTML of the page an turn it into a one level deep HTML code. What you'd need to do is run the Flatten Page before doing anything else at all in every web page. So, before creating any kind (including the Heading kind you'll need to create for the Force elements into same row premade), run the Flatten Page actions tree.

Note that the Force elements into same row premade requires you to run an actions tree called Do Wrap (you'll see this in the description) before creating your other kinds. In this case you'll need to call Flatten Page before calling Do Wrap (and before creating your kinds).

I'm also attaching a project that extracts data from the sample page you sent by doing what I describe above.
Attachments
Sample.hsp
(550.76 KiB) Downloaded 572 times
FlattenPage.hsp
(489.91 KiB) Downloaded 581 times
Juan Soldi
The Helium Scraper Team

Post Reply