Dealing with broken HTML
Dealing with broken HTML
I'm trying to scrape a page that has terrible markup. For instance, none of the P tags are closed. How could I deal this this?
Re: Dealing with broken HTML
It depends on the page. Are you unable to select kinds? Do you have a sample URL? If so, which elements are you trying to select and extract?
Juan Soldi
The Helium Scraper Team
The Helium Scraper Team
Re: Dealing with broken HTML
I can't link the actual page but here's a sample you can test with http://helium.staticloud.com/
Re: Dealing with broken HTML
I tried running it through this beautifier (being sure to redo the kinds) with no luck: https://github.com/einars/js-beautify/b ... fy-html.js
Re: Dealing with broken HTML
Hi,
I'm assuming you're at least able to create kinds properly in these broken pages.
I'm not sure if the pages you're trying to scrape are as broken as the one you sent me, but first, try using the Force elements into same row premade at Files -> Online Premade. Please follow carefully the instructions provided in the premade's description and see if this helps you.
If not, then you'll need both the premade mentioned and the premade attached called FlattenPage.hsp. This premade contains an actions tree called FlattenPage that if you run (both directly or from an Execute Actions Tree action) it will flatten the HTML of the page an turn it into a one level deep HTML code. What you'd need to do is run the Flatten Page before doing anything else at all in every web page. So, before creating any kind (including the Heading kind you'll need to create for the Force elements into same row premade), run the Flatten Page actions tree.
Note that the Force elements into same row premade requires you to run an actions tree called Do Wrap (you'll see this in the description) before creating your other kinds. In this case you'll need to call Flatten Page before calling Do Wrap (and before creating your kinds).
I'm also attaching a project that extracts data from the sample page you sent by doing what I describe above.
I'm assuming you're at least able to create kinds properly in these broken pages.
I'm not sure if the pages you're trying to scrape are as broken as the one you sent me, but first, try using the Force elements into same row premade at Files -> Online Premade. Please follow carefully the instructions provided in the premade's description and see if this helps you.
If not, then you'll need both the premade mentioned and the premade attached called FlattenPage.hsp. This premade contains an actions tree called FlattenPage that if you run (both directly or from an Execute Actions Tree action) it will flatten the HTML of the page an turn it into a one level deep HTML code. What you'd need to do is run the Flatten Page before doing anything else at all in every web page. So, before creating any kind (including the Heading kind you'll need to create for the Force elements into same row premade), run the Flatten Page actions tree.
Note that the Force elements into same row premade requires you to run an actions tree called Do Wrap (you'll see this in the description) before creating your other kinds. In this case you'll need to call Flatten Page before calling Do Wrap (and before creating your kinds).
I'm also attaching a project that extracts data from the sample page you sent by doing what I describe above.
- Attachments
-
- Sample.hsp
- (550.76 KiB) Downloaded 572 times
-
- FlattenPage.hsp
- (489.91 KiB) Downloaded 581 times
Juan Soldi
The Helium Scraper Team
The Helium Scraper Team