Example JavaScript for manipulating HTML Tables

cw42 · Post by **cw42** » Wed Mar 28, 2012 5:34 pm

Good afternoon\evening HS Team!

I have started to evaluate Helium Scraper today and I have come across an issue related to extracting information in different rows of an HTML Table. Specifically the HS generated DB table creates new rows for each piece of information - ie name, phone number, title, etc- despite the fact I am collecting each one as a different kind.

It doesn't help that the website in question has all the information in one giant table - though the type of information repeats itself every 7 rows.

Anyways, now that I have the back-story out of the way- do you have any sample java script for controlling how extracted information is stored in the HS DB?

Thanks!

Post by **webmaster** » Thu Mar 29, 2012 2:57 am

You could use a Navigate URLs action. Look in the documentation at Actions -> Actions List -> Navigate URLs.

rmbraaten · Post by **rmbraaten** » Thu Mar 29, 2012 2:14 pm

cw42,

It looks like the reply you got from Webmaster was meant to answer my question, "How do I use a table of URLs as my starting point?" and mistakenly posted as a reply to yours.

Interestingly, I'm now experiencing the same thing you are, with rows for each type rather than for each "record," which is NOT how it usually works. Something is definitely amiss. I just ran into this problem setting up a "Search in any search engine" project.

Matt

cw42 · Post by **cw42** » Thu Mar 29, 2012 2:59 pm

Here's an update for everyone!

So after looking at the HTML for my target site in detail, it became apparent that the result set is a HTML table with roughly 300 rows per table.

So I ended up capturing the Text of Column 1, and Column 2, as well as the Inner HTML of Column 2 (websites and phone numbers are buried in there)

So Matt, as for me, I am going to push this to SSIS and have it do the transformation and loading into my DB environment.

I might still play with HS and JavaScript to manipulate the tables for fun, but I have a full suite of RegEx tools coupled with parsing and programming capability that I can take advantage of in SSIS.

The fun part is now I have to scale the solution to review 1.4 Million entities that take up to 7 rows each in the DB (comes to 28K pages and some change).. Helium Scraper performance testing.

Thanks!

Chuck

rmbraaten · Post by **rmbraaten** » Thu Mar 29, 2012 3:33 pm

Thanks Chuck,

Hah! Hah! Sounds like you have a lot of workaround tools in your toolbelt. Handy.

The problem does seem related to search results formatted into multiple tables. Not all sites, of course, run into this problem. Scraping Google search results works fine for me. Scraping search results from my institution's website, on the other hand, doesn't.

I'm still hopeful for a less-intensive fix, but until then, I'll likely just modify my scraping to only 1 type for an index table of search result links and then use the Navigate Each to build a separate but related table for the other types/fields.

Good luck!

Matt

Post by **webmaster** » Fri Mar 30, 2012 2:55 am

Hi guys,

Can any of you guys can post a link to these pages causing data to spread among many rows? Helium Scraper's Extract action will organize data into rows according to the HTML structure.If this data is structured as an HTML table, and each kind correspond to each column, then this data must come out as a table similar to the HTML table, since each cell has the same row as parent. If you're positive this is an HTML table, make sure all your kinds are selecting the right elements (one kind selecting, say, an extra item outside the table could mess up the whole thing). Perhaps you could try extracting any two kinds, on different combinations, then tree, and so on until you figure out which kind is messing up.

I've seen some pages where the data will look organized to the eye, but looking at the HTML will give you no clue about the structure of the data (think, a bunch of div's absolutely positioned to look like a table where they all have the same direct parent), and the Extract action will have no clue about how to organize it either. Perhaps an interesting thing to do would be to write an Extract action that uses the elements' x and y positions instead of the HTML structure to figure out how to organize the output data. But again, a link would help figure this stuff out. I myself haven't seen this happening for a while already.

Helium Scraper

Example JavaScript for manipulating HTML Tables

Example JavaScript for manipulating HTML Tables

Re: Example JavaScript for manipulating HTML Tables

Re: Example JavaScript for manipulating HTML Tables

Re: Example JavaScript for manipulating HTML Tables

Re: Example JavaScript for manipulating HTML Tables

Re: Example JavaScript for manipulating HTML Tables