Helium Scraper's Blog All kind of useful and useless stuff related to Helium Scraper

15May/111

The often overlooked JavaScript Gatherers

Gatherers are the eyes of Helium Scraper. And JavaScript gatherers are its user-customized eyes. Let me give you a quick example.

I had a user having troubles with a kind that was supposed to select a "next" button in a page. It worked fine on the first page, but when he added the "next" button on the second page, his kind started selecting also the "back" button. Helium Scraper couldn't find any difference between the "back" and the "next" button, given the set of properties that defined his kind. But, if he and I could tell the difference just by looking at them, then Helium Scraper should be able to do so.

This difference was in the image of the buttons. One of them was a little red left arrow and the other one a right arrow. So all he needed to do is activate the "SrcAttribute" gatherer from Project -> Options -> Select Active Properties. This property gatherer gets the "src" attribute of the element, which contains the URL of the element's image. After doing this, Helium Scraper started selecting only the "next" button on every page.

This is how property gatherers work. When creating a kind, Helium Scraper will gather every active property from every element in a webpage, and generate a list of properties that are common to every element we have added to this kind. This list will be the definition of the kind. So, for instance, if we would tell Helium Scraper to, among other properties, take into consideration the color of the elements when creating kinds (by activating a gatherer that gets the color of the element, such as the "BackgroundColor" one), and we create a kind using elements that are all red, then this kind will only select red elements. But if we use elements with different colors, this property will be removed from the kind's definition and this kind will select elements of any color.

Now, JavaScript Gatherers are the ultimate way to tell Helium Scraper how to look at elements in a webpage. And they work in a straightforward way. When you create one of these, you get to write the body of a function that receives a parameter called "element". This function, as long as the gatherer is active, will be called for every single element in a webpage whenever you create a kind, and it must return a value. This value will be what Helium Scraper "sees" in the element when looking at it through your gatherer.

So let's say we have a website from which we want to extract a bunch of links, but we only want the links that point to webpages in one or a few domains. Here is what I would do. I'd create a JavaScript gatherer that gets the domain of the URL of the links. Here is the code for that gatherer:

  1. function getDomain(url)
  2. {
  3. var index = url.indexOf("://");
  4. return url.substring(index + 3).split(/\/+/g)[0];
  5. }
  6. return getDomain(element.getAttribute("href"));

This will return a domain given a link. I basically just googled the code by searching something like "javascript get domain from url". For about every small task such as this one, there will always be some forum with a dude asking for the code and some good guy bellow posting it. But don't just copy and paste the code without having a clue of what the code does. Most of the time these code snippets will require some modification. Hey, if nothing else, at least test it.

So, to test the code, after creating the JavaScript gatherer, click on the "Select active properties" button in the selection panel at the bottom, deselect all, and then select only the gatherer you want to test. Then select a few elements in a webpage and the result will show up in the selection list.

Now, going back to my example, if I would like to create a kind that selects only links to the "www.example.com" domain, I would select a few links that point to more than one page in that domain and create a kind called "LinksToExample". This kind will now select links that point to any page in that domain. Now, if I wouldn't have any links that point to that domain to take as samples, you can always edit your kind manually by clicking on the "Edit kind" button in the kind editor. It will take you to an XML editor that displays the XML representation of the kind. If you know nothing about XML, don't panic. It's just the list of properties that define our kind. Each item in this list starts with the <Item> keyword and end with the </Item> keyword.

So, if I'd only have links that point to domains I don't care about, I would create a kind that selects links to any of them, then, in the kind's XML, find this line (remember, my gatherer is called "JS_LinkDomain"):

  1. <Property>JS_LinkDomain</Property>

And right underneath, supposing I created my kind by selecting links that pointed to pages in the "www.DomainIDoNotWant.com" domain, change this line:

  1. <Value xsi:type="xsd:string">www.DomainIDoNotWant.com</Value>

for this other one:

  1. <Value xsi:type="xsd:string">www.example.com</Value>

Now, in order for the "JS_LinkDomain" property to be listed in my kind definition's XML, I must have selected links that point all to the same domain when creating my kind. This is because, as I said before, when creating a kind, only properties that are common to every element used when creating it are listed on the kind's definition. If, for some reason, I would have been forced to select links to different domains, I would just add this code, right bellow the <Items> (note the "s") tag:

  1. <Item>
  2. <Property>JS_LinkDomain</Property>
  3. <Value xsi:type="xsd:string">www.example.com</Value>
  4. </Item>

Another important use for JavaScript gatherers is to transform our data before is extracted. If I would like to extract the URL to which a set of links point to, but just the domain part of the URL, all I'd need to do is set the property being extracted to "JS_LinkDomain" when creating my "Extract" action.


Share/Bookmark
Comments (1) Trackbacks (0)
  1. Nice tutorial bro!!
    Can you post more video tutorials?


Leave a comment

No trackbacks yet.