Helium Scraper's Blog All kind of useful and useless stuff related to Helium Scraper

8Oct/110

New features video tutorials

Here are a two of the main new features implemented in Helium Scraper since version 2.2.

This video shows  how to extract data to many tables and maintain the relations among the extracted elements in the output database:

 

This video used the data extracted on the first video to show how you can export your extracted data to virtually any kind of document:

 

 

Share/Bookmark
Filed under: Miscellaneous No Comments
15May/111

The often overlooked JavaScript Gatherers

Gatherers are the eyes of Helium Scraper. And JavaScript gatherers are its user-customized eyes. Let me give you a quick example.

I had a user having troubles with a kind that was supposed to select a "next" button in a page. It worked fine on the first page, but when he added the "next" button on the second page, his kind started selecting also the "back" button. Helium Scraper couldn't find any difference between the "back" and the "next" button, given the set of properties that defined his kind. But, if he and I could tell the difference just by looking at them, then Helium Scraper should be able to do so.

This difference was in the image of the buttons. One of them was a little red left arrow and the other one a right arrow. So all he needed to do is activate the "SrcAttribute" gatherer from Project -> Options -> Select Active Properties. This property gatherer gets the "src" attribute of the element, which contains the URL of the element's image. After doing this, Helium Scraper started selecting only the "next" button on every page.

This is how property gatherers work. When creating a kind, Helium Scraper will gather every active property from every element in a webpage, and generate a list of properties that are common to every element we have added to this kind. This list will be the definition of the kind. So, for instance, if we would tell Helium Scraper to, among other properties, take into consideration the color of the elements when creating kinds (by activating a gatherer that gets the color of the element, such as the "BackgroundColor" one), and we create a kind using elements that are all red, then this kind will only select red elements. But if we use elements with different colors, this property will be removed from the kind's definition and this kind will select elements of any color.

Now, JavaScript Gatherers are the ultimate way to tell Helium Scraper how to look at elements in a webpage. And they work in a straightforward way. When you create one of these, you get to write the body of a function that receives a parameter called "element". This function, as long as the gatherer is active, will be called for every single element in a webpage whenever you create a kind, and it must return a value. This value will be what Helium Scraper "sees" in the element when looking at it through your gatherer.

So let's say we have a website from which we want to extract a bunch of links, but we only want the links that point to webpages in one or a few domains. Here is what I would do. I'd create a JavaScript gatherer that gets the domain of the URL of the links. Here is the code for that gatherer:

  1. function getDomain(url)
  2. {
  3. var index = url.indexOf("://");
  4. return url.substring(index + 3).split(/\/+/g)[0];
  5. }
  6. return getDomain(element.getAttribute("href"));

This will return a domain given a link. I basically just googled the code by searching something like "javascript get domain from url". For about every small task such as this one, there will always be some forum with a dude asking for the code and some good guy bellow posting it. But don't just copy and paste the code without having a clue of what the code does. Most of the time these code snippets will require some modification. Hey, if nothing else, at least test it.

So, to test the code, after creating the JavaScript gatherer, click on the "Select active properties" button in the selection panel at the bottom, deselect all, and then select only the gatherer you want to test. Then select a few elements in a webpage and the result will show up in the selection list.

Now, going back to my example, if I would like to create a kind that selects only links to the "www.example.com" domain, I would select a few links that point to more than one page in that domain and create a kind called "LinksToExample". This kind will now select links that point to any page in that domain. Now, if I wouldn't have any links that point to that domain to take as samples, you can always edit your kind manually by clicking on the "Edit kind" button in the kind editor. It will take you to an XML editor that displays the XML representation of the kind. If you know nothing about XML, don't panic. It's just the list of properties that define our kind. Each item in this list starts with the <Item> keyword and end with the </Item> keyword.

So, if I'd only have links that point to domains I don't care about, I would create a kind that selects links to any of them, then, in the kind's XML, find this line (remember, my gatherer is called "JS_LinkDomain"):

  1. <Property>JS_LinkDomain</Property>

And right underneath, supposing I created my kind by selecting links that pointed to pages in the "www.DomainIDoNotWant.com" domain, change this line:

  1. <Value xsi:type="xsd:string">www.DomainIDoNotWant.com</Value>

for this other one:

  1. <Value xsi:type="xsd:string">www.example.com</Value>

Now, in order for the "JS_LinkDomain" property to be listed in my kind definition's XML, I must have selected links that point all to the same domain when creating my kind. This is because, as I said before, when creating a kind, only properties that are common to every element used when creating it are listed on the kind's definition. If, for some reason, I would have been forced to select links to different domains, I would just add this code, right bellow the <Items> (note the "s") tag:

  1. <Item>
  2. <Property>JS_LinkDomain</Property>
  3. <Value xsi:type="xsd:string">www.example.com</Value>
  4. </Item>

Another important use for JavaScript gatherers is to transform our data before is extracted. If I would like to extract the URL to which a set of links point to, but just the domain part of the URL, all I'd need to do is set the property being extracted to "JS_LinkDomain" when creating my "Extract" action.


Filed under: Miscellaneous 1 Comment
14May/112

Programming Helium Scraper

I'm assuming you already have a little JavaScript knowledge. If not, here is a quick JavaScript tutorial that covers all you need to know for the purpose of this tutorial. I'm also assuming you have already used Helium Scraper before and have a basic idea of how to use it.

In Helium Scraper's "Execute JS" action, all the JavaScript code will be injected into the current webpage as a function and then this function will be called. This means that all your code will have full access to all the elements inside the current page. All the information you will ever need regarding javascript as related to Helium Scraper can be found in the documentation at Actions -> Actions List -> Execute JavaScript. You might find Helium Scraper's log, at Project -> View Log, useful when coding, since javascript errors will appear there.

So let's do some coding. First, make sure Helium Scraper's browser (the tab on the left with a little padlock) is on a webpage, any webpage. Go to an actions tree and add an "Execute JS" action. Remove the default line of code, and paste the following code:

  1.  
  2. currentUrl = window.location.href;
  3. alert(currentUrl);

Now press play to get a message that shows the current URL. The window object is a global object that represents the window in which the webpage resides, and contains information about it. Here are some more details about it. One of the most frequently used objects inside the window object is the document object.

Now let's mix this with kinds. Let's use this very page as our guinea pig. Navigate here with Helium Scraper, and create a kind called "Items" (do call it "Items" please!) that selects the following 3 elements:

  1. This is just random text
  2. Txet modnar tsuj si siht
  3. Hey, what's that in number 2?

Now go back to our code editor, delete any code if present and paste this:

  1.  
  2. Global.Browser.SelectKind("Items");
  3. var selectedItemsCount = Global.Browser.Selection.Count;
  4. alert(selectedItemsCount);

Now press play to get a message box that shows "3". What this code does is, select the kind "Items" in the first line, then assign the amount of selected elements to the selectedItemsCount variable, and then show the value of that variable. Global here is a Helium Scraper built in object that is passed as a parameter to our code (which is as I said above, the code inside a function and therefore can receive parameters). The parameters that are passed to our code will always be Global, Tree and Node (unless there is some major update to Helium Scraper). See Actions -> Actions List -> Execute JavaScript in the documentation for more information about them.

Now delete the previous code and paste this one:

  1.  
  2. Global.Browser.SelectKind("Items");
  3.  
  4. for(i = 0; i < Global.Browser.Selection.Count; i++)
  5. {
  6. alert(Global.Browser.Selection.GetItem(i).innerText);
  7. }

Press play and you will get three message boxes showing the text in each of our three items. The Global.Browser.Selection.Count on this case will be 3, so this for loop will loop three times and the values of the i variable will be 0, 1 and 2. The GetItem method of the Selection object gets the selected element at the specified zero-based (this means the first item is the item 0) index. Most elements contain a property called innerText that contains the text of the element. Here is a complete list of properties used by elements. They are under the "Property" column of the "Members Table". Note that this is Microsoft's table, so some of these properties might not be compatible with other browsers. Helium Scraper uses Internet Explorer's javascript interpreter, so we are good to go. Also, note that not all these properties apply to every kind of element.

Now let's play a little with the database. Create a table called "Table1" with two columns (a column in the table corresponds to a row in the "New Table" dialog where you create your table after pressing the "Create table" button) called "Column1" and "Column2". Leave everything else as is and press OK. Now double click the table, add three rows with random values, and press the "Save changes" button. Now go back to the code editor and change the code for this code:

  1.  
  2. results = Global.DataBase.Query("SELECT * FROM [Table1]").ToObjects();
  3.  
  4. for(i = 0; i < results.length; i++)
  5. {
  6. row = results[i];
  7. col1 = row.Column1;
  8. col2 = row.Column2;
  9. alert("In row " + i + ", Column1 = " + col1 + " and Column2 = " + col2);
  10. }

This will access the content of our table and show it to us. The string "SELECT * FROM [Table1]" is a simple SQL query that selects all the records in the "Table1" table. Check out this page for a quick explanation on the SQL query in case you are still curious. The Query methods runs the given SQL query and returns a DataBaseReader object, which contains a method called ToObjects. This method returns the results as an array of objects where each object correspond to a row in the database, and each property of each object correspond to a column. The documentation for the Query method is at Actions -> Actions List -> Execute JavaScript -> Class List -> DataBaseObject in Helium Scraper's documentation.

Now I'll put kinds and database together. Replace the previous code for this one:

  1.  
  2. results = Global.DataBase.Query("SELECT * FROM [Table1]").ToObjects();
  3. Global.Browser.SelectKind("Items");
  4.  
  5. for(i = 0; i < 3; i++)
  6. {
  7. row = results[i];
  8. Global.Browser.Selection.GetItem(i).innerText = row.Column1;
  9. }

This is not a remarkably useful code, but it does illustrate a point. This is also not a robust code whatsoever because it wont work properly if we don't have exactly 3 rows in our database (don't count the last empty row in the data table editor) and our kind selects 3 items. As you can see, after you press play, it will change the text of the items that our kind selects. And this is how easily you can hack a website with... nah, just kidding, you only changed the page in your local copy. You can also try changing row.Column1 to row.Column2 to see what happens.

So the code above is useless, but if your kind would select an input box in a web page, you could use a similar code to write, for instance, a search query in a search engine. Note that you would need to replace innerText for value, which represents the text that is written in an input box.

Now let's go ahead and do something not so useless as everything I just did. Create a new project and navigate to Google (by the way, make sure "Google instant" is off if it's there). Select the input box (where you would type your search) and create a kind with it called "Input Box". Then select the "Search" button and create another kind with it called "Search Button". Type anything and press Search. Now select again the input box at the top and add it to the "Input Box", then select the "Search" button and add it to the "Search Button" kind. Now go to your database and create a table called "Queries" with a single column called "Query" and then add a few rows with random values, but not so random because these are going to be our search queries and we want Google to return results for them.

Now go to "Actions tree 1" and add a "Execute JS" action with this code:

  1.  
  2. function TreeData()
  3. {
  4. this.CurrentRow = -1;
  5. this.Data = null;
  6. }
  7.  
  8. Tree.UserData = new TreeData();
  9. Tree.UserData.Data = Global.DataBase.Query("SELECT * FROM [Queries]").ToObjects();

The funtion TreeData() line and the code bellow between brackets is the definition of an object called TreeData that will store the current row being read in the CurrentRow property and all the rows in the Data property. In the Tree.UserData = new TreeData(); line we are creating an instance of the object and assigning it to the UserData property of the Tree object. The Tree object is accessible from within the whole actions tree. On the last line we are assigning an array of objects each of which represent a row in our data table to the Data property of our object.

Now create another "Execute JS" action bellow (not inside) the previous one. (Is a good idea to name these "Execute JS" actions by typing in the Comment box so you know which is which. I will name the previous one "Init" and this last one "Read".) And then paste this code:

  1.  
  2. Tree.UserData.CurrentRow++;
  3.  
  4. if(Tree.UserData.Data.length > Tree.UserData.CurrentRow) return true;
  5. else return false;

If you was wondering why did we set the CurrentRow to -1 on the previous code, here is the answer. The first line in this code increases the value of CurrentRow by one, so the first value that it will have on the subsequent lines will be 0. Then, if the value of CurrentRow is less than the amount of rows, we return true and otherwise false. Returning true tells Helium Scraper to execute the child nodes (this is explained also in the documentation for the "Execute JavaScript" action). So let's add these child nodes now. Select the last "Execute JS" action you created and add inside a "Select Kind" action that selects the "Input Box" action and requires at least 1 element. Underneath this action, add a "Execute JS" action (I'll name it "Write") with this code:

  1.  
  2. Global.Browser.Selection.GetItem(0).value = Tree.UserData.Data[Tree.UserData.CurrentRow].Query;

This will write the content of our current row to the currently selected element, which is the input box that our "Select Kind" action above just selected. Now under this action add a "Navigate" one with the "Search Button" kind. Check "Simulate click" and require at least 1 item. If you press play now, this will perform searches for each of our rows in our "Queries" data table.

Now let's create another kind called "Titles" that selects the titles in the results page so we have something to extract. The titles are the blue links at the top of each result that you would click to go to navigate to the page. Then go back to your actions tree and under the "Navigate: Search Button" action, add a "Extract" action that extract our "Titles". In the "New Table" dialog, add another column called "Link" and set its kind to "Titles" and its property to "Link". Set the requirements of both items to "At Least" 1.

Now before we proceed to extract, add a "Wait" action right between the "Navigate" and the "Extract" action (you can use the up and down arrows to move your actions up and down) and set it to 1000 ms. This is because Google uses AJAX to load its results so we need to wait a little for the results to load. You could use a smaller wait time, such as 500 ms, but I prefer to be safe. Now press play and let the magic begin!

Here is the final result. If you have any question related to programming Helium Scraper, don't hesitate to use our forums.


14May/110

Minimal JavaScript tutorial for non programmers

This is a quick javascript tutorial for total non programmers. I won't focus on javascript as applied to webpages, which is the case for most tutorials, because I'm mainly considering Helium Scraper users. So this tutorial comes handy if you want to learn javascript without necessarily caring about how to design web pages.

First off, you'll need a place to test your code. If you don't have Helium Scraper installed, go to this page and place your test code in the box that says "JavaScript" and press "Run" to run it. If you do have Helium Scraper go to any actions tree, add a "Execute JS" action and the JavaScript editor will appear. Erase the default line of code and press play to run your code. Let's do a simple test to make sure everything is ready for further coding. Paste this code in your editor:

  1. alert("Hello world!");

Run it, and a message box that says "Hello world!" should appear. If it did, then we are ready to start coding. I'll be giving you a bunch of code examples. I encourage you to play and experiment with them. As a quick tip, to see the value of a variable called, say, myVariable, just add the following line at the end of your code:

  1. alert(myVariable);

Contents:


Variables

Variables are symbols we use to store information. We can name a variable anything we want as long as:

  • They start with a letter or the underscore ("_") symbol.
  • They are not composed by any other than letters, numbers and underscore characters.

JavaScript is case sensitive, which means a variable named myVariable is a different variable than a variable named myvariable. Even though you can create a variables on more than one way, I'll show you the way that I consider the most intuitive one. To use a variable you need to first create it like this:

  1. var myVariable;

This line of code is creating a variable called myVariable. The semicolon at the end is a way to tell the engine (the "engine" or "javascript engine" is what interprets our javascript code and executes it) that this is the end of our statement. Is optional, but will prevent ambiguities later on when writing more complex code.

To assign a value to our variable we need to add a line that uses the assignment ("=") operator such as on this code:

  1.  
  2. var myVariable;
  3. myVariable = "some text";


The assignment operator assigns whatever is on the right to whatever is on the left (note that is not the same as the equality symbol in math). So in the code above, the second line assigns the text "some text" to the variable myVariable. Now, this code could have been written a little bit differently, even though it means the very same thing:

  1.  
  2. var myVariable = "some text";

The information we store in a variable can be changed at any point such as on this code:

  1.  
  2. var myVariable = "some text";
  3. myVariable = "some other text";
  4. myVariable = "and some other text";

On these examples we are storing strings (text) in the variable, but strings are not the only kind of data or data type we can store in them. We can also store numbers, boolean values (true and false), and objects (I'll talk about these later on):

  1.  
  2. var myOtherVariable;
  3.  
  4. // Store numbers:
  5. myOtherVariable = 12;
  6. myOtherVariable = -12;
  7. myOtherVariable = 1.02;
  8.  
  9. // Store boolean values:
  10. myOtherVariable = true;
  11. myOtherVariable = false;

The lines that start with "//" are comments. Whenever you put a "//" before a line, the engine will ignore that line so you can write there whatever you want.

You can also store the value of another variable into a variable such as on this code:

  1.  
  2. var myVar1 = "value of myVar1";
  3. var myVar2 = "value of myVar2";
  4. myVar1 = myVar2;
  5. // the value of myVar1 is now "value of myVar2"
  6.  


Operators

Assignment operator

The most important operator is by far the assignment operator ("="). We've been using it throughout the whole tutorial and all it does is to assign whatever you put on its right side to whatever you put on its left side. You cannot put literals on the left of an assignment operator because they cannot receive a value. Literals are what I have been assigning to my variables (by putting them on the right side) in the whole Variables section above, such as "Hello world", 32.5 and false.

Mathematical operators

We can use mathematical operators among literals and variables as long as they store numbers. These operators are +, -, * (multiplication) and /. We can also use parenthesis as we would in a mathematical equation. An operation behaves in javascript just like a variable, except that it cannot receive a value (it cannot be on the left side of an assignment operator). So just like we can assign the value of myVar2 to myVar1 by writing myVar1 = myVar2, we can assign an operation to a variable such as in the following code:

  1. var n1 = 10;
  2. var n2 = 20;
  3. var result = n1 + n2; // assigns the result of 'n1 + n2' to 'result'
  4. // result is 30!

We can also write literal numbers in the operations such as in this example:

  1. var n1 = 10;
  2. var n2 = 20;
  3. var result = (n1 + n2) / 2;
  4. // result is 15!

There are a few shorthands for simple operations that I'll summarize in the following table:

Shorthand Meaning
  1. x++;
  1. x = x + 1;
  1. x--;
  1. x = x - 1;
  1. x += y;
  1. x = x + y;
  1. x -= y;
  1. x = x - y;
  1. x *= y;
  1. x = x * y;
  1. x /= y;
  1. x = x / y;

There is a special use for the addition operator when combined with strings: it joins the operands together and turn them into a longer string. The following example will illustrate this:

  1. var dividend = 20;
  2. var divisor = 5;
  3. var quotient = dividend / divisor;
  4. var text = dividend + " divided by " + divisor + " is " + quotient + ".";
  5. alert(text); // Shows "20 divided by 5 is 4".

Comparison operators

A comparison operator compares its operands a returns a boolean value that indicates whether the operation is true or false. Here is the table of all logical comparison operators and their meanings:

Operator Description Example
Equal (= =) Returns true if the operands are equal. x == y returns true if x equals y.
Not equal (!=) Returns true if the operands are not equal. x != y returns true if x is not equal to y.
Greater than (>) Returns true if left operand is greater than right operand. x > y returns true if x is greater than y.
Greater than or equal (>=) Returns true if left operand is greater than or equal to right operand. x >= y returns true if x is greater than or equal to y.
Less than (<) Returns true if left operand is less than right operand. x < y returns true if x is less than y.
Less than or equal (<=) Returns true if left operand is less than or equal to right operand. x <= y returns true if x is less than or equal to y.

For instance, the following code:

  1. var myVar = 10 == 11;
  2. alert(myVar);

will show a message that says "false". These operators and the logical operators bellow will be very useful when combined with if/else statements and loops.

Logical operators

Logical operators use boolean values as operands and return a boolean value. Here is a list of them:

Operator Usage Description
and (&&) expr1 && expr2 True if both logical expressions expr1 and expr2 are true. False otherwise.
or (||) expr1 || expr2 True if either logical expression expr1 or expr2 is true. False if both expr1 and expr2 are false.
not (!) !expr False if expr is true; true if expr is false.

So, for instance, the following code will show a message that says "true":

  1. var myVar1 = 10 == 11;
  2. var myVar2 = false;
  3. alert((!myVar1) && (!myVar2));


Functions

Functions are pieces of code that perform a specific task. Later on, I'll show you how to write your own functions, but first let's see how to call or invoke them. JavaScript provides many useful built-in functions. To call a function, all you need to do is write the function name, followed by open an closing parenthesis, and if the function receives parameters, put them inside these parenthesis separated by commas. For instance, the function alert receives an object, usually a string as a parameter and shows the value of the given object in a message box. This is how we would show a message that says "Hello world":

  1. alert("Hello world");

Note that we could have used a variable as a parameter such as in this example:

  1. var someVar = "Hello world";
  2. alert(someVar); // Shows "Hello world";
  3. var someNumber = 32.5;
  4. alert(someNumber); // Shows 32.5

Now let's see how to write our own functions. To write a function you use the function keyword, followed by the function name, then open and closing parenthesis and optional parameters in between them, and then open and closing brackets with the code to be executed when the function is called in between them. Here is an example:

  1. function MyFunction()
  2. {
  3. alert("Hello from inside a function!");
  4. }

When this function is called, it will show a message that says "Hello from inside a function!". You would call this function in the same fashion we called the alert function:

  1.  
  2. MyFunction();

Note that we are not passing any parameter to the function because this function doesn't receive any parameter. To write a function that receives a parameter, you would put it in between the parenthesis such as on this function:

  1. function ShowTwoThings(sayFirst, sayLast)
  2. {
  3. alert(sayFirst);
  4. alert(sayLast);
  5. }
  6.  
  7. ShowTwoThings("say this", "and then this");

The code above defines a function called ShowTwoThings that receives two parameters. Not particularly useful, but illustrative enough. A concept to take into consideration here, is the concept of a variable's scope. The scope is the section of code in which a variable exists. If I would declare a variable inside a function, everything that is outside the function doesn't know that the variable has exists at all. Also, parameters, such as sayFirst in the function above, are only accessible from within the function's body.

Another important thing to know about functions is that they can return values by using the return keyword. The return keyword will cause the code in the function to terminate (this is why is usually, but not always, at the end of the function), and it will assign the value at the right of the return keyword to the function call. The following is a simple example of this:

  1. function AreaOfSquare(sideSize)
  2. {
  3. return sideSize * sideSize;
  4. }
  5.  
  6. var area = AreaOfSquare(5);
  7. alert(area); // 25


Objects

The object is the main concept in object oriented programming. The goal of OOP is to make code more intuitive and easier to work with by organizing a program into objects that behave a lot like real life objects because they have properties and perform certain actions. So, for instance, a programmer writing a desktop application would write the code for an object that represents a button, name it something intuitive like "Button", put it in another file and forget about all the messy code inside it and just think of it as a button that has properties such as size and color, and performs actions such as changing its appearance from non pushed to pushed.

Before start coding, I want you to keep in mind the difference between two concepts: a class and an object (also called an instance of a class). An object is a particular case of a class. For instance, "car" is class. But my black car parked downstairs is an instance of the class "car": an object. We well use this distinction when dealing with objects in javascript.

Instances or objects can be assigned to variables just like we assign numbers or strings to them, but instead of representing a value, they contain other variables (called properties) and functions (called methods). In order to access a property or a method in an object, we use the accessor operator ("."). So if we have an object called myObject that contains a property called myProperty we would write myObject.myProperty to access the myProperty property of the myObject instance. If the value of myProperty would be also an object that contains a property called myDeeperProperty we could access it this way: myObject.myProperty.myDeeperProperty.

To define a class, you write it just like you would write a function:

  1. function Person()
  2. {
  3. }

The difference between this and a function will be in the way we use it. The following code creates two instances of the class Person:

  1. var mrBob = new Person();
  2. var mrTim = new Person();

We use the new keyword to tell the engine we want to use Person as a class and create an instance of it. The code inside Person will be still called, just like if we would be calling it as a function, and we can pass parameters to it, but it cannot have a return value. The code inside Person, if used as a class, is what we would call the constructor of the class, because is where we set the initial values of the properties of the object and do any other start-up stuff that needs to be done every time an instance of our class is created. Here is an improved version of the class Person:

  1. function Person(firstName, lastName)
  2. {
  3. this.FirstName = firstName;
  4. this.LastName = lastName;
  5. }
  6.  
  7. var bob = new Person("Bob", "Smith");
  8. var tim = new Person("Tim", "Burton");
  9.  

This Person now has a FirstName and a LastName. As you can see, when I create my two objects, I'm passing a name and a last name to each of them. I use the this keyword in the constructor to define two properties (FirstName and LastName) and assign the values passed as parameters to them. The this keyword represents the object itself from inside
the object's code, and in this case, from the object's constructor. If we add the following code after the code above and run it, it will show us "Bob" and then "Burton":

  1. alert(bob.FirstName); // "Bob"
  2. alert(tim.LastName); // "Burton"
  3.  

We can also change the object's properties any time such as in this code:

  1. alert(bob.FirstName); // "Bob"
  2. alert(tim.LastName); // "Burton"
  3. bob.FirstName = "Bobby";
  4. alert(bob.FirstName); // "Bobby"


Arrays

Arrays are a special kind of object that can contain more than one value. You cannot define arrays as you would define a class. You just create and use them. This is how to create an array:

  1. var myArray = new Array();

An array contains a numbered list of virtually infinite storage spots and, as such, we can store in them numbers, string, boolean values and objects. To access the elements in an array you use the square brackets operators ("[" and "]") such as in this example:

  1. var myArray = new Array();
  2.  
  3. // We can assign text to some element in myArray
  4. myArray[0] = "This is the value of the 0th element of myArray";
  5. myArray[1] = "This is the value of the 1st element of myArray";
  6. myArray[10] = "This is the value of the 10th element of myArray";
  7.  
  8. // We can also assign numbers
  9. myArray[11] = 200;
  10.  
  11. // And we can asign a new array!
  12. myArray[12] = new Array();
  13.  
  14. // "myArray[12]" is itself an array, so we
  15. // can use square brackets to access items in it
  16. myArray[12][1] = "This is the value of the 1st element of an array that is in the 12th element of myArray";

The array object has a property called length that tell us how many items are in the array. The following example illustrates how to use this property:

  1. var someArray = new Array();
  2. alert(someArray.length); // 0
  3. someArray[0] = "zero";
  4. alert(someArray.length); // 1
  5. someArray[1] = "one";
  6. alert(someArray.length); // 2
  7.  
  8. // This will expand our array to have 11 items (0 to 10)
  9. someArray[10] = "ten";
  10. alert(someArray.length); // 11


If/Else statements

If/Else statements execute a piece of code if a given condition is true, not 0, or not an empty string (every one of these conditions evaluate to true when used as conditions). If the condition is false, 0 or an empty string (evaluates to false), another optional statement is executed. This is the syntax:

  1. if(condition)
  2. {
  3. // if condition is true, all the code in here will be executed
  4. }

or

  1. if(condition)
  2. {
  3. // if condition is true, all the code in here will be executed
  4. }
  5. else
  6. {
  7. // if condition is false, all the code in here will be executed
  8. }

This condition will normally be a boolean value such as the result of a logical operation. The brackets are optional if the code to be executed consist of a single statement. The following example will help illustrate this:

  1. var myVar = false;
  2.  
  3. if(myVar)
  4. {
  5. alert("myVar is true"); // This won't be executed
  6. }
  7. else
  8. {
  9. alert("myVar is false"); // This will be executed
  10. }
  11.  
  12. if(-1) alert("-1 evaluates to true");
  13. else alert("-1 evaluates to false");
  14.  
  15. var var1 = false;
  16. var var2 = true;
  17.  
  18. if(var1 && var2) alert("Both var1 and var2 are true");
  19. else alert("Not both var1 and var2 are true");
  20.  


Loops

Loops are pieces of code that are repeated until a certain condition is met.

While loops

A while loop executes a piece of code repeatedly while a certain condition evaluates to true. This is the syntax:

  1. while(condition)
  2. {
  3. // Code to be executed
  4. }

Brackets are optionals if the code to be execute consists of a single statement. The condition works exactly as the condition in an if/else statement. We normally want this condition to change inside the code between brackets (otherwise you would loop forever!). This example will illustrate this:

  1. var v = 0;
  2.  
  3. while(v < 3)
  4. {
  5. alert(v);
  6. v++; // Shorthand for: v = v + 1;
  7. }

This code will show us the value of v and increment it as long as it is less than 3. Therefore, it will show us 0, 1 and 2.

For loops

A for loop is a fancier way to write a while loop. Every while loop can be translated to a for loop and vice versa. In many circumstances, particularly when using it with an incrementing variable, becomes more easily readable than a while loop. Here is the syntax:

  1. for(initial-expression; condition; increment-expression)
  2. {
  3. // Code to be executed
  4. }

Again, brackets are optional as long as the code to be executed consists of a single statement. This is how it works. First, initial-expression is executed no matter what. Then, while condition is true, two things happen repeatedly: the code in between brackets is executed and then the increment-expression is executed. The following code has the exact same meaning and therefore produces the same result as the while loop above:

  1. for(var v = 0; v < 3; v++)
  2. {
  3. alert(v);
  4. }

This kind of loop is often combined with arrays such as on this example:

  1. var fib = new Array();
  2.  
  3. fib[0] = 0;
  4. fib[1] = 1;
  5.  
  6. // Calculate the 10 first Fibonacci numbers
  7. // and store them in the 'fib' array
  8. for(var i = 2; i < 10; i++) fib[i] = fib[i - 1] + fib[i - 2];
  9.  
  10. // Show the Fibonacci numbers stored in the
  11. // 'fib' array. Note the use of 'fib.length'
  12. for(var i = 0; i < fib.length; i++) alert(fib[i]);


I made this tutorial as summarized as I could. If you would like to read further about javascript, here is a good guide you might want to take a look at.


Filed under: Miscellaneous No Comments
8May/113

SEO: Creating a project to find non “nofollow” backlinks

For those of you who just want to get those backlinking sites without reading anything, well you will still have to read this paragraph, but that's it. In this post, there are two Helium Scraper files attached: one that extracts backlinking sites given a competitor URL, and one that also extracts PageRank for these sites. These projects are basically enhanced versions of the project I will be creating here.

So if you're still here, let's move forward. I'll be creating a Helium Scraper project that will extract a bunch of potential non "nofollow" backlinks to my imaginary software downloading site. I'm assuming you have an idea of how to use Helium Scraper. If not, I recommend this simple tutorial.

First of all, I'll choose a competitor webscraper, not one of the very big guys because that would be unrealistic, given the fact that my imaginary site is imaginarily just starting. So I'll pick this one: http://www.ixdownload.com. Let's open Helium Scraper and navigate to http://siteexplorer.search.yahoo.com. Here I'll search for any URL (must be a URL, such as google.com, otherwise you will be taken to Yahoo Search instead of Yahoo Site Explorer) so that the "Next" button (the one that turns the page) appears, and create a kind with it called "Next Button". Make sure it's working on two or three pages.

Now I'll import a premade project from Helium Scraper's forum that will make the job a lot easier. What this project does is to navigate through all pages in a set of results pages by using the "Go Through All Pages" actions tree that comes in it. Here is the project and here is the forum's thread where the project is attached,  in case you want to know some more details about them. I think it will become clear what this project does in a little bit just by following this tutorial anyway.

Now I'll go to my "Actions tree 1" in the actions panel and add a "Execute Actions Tree" action that executes the "Go Through All Pages" tree. Set the "Next Button Kind" to the "Next Button" kind and leave everything else the way it is.

Now we need to create another kind called "Competitor Links" that will select the links at the top of each result such as in this picture:

competitor links selected

Again, make sure it works on more the one page. Now, I would normally extract the "href" property of the link, because it contains the URL of the destination page. But this time, this is not the case. If you click on the "Choose visible properties" button on the selection panel and select the "Link" property, and you select one of these links, you will see this weird looking URL that contains around the middle this text: "**http%3a". We need to strip the part of the URL that starts right where that text is, because that's the actual target URL. So I'll create a javascript gatherer that will do just that.

Open the javascript gatherers from the menu Project -> JavaScript Gatherers, create a new gatherer called "FixedLink" and paste this code in it:

  1. var text = element.getAttribute("href");
  2. var encoded = text.substring(text.indexOf("**") + 2);
  3. return unescape(encoded);

Save and close. Is always a good idea to make sure our javascript gatherers are working by selecting it with the "Choose visible properties" button in the selection panel and selecting a few elements to which the gatherer applies. In this case those elements are the links we used to create our "Competitor Links" kins. Notice that the gatherer will now have the "JS_" prefix, so instead of "FixedLink" it will be called "JS_FixedLink".

Now add a "Extract" action inside the "Execute tree: Go Through All Pages" action and select the "Competitor Links" kind. Change the table name to "Links" and the "Property" from "InnerText" to "JS_FixedLink". Also, change the "Req. Mode" to "At Least" and set the "Req. Amount" to 1. This will let us know if no links are found on any page.

Now type "www.ixdownload.com" (without quotes) in Yahoo's search box and press the "Explore URL" button. Then click the "Inlinks" button so it shows links to the "myrecies.com" page, and change the "Show Inlinks" field to "Except from this domain" so we only get external backlinks. Make sure you are at the first page and press play.

Now we have our links, but we don't need more than one URL per domain, because if the links to our competitor are "nofollow" in one page, they will be almost for sure "nofollow" in the whole site. So let's filter duplicated domains out. First, create another data table by clicking on the "Create table" button in the database panel. Call it "LessLinks" and add a single field called "Url" (make sure you enter these names right, otherwise you will have problems later). Then create another data table called "Backlinks" and also add a single field called "Url".  Then create an actions tree called "Fill up LessLinks", add a "Execute JavaScript" action and paste this code in it (after removing the default line of code):

  1. function contains(a, obj)
  2. {
  3. var i = a.length;
  4. while (i--)
  5. {
  6. if (a[i] === obj)
  7. {
  8. return true;
  9. }
  10. }
  11. return false;
  12. }
  13.  
  14. function getDomain(url)
  15. {
  16. var index = url.indexOf("://");
  17. if(index != -1) return url.substring(index + 3).split(/\/+/g)[0];
  18. else return url.split(/\/+/g)[0];
  19. }
  20.  
  21. Global.DataBase.Query("DELETE * FROM LessLinks");
  22.  
  23. var links = Global.DataBase.Query("SELECT [Competitor Links] FROM Links").ToMatrix();
  24.  
  25. var visitedDomains = new Array();
  26.  
  27. for(row in links)
  28. {
  29. var url = links[row][0];
  30. var domain = getDomain(url);
  31. if(!contains(visitedDomains, domain))
  32. {
  33. visitedDomains.push(domain);
  34. Global.DataBase.Query("INSERT INTO [LessLinks] VALUES ('" + url + "')");
  35. }
  36. }

What this code does is take URLs from the "Links" table and insert them into the "LessLinks" table, but ignoring the ones with repeated domains. If you press play now, there should appear about 200 links in the "LessLinks" table. Now create another actions tree called "Extract NON nofollow sites" and add a "Navigate URLs" action and set it to navigate the URLs in the "Url" column of the "LessLinks" table. And then create, inside this action, another "Execute JavaScript" action with this code:

  1. var competitorLink = "www.ixdownload.com";
  2.  
  3. competitorLink = competitorLink.toLowerCase();
  4.  
  5. for (i in document.links)
  6. {
  7. var link = document.links[i];
  8. var href = link.href;
  9. if(href && href.toLowerCase().indexOf(competitorLink) != -1)
  10. {
  11. var rel = link.getAttribute("rel");
  12. if(!rel || (rel.toLowerCase() != "nofollow" && rel.toLowerCase() != "external nofollow"))
  13. {
  14. Global.DataBase.Query("INSERT INTO [Backlinks] VALUES ('" + window.location.href + "')");
  15. return;
  16. }
  17. }
  18. }

This code will try to find non "nofollow" links to our competitor site inside each page and, if found, it will extract the URL to the "Backlinks" table. We are almost done, except for one small detail. Open the Project -> Options item in the main menu and notice there is a "Navigation Timeout" there. This will abort any navigation when executing our extraction if it's taking longer than the given amount of time. It will still consider the page as loaded, so it will try to extract data or perform any other action in it. This way, we won't get stuck at pages that never complete, or take too long to complete loading. Now, to precisely calculate the optimal amount of time to enter would take another post, another project and some math. So I'll just enter 20 because I can tell, by experience, that if a page takes longer than 20 seconds to load, something is wrong with it. This will depend upon your internet connection as well. I'm also considering the fact that I don't necessarily need every single URL, but a good bunch of them gathered in a timely manner.

So now we are good to go. Press play, and if everything was setup properly, you should start getting potential backlinks URLs in your "Backlinks" table. Remember that, if you have the table opened, you need to press "Refresh" to see the latest results. Here is the final product.

 

Filed under: Miscellaneous 3 Comments