Helium Scraper's Blog All kind of useful and useless stuff related to Helium Scraper


SEO: Creating a project to find non “nofollow” backlinks

For those of you who just want to get those backlinking sites without reading anything, well you will still have to read this paragraph, but that's it. In this post, there are two Helium Scraper files attached: one that extracts backlinking sites given a competitor URL, and one that also extracts PageRank for these sites. These projects are basically enhanced versions of the project I will be creating here.

So if you're still here, let's move forward. I'll be creating a Helium Scraper project that will extract a bunch of potential non "nofollow" backlinks to my imaginary software downloading site. I'm assuming you have an idea of how to use Helium Scraper. If not, I recommend this simple tutorial.

First of all, I'll choose a competitor webscraper, not one of the very big guys because that would be unrealistic, given the fact that my imaginary site is imaginarily just starting. So I'll pick this one: http://www.ixdownload.com. Let's open Helium Scraper and navigate to http://siteexplorer.search.yahoo.com. Here I'll search for any URL (must be a URL, such as google.com, otherwise you will be taken to Yahoo Search instead of Yahoo Site Explorer) so that the "Next" button (the one that turns the page) appears, and create a kind with it called "Next Button". Make sure it's working on two or three pages.

Now I'll import a premade project from Helium Scraper's forum that will make the job a lot easier. What this project does is to navigate through all pages in a set of results pages by using the "Go Through All Pages" actions tree that comes in it. Here is the project and here is the forum's thread where the project is attached,  in case you want to know some more details about them. I think it will become clear what this project does in a little bit just by following this tutorial anyway.

Now I'll go to my "Actions tree 1" in the actions panel and add a "Execute Actions Tree" action that executes the "Go Through All Pages" tree. Set the "Next Button Kind" to the "Next Button" kind and leave everything else the way it is.

Now we need to create another kind called "Competitor Links" that will select the links at the top of each result such as in this picture:

competitor links selected

Again, make sure it works on more the one page. Now, I would normally extract the "href" property of the link, because it contains the URL of the destination page. But this time, this is not the case. If you click on the "Choose visible properties" button on the selection panel and select the "Link" property, and you select one of these links, you will see this weird looking URL that contains around the middle this text: "**http%3a". We need to strip the part of the URL that starts right where that text is, because that's the actual target URL. So I'll create a javascript gatherer that will do just that.

Open the javascript gatherers from the menu Project -> JavaScript Gatherers, create a new gatherer called "FixedLink" and paste this code in it:

  1. var text = element.getAttribute("href");
  2. var encoded = text.substring(text.indexOf("**") + 2);
  3. return unescape(encoded);

Save and close. Is always a good idea to make sure our javascript gatherers are working by selecting it with the "Choose visible properties" button in the selection panel and selecting a few elements to which the gatherer applies. In this case those elements are the links we used to create our "Competitor Links" kins. Notice that the gatherer will now have the "JS_" prefix, so instead of "FixedLink" it will be called "JS_FixedLink".

Now add a "Extract" action inside the "Execute tree: Go Through All Pages" action and select the "Competitor Links" kind. Change the table name to "Links" and the "Property" from "InnerText" to "JS_FixedLink". Also, change the "Req. Mode" to "At Least" and set the "Req. Amount" to 1. This will let us know if no links are found on any page.

Now type "www.ixdownload.com" (without quotes) in Yahoo's search box and press the "Explore URL" button. Then click the "Inlinks" button so it shows links to the "myrecies.com" page, and change the "Show Inlinks" field to "Except from this domain" so we only get external backlinks. Make sure you are at the first page and press play.

Now we have our links, but we don't need more than one URL per domain, because if the links to our competitor are "nofollow" in one page, they will be almost for sure "nofollow" in the whole site. So let's filter duplicated domains out. First, create another data table by clicking on the "Create table" button in the database panel. Call it "LessLinks" and add a single field called "Url" (make sure you enter these names right, otherwise you will have problems later). Then create another data table called "Backlinks" and also add a single field called "Url".  Then create an actions tree called "Fill up LessLinks", add a "Execute JavaScript" action and paste this code in it (after removing the default line of code):

  1. function contains(a, obj)
  2. {
  3. var i = a.length;
  4. while (i--)
  5. {
  6. if (a[i] === obj)
  7. {
  8. return true;
  9. }
  10. }
  11. return false;
  12. }
  14. function getDomain(url)
  15. {
  16. var index = url.indexOf("://");
  17. if(index != -1) return url.substring(index + 3).split(/\/+/g)[0];
  18. else return url.split(/\/+/g)[0];
  19. }
  21. Global.DataBase.Query("DELETE * FROM LessLinks");
  23. var links = Global.DataBase.Query("SELECT [Competitor Links] FROM Links").ToMatrix();
  25. var visitedDomains = new Array();
  27. for(row in links)
  28. {
  29. var url = links[row][0];
  30. var domain = getDomain(url);
  31. if(!contains(visitedDomains, domain))
  32. {
  33. visitedDomains.push(domain);
  34. Global.DataBase.Query("INSERT INTO [LessLinks] VALUES ('" + url + "')");
  35. }
  36. }

What this code does is take URLs from the "Links" table and insert them into the "LessLinks" table, but ignoring the ones with repeated domains. If you press play now, there should appear about 200 links in the "LessLinks" table. Now create another actions tree called "Extract NON nofollow sites" and add a "Navigate URLs" action and set it to navigate the URLs in the "Url" column of the "LessLinks" table. And then create, inside this action, another "Execute JavaScript" action with this code:

  1. var competitorLink = "www.ixdownload.com";
  3. competitorLink = competitorLink.toLowerCase();
  5. for (i in document.links)
  6. {
  7. var link = document.links[i];
  8. var href = link.href;
  9. if(href && href.toLowerCase().indexOf(competitorLink) != -1)
  10. {
  11. var rel = link.getAttribute("rel");
  12. if(!rel || (rel.toLowerCase() != "nofollow" && rel.toLowerCase() != "external nofollow"))
  13. {
  14. Global.DataBase.Query("INSERT INTO [Backlinks] VALUES ('" + window.location.href + "')");
  15. return;
  16. }
  17. }
  18. }

This code will try to find non "nofollow" links to our competitor site inside each page and, if found, it will extract the URL to the "Backlinks" table. We are almost done, except for one small detail. Open the Project -> Options item in the main menu and notice there is a "Navigation Timeout" there. This will abort any navigation when executing our extraction if it's taking longer than the given amount of time. It will still consider the page as loaded, so it will try to extract data or perform any other action in it. This way, we won't get stuck at pages that never complete, or take too long to complete loading. Now, to precisely calculate the optimal amount of time to enter would take another post, another project and some math. So I'll just enter 20 because I can tell, by experience, that if a page takes longer than 20 seconds to load, something is wrong with it. This will depend upon your internet connection as well. I'm also considering the fact that I don't necessarily need every single URL, but a good bunch of them gathered in a timely manner.

So now we are good to go. Press play, and if everything was setup properly, you should start getting potential backlinks URLs in your "Backlinks" table. Remember that, if you have the table opened, you need to press "Refresh" to see the latest results. Here is the final product.


Comments (3) Trackbacks (0)
  1. That’s rellay thinking out of the box. Thanks!

  2. these take the nofollow back links of the competidors

    • Hi, make sure you are looking at the Backlinks table and not any of the other two. This table will be populated after the other two are populated with only do-follow links.

Leave a comment

No trackbacks yet.