Archives for 

howto

How to use the Alexa Top 1M domain list for analysis

If you want to do some basic analytics on domain names e.g. the amount of popular domains sorted by country domain, here is how to create such an overview:

1. Download the top 1.000.000 sites .csv from Alexa: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip, they sold this list in the past but some years ago they gave it for free. For production uses you would probably use the Alexa webservices on Amazon to give you real time information.

2. This csv file contains domains in the form of e.g. “bancobrasil.com.br” along with their ranking, you can easily import this in e.g. a sqlite database using e.g. the firefox sqlite plugin (or the sqlite commandline plugin) (or with the pro version of dbvisualizer) (etc).

image

3. If you now want to match this against an existing set of url’s e.g. in a set of bookmarks you will have to figure out what is “top level” and what is not because if you have a bookmark http://apps.facebook.com/blabla you would use

 parse_url($bookmarkurl, PHP_URL_HOST) 

to get the host and will get apps.facebook.com.

But… in Alexa you would want to match this with “facebook.com” not “apps. facebook.com” (otherwise it can not be matched). The problem here is how you would know that “apps.” would have to be stripped since in e.g. “apps.gov.edu.yk” you would have to strip it in another place. This is an unsolvable problem: you will need to have a list of all country level domains.

So… to make this match there are several alternatives but most famous is the effective_tld_names list : http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1 (public suffix list)

image

Which is used e.g. in Firefox, Opera and Chrome

For PHP there is an existing class by the DKIM-reputation.org you can reuse to get the solution which you can include in a /lib.

4. You then simply use

 $registeredDomain = getRegisteredDomain($base_href_host, $tldTree); 

($tldTree is given bij DKIM-reputation as a global array which you can pass to a constructor and $base_href_host is the PHP_URL_HOST).

5. And next you can add two columns in your SqlLite table: the registered domain and the country level part (filled by a sql update run using the registeredDomain here above).

6. Based on this table you can now run lots of analytics on the top 1.000.000 domain names, you can even add new meta information to your own bookmarks collection.

However please note:

- Alexa gives you only the top level registed domain, so e.g. you can not distinguish between popularity of app.facebook.com/game1 and app.facebook.com/game2 : these will both the general  facebook rank. What happens in this case is actually that facebook acts as a further domain hierarchie: you will have to make a list of those kinds of sites and will have to further give popularity scores based on statistics in e.g. wordpress.org, facebook.com or blog platforms on its individual members. I have not seen yet a list of domains which you have to “take out of this Alexa list to analyze further” (someone out there probably has done this already but I have not found it yet). Basically these sites acts as N-level domain and should be put under its own catgegory not under .com.
If such a list has really not been created by anyone else (which I doubt) I will add some GPL project on e.g. Google Code with at least a good start of these types of N level registers.

- If you want to add whois records information maybe the http://www.phpwhois.org/ is also handy for you