Archive for the ‘OWS’ Category

OWS Analysis: Digg vs. Stumbleupon vs. Reddit

Wednesday, November 21st, 2007

So I added some new features to OWS, which allow you to filter any results manually by any field you want (you have to enable obsessively excessive options). I then applied this to my recent traffic and found these interesting (though, not entirely surprising) results:

Summary:

ows_sum.png

Stumbleupon:

ows_stumble.png

Digg:

ows_digg.png

Reddit:

ows_reddit.png

Not surprisingly, Digg and Reddit have almost no tail traffic, with only a really huge burst and then it dies out. Reddit has a lot smaller tail though. Stumbleupon however, doesn’t really have much of a burst, but provides a really long tail of traffic, which ends up superceding (over time) the amount of traffic from Digg or Reddit.

So whats the lesson here?

Because of its nature, I think Stumbleupon will continue to keep providing high-quality traffic for awhile. Not only that, but it also provides repeat visitors too. It should definitely be on your radar when advertising through social bookmarking sites, even though its often ignored.

While traffic is good, traffic from the right places is even better, and can provide more traffic over time. I’ll have to put a stumbleupon button on my blog posts here… hehe.

OWS Analysis of traffic to previous post

Sunday, November 11th, 2007

Looks like I’m getting a lot of traffic to my previous post about the MBR love note, which is pretty awesome. Heres a screenshot of the traffic so far in my open source website traffic analysis tool, Obsessive Website Statistics (I’ll update it later tonight, be more interesting). Its definitely eclipsed any of my other recent traffic…

Edit: this was the pre-digg screenshot

ows_ss.png

This is the post-digg screenshot. I bet you can’t tell when the page was on the front page of Digg AND Reddit.. 

ows_ss2.png

More semi-amusing referrer spam

Friday, November 9th, 2007

As you may know, I have my awesome computer engineering javascript and CSS resume posted on this site. And actually, I’ve gotten quite a few responses from random companies and people all across the US… which has been really encouraging for me, especially since I haven’t applied anywhere! But onto the amusing referrer spam.

Using OWS, I’ve noticed a lot of random referrer spam to my resume. And, its pretty consistent: each instance points to a (valid) resume for some totally random person. This is one instance:

hostname Date Time Referrer URL User-Agent
65.91.101.XXX 2007-11-02 10:37:26 http://lalaland.msu.edu/ ~vanhoose/resume/resume.html Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; SCF – Mean & Nasty; T312461)
65.91.101.XXX 2007-11-02 10:37:25 http://lalaland.msu.edu/ ~vanhoose/resume/resume.html Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; SCF – Mean & Nasty; T312461)
65.91.101.XXX 2007-11-02 10:37:22 http://www.dalehollowmarketing.com/ Htm%20pages/resume.htm Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)

Theres been around 10 or 20 of these types of referrer spam URL’s I’ve seen over the last few months. The weird part is that all of the different URL’s are always people of varying experience, and totally different career fields (teachers, businessmen, researchers), and even international students and such. And, theres been a few different IP addresses its originating from.. but who knows with how much adware/botware is out there.

What I can’t imagine is the purpose.. I mean, is someone paying for someone to spam their resume around? Totally untargeted referrer spam like that can’t possibly be effective — I mean look, they caught my attention, but is an HR person really going to look at website logs? Doubtful. And if they did, I’m pretty sure they would NOT hire that person for being so spammy. Maybe the spammers are just picking resumes at random…. but that doesn’t even make sense either! Heh..

Anyways, I really need to refocus my resume in the very near future (like.. this weekend) to a more engineering type of job, since right now it makes it looks like I want a web developer/related job. Which don’t get me wrong, I think I might enjoy a job doing web related stuff, but I think I’m really looking for somewhere I can innovate and contribute as a computer engineer, or something close. I’ll have to write about this soon.. since, the one thing keeping me looking for a job right now is that I’m waiting to see whether I’m accepted into the graduate school I wanted to get into or not. Wish me luck.

Microsoft is referrer spamming me

Wednesday, October 31st, 2007

Using my open source website statistics program, Obsessive Website Statistics, I monitor my traffic on a regular basis mostly to see if there are any links from anywhere interesting. Well, lately I’ve been noticing a large number of referrer links from search.live.com with random single-term search terms in them. So I looked at the hostnames and noticed this (this is just a sampling):

msn_spam.PNG

The IP’s and non-standard domain names all belong to Microsoft.The agent is always the same. It always browses maybe one or two pages deep, possibly grabbing some CSS files in the process. Its definitely a parallel bot of some kind, since its all from the same IP ranges, and they all have similar queries. Additionally, it NEVER checks robots.txt .

Researching this on the internet (using Google, of course) shows I’m not the only one affected by this. Apparently MSN has claimed that its just a “quality test”, but thats BS. They’ve been quality testing since August 15 for me, in that case.

My first reaction has been to agree with Josh Cohen, who posted in his blog that he thinks MSN is doing good old-fashioned referrer spam to try and entice people to use Live Search since the normal methods of getting people to use it have failed.

Its certainly enticed me to click through a number of times, only to discover that my site isn’t anywhere near the top of the listings for those terms. In fact, originally the links weren’t even resolving to proper pages, and it would give me an error when I actually got to Live Search. Go figure.
If anything, this has convinced me to NOT use Live Search. Thank you Microsoft, for making my log files a little more annoying. Microsoft needs to stop doing this, and do a “quality check” on their own procedures.

Sampling from my log files containing these accesses:

msn_referrer_spam.txt

Related links:

http://artific.com/articles/2005/12/27/a_practically_u/
http://pocketseo.com/analytics/146
http://www.webmasterworld.com/msn_microsoft_search/3424476.htm
http://andrewu.co.uk/webtech/archive/?odd_referrer_spamlike_behaviour_from_microsoft

Anatomy of a Boing Boing link using OWS

Tuesday, September 11th, 2007

My roommate just got a page of his linked to by Boing Boing, so I just added a better heatmap function to OWS to do some better visual analysis of the hit. As you can see, the results are quite nice.

OWS Heatmap of jonathanryan.org

As you can see, the initial traffic spike peaked at 1:00pm on 9/8 with 639 visitors or so that hour, with traffic falling off until the end of the day, with another spike with people waking up at 9 or 10 the next day, and then falling back to mostly normal levels.

A more interesting observation is the spike in bot/crawler traffic, as shown below.

OWS Heatmap (bot) jonathanryan.org

Apparently people aren’t the only ones who follow links on popular sites such as Boing Boing. ๐Ÿ™‚

This heatmap is not yet in the latest version of OWS, but it is stable and available in SVN right now. OWS 0.8.0.4 will have this, which will hopefully be introduced by the weekend if I have time..

Obsessive Web Statistics (OWS) analysis plugin tutorial

Saturday, September 1st, 2007

This is a short tutorial on how you can write an analysis plugin for Obsessive Website Statistics (OWS). OWS is designed first and foremost to be plugin friendly, and as you will see, adding useful functionality in the form of plugins is not hard at all, and can be done in just a few lines of code. We are going to add DNS hostname resolving to OWS.

What is an analysis plugin?

An analysis plugin performs analysis on the parsed logfile data, and stores that information in the database dimensions. OWS has wrapped all of this stuff in a nice easy to use abstraction layer so that you won’t need to make actual SQL queries if you don’t want to.

Implementation

All OWS plugins are implemented as PHP classes. This is the bare skeleton that all OWS plugins should define.

class OWSDNS implements iPlugin{

	// this should return a unique ID identifying the plugin, should start with an alpha,
	// should use basename instead of just __FILE__ otherwise it could expose path information
	public function getPluginId();//{
		return 'p'. md5(basename(__FILE__) . get_class());
	}

	// returns an associative array describing the plugin
	public function getPluginInformation(){

		return array(

			'pluginName' => 'Name of plugin',
			'aboutUrl' => 'http://information.about.plugin',

			'author' => 'author',
			'url' => 'http://developers.website',

			'description' => 'Description of what plugin does'
		);
	}
}

You should notice we define two functions — getPluginId() and getPluginInformation(). These must be defined by any OWS plugin, and are used to identify the plugin in a number of instances. This plugin also implements iPlugin. All interfaces are defined (with plenty of comments) in include/plugin_interfaces.inc.php. A plugin can implement as many interfaces as it needs to. There are a few types, but the one we are going to implement is iAnalysisPlugin. We will do so by changing the first part to:

class OWSDNS implements iPlugin, iAnalysisPlugin {

Additionally, we need to register the plugin with OWS so that it knows what kind of plugin you are defining. Add this to the end of your source file:

register_plugin('analysis',new OWSDNS());

An analysis plugin needs to implement the following functions:

define_dimensions
InitializeAnalysis
preAnalysis
getPrimaryNode
getAttributes
postAnalysis

All of these functions are documented in include/plugin_interfaces.inc.php if you need more comprehensive information.

Now, OWS stores data in multiple dimensions. Each dimension has a ‘primary node’ which is the main data element of the dimension. Each primary node can have mutliple attributes which are defined about it, and always has the same name as the dimension. Plugins can define new dimensions or extend existing dimensions.

Right now, OWS stores only the host address — which is an IP address representing the visitor. What our plugin needs to do is resolve this address, and store it as an attribute of the dimension. So, we need to extend the dimension ‘host’, which we can do using the function define_dimensions().

// this function should return a set of arrays that define the dimensions
// and attributes that this plugin defines. You should not specify an attribute
// that another plugin defines. This is not website dependent.
public function define_dimensions(){

	return array(
		'host' => array(
			'hostname' => attribute_defn('varchar',254,16)
		)
	);
}

Pretty simple, eh? See, the array returned means that we are defining inside dimension ‘host’, an attribute named ‘hostname’. The function attribute_defn is used to define the SQL type that our attribute has, so the installer can create it for us. Now, we can write the actual analysis part.

At the beginning of analysis, the function InitializeAnalysis is called in case the plugin needs to do something before the analysis begins. This function is called once per website analyzed. Our plugin isn’t going to need this, so we just return true.

public function InitializeAnalysis($website){
	return true;
}

Now, after all plugins are initialized, then the logfile lines are read from the logfile (or from the database in the case of an install or in the case of reanalysis). It is read in phases, which consist of 4 steps:

preAnalysis
getPrimaryNode
getAttributes
postAnalysis

Now, preAnalysis and postAnalysis are only called once per phase, but getPrimaryNode is typically called at least once per logfile line. Our plugin doesn’t use getPrimaryNode — getPrimaryNode is only used for plugins that define new dimensions in define_dimensions. If you don’t define a primary node, then you should return false and show an error.

It should also be noted that our plugin doesn’t need to do any preAnalysis or postAnalysis, so we can just return true.

public function preAnalysis($website,&$ids){
	return true;
}

public function getPrimaryNode($website, $dimension, $line){
	return show_error("Invalid dimension passed to plugin\"" . get_class() . "\"");
}

public function postAnalysis($website,&$ids){
	return true;
}

Now we get to the part that actually does the work. The function getAttributes needs to return an array representing the attributes that the plugin defines per dimension. The $dimension argument is passed in to the function, and we should only do analysis on the primary node. The contents of the primary node are passed in to the function as well. This makes sense, because attributes of the primary node should be discernable by only looking at the primary node itself. If this is not the case, then you should probably be defining a new dimension instead.

This function should return an array of attributes/values in the form of:

	array('attribute' => 'value', ...)

Note: The returned values can be cached (for performance reasons), so this function may NOT always be called for each row. You should ALWAYS return an array with the same keys each time, in the same order that you defined them in define_dimensions. Of course, if you do not define any attributes in the dimension passed in the $dimension parameter, or if there is an error, then return false.

Anyways, heres the code for this function:


public function getAttributes($website, $dimension, $pnode){

	if ($dimension != 'host')
		return show_error("Invalid dimension passed to plugin\"" . get_class() . "\" in getAttributes!");

	// return the hostname
	return array('hostname' => gethostbyaddr($pnode));

}

And thats it! Wasn’t that easy? Of course, theres a lot more useful things we could probably implement, and make this more polished. Now, after you install the plugin and run the analysis, the only filter you’ll be able to use on your new dimension attribute in the web interface is the manual analysis, since it allows analysis on all defined dimensions. But, it would be a pretty trivial matter to either modify an existing filter plugin or create a new filter plugin. We’ll discuss this in the future.

Hope this helps you out. If you need help with OWS, or developing for OWS, don’t hesitate to ask! Leave your comments, or join the obsessive-compulsive mailing list!

Download this
Obsessive Website Statistics Website

Obsessive Web Stats Demo on the Virtual Roadside!

Friday, August 17th, 2007

A demo of OWS is now on the virtual roadside! If you’re interested in seeing OWS in action, then you can visit it at http://obsessive.virtualroadside.com/, it details the traffic to the OWS sourceforge site. The only limitation is that it only tracks the main page… so you can’t really do any in-depth analysis. But it shows you the key concepts behind OWS in any case.

MySQL Cluster Tips

Friday, August 17th, 2007

Well, I setup a 9-computer MySQL cluster to do some experimentation with OWS. Its pretty neat, I have DDNS setup with DHCP, and a neat thing setup with rsync where every single machine syncs its configuration to the ‘primary’ machine each hour. Its pretty cool, I’ll have to write some more posts about it.

Anyways, if you ever use MySQL cluster, theres one important tip that they don’t really mention in the manual:

MAKE SURE ALL OF YOUR STORAGE NODES ARE UP, OTHERWISE THE CLUSTER WONT START. 

See, I had this issue with one of the network cards on the machines, so I decided just to try and get the thing to work without messing with the machine. Which, has worked pretty well until I got around to screwing with the MySQL cluster.  And, you would think this is perfectly obvious — but its not. So thats my tip.

Of course, after talking to the guys on efnet #mysql, turns out that MySQL cluster probably won’t benefit OWS anyways. But, we shall see, right? ๐Ÿ™‚

OWS v0.8.0.1 released

Tuesday, August 14th, 2007

There was a huge issue with the ows_aggregate plugin in version v0.8..
sorting just did not work at all. v0.8.0.1 has been released to resolve
this issue. Thanks to Jon for pointing this out.

OWS Download Link 

Major Release of Obsessive Website Statistics

Tuesday, August 14th, 2007

Note: This announcement can also be found in the obsessive-compulsive mailing list and the OWS news archives at sourceforge.

The first open source Web 2.0 website log analyzer, Obsessive Website Statistics (OWS) uses PHP and jQuery to provide a powerful and intuitive interface to manipulate website log data stored in a MySQL database via easy to create plugins.

This is a major release of OWS. All users are strongly encouraged to upgrade. v0.7.x is completely not compatible with v0.8, as the database structure has totally changed for performance and flexibility reasons. You will need to totally delete your old databases and upload logfiles from scratch. This is not expected to happen again in the future.

OWS v0.8 now stores its data in a multidimensional OLAP-style data schema that has shown huge performance gains for data retrieval in our initial testing, and also promises to scale better than the previous releases of OWS. Additionally, OWS plugins have been enhanced to take advantage of the new data schema, and the manual analysis option is now much more intuitive to use for individuals not familiar with SQL.

Download link for OWS
Sourceforge Project Page