Three Internet Privacy Misconceptions

August 11th, 2007

I was having a conversation with my dad last night, he had a lot of stupid misconceptions about privacy and the Internet. Even simple things like IP addresses and firewalls… people don’t quite understand them. So, i thought it would be a great subject to discuss. Of course, this is NOT a comprehensive list, just some basic thoughts.

Misconception #1:
Everything you do on the internet is anonymous

NO NO NO! This is such a lie… any time you go to a website and use their resources, the operator has the opportunity to record information (while may not be necessarily personally identifying) about you. The problem is, visiting a website is kind of like visiting a department store: if you walk in, they have a right to record you to safeguard their assets. Same thing with a website: when you contact a server, you are visiting their “department store”, and they can (and will) record information about your visit. If someone gets enough information about you, they can potentially identify you — the AOL search fiasco is a great example of this.

Misconception #2: Firewalls safeguard your privacy

Not completely true. See, using the internet is sorta like calling people on a telephone and talking back and forth. Generally speaking, a firewall does not usually interfere with you calling people. It is designed to stop people from calling YOU. While a firewall is definitely a good thing to have and can stop some types of viruses/spyware, I find a lot of times people give them too much credit/abilities — just like antivirus. Personally, I advocate that people should use a hardware firewall such as a router or other such device, they tend to be more reliable at protecting you than software firewalls (and consume less resources) in my opinion… but thats a whole different story…

Misconception #3: You have an implicit right to privacy on the Internet

The thing is, you don’t. The internet is a public network, and when you do something on the Internet, then you are doing things in public where anyone can ‘see’ you. This is because when you connect to a site, you’re really using a number of different computers to connect to that computer, and they can all potentially record information about your connection. Expecting privacy on the internet is like getting naked in the middle of a field and expecting that nobody can watch: you can only get privacy if there are walls or some other barriers. Keeping with that analogy, THERE ARE NO WALLS ON THE INTERNET (by default, at least). The biggest problem is that since people can’t see the walls, they assume that they exist, and they don’t.

With all of this said, there are definitely ways to safeguard your privacy on the internet, its just a matter of how paranoid you are. Programs like Tor or certain Firefox features can make your Internet experience more anonymous and secure — but the Internet is not secure or anonymous by default.

Pseudo-XKCD

August 9th, 2007

Me and my roommate Jon are huge fans of the webcomic XKCD, so a few nights ago we randomly created a really crappy pseudo-XKCD style comic of our own. See for yourself.

Pseudo-XKCD

Highly Scalable Websites

August 8th, 2007

My roommate found a neat site today, http://highscalability.com/, which gives a (unfortunately non-comprehensive) overview of a lot of successful websites and how they coped with growing hordes of traffic. Its way neat. 🙂

Improving the speed of large MySQL inserts

August 6th, 2007

Obsessive Web Stats (OWS) has been a surprisingly addictive project for me. I had started doing some (limited) work on it before Boys State in June, and I figured it would be something simple that would be fun to work on… then I started trying to optimize for performance.

The next version of OWS (v0.8) will definitely be using a Star Schema to store its data. I’ve found that by implementing this, it reduces the database size by up to 50-75%. Which, is definitely a positive thing. And its cut some query times by 75% as well, which I’m pretty excited about. Check out this console screenshot from show_info.php:

Name                          Rows    Data      Idx
==========================================================
virtualroadside_com           105798  7.9 MB    26.9 MB
virtualroadside_com_agent     1558    196.6 KB  114.7 KB
virtualroadside_com_bytes     6336    229.4 KB  163.8 KB
virtualroadside_com_config    4       16.4 KB   16.4 KB
virtualroadside_com_date      292     16.4 KB   16.4 KB
virtualroadside_com_host      5517    245.8 KB  180.2 KB
virtualroadside_com_method    5       16.4 KB   16.4 KB
virtualroadside_com_protocol  2       16.4 KB   16.4 KB
virtualroadside_com_referrer  2773    409.6 KB  491.5 KB
virtualroadside_com_request   3893    540.7 KB  786.4 KB
virtualroadside_com_status    1       16.4 KB   16.4 KB
virtualroadside_com_time      46409   1.6 MB    3.2 MB
virtualroadside_com_user      2       16.4 KB   16.4 KB

Total Data:     11.2 MB
Total Indexes:  31.9 MB

Checking dimensions..............OK.

Dimension  Rows   Unique  Status
==================================
host       5605   5605    OK
user       2      2       OK
date       292    292     OK
time       45710  45710   OK
method     5      5       OK
request    3768   3768    OK
protocol   2      2       OK
status     1      1       OK
bytes      6769   6769    OK
referrer   2690   2690    OK
agent      1470   1470    OK

However, inserts are currently horribly slow. As in, almost unbearably slow. Actually, its not so bad initially: 36 seconds for 10,000 logfile lines, which ends up being around 100,000 SQL queries to insert/retrieve data. However, in-memory caching measures reduce the number of actual SQL queries to around 5-10,000 or so. Not too shabby for my Pentium III.

Once the main ‘fact table’ gets to around 100,000 rows, the insert times start declining… the insert times were around 900 seconds at the 1 millionth row.

Right now, I’m inserting data like so: There are 12 tables holding the different dimensions of the data. Each insert to a new row, I check to see if the dimension key already exists, in which case I reuse it. Of course, this brings up the question of whether I’m properly denormalizing the data or not. One of the useful things I found was the following magic command:

SET UNIQUE_CHECKS=0

This cut my insert time by about 1/6 or so. Pretty awesome. I also tried

ALTER TABLE table DISABLE KEYS

But unfortunately, this makes zero difference on an InnoDB table. I did try switching to a MyISAM based table, but that didn’t seem to make much of a difference either. I’ve been busy scouring the web for performance tips, but I think one of the biggest barriers at the moment is my hardware: Dual Pentium III 500 Mhz with 1GB of ram, and 20gb/40gb IDE disks. I moved the MySQL database to my roommates computer (Pentium D, 1GB RAM, Raid IV) and performance for retrievals went way up, but the inserts are still pretty slow — though not as slow as on the PIII.

In conclusion, I’ve been able to get pretty decent data retrieval speeds from switching to the OLAP data layout, but inserts suck — and as such, OWS still doesn’t scale well. If you use OWS, let me know how its working for you! I’m always interested in hearing other opinions. 🙂

Note: I hate the tag feature of WordPress. It should ask me to tag the article AFTER I’ve written it, otherwise I just forget to do so. Theres some auto-tagging plugins but I haven’t tried them yet. Yes, I realize its open source, but I really don’t feel like hacking on WordPress right now… lol.

PHP Snippet: Padded table on CLI

August 2nd, 2007

While working on OWS, I created this neat little code snippet, which while it only took a few minutes to code, could be useful for someone just looking for a routine to display a simplistic table on the command line in PHP. Heres the code:

/*
	Pass this function an array of stuff and it displays a simple padded table. No borders.
*/
function show_console_table($rows, $prepend = '', $header = true){

	$max = array();

	// find max first
	foreach ($rows as $r)
		for ($i = 0;$i < count($r);$i++)
			$max[$i] = max(array_key_exists($i,$max) ? $max[$i] : 0 ,strlen($r[$i]));

	// add a header?
	if ($header){

		// remove the first element
		$row = array_shift($rows);

		echo "$prepend";
		for ($i = 0;$i < count($row);$i++)
			echo str_pad($row[$i],$max[$i]) . "  ";
		echo "\\n$prepend" . str_repeat('=',array_sum($max) + count($max)*2) . "\\n";

	}

	foreach($rows as $row){
		echo "$prepend";
		for ($i = 0;$i < count($row);$i++)
			echo str_pad($row[$i],$max[$i]) . "  ";
		echo "\\n";
	}
}

Like I said, pretty simple, but quite useful too. Just pass the function an array, and it outputs a space-padded table with an optional header. Its probably been done already, but thats my implementation. 🙂

Why so-called “Web Apps” bug me

August 1st, 2007

I’ve been really really busy lately working on my latest upgrades to OWS. The problem was that after putting in a few million rows into the database, queries took FOREVER. So.. I got myself a few books/articles about OLAP and star schemas and I think I’ve devised a high performance way of storing/retrieving the data. Its not done yet, but once it is I’ll probably write a series of blog posts about it — its some really neat stuff.

But really, what I wanted to mention today is the silliness of a lot of enterprise-level so called “Web Applications”. Prominent examples I can think of that we use/maintain at my workplace are things like Blackboard WebCT, Sungard Banner, and Kronos Workforce Central.

These silly applications call themselves “Web Apps”, but when you go to use them on a brand new computer, they all have the same problem — you need a runtime of some kind to be installed on the computer, or applets, or some other such nonsense. Which really, thats fine and all, but I hate that they call themselves web apps, since they’re just really ‘fat’ applications that happen to run from a web browser. If you need to install something other than a web browser, its NOT a web application anymore (at least, in my opinion). Of course, don’t get me started on the complexity of installing some of the Oracle ‘web’ stuff that some of our people use…. gah.

You know, its funny, most of the capabilities that these applications implement can easily be done using AJAX and javascript technologies — look at google maps, or any one of a number of google products. Or OWS. All of these products act like an application on a desktop, except that they work in the web browser with no strings attached, and will work on a Windows PC or a Mac or Linux… or even a Wii.

These other apps — its not a web application if it only runs on Internet Explorer on Windows (granted, some have firefox plugins too… or use Java, but thats a whole other realm right there ) . At least, not in my book.

Faceball

July 26th, 2007

Heh.. this is amusing. Faceball…

“At its simplest level Faceball involves two people hitting beachballs at each other’s faces. At a deeper level it’s a vehicle for the release of personal animosity, and the Shaming of the Weak.”

I think thats all the introduction it needs.

http://faceball.org/

http://flickr.com/groups/faceball/

MySQL and indexes

July 26th, 2007

 I was able to obtain a 1.6GB apache combined logfile from a colleague, and have been using it to see how good/bad OWS performance is on this size of logs. Unfortunately, it looks like OWS does not work well with data this size. What makes it really disappointing is that the site in question only gets around 1000-5000 unique visitors a day.

The biggest performance-related problem right now is that MySQL is ignoring the indexes that I have set up. Through some research, apparently this is an InnoDB related problem where it tries to use the primary key for everything, as opposed to using the secondary indexes the same way. This has been evidenced with normal index usage on my tables with only 100,000 rows or so on it, while it trying to use the primary key for the table with 6.5 million rows in it (and performing a table scan, which is definitely BAD). Then when I use FORCE INDEX then it seems things work better, but I can’t imagine thas the proper way to do it. What I’m going to do is try and use clustered indexes, and use the date as the primary key (since almost every single query deals with the date in some way), and see what kind of performance increases I get.

I think when it comes right down to it though, using a flat MySQL table ends up having the same types of problems you have with flat files — browsing gigabytes of data is slow. Of course, some of this can probably be eliminated with better queries, but I haven’t quite figured out how to do that.

All of this analysis stuff has brought me into examining OLAP and other multidimensional ways of representing this kinds of data. Right now, I’m thinking I want to redo the backend storage model of OWS so its more efficient and fast using a different type of data representation (still using MySQL), while maintaining the same easy to use interface.

By the way, a great link I’ve found thats helped me with some of these random issues  is http://www.xaprb.com/blog/ , though most of the useful articles were published last year when he wasn’t so busy. I encourage you to check it out.

Apparently Microsoft uses Firefox too!

July 24th, 2007

Haha… I was browsing the Facebook developer site and they linked to Microsofts website in relation to MS’s popfly application. Apparently it has support for Facebook application development or some such thing. Well if you take one good look at the screenshot they were using…

popfly.jpg

Recognize those tabs? Its Firefox! Apparently Microsoft likes Firefox too. 😀 Of course, we probably already knew that, but this is further proof!

Link: http://msdn.microsoft.com/vstudio/express/showcase/

Image Link: http://msdn.microsoft.com/vstudio/express/images/facebook/popfly.jpg

Optimizing a really nasty SQL query

July 23rd, 2007

Ok, so while working on Obsessive Website Statistics (OWS), I’ve hit a situation. See, OWS tries to be semi-intelligent and combine the parameters of all the installed plugins into really giant/nasty SQL queries that make you shudder, but tend to work.

So right now, I’m trying to select the following (at the same time):

  • All pages, grouped
  • COUNT() of all hosts
  • COUNT() of all pages that end with .html, .htm, .php, /
  • COUNT() of all pages
  • SUM() of filesizes

So of course, getting the 1st, 2nd, 4th, and 5th items is pretty easy. However, the third item is getting annoying. I tried using a subquery, but considering the table is 100,000 rows this is a particularly slow query:

SELECT 
	virtualroadside_com.request_str,
	COUNT(DISTINCT virtualroadside_com.host),
	c.b,
	COUNT(virtualroadside_com.filename),
	SUM(virtualroadside_com.bytes) 
FROM 
	(	
		SELECT 
			virtualroadside_com.request_str AS a,
			COUNT(virtualroadside_com.id) AS b 
		FROM 
			virtualroadside_com 
		WHERE 
			(virtualroadside_com.filename LIKE '%.html' 
			OR virtualroadside_com.filename LIKE '%/' 
			OR virtualroadside_com.filename LIKE '%.htm' 
			OR virtualroadside_com.filename LIKE '%.php') 
		GROUP BY virtualroadside_com.request_str 
		ORDER BY virtualroadside_com.request_str DESC 
		LIMIT 0,100
	) c,
	virtualroadside_com 
WHERE 
	virtualroadside_com.request_str = c.a
GROUP BY virtualroadside_com.request_str 
ORDER BY virtualroadside_com.request_str DESC 
LIMIT 0,100;

So, the big question here is: Is there better ways to accomplish this sort of thing without using subqueries? This particular query takes around 40 seconds on 105,000 rows to execute on my computer (Dual PIII, 500Mhz). I’m positive theres a way to do with with a JOIN of some kind, but I can’t get any of those to work correctly either. Let me know if you have any good ideas! I’ll publish a better way to do this hopefully in the next few days once I figure it out. 🙂