My roommate found a neat site today, http://highscalability.com/, which gives a (unfortunately non-comprehensive) overview of a lot of successful websites and how they coped with growing hordes of traffic. Its way neat. 🙂
Archive for August, 2007
Highly Scalable Websites
Wednesday, August 8th, 2007Improving the speed of large MySQL inserts
Monday, August 6th, 2007Obsessive Web Stats (OWS) has been a surprisingly addictive project for me. I had started doing some (limited) work on it before Boys State in June, and I figured it would be something simple that would be fun to work on… then I started trying to optimize for performance.
The next version of OWS (v0.8) will definitely be using a Star Schema to store its data. I’ve found that by implementing this, it reduces the database size by up to 50-75%. Which, is definitely a positive thing. And its cut some query times by 75% as well, which I’m pretty excited about. Check out this console screenshot from show_info.php:
Name Rows Data Idx ========================================================== virtualroadside_com 105798 7.9 MB 26.9 MB virtualroadside_com_agent 1558 196.6 KB 114.7 KB virtualroadside_com_bytes 6336 229.4 KB 163.8 KB virtualroadside_com_config 4 16.4 KB 16.4 KB virtualroadside_com_date 292 16.4 KB 16.4 KB virtualroadside_com_host 5517 245.8 KB 180.2 KB virtualroadside_com_method 5 16.4 KB 16.4 KB virtualroadside_com_protocol 2 16.4 KB 16.4 KB virtualroadside_com_referrer 2773 409.6 KB 491.5 KB virtualroadside_com_request 3893 540.7 KB 786.4 KB virtualroadside_com_status 1 16.4 KB 16.4 KB virtualroadside_com_time 46409 1.6 MB 3.2 MB virtualroadside_com_user 2 16.4 KB 16.4 KB Total Data: 11.2 MB Total Indexes: 31.9 MB Checking dimensions..............OK. Dimension Rows Unique Status ================================== host 5605 5605 OK user 2 2 OK date 292 292 OK time 45710 45710 OK method 5 5 OK request 3768 3768 OK protocol 2 2 OK status 1 1 OK bytes 6769 6769 OK referrer 2690 2690 OK agent 1470 1470 OK
However, inserts are currently horribly slow. As in, almost unbearably slow. Actually, its not so bad initially: 36 seconds for 10,000 logfile lines, which ends up being around 100,000 SQL queries to insert/retrieve data. However, in-memory caching measures reduce the number of actual SQL queries to around 5-10,000 or so. Not too shabby for my Pentium III.
Once the main ‘fact table’ gets to around 100,000 rows, the insert times start declining… the insert times were around 900 seconds at the 1 millionth row.
Right now, I’m inserting data like so: There are 12 tables holding the different dimensions of the data. Each insert to a new row, I check to see if the dimension key already exists, in which case I reuse it. Of course, this brings up the question of whether I’m properly denormalizing the data or not. One of the useful things I found was the following magic command:
SET UNIQUE_CHECKS=0
This cut my insert time by about 1/6 or so. Pretty awesome. I also tried
ALTER TABLE table DISABLE KEYS
But unfortunately, this makes zero difference on an InnoDB table. I did try switching to a MyISAM based table, but that didn’t seem to make much of a difference either. I’ve been busy scouring the web for performance tips, but I think one of the biggest barriers at the moment is my hardware: Dual Pentium III 500 Mhz with 1GB of ram, and 20gb/40gb IDE disks. I moved the MySQL database to my roommates computer (Pentium D, 1GB RAM, Raid IV) and performance for retrievals went way up, but the inserts are still pretty slow — though not as slow as on the PIII.
In conclusion, I’ve been able to get pretty decent data retrieval speeds from switching to the OLAP data layout, but inserts suck — and as such, OWS still doesn’t scale well. If you use OWS, let me know how its working for you! I’m always interested in hearing other opinions. 🙂
Note: I hate the tag feature of WordPress. It should ask me to tag the article AFTER I’ve written it, otherwise I just forget to do so. Theres some auto-tagging plugins but I haven’t tried them yet. Yes, I realize its open source, but I really don’t feel like hacking on WordPress right now… lol.
PHP Snippet: Padded table on CLI
Thursday, August 2nd, 2007While working on OWS, I created this neat little code snippet, which while it only took a few minutes to code, could be useful for someone just looking for a routine to display a simplistic table on the command line in PHP. Heres the code:
/* Pass this function an array of stuff and it displays a simple padded table. No borders. */
function show_console_table($rows, $prepend = '', $header = true){ $max = array(); // find max first foreach ($rows as $r) for ($i = 0;$i < count($r);$i++) $max[$i] = max(array_key_exists($i,$max) ? $max[$i] : 0 ,strlen($r[$i])); // add a header? if ($header){ // remove the first element $row = array_shift($rows); echo "$prepend"; for ($i = 0;$i < count($row);$i++) echo str_pad($row[$i],$max[$i]) . " "; echo "\\n$prepend" . str_repeat('=',array_sum($max) + count($max)*2) . "\\n"; } foreach($rows as $row){ echo "$prepend"; for ($i = 0;$i < count($row);$i++) echo str_pad($row[$i],$max[$i]) . " "; echo "\\n"; } }
Like I said, pretty simple, but quite useful too. Just pass the function an array, and it outputs a space-padded table with an optional header. Its probably been done already, but thats my implementation. 🙂
Why so-called “Web Apps” bug me
Wednesday, August 1st, 2007I’ve been really really busy lately working on my latest upgrades to OWS. The problem was that after putting in a few million rows into the database, queries took FOREVER. So.. I got myself a few books/articles about OLAP and star schemas and I think I’ve devised a high performance way of storing/retrieving the data. Its not done yet, but once it is I’ll probably write a series of blog posts about it — its some really neat stuff.
But really, what I wanted to mention today is the silliness of a lot of enterprise-level so called “Web Applications”. Prominent examples I can think of that we use/maintain at my workplace are things like Blackboard WebCT, Sungard Banner, and Kronos Workforce Central.
These silly applications call themselves “Web Apps”, but when you go to use them on a brand new computer, they all have the same problem — you need a runtime of some kind to be installed on the computer, or applets, or some other such nonsense. Which really, thats fine and all, but I hate that they call themselves web apps, since they’re just really ‘fat’ applications that happen to run from a web browser. If you need to install something other than a web browser, its NOT a web application anymore (at least, in my opinion). Of course, don’t get me started on the complexity of installing some of the Oracle ‘web’ stuff that some of our people use…. gah.
You know, its funny, most of the capabilities that these applications implement can easily be done using AJAX and javascript technologies — look at google maps, or any one of a number of google products. Or OWS. All of these products act like an application on a desktop, except that they work in the web browser with no strings attached, and will work on a Windows PC or a Mac or Linux… or even a Wii.
These other apps — its not a web application if it only runs on Internet Explorer on Windows (granted, some have firefox plugins too… or use Java, but thats a whole other realm right there ) . At least, not in my book.