I was able to obtain a 1.6GB apache combined logfile from a colleague, and have been using it to see how good/bad OWS performance is on this size of logs. Unfortunately, it looks like OWS does not work well with data this size. What makes it really disappointing is that the site in question only gets around 1000-5000 unique visitors a day.
The biggest performance-related problem right now is that MySQL is ignoring the indexes that I have set up. Through some research, apparently this is an InnoDB related problem where it tries to use the primary key for everything, as opposed to using the secondary indexes the same way. This has been evidenced with normal index usage on my tables with only 100,000 rows or so on it, while it trying to use the primary key for the table with 6.5 million rows in it (and performing a table scan, which is definitely BAD). Then when I use FORCE INDEX then it seems things work better, but I can’t imagine thas the proper way to do it. What I’m going to do is try and use clustered indexes, and use the date as the primary key (since almost every single query deals with the date in some way), and see what kind of performance increases I get.
I think when it comes right down to it though, using a flat MySQL table ends up having the same types of problems you have with flat files — browsing gigabytes of data is slow. Of course, some of this can probably be eliminated with better queries, but I haven’t quite figured out how to do that.
All of this analysis stuff has brought me into examining OLAP and other multidimensional ways of representing this kinds of data. Right now, I’m thinking I want to redo the backend storage model of OWS so its more efficient and fast using a different type of data representation (still using MySQL), while maintaining the same easy to use interface.
—
By the way, a great link I’ve found thats helped me with some of these random issues is http://www.xaprb.com/blog/ , though most of the useful articles were published last year when he wasn’t so busy. I encourage you to check it out.