Tuesday, July 1, 2008

How-To: SQL query IIS Logs with Log Parser

Face it: the built-in reports in Community Server are not good enough if you want detailed information about your traffic. Google Analytics is great, but you can't export the data in order to use it in your own queries. A better way is to analyze the IIS log files directly and join the information with the CS database tables. But how can you find useful data in these huge files?
First of all: what do you really want to know?


I wanted to answer the following question: How many request do we get per day and blog post ?
I'm only interested in "real people" requests. So we need to eliminate all requests from bots and feed readers.


I only want to count page requests, so we need to eliminate all requests to images and other files.
I only want to count blog page requests, no navigation on tags, or other pages.
I only want to count successful requests.
I want to use the data to join it with CS Database data, in order to query additional meta-information (Author, Blog Name, Post creation date etc.) so some transformation of the standard IISW3C fields has to take place.

Since we are database specialists here at Trivadis, it's quite natural that SQL is a perfect language to query data. But how do you query an IIS log file which is in a special format (no CSV)?

No, don't even try do do it with Excel or Access. Even SQL Server can't import dozens of files in a proprietary format easily. You could try to play with SSIS, but it would kill you, believe me :-). The files can have breaks where the header is repeated somewhere in the middle of a file, and none of the tools I mentioned can deal with that. Also, the size is just too much for office tools.
Use Log Parser! This tool is just great, simple to use and extremely powerful. It's only purpose is to query log files (not only IIS). And it does it very well: with a SQL-like query language you can run queries against complete folders of files, and the performance is just amazing. It takes some seconds to query gigabytes of date. But it is only a command line tool. The cream topping is the UI that the fine people at SerialCoder wrote: Visual Log Parser GUI . It's quite simple, but has all the features you need if you are a GUI spoiled developer like me :-). Just think of it as a SQL Server Management Studio, but for files.

A query to provide data for the use case I mentioned above could look like this:

select
ViewDate, PostName,
KlickCount,
Blognameusing
date as ViewDate, coalesce( replace_str(extract_prefix(extract_token(cs-uri-stem, 7,'/'),0,'.'),'-',' '),'default') as PostName
, to_uppercase( extract_token(cs-uri-stem, 2 ,'/')) as BlogName
, count(*) as KlickCount
from 'E:\Blog\*.log'
where cs-uri-stem like '/blogs/%'
and cs-uri-stem not like '%.png'
and cs-uri-stem not like '%.jpg'
and cs-uri-stem not like '%.gif'
and cs-uri-stem not like '%rss%'
and cs-uri-stem not like '%/tags/%'
and cs-uri-stem not like '%MainFeed.aspx'
and cs(User-Agent) not like '%bot%'
and cs(User-Agent) not like '%slurp%'
and cs(User-Agent) not like '%crawl%'
and [date] between to_timestamp('2007-12-01' ,'yyyy-MM-dd') and to_timestamp('2007-12-17','yyyy-MM-dd')
and extract_token(cs-uri-stem, 3 ,'/') in ('archive', 'default.aspx')
and sc-status = 200
group by date, PostName, BlogName
order by KlickCount desc

The output should look somewhat like this (one example row) :
12.09.2007 00:00:00;mvc castle subsonic bye bye webforms;111;MICHAELKOENINGS


For Community Server admins:
The PostName column matches the column PostName on the cs_Posts Table. The Blogname column matches the ApplicationKey column on the sections table. I used MS-Access to bring the data together. This is the perfect starting point to provide excel pivot tables about the performance of your service that makes every manager happy :-)


Some notes:


The hits on the default.aspx page cannot be matched with a particular post. I will try to find a solution to match it by date. The last post that was saved before the page was displayed should be count.


More cool stuff to come: there is a free regions DB available on the internet. With that you can resolve the country, region and city of your visitors IP address. Since it cannot be queried with SQL (it is a proprietary binary format that can only be read via a API), we will use LINQ to help us with that. That's right: a real world LINQ example! So stay tuned!

No comments: