Quick Log File Processing with Perl

Perl

A common thing to want to do as a sysadmin is match and print text from a file in a particular output format. There are lots of ways to do this using shell tools - grep, sed and awk are used frequently - but I'd like to show you a common Perl idiom for doing this type of task.

Perl was originally designed to be a replacement for the various shell tools, and while it has grown into much more over the years, it is still a great tool to have in your command line toolbox. Here's an example. Let's say you want to print the date, time, IP address and URL each time your website is crawled by a Googlebot. The Apache access log will look something like this:

... 10.249.66.234 - - [12/Sep/2011:19:22:51 -0400] "GET /robots.txt HTTP/1.1" 404 424 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 10.249.66.234 - - [12/Sep/2011:19:22:51 -0400] "GET / HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ...

A quick solution is this, all in one line:

serenity:~# perl -wnle 'print "Googlebot accessed \"$4\" from $1 on $2 at $3" if (/^ (\d+\.\d+\.\d+\.\d+) .+? \[ (.+?) : (.+?) \s .+? GET\s+(.+?)\s+HTTP .+ Googlebot/x)' /var/log/apache2/access.log Googlebot accessed "/robots.txt" from 10.249.66.234 on 12/Sep/2011 at 19:22:51 Googlebot accessed "/" from 10.249.66.234 on 12/Sep/2011 at 19:22:51 serenity:~#

There are four command line options used here:

See the perlrun manpage for details, there is much more to Perl's command line processing.

I build the regular expression by picking a target line and going through it from left to right, adding expressions as I go. I make use of the /x modifier so that it is easier to read - this makes Perl ignore whitespace in the regexp. I also use Perl's non-greedy quantifier quite a bit, this is the question mark in expressions like .+? \[. This little snippet matches one or more of any character, followed by a left-bracket. The question mark ensures that the first such left-bracket is matched. Normally Perl's regexp engine would happily chomp away at characters and match the last left bracket it found in the line. Using the greedy form .+ \[ would work for us, since there is only one such left bracket in each line, but it turns out to be a performance improvement if we are parsing large text files (For more info, I encourage you to read Mastering Regular Expressions by Jeffrey Friedl, or start with the Regular Expression Tutorial).

This method has a few advantages. For one, it relies on just one tool, not a few disparate ones. Perl is portable to many operating systems, so you could use this to parse text files on Windows, for example. You also have the ability to load modules on the command line with the '-M' switch. This gives you access to all of CPAN, potentially a huge time-saver.