Please note that all blog posts before 8 April 2007
were automatically imported from LiveJournal. To see the comments and any LiveJournal-specific extras such as polls and user icons, please find the source posting at http://brianenigma.livejournal.com/2007/01/
* IRC Freq
* Word frequency analysis of IRC log files.
* Copyright 2007, Brian Enigma
* This work is licensed under a Creative Commons Attribution-Noncommercial-
* Share Alike 2.5 License (http://creativecommons.org/licenses/by-nc-sa/2.5/)
* Some rights reserved.
* freq [-d days] [-u user] [-l] filename
* days = number of days to go back (otherwise, all days)
* user = specific user (otherwise, all users)
* -l = force all words lowercase (so LOL==lol)
* Originally, Ehsan of Unfiction was trying to prove a point about
* a particular new IRC channel (one with a lot of newbies to both
* Alternate Reality Games as well as IRC in general) and used some
* statistics like the frequency count of the term "LOL." Becuase I
* almost constantly log several ARG-related channels, I thought it
* would be relatively simple to parse those log files and perform a
* more detailed lexical analysis. This, here, is the program that
* does that parsing and analysis.
* The day filter goes back only the specified number of days. For
* instance, if you had a sudden influx of a certain class of people
* into the channel recently, you can filter down onto the the relevant
* days (because averaging against previous days could skew results.)
* The username filter will perform analysis only against the things
* that particular user says, instead of the channel as a whole. If
* there are specific people who are grammatically or stylistically bad,
* you can run stats for just them.
* The lowercase filter forces everything to be lowercase before
* processing. There are some instances where you want to use this
* filter and some where you do not. For instance, without the filter,
* it may be interesting to see how often the incorrect lowercase form
* of "i" is used, as compared to the correct uppercase "I." On the
* other hand, if you want to see how often "LOL" is used, as compared
* to other words, you may want to include the forced-lowercase filter
* because some people use "lol" and others use "LOL," which are logged
* as two separate words without the lowercase filter.
* Note 1: This program works for me. It may not work for you. I use
* X-Chat Aqua and have tweaked the log file's date stamp format to
* include the last two digits of the year.
* Note 2: This is not the most memory efficient method. It uses some
* STL objects that may be a bit more heavyweight than they need to be,
* plus it sorts a container object by copying the contents into a second
* container object, rather than sorting in-place or moving the items.
* It was a quick proof-of-concept first pass, so I didn't care so much
* about memory efficency. Plus, we all have computers with 2 gigs in
* them these days. Don't we?
Sample output: 7 day filter, 7 day + lowercase filters