Please note that all blog posts before 8 April 2007 were automatically imported from LiveJournal. To see the comments and any LiveJournal-specific extras such as polls and user icons, please find the source posting at http://brianenigma.livejournal.com/2007/01/
https://netninja.com/files/ircfreq/freq.cpp
/** * IRC Freq * Word frequency analysis of IRC log files. * Copyright 2007, Brian Enigma* This work is licensed under a Creative Commons Attribution-Noncommercial- * Share Alike 2.5 License (http://creativecommons.org/licenses/by-nc-sa/2.5/) * Some rights reserved. * * Usage: * freq [-d days] [-u user] [-l] filename * days = number of days to go back (otherwise, all days) * user = specific user (otherwise, all users) * -l = force all words lowercase (so LOL==lol) * * Originally, Ehsan of Unfiction was trying to prove a point about * a particular new IRC channel (one with a lot of newbies to both * Alternate Reality Games as well as IRC in general) and used some * statistics like the frequency count of the term "LOL." Becuase I * almost constantly log several ARG-related channels, I thought it * would be relatively simple to parse those log files and perform a * more detailed lexical analysis. This, here, is the program that * does that parsing and analysis. * * The day filter goes back only the specified number of days. For * instance, if you had a sudden influx of a certain class of people * into the channel recently, you can filter down onto the the relevant * days (because averaging against previous days could skew results.) * * The username filter will perform analysis only against the things * that particular user says, instead of the channel as a whole. If * there are specific people who are grammatically or stylistically bad, * you can run stats for just them. * * The lowercase filter forces everything to be lowercase before * processing. There are some instances where you want to use this * filter and some where you do not. For instance, without the filter, * it may be interesting to see how often the incorrect lowercase form * of "i" is used, as compared to the correct uppercase "I." On the * other hand, if you want to see how often "LOL" is used, as compared * to other words, you may want to include the forced-lowercase filter * because some people use "lol" and others use "LOL," which are logged * as two separate words without the lowercase filter. * * Note 1: This program works for me. It may not work for you. I use * X-Chat Aqua and have tweaked the log file's date stamp format to * include the last two digits of the year. * Note 2: This is not the most memory efficient method. It uses some * STL objects that may be a bit more heavyweight than they need to be, * plus it sorts a container object by copying the contents into a second * container object, rather than sorting in-place or moving the items. * It was a quick proof-of-concept first pass, so I didn't care so much * about memory efficency. Plus, we all have computers with 2 gigs in * them these days. Don't we? */
Sample output: 7 day filter, 7 day + lowercase filters