ircfreq

Please note that all blog posts before 8 April 2007 were automatically imported from LiveJournal.  To see the comments and any LiveJournal-specific extras such as polls and user icons, please find the source posting at http://brianenigma.livejournal.com/2007/01/

https://netninja.com/files/ircfreq/freq.cpp

/**
 * IRC Freq
 * Word frequency analysis of IRC log files.
 * Copyright 2007, Brian Enigma 
 * This work is licensed under a Creative Commons Attribution-Noncommercial-
 * Share Alike 2.5  License (http://creativecommons.org/licenses/by-nc-sa/2.5/)
 * Some rights reserved.
 * 
 * Usage:
 * freq [-d days] [-u user] [-l] filename
 * days = number of days to go back (otherwise, all days)
 * user = specific user (otherwise, all users)
 * -l = force all words lowercase (so LOL==lol)
 *
 * Originally, Ehsan of Unfiction was trying to prove a point about
 * a particular new IRC channel (one with a lot of newbies to both
 * Alternate Reality Games as well as IRC in general) and used some
 * statistics like the frequency count of the term "LOL."  Becuase I
 * almost constantly log several ARG-related channels, I thought it
 * would be relatively simple to parse those log files and perform a
 * more detailed lexical analysis.  This, here, is the program that
 * does that parsing and analysis.
 * 
 * The day filter goes back only the specified number of days.  For 
 * instance, if you had a sudden influx of a certain class of people
 * into the channel recently, you can filter down onto the the relevant
 * days (because averaging against previous days could skew results.)
 *
 * The username filter will perform analysis only against the things
 * that particular user says, instead of the channel as a whole.  If
 * there are specific people who are grammatically or stylistically bad,
 * you can run stats for just them.
 *
 * The lowercase filter forces everything to be lowercase before 
 * processing.  There are some instances where you want to use this
 * filter and some where you do not.  For instance, without the filter,
 * it may be interesting to see how often the incorrect lowercase form
 * of "i" is used, as compared to the correct uppercase "I."  On the 
 * other hand, if you want to see how often "LOL" is used, as compared
 * to other words, you may want to include the forced-lowercase filter
 * because some people use "lol" and others use "LOL," which are logged
 * as two separate words without the lowercase filter.
 * 
 * Note 1: This program works for me.  It may not work for you.  I use
 * X-Chat Aqua and have tweaked the log file's date stamp format to
 * include the last two digits of the year.  
 * Note 2: This is not the most memory efficient method.  It uses some
 * STL objects that may be a bit more heavyweight than they need to be,
 * plus it sorts a container object by copying the contents into a second
 * container object, rather than sorting in-place or moving the items.
 * It was a quick proof-of-concept first pass, so I didn't care so much
 * about memory efficency.  Plus, we all have computers with 2 gigs in 
 * them these days.  Don't we?
 */

Sample output: 7 day filter, 7 day + lowercase filters

Posted in: Code Projects

Leave a Reply

Your email address will not be published. Required fields are marked *