Generating a 2016 Word Cloud

Over the years, Netninja has shifted from a LiveJournal-era personal blog to a chronicle of DIY maker projects. I thought I’d generate a word cloud for a year-end 2016 post, but realized half-way through that it’d be even better if I were to show my work.

But first, the word cloud!

(click to embiggen)

This year has been all about design, code, the Arduino, Alexa, and stickers. If you look closely, you might even find a [puzzled] pint.

So how did I make it? Simple. It was a three-step process: export the year of posts, extract the article text, and run it through a cloud generator.

Exporting was easy. WordPress has a built-in export feature that lets you select a date range, then download the resulting file.

The file is XML — specifically a variation of RSS that includes some WordPress-specific metadata such as post categories. There are a ton of words in there, but they’re not all content. There’s too much markup and metadata to form a cloud. What we really want is a text file of just the article content.

Extracting the text requires some command-line skills. I used a tool called xmlstarlet, which works as a sort of XML/XPath Swiss Army Knife. For the word cloud, plain text is required, so I needed to select just the content of all //item/content:encoded elements and write the result to a text file. The specific command looks like this:

xmlstarlet sel -T -t -c "//item/content:encoded" export.xml | 
    sed 's/<[^>]*>//g' | 
    sed 's/\[ gallery .*]//g' | 
    sed 's/\[ caption .*]//' > 
    output.txt

…except you’ll want to remove the space between the open square bracket and the terms “gallery” and “caption” since those are special reserved shortcodes that WordPress uses to render images.

For those curious about the details, the sequence of operations that command performs is:

  • sel puts xmlstarlet in select mode, as opposed to edit, transform, etc.
  • The -T says we want to output plaintext, not escaped XML. We want actual angle-brackets and not ones encoded as “&lt;”.
  • The -t -c says we want to use a copy template.
  • The remainder of that line is the XPath to all the elements we want to select and the WordPress export XML we want to perform these actions on.
  • The first sed command strips out HTML tags.
  • The second and third sed commands strips out the gallery and caption WordPress shortcodes.
  • The results get written to output.txt.

Finally, generate your word cloud. I used wordclouds.com, but you’re free to use whatever cloud generator you desire. Although they have a setting to remove stop words (common words such as and and the), I didn’t think the word list backing that feature was large enough. It left few words like I’ve and you’re that didn’t seem to add much value to the resulting cloud. I manually pruned them from the list.

Cheers to a new year of projects and making cool things!

Posted in: Dear Diary

Published by

Brian Enigma

Brian Enigma is a Portlander, manipulator of atoms & bits, minor-league blogger, and all-around great guy. He typically writes about the interesting “maker” projects he’s working on, but sometimes veers off into puzzles, software, games, local news, and current events.

Leave a Reply

Your email address will not be published. Required fields are marked *