One of our TF2 server admins grabbed log files from one of our trade servers, pulled the chat lines out and turned it into a pretty neat word cloud using Wordle.  Our users thought it was a fun representation of the server.  But 10,000 lines?  That’s not very reliable data!

Since we log all chat across all servers to a database I thought it would be pretty easy to turn a query into a more accurate representation.  I ran the query and spat out a 19MB text file for one of our servers (1.22 million chat lines).  Now all I had to do was find a service that would turn it into a word cloud.  Turns out that’s easier said than done.  I looked at 15 different services and none of them allowed files larger than 5MB.  I tried a downloadable solution, but it took about 5 hours to parse the file (and if you select any filtering options it started over from scratch).  Sure, I could cut out the data to include only the first 100,000 lines, but I didn’t like that solution.  Instead I used some command line tools to ease the burden of the word cloud generator.

First, the query (if you want the smallest file, terminate lines with ” instead of ‘\n’):

SELECT `message` FROM `chat` WHERE `server` = "ABDB" INTO OUTFILE '/tmp/dustbowlchat.txt' FIELDS TERMINATED BY ' ' ESCAPED BY '' LINES TERMINATED BY '\n';
Query OK, 1227593 rows affected (10.19 sec)

Out of curiosity, I wanted a word-count as well.

 [root@fp-ashburn-1 tmp]# wc dustbowlchat.txt
1227593  3844828 19882426 dustbowlchat.txt

Cool, just over 3.1 words per line on average. About what I expected from a bunch of gamers likely saying “lol” and “rekt” all the time.

Now here’s where the magic happens. Remember how it took about 5 hours to parse all the data using some pre-built java application? Using translate, sort, and unique we can do this entire job in about 6 seconds. Quite a difference!

tr '[:upper:]' '[:lower:]' < dustbowlchat.txt | tr -d '[:punct:]' | tr -s ' ' '\n' | sort | uniq -c | sort -nr | head  -300 | awk '{print $2 ":" $1}' > dustbowlsummary.txt

Let’s break down this one-liner:

  1. Covert all text from upper to lowercase
  2. Delete all punctuation
  3. Convert spaces into newlines (so each word is now on its own line)
  4. Sort the document
  5. Delete and count duplicate words (a line now looks like: ” 1003145 lol”)
  6. Sort by the count column in reverse order
  7. Only return the top 300 results
  8. Format the data for Wordle (so lines look like: “lol:1003145”)
  9. Save to a file.  You can just echo it out and copy/paste if you want to save a step.

Now just paste that output into Wordle and you have a relatively accurate word cloud! In my case though, removing punctuation also meant things like “:)” and “;P” weren’t included, which wasn’t a huge deal. Also UTF-8 emoticons were a bit mangled due to encoding issues, but Wordle doesn’t support them anyway. Sorry Lenny face spammers!