Rcjp's Weblog

October 3, 2006

Word Frequency Comparison

Filed under: c, lisp, python — rcjp @ 2:49 pm

Quite a common programming exercise is to write a word frequency counter. For a bit of fun I thought I’d code it in a few languages for comparison. There are essentially two parts to the problem: splitting up text into real words and keeping a tally of the occurances. Before we look at the python here is an interesting piece of shell script from the Classic Shell Scripting Oreilly book:

#! /bin/sh
# Read a text stream on standard input, and output a list of
# the n (default: 25) most frequently occurring words and
# their frequency counts, in order of descending counts, on
# standard output.
#
# Usage:
#       wf [n]
tr -cs A-Za-z\' '\n' |        # Replace nonletters with newlines
  tr A-Z a-z |                # Map uppercase to lowercase
    sort |                    # Sort the words in ascending order
      uniq -c |               # Eliminate duplicates, showing their counts
        sort -k1,1nr -k2 |    # Sort by descending count, and then by ascending word
          sed ${1:-25}q       # Print only the first n (default: 25) lines;


You can then use e.g. head -n10 to get the top ten words, or wf 999999 < texfile | wc -l to find the number of unique words etc. It is impressive that it can be done, but I think I’d forget what those args to sort did pretty quick.

Python

There was an entry in the python shootout (which I can’t seem to find now online?)

import sys

def main():
    i_r = map(chr, range(256))

    trans = [' '] * 256
    o_a, o_z = ord('a'), (ord('z')+1)
    trans[ord('A'):ord('Z')+1] = i_r[o_a:o_z]
    trans[o_a:o_z] = i_r[o_a:o_z]
    trans = ''.join(trans)

    count = {}
    for line in sys.stdin:
        for word in line.translate(trans).split():
            try:
                count[word] += 1
            except KeyError:
                count[word] = 1

    l = sorted(zip(count.itervalues(), count.iterkeys()), reverse=True)

    print '\n'.join(["%7s %s" % (count, word) for count, word in l])

main()


and a sample run would be…

    r@laptop:~$ python py/work/wordfreq.py < testwords
   5621 the
   2739 a
   2585 of
   2242 and
   2155 it
   2086 to
   1938 he
   1750 said
   1495 you
   1387 was
   1221 in
   1132 that


But I don’t think I would have come up with that translate solution. My, probably slower, effort was:

import re

def wordfreq_fromfile(filename):
    wordfreq(file(filename).read())

def wordfreq(str):
    freq={}
    words = re.compile(r"[\w'-]+")
    for word in words.finditer(str):
        w = word.group(0).lower()
        freq[w] = freq.get(w,0) + 1
    # make a list of most common words
    common = sorted(freq, key=freq.get, reverse=True)
    print '\n'.join("%7s %s" % (freq[word], word) for word in common)


and I think it is more obvious what that code is doing.

C++

The standard library has a map container handy to keep the word/occurance, but you can’t sort that so we switch to a vector at the end

#include <iostream>
#include <fstream>
#include <map>
#include <string>
#include <vector>
#include <algorithm>
#include <iterator>
#include <iomanip>

typedef std::map<std::string,int>  Dict;
typedef std::pair<std::string,int> DictItem;
typedef std::vector<DictItem>      VDict;

bool wordfreq_greater(const DictItem& a, const DictItem& b)
{
    return a.second > b.second;
}

// cout will only look for << operators in std namespace
// (could've just defined format in print_words rather than <<)

namespace std{
    std::ostream& operator << (std::ostream& out, const DictItem& d)
    {
        return out << std::setw(7) << d.second << " : " << d.first;
    }
}

template <typename T>
void print_words(T iter, std::string title, int maxcount)
{
    std::cout << title << std::endl
              << std::string(title.size(), '-') << std::endl;
    for(int i = 0; i < maxcount; ++iter, ++i)
        std::cout << *iter << std::endl;
    std::cout << std::endl;
}

int main(int argc, char *argv[])
{
    std::ifstream essay(argv[1]);
    if (!essay) {
        std::cerr << "please supply a filename for input" << std::endl;
        exit(1);
    }

    Dict dict;
    std::string word;
    while (essay >> word)    // count up frequency of words
        ++dict[word];

    // print the first 10 words, map sorts alphabetically
    // by default.. map< std::string,int,std::less<std::string> > dict;

    print_words<Dict::const_iterator> (dict.begin(), "Alphabetically", 10);

    // copy the dict to a vector so we can sort it

    VDict vdict(dict.begin(), dict.end());
    sort(vdict.begin(), vdict.end(), wordfreq_greater);

    print_words<VDict::const_iterator>(vdict.begin(), "Frequency", 10);
}


when run with a sample text file prints something like:

Alphabetically
--------------
      3 : !
    230 : "
      1 : ""allo,
      1 : "'E's
      1 : "'Ha'
      1 : "'Yet
      2 : "...
      2 : "/
      2 : "7
     38 : "A

Frequency
---------
   4836 : the
   2529 : a
   2509 : of
   1997 : to
   1843 : and
   1342 : was
   1304 : said
   1156 : he
   1093 : in
    917 : it


so it could probably do with some work on getting punctuation right.

Common Lisp

Common Lisp doesn’t have regular expressions included, but its usually not too much trouble to just write a quick bit of parsing code

    (defvar *freq* (make-hash-table))

    (defun word-letter-p (c)
      (and (characterp c) (alpha-char-p c)))

    (defun gobble-punctuation (stream)
      (loop for c = (peek-char t stream nil)
         until (word-letter-p c)
         while c do (read-char stream)))

    (defun getword (stream)
      (gobble-punctuation stream)
      (loop for c = (read-char-no-hang stream nil nil)
         while (word-letter-p c)
         collect c into letters
         finally (return (format nil "~{~C~}" letters))))

    (defun wordfreq (stream)
      (clrhash *freq*)
      (loop for word simple-string = (getword stream)
           while (string-not-equal word "")
           do (incf (the fixnum (gethash (intern (string-downcase word)) *freq* 0))))
      (let ((freqlist nil))
        (maphash #'(lambda (k v) (push (cons v k) freqlist)) *freq*)
        (loop for (k . v) in (sort freqlist  #'> :key #'car)
           repeat 15 ;; just dump out the top 15 words
           do (format t "~4A, ~A~%" k v))))

    (defun wordfreq-string (str)
      (with-input-from-string (s str)
        (wordfreq s)))

    (defun wordfreq-file (filename)
      (with-open-file (s filename :direction :input)
        (wordfreq s)))

    (wordfreq-file "/home/r/py/finished/testwords")


of course there are regexp libraries for Common Lisp like Edi Weitz’s cl-ppcre:

    (eval-when (:compile-toplevel :load-toplevel :execute)
      (require :cl-ppcre))

    (defvar *freq* (make-hash-table))

    (defun wordfreq (stream)
      (clrhash *freq*)
      (loop for line simple-string = (read-line stream nil nil)
         while line
         do (cl-ppcre:do-matches-as-strings (word "[\\w'-]+" line)
              (incf (the fixnum (gethash (intern (string-downcase word)) *freq* 0)))))
      (let ((freqlist nil))
        (maphash #'(lambda (k v) (push (cons v k) freqlist)) *freq*)
        (loop for (k . v) (fixnum . symbol) in (sort freqlist  #'> :key #'car)
           repeat 15 ;; just dump out the top 15 words
           do (format t "~4A, ~A~%" k v))))

    (defun wordfreq-string (str)
      (with-input-from-string (s str)
        (wordfreq s)))

    (defun wordfreq-file (filename)
      (with-open-file (s filename :direction :input)
        (wordfreq s)))

    (wordfreq-file "/home/r/py/finished/testwords")


None of these programs are really finished… it’d be very fiddly to get them to deal correctly will all kinds of punctuations marks etc, also I haven’t bothered to profile any of them – computers are so quick these days that speed is almost irrelevent for this kind of stuff.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: