Friday, January 4, 2008

The difference a review can make

One of my previous classes involved building a search engine for a document. The professor demonstrated different ways of indexing documents. One that really jumped out at me was ngramming. Ada's ability to do Enumeration_Type'Value on an equivalently sized string and return Enumeration_Type really made this attractive to me. Seemed like a pretty straightforward idea and should be easy to implement.

It wasn't.

I wrote a script to define the ngram type of all combinations of 3 letters. The script output a file "aaa, aab, aac, .. zzx, zzy, zzz". There are 17576 combinations of 3 letters. Ada has 12 reserved words that are 3 letters so I had to code around that. My once neat "abs" became "absa" and I had to add a case statement with 7 choices and at least one if-else in every choice of that case. Yipe. What was a very close representation of the data just became a burden. The conversion function and the accompanying type declaration make for a nearly 90 KB file. Ugh.

Using a script to write the data type for me should have been a clue.

I've been looking over that code recently and came up with a much neater solution. Here is the declaration and test of what should be a nearly drop in replacement and is probably faster to boot:
(trigram_test.adb)

In comparison the original enumeration type declaration took 704 lines. This new type might be not as representative, but I never used the representation in the old implementation. This is by far easier to understand and maintain.

No comments: