We've Moved


The blog has been retired - it's up for legacy reasons, but these days I'm blogging at blog.theodox.com. All of the content from this site has been replicated there, and that's where all of the new content will be posted. The new feed is here . I'm experimenting with crossposting from the live site, but if you want to keep up to date use blog.theodox.com or just theodox.com

Saturday, June 6, 2015

Porting Spelchek to Boo


What could be more ghostly than a post mortem?

If my last post about Boo piqued your interest, but you haven’t had time to do a deep dive into the language to see for yourself, I’ve posted a version of the Spelchek Python spell checker module converted to Boo so you can see the similarities and differences between the two languages.
The original Python version is here and the Boo port is here. As a good indication of what I’ve been saying about the economy of Boo syntax, the Boo version comes in at almost the same size as the Python original (5.05 kb for Boo and 4.95kb for Python) and pretty much the same number of lines – I haven’t done the excersize of converting it to C# for comparison but I’d guess the C# version would come in at about half again as much typing.
Looking at the code, significant chunks are almost identical: the logic is pretty much the same and the type annotations are the only real difference.
Thus
def add(word as string, pri as int):
    """
    Adds <word> to the dictionary with the specified priority. 
    """
    _DICTIONARY[word.ToLower()] = pri
becomes
def add(word, priority=4):
"""
Adds <word> to the dictionary with the specified priority (default is 4)
"""
    _DICTIONARY[word.lower().strip()] = priority
which is pretty much identical.
The tricky bit of the conversion was the routine which generates possible variants of the word - it generates variants of a word by transposition and deletions. In Python:
def first_order_variants(word):
    """
    return the obvious spelling variants of <word> with missing words, transpositions, or misplaced characters
    """
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes = [a + b[1:] for a, b in splits if b]
    transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b) > 1]
    replaces = [a + c + b[1:] for a, b in splits for c in _ALPHABET if b]
    inserts = [a + c + b for a, b in splits for c in _ALPHABET]
    return set(deletes + transposes + replaces + inserts)
As you can see the first list comprehension, splits, generates a lists of pairs representing places where the word could be broken up, so that ‘cat’ produces [("c","at"), ("ca", "t")]. The other comprehensions use that list to try inserting, deleting or transposing letters to guess what the user might have really been typing.
In Boo, the tricky bit was getting the compiler to recognize that the splits list contained a pair of strings and that all the lists produced by it would also be lists of strings. Porting the python code directly wouldn’t work because Boo would see splits as a list of type object instead of deducing that it was a set of string pairs.
Here’s the Boo version, which as you can see is recognizably the same but is clunkier than the Python, due to the need for typing,
def first_order_variants(word as string):
"""
return the obvious spelling variants of <word> with missing words, transpositions, or misplaced characters
"""
    _stringList = Boo.Lang.List[of (string)]
    _strings = Boo.Lang.List[of string]
    pair = {w as string, i as int | (w[:i] cast string, w[i:] cast string)}
    splits = _stringList((pair(word, i) for i in range(len(word) + 1)))
    deletes  = _strings((a + b[1:] for a as string, b as string in splits if b))
    transposes  = _strings((a + b[1] + b[0] + b[2:] for a as string, b as string in splits if len(b) > 1))
    replaces  = _strings((a + c + b[1:] for a as string, b as string in splits for c in _ALPHABET if b))
    inserts  = _strings((a + c + b for a as string, b as string in splits for c in _ALPHABET))  

    result = HashSet[of string]()
    for chunk in (deletes, transposes, replaces, inserts):
        result.UnionWith(chunk)

    return result
To clean it up I added two ‘aliases’ up at the top, since the Boo syntax for declaring typed containers is hard to read (‘List[of string]’): so _stringList is a shortcut for ‘list of string arrays’ and _strings is a shortcut for ‘list of strings’.
The variable pair contains a lambda (ie, an inline function) using Boo’s idiosyncratic syntax: you could mentally rewrite it as
def pair(w as string, i as int) of (string):
    return (w[:i], w(i:))
or in other words “give me a string and an integer, I’ll return a pair of strings split at the index you gave me.”
With those helpers in place the logic is identical, but it is harder to follow because of all the type-mongering. I’m pretty sure there are more elegant ways to do this withgout being so wordy, but I’m not an expert.

So…

The point of the experiment was to see how hard the Python > Boo translation would be. This is an application where types actually matter a good deal, since all my values are strings and I need to be able to do string operations like joins on them – if all I was doing as asking questions of them things would have been more Pythonic (though probably slower as well: one of the reasons we need those types is to get the compiler to help us speed the code up).
While this is hardly a demanding application, it is at least a proof-of-concept for the idea of prototyping in Python and then selectively porting to Boo isn’t completely nuts.

No comments:

Post a Comment