I’m planning one of the worst things that can happen to a TA: a big massive file move-and-rename operation. Much as I love my team, we have a poor record as a company when it comes to spelling, and it occurred to me that I’d like to at least have some degree of automatic spell checking on the names of the new files, folders and assets. Spelchek
It turns out that there’s no good spell checker for Python that doesn’t come with some kind of extension module (BTW, I’d love to be wrong about that - if you know one definitely post it in the comments). PyEnchant for example is great, but it’s got 32-bit only Windows extensions that I can’t distribute without a hassle.
I shamelessly borrowed his structure, with a couple of minor and not very creative tweaks. Peter’s original is built around Bayesian analysis: it guesses the correct word by looking at the relative frequencies with which variants show up – if ‘meet’ shows up 1000 times in your database but ‘mete’ shows up 5 times, that’s a good indication that ‘meet’ is the correct first guess.
Since I’m in a rush, I didn’t use that functionality very much. I scrounged around for as many sources correctly scored words. Unfortunately the only free source I could rely on turned out to be the venerable ‘GSL’ or ‘General Service List’, which has great data but only for about 2000 words (I used the version found here, by John Bauman as a the core of the list, and then scrounged the internet for other free sources. Since all of these were less common words than the ones in the GSL I gave them pretty arbitrary Bayes scores (4’s and 5’s for common words, 3’s for variants, plurals and participles). This is not sophisticated linguistics, but it’s close enough for horseshoes.
The result is up on github as spelchek, which I affectionately refer to as the cheap-ass spell checker.
It is hardly rocket science, but it does work. You can do something like:
import spelchek spelchek.correct('vhicle') # 'vehicle'
spelchek.guesses('flied') # ['filed', 'flied', 'flies', 'lied']
As always, MIT licensed so go to town.
No comments:
Post a Comment