Monday, September 2, 2013

Medicine names and Magic cards

It's Theros spoiler season!!! Excited about Polukranos? Or is it Polkranas? Or Polocranus? Shit...

How can we design a data collection algorithm which can search for misspellings of made up words?

When querying Twitter for tweets containing names of new cards, we need also consider what SEO people refer to as "generated misspellings"(in the context of SEO, these are used to generate web keywords to artificially increase their pagerank when searchers misspell their search.)

But what does this have to do with medicine...

Just like using a corpus of tweets, these medical documents lack a comprehensive vocabulary, so a standard spell corrector is inappropriate for this task. Simply, we don't want to download all tweets and spellcorrect them, we want to search for tweets that use one of our generated erroneous terms.

Ways to err

Here are three obvious misspellings(Approximate string matching);

1) insertion i.e. Bolivia Voldaren
2) deletion i.e. Counterfux
3) substitution i.e. Knightly Vapor

and algorithmically, we see how to generate each of these from our starting phrase.

One additional misspelling comes from bi-gram transposition, i.e. Angle of Despair

These four methods should generate a rich library of mispellings for a single magic card name, but to what degree should we consider double mistakes? More on this later.