Thursday, December 5, 2013

Decoding A Facebook Interview Question


Read on Glassdoor today about this question from Facebook, thought it'd be fun.

The question:

 A message containing letters from A-Z is being encoded to numbers using the following mapping: a -> 1, b -> 2, ... etc. How many decodings of some given number.


It seemed immediately obvious to me that this was a recursion problem. So I started there. The first attempt was almost right, but missed the oddity of zeros. After accounting for those, it was just as straightforward as expected.

Here is the code:

Friday, October 25, 2013

Clever Ad...

Sorry it's hard to resist an opportunity to post that clip.

Today I saw one of the best advertising techniques I've seen all year.

Whats the big deal?

I couldn't give a single shit about the insurance, because in reality, its not that exciting. A funny idea an all, but whatever.

Now look at this picture:

Did you see it? 

That's how you claim the freebie! By posting on Facebook. 

This simple warranty policy is a very elegant advertising mechanism for a company like Betabrand who seems to be directly targeting social media advertising.

So, Betabrand, kudos to you!

Tuesday, October 22, 2013

Noncommutative Webforms

A story:

Today I was filling out a webform. It asked for some address information.

Pretty normal. 

First it asks for state, with a blank text field. I type Kansas.

Still fine.

Then it asks for zip, with a blank text field. I type mine.

First alert goes off.

Then it asks for country, with a drop down. I select USA. It changes the state field to a drop down, erasing my previous entry!

Second alert goes off. Third alert goes off.

This webform, is for a job, doing NLP for a major tech company.

At this point, I nearly stopped applying.

Why does this matter?

The first paper I read when I started getting interested in data-science was by DJ Patil. I'll admit, I was skeptical as hell, and I don't mean in the haute Cathy O'neil way.

One of my favorite points from his paper, and what convinced me that smart people were doing data science was his little observation about building smarter webforms for data entry.

So, this company isn't even using the basic advice from one of the most popular data science papers of all time?


This is a perfect example of noncommutativity. It is NOT the same to:

  • ask first for a state, then ask for the country which changes the entry field for the state
  • ask first for the country which changes the entry field for the state, then ask for the state
So which is better?

The astute reader(or even beginner web programmer) will say... zip code.

Friday, October 11, 2013

Evolutive Sentiment Analysis

How to go beyond AFINN for highly specialized lexicons.

If you've heard of sentiment analysis, you've heard of AFINN...

For game like MtG, the AFINN model, frankly speaking, will probably suck. To illustrate, let me provide an example:

Notice here that fattiest and fatty are just referring to the power/toughness, and are positive characteristics. Etc...

In niche areas, the AFINN model is going to be especially weak for lexicons with lots of buzzwords. Companies and organizations working in niche areas start their analysis by building stronger models than AFINN.

For MtG, how can we build a better model? Using sentiment analysis actually...

A training set

If you need to build a more refined model, you need to incorporate words in this niche into your corpus.

For our project, we use 50000 tweets about MtG. We gather these from the hashtags, #MtG, #Theros, #PTTheros, #PTTHS. And now we use AFINN on these. Wait. What? I thought that AFINN was bad?!

AFINN provides a first approximation to our niche words. With 50000 tweets, we can look at words that appear more than a few times, and add their sentiment scores(build using AFINN avg sentiment) to our model. We call this model AFINN-MTG to reflect its bias towards MTG lingo.

Note: It is important to remove any Theros card names from this model to not accidentally skew our later results. We could do this just by removing these words, or by removing tweets from the training set which contain these card names.

Test set

The next step in any machine learning application is to run on a test set. Here is where we will use our card name tweets. This set will be run with the new model to build our first primitive sentiment scores.

Doing better

What about sarcasm? Or negative words? These are common issues for sentiment analysis, and we'll talk about those later, but one additional improvement is to look at word bigrams. We'll see next time what kind of word bigrams appear and are noteworthy.

Monday, September 2, 2013

Medicine names and Magic cards

It's Theros spoiler season!!! Excited about Polukranos? Or is it Polkranas? Or Polocranus? Shit...

How can we design a data collection algorithm which can search for misspellings of made up words?

When querying Twitter for tweets containing names of new cards, we need also consider what SEO people refer to as "generated misspellings"(in the context of SEO, these are used to generate web keywords to artificially increase their pagerank when searchers misspell their search.)

But what does this have to do with medicine...

Just like using a corpus of tweets, these medical documents lack a comprehensive vocabulary, so a standard spell corrector is inappropriate for this task. Simply, we don't want to download all tweets and spellcorrect them, we want to search for tweets that use one of our generated erroneous terms.

Ways to err

Here are three obvious misspellings(Approximate string matching);

1) insertion i.e. Bolivia Voldaren
2) deletion i.e. Counterfux
3) substitution i.e. Knightly Vapor

and algorithmically, we see how to generate each of these from our starting phrase.

One additional misspelling comes from bi-gram transposition, i.e. Angle of Despair

These four methods should generate a rich library of mispellings for a single magic card name, but to what degree should we consider double mistakes? More on this later.

Tuesday, August 13, 2013

NWO in MtG and Twitter Sentiment Analysis 1: The Plan

The New World Order:

For those of you unfamiliar, NWO is synonymous in Magic: the Gathering with dumbing the game down. For older players, NWO is the winds of change chapping our faces in near mockery.

Here is the MaRo article that started it all: .

In MaRo's article we see a fantastically ad-hoc graphs:

Enlightening eh?

We see some gedanken experiments:
For those that are worried that this is boring, I ask you to simply try the following experiment. Make two decks of just vanilla creatures and common sorceries. Find a player of a similar skill level to your own and play the decks against each other. What you will find is that these games are actually quite interesting.

And in general the article has a strong flavor of "we're doing it, deal with it."

My experience is that players I speak with hate NWO, and don't necessarily have anything to back up if it is actually good or not. The mothership surely hasn't provided us with any resources to see what analysis they've actually done on card enjoyment. That's where sentiment analysis comes in.

Twitter and Sentiment Analysis:

Sentiment analysis is exactly what it sounds like. You use text mining techniques to analyze the positivity or negativity of a statement by giving values to each word. This technique is extremely popular for quickly analyzing huge text data sets to try to extract insights about a particular subject.

This past summer I did some sentiment analysis on tweets for fun, and realized how powerful and efficient this technique is.

Here's why it is a good fit for Twitter and Magic:
  1. Tweets are short and numerous, they are also very concise on a topic due to character limit. 
  2. The MtG Twitter community is fairly large, including pro level, in house, and casual players.
  3. The Twitter API makes it easy to search for specific card names.
The Plan:

It seems invaluable to know what cards from the new sets players like. We can start to think about the strengths and weaknesses of NWO if we can see what people are saying about each new card. Like all NLP(or anything really), its not a perfect model, but we can see in a broad sense, what cards people are enjoying.

Here is an outline:
  1. Beginning with Theros' spoilers, follow the daily tweets using new card names.
  2. Compile these tweets based on individual card names
  3. After release, roughly group cards into categories by card function
  4. Run sentiment analysis on card groups and new keywords.


To roughly evaluate the hits and misses in Theros' based on Twitter chatter.

Check back here for updates on how this is done, and what we can learn about new cards in Theros'.

Suggestions Welcome!

Monday, August 12, 2013

My text editor of choice, and why general questions are bad.

J: "What is your text editor of choice?"
B: "LaTeX."
"That's not a text editor."
B: "I know, but your question was so general, this was the only reasonable answer."

Why the question above is terrible:

The question above fails to identify the motivation for asking. Here are some better questions:

  • What text editor do you code in?
  • What text editor do you write academic papers in?
  • What text editor do you due warp-speed text editing in?
  • What text editor do you blog in?
  • What text editor do you leave open for your cat to type silly nonsense that you can post on Facebook?
  • What is the absolute best text editor?
Hopefully this clears up the issues with the question.
BTW: ʇxǝʇ ǝɯıןqns (ɟ ʇxǝʇ ǝɯıןqns (ǝ ɹǝƃƃoןq (p ʇıpǝƃ (ɔ ǝʌıןxǝʇ (q  ǝuıן puɐɯɯoɔ puɐ 'ʇıpǝƃ (ɐ

Some thoughts on text editors:

Long ago I was introduced to Emacs and Vim, and I'll admit I had a torrent love affair with Vim for a while as an undergraduate. I eventually had a falling out with Vim and reverted back to gedit(coinciding with my hiatus from coding). Sometime last year I was introduced by Kevin Lin to Sublime Text. Since, I've been dying to get familiar with it but haven't had a good opportunity. I'm looking forward to the upcoming Kaggle project to force myself to use it. But this is a super high level editor, and far too serious-business for  normal everyday text editing...

Why Gedit is my most commonly used editor:

Two days ago, a friend sent me a data set that she needed cleaned up in about 5 minutes. She(a biologist) was having Excel issues and really needed an 11'th hour data cleanup. The data had some real issues. Some percentages were integers, others were decimals, some were even written as ratios, and a couple entries even were written in words because excel was autocorrecting them as dates and she was in such a hurry this was the fastest solution(I had to fight the urge to send her this link:

How long did it take me to clean the data set? About two minutes. Because "I learned to stop worrying and learn to love plaintext editors".

Find/replace is powerful, fast, and if you're clever, can more than likely perform that little regex you know awk can do so nicely. 

Coding, and pretty colors:

I'm not ashamed to admit, that as a CS minor, I learned how to program in Java, using Eclipse. Now-a-days, coder friends give me a look of scorn when I mention this, but as someone who was already reindexing his music collection folder structure using awk in Sophomore year, I don't feel like this requires me to turn in my nerd badge. When looking at Sublime text, I like that it strikes a balance between the two extremes, it has the pretty colors and language specific capabilities, but still preserves that narcissistic love of masturbatory-keyboard-shortcutting. Additionally, green-text-black-background-what! A notable exclusion from Sublime text is the incredible testing suite that comes with Eclipse, but maybe this is a crutch anyhow(get off my lawn).


The point of all this is to say, aggressively pursue getting the damn problem done and try new text editors in your own time.