Tuesday, February 4, 2014

Digging more broadly for Timeless, Goldilocks names: MADness

In my initial search for Timeless, Goldilocks names, I defined Timeless as "staying within a fixed range over a long period of time". One problem with this "stays-in-range" definition is that the range is arbitrary.  This can introduce shelves in the data, or make it impossible to visualize data just out of range that might make a qualitative difference.  Another problem is what to do with names that simply haven't been around for the full 133 years in the dataset.  Presumably they are less timeless, but how much less?  Is there a better definition of Timeless?

Gus suggested MAD: Median Absolute Deviation, which was first mentioned in print by Gauss, so you know it's legit.  MAD is the median of how far a curve deviates from its own median.  For example, here's a curve:
Here's the median:
  And here's the deviation from the median at each point:
Treating each of those bars as an absolute (so that -1 is just 1), the eight bars are 0.5, 0.5, 0.5, 4.5, 5.5, 5.5, 5.5, and 7.5.  The Median of the Absolute Deviations is 5.  So half of this curve is closer than 5 to its median and half is further away.  One thing to notice is that this method is not very sensitive to narrow spikes.  If this curve started at 50, instead of 5.5, the MAD would still be 5.  I don't know if that's good or bad, because I'm still not totally sure mean by Timeless.  This is one of the limitations of statistical analysis: you can't precisely analyze something that you can't precisely define.  I'll know when I see it.

So how does MAD compare to "stays-in-range"?  Here are three curves; they could be the relative popularity of three names.  I've highlighted the 8 to 12 range as the Goldilocks range ("stays-in-range" ties Timeless very closely Goldilocks, whereas MAD tells you how close a curve sticks to itself).  Which of these names do you think is most Timeless?
All three of these names have the same median, or "popularity".  By MAD, Alice and Chang score the same (1.5), and Bob scores much worse (5).  Chang's two big excursions out of the zone aren't counted any worse that Alice's much small deviations.  By stays-in-range, Alice is the clear winner with 6, and Bob and Change tie with 3.  Does that match what your eyeballs are telling you?

I declare this inconclusive as a tie-breaker for a few names.  But does MAD help us pick out a few Timeless names out of the tens of thousands in the data set?  And does it do that any better than stays-in-range?  In our next installment, we'll look at the data to find out.

No comments :

Post a Comment