[EL] McDonald study, birthdate distribution in real voter list

Sun Sep 11 14:29:25 PDT 2011

On 9/11/11, Michael McDonald <mmcdon at gmu.edu> wrote:
> This is the surprising counter-intuitive result the Birthday Problem
> reveals. You have to test each Robert Smith against all other Robert Smiths
> to properly calculate the probabilities. Once you have one Robert Smith
> without a match, his birth date is taken up, so you have to remove that
> birth date from the matching search for the remaining Robert Smiths.

All probabilistic calculations will assume dates of birth are evenly
distributed over 365 days.  But this is not the case.  The most common
US date of birth is October 5, almost exactly the period of gestation
beyond New Year's Eve -- suggesting that more than mere kissing goes
on around the stroke of midnight.  :)   According to this source, an
average of 12,756 Americans are born on October 5.  See
www.anybirthday.com

May 22nd is the least frequent date of birth in the USA, with an
average of 10,259 persons born on that date each year.  Id.  Nine
months prior to May 22 is the the hottest part of the dog days of
August.  These facts undermine the assumptions of these probabilistic
calculations, making probabilistic calculations mere estimates.  I
suppose this is why folks are "gaming" the date of birth match
problem.

My query is whether the simulation or gaming of the problem is being
done correctly, or if I've misunderstood:

My question on the quoted portion from Michael McDonald's email above
goes to why would one "have to remove that [non-matching] birth date
from the [remainder of the] matching search?"   This would appear to
artificially increase the experimental data being used to compensate
for the inability of probabilistic calculation to give us solid
numbers, given that birthdays are not randomly distributed over 365 or
366 days in a year.

Perhaps this is compensated for in the actual experiments and I'm just
reading ambiguity in the language above?   But I'm thinking that if
the computer or process is not carefully done to credit each remaining
person with a non-match as to each removed person it's been tested
against (so as to start the later-performed counts for matching
purposes at a number higher than zero to reflect the number of matches
that have failed), then the data which result will generate
probabilities that are artificially high.  This would be so because
the fraction expressed as Matches/Population will report the correct
number of matches even when some names are removed after being tested,
but the population denominator necessary to have that given number of
matches would be shrunken, thus increasing the overall value of the
fraction artificially.

It may be that results on this point are generated properly in the
actual experiments if the population denominator remains static even
after subjects are tossed out after obtaining no match.  But it
doesn't seem necessary to a proper calculation to throw these out, it
is only necessary from the standpoint of avoiding redundant work.  No?

-- 
Paul R Lehto, J.D.
P.O. Box 1
Ishpeming, MI  49849
lehto.paul at gmail.com
906-204-4026 (cell)