[EL] Princeton researchers deanonymize optical scan forms/ballots
Joseph Lorenzo Hall
joehall at gmail.com
Wed Jun 8 08:31:59 PDT 2011
(Here is the other example I wanted to share with you all last week
that counsels us to only limitedly disclose scanned images of ballots.
This work will be presented at the 2011 USENIX Security Symposium in
August.)
http://www.freedom-to-tinker.com/blog/wclarkso/new-research-result-bubble-forms-not-so-anonymous_NWe-tAb
New Research Result: Bubble Forms Not So Anonymous
By Will Clarkson - Posted on June 7th, 2011 at 8:27 pm
Today, Joe Calandrino, Ed Felten and I are releasing a new result
regarding the anonymity of fill-in-the-bubble forms. These forms,
popular for their use with standardized tests, require respondents to
select answer choices by filling in a corresponding bubble.
Contradicting a widespread implicit assumption, we show that
individuals create distinctive marks on these forms, allowing use of
the marks as a biometric. Using a sample of 92 surveys, we show that
an individual's markings enable unique re-identification within the
sample set more than half of the time. The potential impact of this
work is as diverse as use of the forms themselves, ranging from
cheating detection on standardized tests to identifying the
individuals behind “anonymous” surveys or election ballots.
If you've taken a standardized test or voted in a recent election,
you’ve likely used a bubble form. Filling in a bubble doesn't provide
much room for inadvertent variation. As a result, the marks on these
forms superficially appear to be largely identical, and minor
differences may look random and not replicable. Nevertheless, our work
suggests that individuals may complete bubbles in a sufficiently
distinctive and consistent manner to allow re-identification. Consider
the following bubbles from two different individuals:
(images in original)
These individuals have visibly different stroke directions, suggesting
a means of distinguishing between both individuals. While variation
between bubbles may be limited, stroke direction and other subtle
features permit differentiation between respondents. If we can learn
an individual's characteristic features, we may use those features to
identify that individual's forms in the future.
To test the limits of our analysis approach, we obtained a set of 92
surveys and extracted 20 bubbles from each of those surveys. We set
aside 8 bubbles per survey to test our identification accuracy and
trained our model on the remaining 12 bubbles per survey. Using image
processing techniques, we identified the unique characteristics of
each training bubble and trained a classifier to distinguish between
the surveys’ respondents. We applied this classifier to the remaining
test bubbles from a respondent. The classifier orders the candidate
respondents based on the perceived likelihood that they created the
test markings. We repeated this test for each of the 92 respondents,
recording where the correct respondent fell in the classifier’s
ordered list of candidate respondents.
If bubble marking patterns were completely random, a classifier could
do no better than randomly guessing a test set’s creator, with an
expected accuracy of 1/92 ≈ 1%. Our classifier achieves over 51%
accuracy. The classifier is rarely far off: the correct answer falls
in the classifier’s top three guesses 75% of the time (vs. 3% for
random guessing) and its top ten guesses more than 92% of the time
(vs. 11% for random guessing). We conducted a number of additional
experiments exploring the information available from marked bubbles
and potential uses of that information. See our paper for details.
Additional testing---particularly using forms completed at different
times---is necessary to assess the real-world impact of this work.
Nevertheless, the strength of these preliminary results suggests both
positive and negative implications depending on the application. For
standardized tests, the potential impact is largely positive. Imagine
that a student takes a standardized test, performs poorly, and pays
someone to repeat the test on his behalf. Comparing the bubble marks
on both answer sheets could provide evidence of such cheating. A
similar approach could detect third-party modification of certain
answers on a single test.
The possible impact on elections using optical scan ballots is more
mixed. One positive use is to detect ballot box stuffing---our methods
could help identify whether someone replaced a subset of the
legitimate ballots with a set of fraudulent ballots completed by
herself. On the other hand, our approach could help an adversary with
access to the physical ballots or scans of them to undermine ballot
secrecy. Suppose an unscrupulous employer uses a bubble form
employment application. That employer could test the markings against
ballots from an employee’s jurisdiction to locate the employee’s
ballot. This threat is more realistic in jurisdictions that release
scans of ballots.
Appropriate mitigation of this issue is somewhat application specific.
One option is to treat surveys and ballots as if they contain
identifying information and avoid releasing them more widely than
necessary. Alternatively, modifying the forms to mask marked bubbles
can remove identifying information but, among other risks, may remove
evidence of respondent intent. Any application demanding anonymity
requires careful consideration of options for preventing creation or
disclosure of identifying information. Election officials in
particular should carefully examine trade-offs and mitigation
techniques if releasing ballot scans.
This work provides another example in which implicit assumptions
resulted in a failure to recognize a link between the output of a
system (in this case, bubble forms or their scans) and potentially
sensitive input (the choices made by individuals completing the
forms). Joe discussed a similar link between recommendations and
underlying user transactions two weeks ago. As technologies advance or
new functionality is added to systems, we must explicitly re-evaluate
these connections. The release of scanned forms combined with advances
in image analysis raises the possibility that individuals may
inadvertently tie themselves to their choices merely by how they
complete bubbles. Identifying such connections is a critical first
step in exploiting their positive uses and mitigating negative ones.
This work will be presented at the 2011 USENIX Security Symposium in August.
--
Joseph Lorenzo Hall
ACCURATE Postdoctoral Research Associate
UC Berkeley School of Information
Princeton Center for Information Technology Policy
http://josephhall.org/
View list directory