Coming Captcha Crisis
by Richard DeMillo
If you can read the text strings below, chances are pretty good you’re human.


These five graphics were plucked a few days ago from the sign-up pages of the most visited sites on the Internet: Google, Yahoo, MySpace, YouTube, and FaceBook. These sites, and virtually every other high traffic site on the Web, use “Captcha” (Completely Automated Public Turing test to tell Computers and Humans Apart) as the first line of defense against the onslaught of Virtual Blight from black hat marketers and distributors of malware.
The problem? Captchas don’t work anymore. Over the past six months, programs have been designed to automatically solve Captchas. This is exactly what the tests are designed to avoid: automatic solutions. A group of programmers in Russia announced a technology that could solve some of the most difficult Captchas 35% of the time. Another system, EZ-Gimply, developed at UC Berkeley, claims to solve a tough class of Captchas 92% of the time. Recently, it was reported that Spammers have cracked Microsoft Live’s Email account creation.

Why do Captchas matter? Looking at data streams that do not use Captcha, such as email and blog comments, signal to noise ratio is unbearable; 9 out of 10 bytes of email traffic and 91% of comments on blogs are Spam. By blocking the automated creation of accounts and postings, Captcha’s throttle back the volume of blight by an order of magnitude. Instead of grappling with a 9:1 ratio of spam to legitimate traffic (let’s call it “Ham”), websites that gate participation by using Captchas typically see only 1 byte of Spam for every 9 bytes of Ham. This is a manageable level, small enough to be further reduced by more labor-intensive strategies, like community flagging systems, volunteer editors and special-purpose redaction bots.
“Captcha plays an essential role as the first line of defense for Web 2.0 sites against vandals and marketers trying to exploit user generate content sites, blogs and opinion sites to hawk their products. Sophisticated spammers have already developed reliable ways to get around Captcha for about $.01 each and it only a matter of time before these techniques trickle down and Captcha becomes less of a nuisance to hackers than it is to legitimate users. Unless we develop a new solution, Social Media, Social Networking and Community sites will be as filled with blight as your inbox without a Spam filter.”
Captchas are like a leaking levee facing a hundred-year storm. As mentioned, Bots with sophisticated character recognition and parsing components are able to solve over 35% of them. Because bots are so scalable, that’s a frightening success rate. It means that three times the volume of attacks would essentially erase the protection that Captchas currently provide. Worse, the roadmap for improving Captchas is bleak. Internet security in general is a cat and mouse game, with security systems and professionals trying to stay a step ahead of the black hats. But staying ahead with Captchas is tricky and may be impossible.
No Admittance. To work, Captchas, need to do more than foil the bots; they also need to let in the people. Look again at the samples at the top of this article. The second and fourth ones, “V3YG” and “swable,” are pretty easy to read. The other three are much harder. In making Captchas more difficult, site owners create “false negatives” – instances where a real human being fails to read the Captcha correctly. Currently, major websites see a false negative rate of about 20%. In other words, two out of ten captchas foil real users, while fake users can foil the Captchas more than three out of ten times.
The levee is starting to fail.
The greatest threat from Virtual Blight is to Web 2.0, the community web. YouTubes, Wikipedias, MySpace, Facebooks, Digg, etcetera, are based on users creating the content. The value of Web 2.0 sites is based on trust. Users trust that other users are legitimate community members. Posts are from people with opinions, polls reveal what people think, friending requests are from people who want to be your online friend.
What’s needed is an alternative type of Automated Turing Test. A test that is less intrusive and more reliable. A new line of defense that guarantees a human presence at the other end of the click, be it on a submit button, a link or banner, a seat at a poker table or a profile in a social network. Without a new approach, nine-tenths of the signal could soon be noise and trust and usability will be washed away in a flood of automated marketing blight from sock puppet accounts.
Richard DeMillo is Dean of Computer Science at Georgia Tech and former Chief Technology Officer for Hewlett-Packard. Prior to HP, DeMillo has also directed the Computer and Computation Research division of the National Science Foundation.




March 7th, 2008 at 5:32 pm
The news that some Russian hackers have effectively beaten gmail’s captcha (http://www.websense.com/securitylabs/blog/blog.php?BlogID=174) kicks this up to a whole new level.
March 28th, 2008 at 12:29 pm
I wonder if a captcha-thesaurus strategy would improve the situation; ie, instead of having to fill in the word you see in the image, you could fill in the word that’s described by the text in the image. Eg the captcha could say: four-legged animal that likes bones (three letters) and the user types in ‘dog’…
August 18th, 2008 at 10:47 am
@djtip:
That’s a good idea, but much like the difficult CAPTCHAS, we don’t want it to be too hard for humans to figure out either. The example you gave was simple enough, but how far can that go before it gets too complicated? You don’t want to start asking obscure trivia to average users.
A simple name-that-object test might do well, like showing a picture of an apple, an orange and a lemon and you have to enter “apple, orange, lemon” as verification.
I don’t think just having a new verification method is going to be enough. I think the key is to mix it up a bit, have a bunch of different methods in use, and consistently roll out new ones. So far we’ve always had the same one concept for verification, and this has allowed the black hats to focus on one task: beat the CAPTCHA. If new systems employ a bunch of different methods of verification, we’ll keep them guessing. They won’t have one thing to crack, they’d be one step behind with all the different verification methods. And we can’t stop there, we should (somewhat regularly) introduce new methods as the old ones get weaker.
We’ve been lazy so far, assuming that the CAPTCHA would cover us forever. It was once considered all but unbeatable, and has now been beaten. What’s stopping the next “unbeatable” system from eventually getting beaten? We can’t be lazy, we have to take a new approach altogether, which is to not give black hats the simplicity of one system to crack.
August 29th, 2008 at 2:24 am
If we can read them. Eventually machines will too. But at that point will we really care about the difference?
October 7th, 2008 at 3:47 pm
AAAA PANIC! THE MACHINES ARE OUT TO GET US!!!!
Feh. Lot’s of FUD and fail in this post. “The problem? Captchas don’t work anymore.” So wrong. People just need to actually start thinking about how to make good CAPTCHAs. EZ-Gimply (or spelled correctly “EZ-Gimpy”) was one of the first CAPTCHA systems. IT WAS NOT A SYSTEM FOR BRAKING CAPTCHAS. Besides, it’s very old news. Also, the article gives the impression bots can break 35% of all CAPTCHAs. The article talks about YAHOO’s CAPTCHAs… Please try to read your own references. What the hell are you using?
I want to see you make a program for these: http://ocr-research.org.ua/teabag.html (Also has a list of weak CAPTCHAs)
Sorry, for hostility, but this is the worst articles I’ve read in a while… I wonder why I am reading this anyways…
May 26th, 2009 at 11:03 am
[...] http://www.virtualblight.com/articles/?p=20 [...]