CAPTCHAs as a prevention of malicious automated software
This article aims to review general use of image validation on websites to prevent automated software from performing actions which degrade the quality of services, as well as recommended practices for implementation and known weaknesses to be avoided. It also features sample image validation solutions for a PHP and an ASP.NET (using C#) websites.
Definition of terms used
The acronym CAPTCHA stands for the title "Completely Automated Public Turing test to tell Computers and Humans Apart", trademarked by Carnegie Mellon University. In its essence this is a challenge-response test generated and graded by a computer (a server) in order to find out if the user is really human or another computer. A common type of CAPTCHA consists of an image containing some distorted characters, which the user is required to type in and submit to the server.
At present CAPTCHA implementations apply to numerous areas of practical website security including:
- Preventing comment spam in blogs, forums and wikis;
- Protecting website registration (i.e. in websites that offer free e-mail services);
- Protecting online-posted e-mail addresses from scrappers (obfuscating them until the CAPTCHA test is solved);
- Online polls and votes;
- Preventing dictionary attacks in password systems (instead of locking accounts after a certain number of failed login attempts);
- Search engine bots (CAPTCHAs are used to prevent search engine bots from reading web pages which are not desired to be indexed);
- E-mail worms and spam; an e-mail is accepted only if I know there is a human behind the other computer.
Recent interest has turned to developing systems that allow a computer to distinguish between another computer and a human. These systems enable the construction of automatic filters that can be used to prevent automated scripts from utilizing services intended for humans.
The term automated software / scripts in our case incorporates the following:
- Internet bots (software applications that run automated (simple and structurally repetitive) tasks over the Internet) that submit numerous bogus comments, usually for the purpose of raising search engine ranks of some website or as commercial promotion, or harassment, or vandalism;
- Spambots - programs designed to collect addresses from the Internet in order to build mailing lists for sending unsolicited e-mail, also known as spam;
- Programs iterating through a possible space of passwords and performing dictionary or brute-force attacks on a login system;
- Bots that sign up for great number of email accounts to sites that offer free e-mail services;
- Worms and spam in e-mails.
A usual CAPTCHA would feature an image with some characters (or symbols) rendered to it, which is featured in a web form and presented to the user. Solving it requires identifying all characters in the correct order, typing them into a provided textbox and submitting the web form back to the server. There the submitted input is validated in order to proceed with authentication process (or any other process initiated by user's form submission). Since characters are inside the image and not plain text, they presumably cannot be directly discovered by a robot by examining the source of a page, they are perceivable to the human eye only.
Typical examples of use of CAPTCHAs are major websites offering free e-mail services (such as Yahoo, Hotmail and Google/Gmail) or hosting online communities (such as LiveJournal), or using open source software such as phpBB. Currently a version called reCAPTCHA is recommended as official CAPTCHA implementation by its original creators.
General guidelines
Construction of CAPTCHAs that are of practical value is difficult because it is not sufficient to develop challenges at which humans are more successful than machines. This is because the cost of failure for an automatic attacker is minimal compared to the cost of failure for humans.The following guidelines are recommended for any CAPTCHA implementation:
- Accessibility to users
- Image security
CAPTCHA images should be distorted randomly before being presented to the user. Displaying undistorted text, images become vulnerable to simple automated attacks.
- Script security
- passing the answer to the CAPTCHA as plain text to browser;
- solution to a CAPTCHA which is usable multiple times.
- Independence of usage scale
- Own vs. ready-made solutions
Not neglecting security towards automated attacks, final result should be still legible to most humans. It is recommended that it also features audio and other ways for disabled users to be able to solve it. It should also provide smooth experience to users. (e.g. display a relevant message when they do not pass the CAPTCHA.)
Developers should make sure that the image validation cannot be bypassed at script level. Examples of script weaknesses include:
Solution should remain secure even if it becomes widely adopted (and therefore worth bot authors investing time and effort to break it).
When a website is to feature image validation, developers need to decide whether they should implement it themselves or use an existing solution. The official CAPTCHA site recommends using the solution they have developed as less failure-prone and thoroughly tested to be difficult for optical recognition from specialized software. On the other hand a great deal of ready-made examples are available on the web, though many of them are vulnerable to automated attacks. A developer could write or reuse an implementation suitable for the programming language used, additionally elaborating it regarding recommended development practices.
Weaknesses to be avoided
As use of CAPTCHAs has become a general practice, research projects have emerged, focused on the development of automated optical recognition systems. Several of them have achieved reasonable effectiveness in breaking CAPTCHAs used in popular sites, such as Yahoo's "EZ-Gimpy". This is one of the real-world obstacles that actual introduction of CAPTCHAs could encounter. Developers should make sure their solutions are not susceptible to the following weaknesses:
- Reusability
- Weak hash
- High success rate of computer character recognition
- Extraction of the image from the web page;
- Processing the image, for example converting to grayscale and then thresholding to black and white or removal of background clutter with color filters and detection of thin lines;
- Segmentation, i.e. splitting the image into regions each containing a single character (so called connected components);
- Recognition of the character for each region.
- Human solvers
Some CAPTCHA protection systems can be bypassed simply by re-using the session ID of a known CAPTCHA image. A correctly designed CAPTCHA does not allow multiple solution attempts at one CAPTCHA. This prevents the reuse of a correct CAPTCHA solution or making a second guess after an incorrect OCR attempt.
Other CAPTCHA implementations use a hash (such as a MD5 hash) of the solution as a key passed to the client to validate the CAPTCHA. Often the CAPTCHA is of small enough size that this hash could be cracked.
OCR (Optical character recognition) is translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text.
One of existing implementations of OCR is GOCR (or JOCR), which is a free optical character recognition program, initially written by Jorg Schulenburg. It can be used to convert or scan image files (portable pixmap or PCX) into text files.
The algorithm of OCR systems would usually feature the following steps:
Theoretically CAPTCHAs are vulnerable to relay attacks using human solvers. One approach is relaying the images to a "sweatshop" of human operators to solve them. Another approach involves copying the CAPTCHA images and using them as CAPTCHAs for a high-traffic site owned by the attacker. The real-world example, cited by sources used, is of CAPTCHA images being sent to an "adult" site, and its users being asked to solve the CAPTCHA before being able to see a selected image. According to the official CAPTCHA site such cases are not a concern since the impact of damage done by this technique is negligeable as it will only allow spammers to abuse systems a limited number of times (unlike an automated bot attack). Moreover by adding the extra step these sites risk losing visitors to other sites that would not have it so this technique could be less than beneficial for the site itself.
Practical implementation recommendations
If you as a developer intend to implement own CAPTCHA solution, here is a list of collected practical recommendations for making it more secure and reliable:
- Random generation of characters using a broader range - small and capital Latin letters plus digits. Moreover this provides a great number more combinations than a list of real world words (such as a subset of the English vocabulary). Another strong point would be applying variant string length;
- Random (not fixed) initial position of text - this would prevent determining of an initial position where text starts from;
- Alternation of color used for text - color variability between characters could confuse identifying text by filtering a single predictable color;
- Rotation / distortion of text / background - thus aiming to still keep it legible to humans but make it unrecognizable for software;
- Random (not fixed) font size of characters, selected from a range;
- Random font face from a range (possibly for every character in the string);
- Use of background - random background image from a range, use of pattern imitating characters having the same color as the characters, randomly put pixels with the color of the text;
- Extra elements - use connected lines and arcs as clutter (both in background and foreground) to make segmentation difficult; for example random positioned lines intersecting and connecting the characters.
Examples
Examples below are an attempt to present similar functionality in different programming languages. Please note that these are just sample implementations of CAPTCHA, using some of the recommendations listed above.
I. Implementation in PHP
Two files are used:
- image.php, which outputs an image containing a character string;
- index.php, which contains a web form where the solution to the CAPTCHA is submitted and verified.
The validation image is created using a canvas jpeg image file, standard Windows TrueType fonts and PHP image functions (these require the GD library installed).
It features a randomly generated alphanumeric string displayed with random position, size, font, color and angle for each character. Additionally four askew lines with random position, color and width are drawn across the image and multiple random-colored pixels are set at random positions.
The MD5 hash of the displayed string is stored as a SESSION variable and the image is output to the browser.
<?
session_start();
$font_path = "C:/WINDOWS/Fonts/";
$md5 = md5(microtime() * time());
$string = substr($md5, 0, 5);
$image = imagecreatefromjpeg("images/1.jpg");
//Allocate a color for an image
//black
$black = imagecolorallocate($image, 0, 0, 0);
//random dark colors
$line = imagecolorallocate($image, rand(0,70), rand(0, 70), rand(0, 70));
$line2 = imagecolorallocate($image, rand(0,70), rand(0, 70), rand(0, 70));
//draw askew random lines
imagesetthickness($image, rand(1, 3));
imageline($image, 10, 1, rand(30,200), rand(25,58), $line2);
imagesetthickness($image, rand(1, 3));
imageline($image, 1, 60, rand(40, 200), rand(10, 60), $line);
imagesetthickness($image, rand(1, 3));
imageline($image, 5 ,8 , rand(120, 200), rand(20, 60),$line2);
imagesetthickness($image, rand(1, 3));
imageline($image, rand(6, 60), rand(40, 60), rand(20, 200) , rand(15, 60), $line);
//set random pixels with random color
for ($i = 0; $i <= 256; $i++)
{
$point_color = imagecolorallocate ($image, rand(0,255), rand(0,255), rand(0,255));
imagesetpixel($image, rand(6,200), rand(2,58), $point_color);
}
//Write text to the image using TrueType fonts
$y1 = rand(25, 50);
ImageTTFText ($image, rand(17, 20), rand(-30, 30), rand(10, 20), $y1, $black, $font_path.'verdana.ttf', substr($string, 0, 1));
$y2 = rand(25, 50);
ImageTTFText ($image, rand(17, 20), rand(0, 30), rand(45, 55), $y2, $line2, $font_path.'trebuc.ttf', substr($string, 1, 1));
$y3 = rand(25, 50);
ImageTTFText ($image, rand(17, 25), rand(-20, 0), rand(80, 90), $y3, $black, $font_path.'trebuc.ttf', substr($string, 2, 1));
$y4 = rand(25, 50);
ImageTTFText ($image, rand(16, 25), rand(0, 30), rand(115, 125), $y4, $black, $font_path.'verdana.ttf', substr($string, 3, 1));
$y5 = rand(25, 50);
ImageTTFText ($image, rand(18, 20), rand(-25, 0), rand(150, 160), $y5, $line2, $font_path.'trebuc.ttf', substr($string, 4, 1));
$x = rand(0, 50);
$y = rand(0, 50);
$_SESSION['key'] = md5($string);
header("Content-type: image/jpeg");
imagejpeg($image);
?>
The validation page contains a simple form with the image displayed, a textbox where the text is to be typed and a submit button.
When the user submits the form, the MD5 hash of the text typed into the textbox is calculated and compared to the value stored as a SESSION variable. After that a message is displayed, notifying if the input text matches the text in the image or not.
<?
session_start();
$result = "";
if(isset($_POST["key"]))
{
if(md5($_POST["key"]) == $_SESSION["key"])
$result = "Correct!";
else
$result = "Not correct!";
}
?>
<html>
<head></head>
<body>
<form action="" method="POST">
<font color="#FF0000"><? echo $result; ?></font><br /><br />
<img src="image.php" alt="" /><br /><br />
<input type="text" name="key" /><br /><br />
<input type="submit" name="submit" value="Validate" />
</form>
</body>
</html>
This is how the end result looks like:
II. Implementation in ASP.NET (C#)
Two files are used:
- image.aspx, which outputs an image containing a character string;
- index.aspx, which contains a web form where the solution to the CAPTCHA is submitted and verified.
The validation image is created using classes from System.Drawing, System.Drawing.Imaging and System.Drawing.Text namespaces. It features a randomly generated alphanumeric string (small and capital letters and digits) with variant initial position, tilt, color and font size. Additionally three askew lines and a curve with random position, color and width are drawn across the image and multiple random-colored pixels are set at random positions.
The hash code of the displayed string is stored as a SESSION variable and the image is output to the browser.
<%@ Page Language="C#" %>
<%@ Import Namespace="System.Drawing" %>
<%@ Import Namespace="System.Drawing.Text" %>
<%@ Import Namespace="System.Drawing.Imaging" %>
<script runat="server">
private Random rnd;
protected void Page_Load(object sender, EventArgs e)
{
int x, y;
string strKey = null;
rnd = new Random();
Response.ContentType = "image/jpeg";
Response.Clear();
Response.BufferOutput = true;
strKey = GenerateString(5);
Session["key"] = strKey.GetHashCode();
Font font = new Font("Arial", (float)rnd.Next(18, 24), FontStyle.Bold);
Bitmap bitmap = new Bitmap(200, 50);
Graphics gr = Graphics.FromImage(bitmap);
Color black = Color.Black;
Color line = Color.FromArgb(rnd.Next(0, 70), rnd.Next(0, 70), rnd.Next(0, 70));
SolidBrush brush = new SolidBrush(line);
gr.FillRectangle(Brushes.White, new Rectangle(0, 0, bitmap.Width, bitmap.Height));
gr.DrawString(strKey, font, brush, (float)rnd.Next(70), (float)rnd.Next(20));
gr.DrawCurve(new Pen(line, (float)rnd.Next(1, 3)), new Point[] { new Point(0, rnd.Next(50)), new Point(rnd.Next(200), rnd.Next(50)), new Point(rnd.Next(200), rnd.Next(50)), new Point(rnd.Next(200), rnd.Next(50)), new Point(200, rnd.Next(50)) });
gr.DrawLine(new Pen(line, (float)rnd.Next(1, 3)), new Point(0, rnd.Next(50)), new Point(200, rnd.Next(50)));
gr.DrawLine(new Pen(black, (float)rnd.Next(1, 3)), new Point(0, rnd.Next(50)), new Point(200, rnd.Next(50)));
gr.DrawLine(new Pen(line, (float)rnd.Next(1, 3)), new Point(0, rnd.Next(50)), new Point(200, rnd.Next(50)));
for (x = 0; x < bitmap.Width; x++)
for (y = 0; y < bitmap.Height; y++)
if (rnd.Next(6) == 1)
bitmap.SetPixel(x, y, Color.FromArgb(rnd.Next(0, 255), rnd.Next(0, 255), rnd.Next(0, 255)));
font.Dispose();
gr.Dispose();
bitmap.Save(Response.OutputStream, ImageFormat.Jpeg);
bitmap.Dispose();
}
private string GenerateString(int length)
{
string validatingText = String.Empty;
Int32 seed = new Int32();
rnd = new Random();
for (int a = 0; a < length; a++)
{
do
{
seed = rnd.Next(0, 61);
}
while (seed == 0 || seed == 1 || seed == 18 || seed == 24 || seed == 44 || seed == 50);
if (seed < 10)
seed += 48;
else if (seed < 36)
seed += 55;
else
seed += 61;
char chr = (char)seed;
validatingText += chr.ToString();
}
return validatingText;
}
</script>
<html xmlns="http://www.w3.org/1999/xhtml" >
<body></body>
</html>
The validation page contains a simple form with the image displayed, a textbox where the text is to be typed and a submit button.
When the user submits the form, the hash code of the text typed into the textbox is calculated and compared to the value stored as a SESSION variable. After that a message is displayed, notifying if the input text matches the text in the image or not.
<%@ Page Language="C#" %>
<script runat="server">
protected void btnValidate_Click(object sender, EventArgs e)
{
Int32 sessKey;
Int32 inputKey = txtKey.Text.GetHashCode();
Int32.TryParse(Session["key"].ToString(), out sessKey);
if(sessKey.Equals(inputKey))
{
lblResult.Text = "Correct!";
}
else
{
lblResult.Text = "Not correct!";
}
}
</script>
<html>
<body>
<form id="form1" runat="server">
<div>
<<br />
<asp:Image runat="server" ID="imgVal" ImageUrl="~/image.aspx" />
<br /><br />
<asp:TextBox ID="txtKey" runat="server" />
<br /><br />
<asp:Button ID="btnValidate" runat="server" Text="Validate" OnClick="btnValidate_Click" />
</div>
</form>
</body>
</html>
This is how the end result looks like:
Other methods
There are also other methods for protecting web sites from bots. Some of them are applicable for certain purposes such as:
- manual moderation / approval - tedious and time-consuming but could be used when human interaction is required (for example contacting registered users is necessary for verification of registration details). Process could be smoothed by introduction of mail notification for special events and presenting items awaiting approval in a designated section on site;
- applying text filters - this method is appropriate for blogs and wikis where comments could be checked for a list of common spam words. Presence of links (spammers are aiming to improve their backlinks for Search Engine Optimization purposes) can be another reason for adding entries to the approval queue or deleting them;
- e-mail validation - posting an entry or creating an account requires users to provide their e-mail. In order for the account to be approved, they must click a link that is sent to their email address. This ensures they are using a valid email address;
- logging IP address and username of login attempts and temporary locking account for a period of time after a given number of unsuccessful attempts (as a preventive measure against dictionary and brute-force attacks of login systems);
Other measures at a more basic level include:
- blacklisting IP addresses based on detected malicious activity - even though in case of a geographically dispersed botnet it becomes difficult to identify a pattern of offending machines, and the volume of IP addresses cannot be covered by filtering individual cases;
- passive OS fingerprinting can identify attacks originating from a botnet: network administrators can configure newer firewall equipment to take action on a botnet attack by using information obtained from Passive OS Fingerprinting;
- utilization of rate-based intrusion prevention systems.
Conclusion
Experience and wide use have proved CAPTCHAs to be an effective solution to prevent automated software agents from performing undesirable activities against websites. Developers should strongly consider it as a solution to the problem but make sure their implementation is a reliable and a robust one, as well as provide other prevention methods for even greater security.
References:
- http://www.captcha.net - the official CAPTCHA site
- http://www.recaptcha.net - reCAPTCHA project
- http://www.en.wikipedia.org/CAPTCHA - a good source for a general overview and more
- http://research.microsoft.com/~kumarc/pubs/chellapilla_nips04.pdf - K. Chellapilla, P. Simard - Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)