You are currently viewing the personal blog of Wiras Adi, a web designer and application developer, located in Jakarta, Indonesia.

Simple Text-based CAPTCHA Implementation

Spambots are automated scripts that crawl on the net searching for URLs containing some kind of application forms – such as forums, guestbooks, or comment form on popular blogs -, and then automatically posting whatever its initial launcher (spammer) wants everybody to know. It usually carries commercial messages, offers, or simply just site promotions. This annoying practice has been one of the biggest problems of the Internet since the early days.

There are several known ways to fight this kind of spambot, like applying moderation mechanism to allow moderators of the site doing some sort of manual checking and validation against every post submitted. Despite being an effective (yet not too efficient) way to prevent spams, there is in fact a more preferred method called CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart). As being indicated by the word automatic in the name, this AI-based testing attempts to eliminate manual validations existed in a moderated system, adding a higher degree of efficiency.

Different algorithms have been developed to implement CAPTCHA. The most popular one is by challenge users to rewrite a certain text or word presented as a distorted image, assuming such text will be difficult for computer to read but still recognizable to human. Another algorithm is to present users a sound and challenge them to write what they’ve heard. But my favorite CAPTCHA implementation is the old and simple text-based challenge. It works by asking users to answer a randomly generated question, like “What is the color of the sky at night?” or simple math question like “What is twenty divided by five?”. Personally I’d prefer this kind of question-answer interaction to a system that asking me to write down something it shows. It feels more “human”, and it works at roughly the same security level as the other methods.

How it works?

In order to implement this text-based CAPTCHA, you’ll need to have a collection of questions stored in some sort of database, whether it’s an RDBMS, a file, or a specially built web service. Generally the questions should be easy and simple to guarantee everybody knows the answer, unless you’re expecting certain limited users that can submit their responses. In this case you can present question that is esoteric to the kind of users you’re expecting, adding some security measures.

To validate users’ answer, you also need to enable session for your pages. It uses session to track what question has been presented and answered by user, and then matches user’s answer with the correct answer stored in database. Depending to the question, it might has more than one correct answers.

Consider the following question: “What is six multiplied by five?”. Users might have typed the answer as:

  • 30, or
  • thirty

And since both answers are correct, you will most likely use regular expression in the matching process.

PHP Implementation

And now I’ll show you how easy it is to make the implementation of this CAPTCHA system using PHP. We’ll use MySQL to store the questions. The SQL script to generate the table looks like this:

--- Table structure for table `captchas`
CREATE TABLE IF NOT EXISTS `captchas` (
`id` int(10) unsigned NOT NULL auto_increment,
`question` varchar(255) NOT NULL,
`answer` varchar(255) NOT NULL,
PRIMARY KEY  (`id`)
) AUTO_INCREMENT=1;

--- Dumping data for table `captchas`
INSERT INTO `captchas` (`id`, `question`, `answer`) VALUES
(1, 'What is six multiplied by five?', '30:thirty');

Here we just generated a table namely captchas, consists of three columns id, question, answer, which are quite self-explanatory. The SQL script also inserts one question-answer pair as an example.

One thing to notice is that how we use colon (:) to separate all possible correct answers in the answer column. You can put an arbitrary number of correct answers this way, however it is best to keep it at the minimum. Those answers will later be splitted using regular expression at the matching process.

// Match $answer
// $db_answer is a string of answer obtained from the database
$answer_ok = false;
$arr_answers = split(':', $db_answer);
for ($i = 0; $i < count($arr_answers); $i++) {
  $check_against = $arr_answers[$i];
  if (preg_match('/\b' . $check_against . '\b/i', $answer)) {
    $answer_ok = true;
    break;
  }
}

This snippet matches the answer submitted by user ($answer) against a collection of correct answers obtained from the database ($db_answer). It will first split the possible answers into an array ($arr_answers), iterates it and then matches the answer on each iteration using the exact word matching method. Once the answer matched the iteration will be stopped, and a boolean flag $answer_ok will be set to true indicating the user has passed CAPTCHA validation.

Using this exact word matching, thus our example will accept answers like “30″, “the number 30″, or “it’s thirty”. And will reject answer like “#30″ or “number-thirty”. This small kind of flexibility somehow gives a sense of human into the system which I like and can not be found at image- or sound-based CAPTCHA.

The question presented to users is randomly selected from the database using the following snippet:

// Query string to pick single question randomly
$query = 'SELECT * FROM captchas LIMIT ' . (rand(0, $q_count-1)) . ', 1';

To explore the rest of the code, I’ve made a downloadable demo files of this simple text-based CAPTCHA implementation, free for you to use and improve.

Further Considerations

  • For the brevity of the example, in the demo you may notice that I use a $_SESSION variable to persist the CAPTCHA ID so that our application can track which question has been presented to users early on.
    function persist_id($id) {
      $_SESSION['captcha_id'] = $id;
    }
    
    function persisted_id() {
      return $_SESSION['captcha_id'];
    }

    However, in a cookie-based session system, this practice is open to session reuse attack, as the CAPTCHA ID is stored in the client side. An attacker can easily answer one of your questions, grab the session, and repeatedly use that same answer and session to spam your pages. It is considered better and safer to persist the CAPTCHA ID in the server side, either using database or file session storage, or manually create persistence mechanism with temporary table for instance. All you need to do is just to override those two setter and getter (persist_id() and persisted_id()).

  • To further add a mean of security, you could combine CAPTCHA validation with form authencity token system to guarantee that your session hasn’t been tampered, as well as avoid users to submit spam message without visiting your page (hotposting).
  • You may also want to limit users’ answer to certain length of characters. Most of the time you’ll only need to concern about the first 10 or 15 characters to avoid nasty users to brute force your form with a lengthy answer.
  • Unlike other type of CAPTCHA system, text-based CAPTCHA works by giving users a challenge to answer certain question, requires more effort from users to think instead of just rewrite what’s shown or hearkened. This way, you could target specific group of users that can post. Say you’ve written a review about Middle Earth story by JRR Tolkien. And you want only users that actually follow the story too that can post replies or give their own reviews, you’ll only need to make a set of questions that you think will only be answered by those kind of user to narrow the number of potential posters.

Update history

Feb 07, 2008
Added the ASP Implementation download files of this text-based CAPTCHA.
 

4 responses to this entry so far.

  1. Luc

    Hi there, after searching the net to get an opinion on what captcha to use, your solution is the most interesting I found.
    However, I have a hard time to integrate it with tectite formmail (www.tectite.com). I have adapted the form like in the demo page, still, even if I slect a wrong answer, formmail processes the form.

    Is it possible to help me out ?

    Thank you very much.

    Luc

    #1
  2. aj

    your so good…thank you for sharing the information :) i’m glad i found your site it’s very helpful for the research i’m doing:) thank you again :)

    #2
  3. shoes

    Can I simply say such a relief to search out person who actually realizes just what they are discussing on the web. You certainly realize how to take a problem to light and enable it to be important. More people need to study it all and understand this side of the story. I can’t believe you are not more popular because you really have the gift.

    #3
  4. garwil

    Wow… a great piece that is simple to imliment..
    I would like to make some changes and not even sure if this is advisable but going on your comment about personalising the script, I thought about using a form field as part of the question…e.g. I have a travel site and people need to register in order to post their product on the site. During this process they have to select a Country where they are from via a drop down select. Those countries are contained within a variable inside the web page but can just as easily be placed in a mysql db.
    What I thought of was making the visitor input the same country name that they selected previously… how could that be set up or even if it advisable to do s0?
    In the meantime I impliment what you have provided here…
    Many thanks

    #4