HTML 5 Forms – a spammers paradise

HTML 5 form spam
Did you know, HTML 5, the spec that will be completed in 2022, but with some bits available now, will have a whole new set of form elements designed to make complex forms available natively from the browser. I’ve been to a few talks where Opera’s Bruce Lawson has demoed and talked about these upcoming features that have been implemented in the Opera browser. From an accessibility standpoint it looks great; no longer will screen readers have to rely on labels to infer the type of data to be entered into forms. From a developer’s standpoint, you won’t have to code javascript date pickers any more, nor have to rely on javascript for validation.

So, all of this makes it easier to enter data on the web, a great thing. I asked the question this morning, “who enters the most data on the internet?”. The answer is spammers. It is generally thought that 90% of all e-mail sent is spam, and a quick glance at my blog’s spam counter sees 7,300 fake comments caught compared to 56 real comments.

So, why will HTML 5 forms be such a problem? Well, at the moment, spammers use automated tools to crawl the internet, looking for forms to fill in to spread their advertising links or perform XSS attacks. To bypass most validation, the crawlers look for labeled form fields to fill in. Quite simply, HTML 5 forms will make this job easier.

Instead of labelling forms with “e-mail”, there’s now a specific input type <input type=”email”> which validate an e-mail address. Common anti-spam methods of adding a second e-mail field hidden to normal users will be ignored as there is a clear (and CSS visible) e-mail address field.

Forms validation may be useful for the normal user, but it’s even more useful for the spammer. With limits of input fields now being contained in plain text in the input, it makes it trivial for bots to enter correct data.

So, what can be done about this? Well, I’m not sure. There are some anti-spam methods that will still work, for instance timing the entrance to the page and seeing how long it took to complete the form. Very short times are spam, short times are sent for moderation and normal times are approved. There’s captcha, which is inaccessible and then there’s blacklisting, which hasn’t worked for years.

If you have any theories, please share them here. If there’s a solution or something the working group can do to make spam more difficult rather than easier, it should get into the spec sooner, rather than later.

If you enjoyed this post, leaving a comment or subscribe to the RSS feed to have future articles delivered to your feed reader.

Steve Workman

Steve is an engineering manager at Maersk, and organises BerkshireJS. He has also worked at Yell and PA Consulting and is a former organiser of London Web Standards

More Posts - Website - Twitter

Tags: ,


  1. André Luís said:

    You really need to add negative captchas to your blog. 😉

    Also, it’s far easier to catch a spammer redhanded. Label a field “email” but hide it along with a message “If you’re a human, please do not fill this field”. If it’s filled in, it’s a spam bot. If it’s filled and it’s an email address it’s a clever spam bot. Both caught.

  2. jl said:

    I’m more worried about perfect HTML parser being available. Bots could use it and you won’t be able to catch them parsing comments or entities wrong.

    All these hidden field tricks are just simple techical arms rece and it’s inevitble that bots will catch up sooner or later. Robust methods like Akismet/defensio/sblam are needed.

  3. James said:

    One solution may be to require the correct answer to a simple question in each form. For example adding the question “what colour is grass?” and filtering all non-correct answers into a moderating queue may work. Although if enough people were to use the same questions over and over the spammers would eventually be able to predict the correct answers. The questions would also need to cater for the intended audiences of each form and not be so obscure / difficult to answer that it put off real users…

  4. Steve Workman said:

    @Andre – It all sounds great in theory, but can you rely on humans not filling that in? How much attention do you actually pay to forms?

    @Mathias – how does Akismet work? What does it do that is so special?

    @james – what colour is the grass on your side of London? Mine is green but it’s quite brown or burnt in other places. Questions don’t have to work just locally, but internationally too (i.e. the answer to what colour is the grass in Tunisia is “what grass?”)

  5. designer boots said:

    I just couldn’t leave your webpage without letting you know that I really enjoyed the quality information you offer to your visitors… Beats wasting all day at tmz reading about Tiger Woods lol.. Will be back often to check up on new stuff you post!

  6. James said:

    @Steve – “questions don’t have to work just locally, but internationally” too true, but the principle works. It would just be a matter of working out how to word a question that was understandable for the whole audience.

  7. rimmer333 said:

    I think the client side validation is not to be confused with spam filtering. HTML5 doesn’t cancel any other filtering you could apply otherwise. If spammers are gonna use HTML5 features to ease their job – so they have to use a compatible browser (or other user agent). Who prohibits them from using the same browser today? And if they use it, will there be any more complexities to parse any modern valid pre-HTML5 markup with JS/jQuery than to parse valid HTML5 and use some of it’s sweetness? The question of successful parsing anything – be it binary images or super-semantic HTML5 forms – is just a question of time, it will be done someday, and the mission of preventing the spam is definitely not to postpone the issues to the day when spammers learn how-to.
    The idea of all server-side security (front-ended by HTML5 or any other) is not to trust the user input anyway. Client-side validated, coming from HTML5 – so what (and, BTW, how exactly would you know that)? Spammers will fill more forms, sure, but as long as server-side guys keep their eyes open – it won’t hurt much. Traffic worried? Traffic increases every other minute, so there’s not much problem about it either.

  8. Stave said:

    Would it be possible to get permission to use some of your entries on websites with a link back?

  9. Steve Workman said:

    Sure, as long as the article is reprinted in full, with a link back and the complete copyright notice on the article.

Leave a response