How I built an effective blog comment spam blocker

Mention comment spam and most people, in particular those crazy WordPress users, mention Akismet. Great tool and I have nothing against it but I wanted to build my own, avoiding the external call to the Akismet service. What has been interesting to see, is just how effective it is. Turns out, my spammers are quite obvious.

As you might see, I don't use CAPTCHAs and I don't use JavaScript detection. I just use a number of rules that validate each comment on the server. Oh, and I don't use nofollow.

Points System

I use a points system, which I got the idea from Movable Type, whose spam protection is also based on a points system. For everything in a comment that I like, you get a point. For everything I don't like, you lose a point (or two, or three). If you get a 1 or higher, you've made it on the site as a valid comment. If you get a 0, it's set for moderation and I'll take a look at it. If it's below 0, it's marked as spam and I'll never see it (although I check every couple weeks just in case a legitimate comment needs to be unflagged). If it falls below -10, I don't even bother saving it to the database since it is so obviously spam.

Types of Spam

There are two main types of spam: automated and manual.

Automated spam is the most obvious. There are a number of tricks they try to pull and stands out when you see the same message a dozen times posted within seconds of each other. Automated spam is also the easiest to catch. So insanely simple that just a few rules would catch about 95% of all comment spam hitting a server. (That percentage may even be higher...I'm just guessing).

Manual spam, on the other hand, is more devious. People actually try and respond to the article at hand, which makes it slightly harder to catch. I say slightly because the vast majority of manual spammers do such a poor job at leaving a comment that they stand out like a sore thumb. The remaining few are usually the ones you end up filtering by hand.

Quick Solution

The quickest solution to reducing the amount of comment spam you get, and doesn't require any server-side programming and is built into almost all blogging tools, is to simply turn off the comments on a post after a certain amount of time. It works quite well and here are the two major reasons why:

Automated spam has a database of pages to which they try to submit to. If the form is no longer there then you don't get spam. Spammers are forced to discover new pages in which to spam.
Manual spam often tries to hit pages that have higher page ranks. There's plenty of search engine tools to help people look this information up. (I'd actually see referrers from these search tools, followed shortly by a new blog comment.) Higher page ranks will happen on older and popular posts. By shutting down the comment form, manual spammers are left to target newer pages in the hopes of getting missed until the page gets a higher ranking.

I've had old posts that I left the comments open for years and would still see users come across it and add to the discussion in meaningful ways. I loved that. However, that almost never happens now. So, I finally gave in and just close comments.

The Rules

In a blog comment, there are 5 fields and I test each one separately and in various combinations for various rules. The fields are: body, email, author name, url, and ip.

Here now are my rules for filtering blog comments.

How many links are in the body	More than 2	-1 point per link
How many links are in the body	Less than 2	+2 points
How long is the body	More than 20 characters and there's no links	+ 2 points
How long is the body	Less than 20 characters	-1 point
Number of previous comments from email	Approved comments	+1 point per
Number of previous comments from email	Marked as spam	-1 point per
Keyword search	Levitra, viagra, casino, etc.	-1 point per
URLs that have certain words or characters in them	.html, .info, ?, & or free	-1 point per
URLs that have certain TLDs	.de, .pl, or .cn (sorry guys)	-1 point
URL length	More than 30 characters	-1 point
Body starts with...	Interesting, Sorry, Nice or Cool.	-10 points
Author name	has http:// in it	-2 points per
Body used in previous comment		-1 point per
Random character match	5 consonants	-1 point per

Once you have a database of spam messages, you can observe certain patterns. In checking some information from time to time, I discovered some interesting stats:

Body length

Write something of consequence. If it's less than 20 characters, you obviously don't have much to say.

URL matches

Most people who include a URL usually have a top level domain or a subdomain that they use. They're not using querystring parameters or any other crazy URL structures. And I'm sorry for all the German, Polish or Chinese but a few of your fellow countrymen aren't being very nice.

URL Length

URLs that are longer than 30 characters are almost always spam. This ties in with the last filter. If you've got a URL, it's short, sweet and sexy. It's not crazy long — although I have seen some crazy long, perfectly legitimate URLs.

Body matches

It may seem like I'm being overly severe on people who start their comments like this but it's a very specific pattern that I'm matching. I was getting 10 to 20 hits of the same message coming in. It was just easier to match the messages and essentially ban them.

Random character matches

The other thing I noticed was email addresses or author names that were just a random string of characters. If there's no vowels, sure you might be Polish but more likely that you're spam. Rarely do even the Polish have 5 consonants in a row!

Effective?

How effective has it been? These days, I only see a new spam message get through maybe once every week or two. It's usually a message that somebody has handtyped to be relevant to the page but the comment is near useless and their author name is most evidently spam.

I've also reworded the disclaimer text under the submit box to let people know that I'm actively on the look out for spam and even legitimate comments will get edited or marked as spam if they plan to abuse the system. This lets those people know — like those who like to leave signatures on a blog comment or who like to use their company name as their author name — that being underhanded will not be rewarded.

Despite my past frustration with spam, things are at a point now where I'm happy to leave comments open on recent posts for a couple weeks and then just close them up and never have to worry about them again. It certainly isn't the death of comments I thought it might need to come to.

Published February 05, 2008

Conversation

92 Comments · RSS feed

chrispian said on February 05, 2008

Very cool. I wonder how hard it would be to hook into spamaassin's bayesian filter to add some additional checks. Still, I love the idea.

I wrote a text based challenge/response. It asks questions like what's 2+2, etc. I wrote it for nucleuscms and added it to php form mail. I'm currently adapting it to WP as well. It uses a rotating key for the question id, so that every time a question is answered a new key is generated and are never the same twice. I only have 4 questions and in a year of running it stopped 100% of the automated spam. I'd love to add some rules to catch the hand written stuff. I'll certainly be looking at your idea here for that.

chrispian said on February 05, 2008

I almost tripped your filter, I think with "very cool", lol. Great, now I'm paranoid!

kyle said on February 05, 2008

so, if i were to start a comment with "interesting post, but i disagree that..." i would be permabanned from commenting again and my comments wouldn't even go to the database in the future? that hardly seems fair.

also, by posting this aren't you helping the spammers learn what type of comments they need to post in order to become trusted users?

J Richmond said on February 05, 2008

I hope the manual spammers don't see this article ;)

Great article... thanks!

Jonathan Snook said on February 05, 2008

@kyle: I tried to qualify the word filter by saying "there's a very specific pattern I'm matching." Suffice it to say, 99.9% of average users will not set it off. As to whether I make it easier for spammers, it's possible but very few of the rules would really make much of a difference. There's only a couple and they were added to handle edge cases (like URLs longer than 30 characters).

The keyword matching is probably the easiest to catch and the hardest for spammers to get around since they need to use keywords to gain google juice.

Aaron B said on February 05, 2008

Wait, but there's no mathematical thesis accompanying your anti-spam efforts! You mean you can just block comments that seem obnoxious and you've destroyed almost all spam? Way too easy.

James Finley said on February 05, 2008

I find this very interesting. I have been looking for a method to block comment spam for a while, and though my developer partner and I have had some success, it is relatively limited. Our new CMS, Ministry(Starter), will have an Akismet plug-in, but it would be nice to have an in house system too. Thank you for your relevant articles, Snook.

Jeff Croft said on February 05, 2008

I've got a pretty effective system I built (which includes Akismet, but also some local stuff), but this is freaking great. I think I'll definitely build a very similar scoring system into my Django-powered CMS. Thanks so much for the inspiration, Snookums! :)

Philip said on February 05, 2008

I knew there was reason for keeping a subscription to your feed! Awesome post, thanks :) I'll be looking to implement something similar soon, your write-up will be a great help when I do. Nice one.

Olivier D. said on February 05, 2008

A simple way is to add a field in your form like this one :
input class=hfrm type=text name=body_2 value=

After you add in your CSS those class :
.hfrm { display:none; }
(it will hide the field, for normal user)

When you make data validation, you simply have to check that the field 'body_2' is empty. Most of the current robots will fill it.

This trick eliminate robot posts in my blog.

Nathan Smith said on February 05, 2008

"Interesting" post. :) I've always been curious about the inner workings of your hand-built CMS. I'd love to hear more about the Admin interface, how you manage dynamic vs. static pages, and/or anything else you'd like to divulge!

Jonathan Snook said on February 05, 2008

@Olivier: I don't like to do that for accessibility and aesthetic reasons. This page is still just as perfectly usable without CSS or JavaScript. Basically, I never put the onus on the user to solve the spam problem, I put it on myself as the operator of this site.

Jonathan Snook said on February 05, 2008

@Nathan: It's pretty hodge-podge, actually. Not as fancy as it could be or even should be. I've been meaning to set up a proper admin but just never got around to it.

Voyagerfan5761 said on February 06, 2008

I like this post. The algorithm you've put together here is quite intriguing, though I'm a bit worried about a couple of the filter rules.

For instance, my URL is more than 30 characters, simply because I'm using a subdomain at a free blog host. Does that necessarily mean I'm a spammer? No. I've been thinking about getting my own domain, but -- and this is where it gets interesting -- my ideal registered domain would be 7+3+1+14+1+3+1=30 characters exactly, including http:// and a trailing slash.

I suppose I should be thankful that you don't seem to be filtering out *.blogspot.com completely, as I've seen suggested elsewhere on the 'Net.

Regarding your filtering URLs containing .html, ?, or &, does that apply to the comment body or just the URL field? Some blog platforms and content sites (PC World for example) end their pages in .html, and/or use query strings to retrieve the correct page, so I'm just curious about that.

Voyagerfan5761 said on February 06, 2008

Yep, and my comment got sent into moderation, after all that. Oh, well...

Michael Siebert said on February 06, 2008

Great Article, this should be a library :)

But I disagree with your blocking of .de TLDs (maybe since I am german?), I never saw .de URLs in my Blog's comments... But this is like it is, because if you saw spammers like this, its perfectly OK and the rest of your rules seems is a great point to start with.

Luke L said on February 06, 2008

I used to have terrible problems with comment spam (upwards of 400 messages a day on my homebrew CMS). Askimet didn't help, flagging as many genuine messages as spam as spam that got through. I've been slowly tweaking my own system that uses local page elements, which I know you didn't want to use. Having said that, in the past 6 months I've had over 25,000 spam attempts and 12 have gotten through (http://hybridlogic.co.uk/journal/63/comment-spam-follow-up). The only spam that gets through now are the really short messages, however I have a lot of visitors leaving short, often one word, replies so I didn't want to add a character count limit.

I like your filter combinations though, I'm tempted t update mine to assign points to each check now. Thanks for the article.

Michael Siebert said on February 06, 2008

What? my comment's gone into moderation? Even though I know the rules? Slick...

Riddle said on February 06, 2008

If there's no vowels, sure you might be Polish but more likely that you're spam. Rarely do even the Polish have 5 consonants in a row!

Polish? Never heard of a word with 5 consonants in a row!

Paul Decowski said on February 06, 2008

Polish? Never heard of a word with 5 consonants in a row!

WstrzyknÄ…Ä‡ — to inject.

Eugene Sutula said on February 06, 2008

You know it's a really great idea to use points system. But I think it can be even better if rules won't be so simple. For example: if comment contains 2 spam-keywords it will reduce not 2 points but already 3. The more rules are broken the bigger is a coefficient.
You know, it's like 2+2=5 (synergy), but in opposite way.
Maybe fuzzy logic can be good for that purpose too.
Well... thanks for great article. Now I know what will be my first thing to do after passing exams :)

Jacek Becela said on February 06, 2008

Cool. I'm Polish but I will try to pass thru. We use Akismet and so far it works as advertized. I know it doesn't work for some people. I'm curious how it is possible for a company which specializes in fighting spam to develop a product which is worse than one man's work.

On the other hand, there is no way to build 100% effective automated spam filter. Building an automated bot which bypasses any (even unknown for the bot) filter with say 10% efficiency is not a hard task. And 10% is ok for spammers these days (take a look at email spam which has circa 1% or less). The key is (and always was) to mimic a real person. There are generally two things you have to consider:

1. Using a browser engine (you have to be able to execute js in order to bypass filters based on execution of js), all the bad guys should learn WatiR / Selenium and similar testing tools.

2. Writing a message that is statistically legit.

There are not a lot of these advanced bots in the wild now and the only reason is that old-school one-file .php bot-scripts are still effective.

My conclusion is that, no matter what, we will spend some time in our life marking those viagra and levitra messages as spam by hand and mark messages like this humble as non spam. How many points did I get? -9? :)

Fredrik W said on February 06, 2008

I came to think. If a certain e-mail adress has made lots of comments (like yourself, maybe you don't even run your own comments through the filter) wouldn't the hit on the db be fairly noticeable (at least if you're using active record, if you manually search for it and use mysql_num_rows() I would assume the effect gets smaller).

If you were to expand this idea you could set up a database with emails and a karma score where you add the total karma for a certain e-mail adress. If you aren't in the database and make a spam comment you instantly get ip-banned.

minute44 said on February 06, 2008

Nice work Jon! I'm one of those "Crazy wordpress users" you talk about and at the moment I don't use any spam blocking but am looking around for the best option. I only get a couple of pieces a week but it is on the rise.

What's wierd is, on my site, it seems like only one or two posts attract 99% of the spam. There's nothing special i can see about those posts... I've looked and they're not fundamentally different from all my others. It must be the subject matter.

Oh well. Hope this spam filter works out for you. I'm sure it will.

Jonathan Snook said on February 06, 2008

@Voyagerfan5761: Yeah, the lengthy blogspot domain did get you moderated. :) A lot of spam comes from blogspot, which is unfortunate but that's why I try not to outright ban. On the plus side, now you won't get moderated.

@Michael Siebert: most spammers coming from a .de domain were manual spammers who were targetting the site. I know 456bereastreet.com has a similar .de rule. Admittedly, I haven't noticed it to be much of a problem as of late so it might be something I reconsider.

Yaggs said on February 06, 2008

Looks like a good approach too me, although being German I find it sad that a .de Domain lowers the score. Plus, I think -10 points for a body starting with the wrong words seems a little extreme. Anyway, I was looking for a good scoring system, might give this a try with some minor modifications.

Neil said on February 06, 2008

Never heard of a word with 5 consonants in a row!

Rhythm?

Carl Camera said on February 06, 2008

Jonathan, these seem to be good heuristics -- and they must be if they're working so well on your site. Phil Haack has written several articles about invisible captcha techniques including a honeypot captcha technique that hides an input field to visitors via CSS then catches bots when they fill it out.

rb said on February 06, 2008

interesting, actually the whole first thing about time constrained comment periods does in fact eliminate 95% of it I have found. I would also like to add that if you can restrict the number of comments that an IP can make in to 1 per 5 minutes also helps alot too.

Mislav said on February 06, 2008

Great writeup, Jonathan. I really like the +1 per previously approved comment rule. However, don't you fear that being a potential backdoor for manual spammers? They could use the frequent commenter's email address.

Carl, Snook has already express dislike for that technique in his 2nd comment.

rb, I strongly oppose that rule. Often visitors make 2 comments in a row to add something they've forgotten or to correct themselves if they notice something after they posted.

Robin said on February 06, 2008

Hey, great article you wrote here! It was certainly worth reading and I will keep following your blog for more great posts like this one in the future. In the meantime, check out my site. (*grin*)

Robin said on February 06, 2008

Thank you for sharing this. Keep up the great work! :)

Robin said on February 06, 2008

Ok, now for real: my previous two comments went into moderation. This method is pretty effective, although I think you still have quite a lot of comments to moderate, especially when you post something that attracts a lot of new users.

Mohammad Arfeen said on February 06, 2008

ZOMG

Carl Camera said on February 06, 2008

@Mislav Oops. I didn't catch that. Honeypot captcha works without CSS as long as you place a "leave this blank" message next to the field. I don't see that as placing an onus on the user, but since brain cells are actually invoked to bypass the field I suppose that counts as an onus to some.

Jonathan Snook said on February 06, 2008

@Neil: I think the Y technically counts as a vowel. (you know, "sometimes Y") I don't include Y as a consonate in my checks.

@Mislav: An email address of a popular poster could be used as a backdoor attack but a spammer would need to know a popular email address being used specifically on this site. A spammer could post a couple valid messages and then use that but they're still limited based on a number of the other rules. And once I see a spam message and flag it as spam, they're done. Basically, it's a lot of work for little gain.

@Robin: Generally i don't have to moderate many messages. With this post, of course, some people are trying to check the filter and are getting moderated accordingly but for the most part, it's pretty reliable.

Neal G said on February 06, 2008

Snook, I have a few methods of my own which have been really successful (some which you aren't currently using) but I'm afraid to reveal my methods in fear that the spammers will "learn".

Over the past year I have successfully blocked over 16,000 attempts at comment spam on my website.

Daniel Marino said on February 06, 2008

Your method seems so simple, but it makes so much sense. As a TextPattern user, I've always relied on spam blacklists (such as Askimet or Spamhaus). For the most part they seem to work alright, but every once in a while I find a domain blocked (such as my workplace off and on) that seems like it shouldn't. I don't receive nearly as much traffic as you do so it's rarely a problem, but with a method like this, I'd no longer have to rely on other black lists to take care of the situation. Can someone say TXP Plugin...

Matthew Keefe said on February 06, 2008

Very insightful article Jon. I always have been using Akismet but may have to work on my own version for the pure enjoyment of learning and experimentation.

Hamish M said on February 06, 2008

DIY Spam-prevention. I like it.

I must admit, I've only used Akismet, but it's always worked well for me. Of course, my site is fairly low-traffic, so I imagine my conditions aren't as intense as some.

Well, I'm glad you've come to this resolution, especially if the alternative was no comments at all.

Joe said on February 06, 2008

I don't know if it's really possible to eliminate human generated spam (since they could just tweak a comment until it passes), but I've had similar success against bot generated spam, with a WordPress plugin I wrote that uses a similar set of rules.

At the moment, I'm trying out a slightly different methodology to blocking the spam, and that is to be slightly more aggressive in blocking the comment spam. Then if a comment gets flagged as spam, a "second chance" screen is presented with a captcha that allows the comment to be verified. So far, it seems to be working quite well, and it basically eliminates false positives (tho not necessarily human generated spam).

Brad Touesnard said on February 06, 2008

I'm a crazy Wordpress user/developer myself and although Akismet works great for my own blog, I can imagine with a higher volume of SPAM an annoying number of SPAM comments would still get through. However, you could always write a Wordpress plugin that hooks after Akismet and applies further filters to comments that pass Akismet. Or if you'd rather not reinvent the wheel, you could use an existing strike-counting plugin like Spaminator.

If you wanted to get really crazy, you could feed comments to Spamassassin and you could just tweak Spamassassin's settings to work against your SPAM. In fact, I wouldn't be surprised if Automattic leverages Spamassassin for Akismet.

Jos Hirth said on February 06, 2008

Without knowing the rules I would have triggered the -10 points "starts with" rule. Heh.

The honeypot thing Oliver D. mentioned works pretty well. In order to keep it accessible you can label it accordingly. You can also do it the other way around (i.e. one field which contains some random garbage to begin with and shouldn't be changed).

The ".de" rule surprised me a bit. With a German domain everyone can look up who you are and where you're living. I would just give em a call (it's almost 2 am right now haha). Additionally, I would file a complaint.

Another thing you can check is the existence of invalid markup. They will often try to use bb code and html at the same time.

Ben Hirsch said on February 06, 2008

Snook, thanks for the nice solution. Any chance you'd be willing to open-source this ranking system as a php class? I've been trying to get off Akismet now for awhile.

Jonathan Snook said on February 06, 2008

@Ben Hirsch: my class is tied into my own blog structure and CakePHP since it makes a few DB calls to determine things like previous email usage counts so I have no plans to open this up beyond what I have detailed here. Sorry.

Jermayn said on February 06, 2008

Interesting read! Oh hang on, I cannot use that word!!
Jokes aside it was good to read your rules and also that I was not the only special person who has spammers targeting specific posts :)

You going to release this spam filter or keep the goodness to yourself?

John said on February 06, 2008

What exactly are you giving the points to? Their IP, name, e-mail address they submitted?

Jonathan Snook said on February 07, 2008

@John: The points are tallied for the comment as a whole, and only to that single comment. Points aren't tallied across multiple comments.

Jason Beaird said on February 07, 2008

Pure, unadulterated, Snook Genius. :) I see you've put a lot more thought into blocking spam than I have. When the classic MT-Blacklist started to fail for me (and I got tired of adding variations of the word v1agra) I set up my silly "Type the following letter" captcha - if you can call it that. Somehow that simple trick has almost completely eliminated spam for me.

Jonathan E said on February 08, 2008

I know we've briefly traded emails about this before, but it's really nice to get a breakdown on the specifics of the scoring system you're using now - thanks for the insight!

I've personally been using a combination of the Akismet library, but I think I may have to use your system as some inspiration to create my own filter... thus negating the need to connect to the Akismet server.

Bryan Migliorisi said on February 08, 2008

I kinda like this idea, but it seems like it would take quite a bit of work to keep up to date on all the different ways people spam you - different domains, keywords, etc. I enable Akismet on all client blogs and websites because I know it works, and requires no work on my part of the clients. It is just easier for the clients and that means i have less to deal with.

However, your idea is still rather interesting to me and I wonder if such an algorithm could ever become public.

Eddie said on February 08, 2008

This is probably the most efficient looking system of blocking spam I have ever seen. On my site right now, I have gotten over 3000 spam messages left in my comments on one day.

In my current redesign of my site, I am going to incorporate your method for blocking spam bots. That is a brilliant idea to look at the patterns to, I never would have thought of something like this!!

Steve said on February 10, 2008

I thought I've seen all the ways of dealing with spam but this is definitely a new one to add to the list. On my blog I've been using reCAPTCHA, yeah it makes the user type the captchas but it also contributes to digitizing books so I don't feel as bad about using it.

I want to commend you on all the custom solutions you create, it's very easy to get caught up in all the plugins and frameworks out there.

Greg K Nicholson said on February 10, 2008

Never mind the Polishâ€”the Welsh (bless 'em) use w and y as vowels, so what looks like a string of five vowels is actually perfectly sensible. ...so they say.

Also, the prevalence of â€œcaptchasâ€ (a misnomer: if it were a real Turing test, there'd be no accessibility problems)â€”amongst other thingsâ€”makes me glad I'm not deafblind.

(Do I get bonus points for discussing both the Welsh language and Turing tests in one comment?)

Eric Meyer said on February 11, 2008

Kind of a like a home-brewed SpamAssassin, eh? Love it. I might've done something like this in my WordPress install (yes, I'm one of the crazy ones) as a followup to WP-Gatekeeper but I've had phenomenal results with a combination of Akismet (which has blocked, as of the moment I post this, 678,928 bits of spam), a plugin that wanrs users when their comment is Akismetted, and the "hold posts from new e-mail addresses" setting built into WordPress. The convenience is worth the external call to the Akismet service for me. It's worked so well that I don't even use Gatekeeper on my own blog any more.

Andy Kant said on February 11, 2008

This is definitely some cool stuff that I might look at integrating into my own blog platform (it would be interesting to see a low level design since you have decided not to release it though ;-).

I've been looking at a custom solution for a while now, mainly because I think that my own email address might be blacklisted in Akismet. After posting on 37signals on a regular basis, my comments stopped showing up there (I'm guessing because I'm pro-Microsoft, I can't think of any other reason that they might have blocked me since my comments usually contribute to the discussion), and after that happened my comments stopped showing up on other blogs too (I think one of them was Jeff Croft's) leading me to believe that I got blacklisted for no apparent reason.

Julian Schrader said on February 11, 2008

This seems to be a very nice set of rules, maybe I'll incorporate this into the blog/CMS I'm going to build for my website. Thanks!

But I have to support Michael Siebert (#16) â€” I also do not know much spam coming from .de-TLDs ;-)

Robin Fisher said on February 12, 2008

I've just built my own custom CMS in Rails for my own site and have been thinking about preventing comment spam. I had seen an article on integrating Akismet with Rails but this has inspired me to develop my own filter.

Is there any rationale behind the points you've assigned to each rule or is it nothing beyond analysis of spam you've received previously? (Not to demean the time and effort that went into such analysis of course!)

Jonathan Snook said on February 12, 2008

@Robin: the points were tweaked based on spam analysis. It's nice when you get a large collection of data and can perform various queries on it to see what comes of it. The less than 20 characters and the URLs longer than 30 came out of that analysis. I could count how many comments would get flagged based on that criteria. If a particular pattern was almost likely to be spam, then I can give it more weight by increasing the points. If it's more of a grey area, it uses less points.

Jim Whitesell said on February 12, 2008

This is the most enlightening post I've read in quite a while. I'm still in the middle-ages since I don't keep an active blog, but I have several contact forms on sites that get hit by spammers.

Jon, between your original post and many of the responses I've come up with some really great ideas for my own uses. Thanks so much!!

Andreas said on February 12, 2008

Amazingly cool. Been thinking about how to do this best and this is just so... simple yet genius. Impressive. Would also love to see some of the code =)

EJ said on February 12, 2008

A very interesting read. I love the simplicity of the rule set you've laid out, although I'm a bit wary of the body match rule: I feel that might falsely trash good comments. I don't know if someone has mentioned this or not (don't have the time to read all 60 previous comments - you're becoming so popular snook!) but what would be cool is if your blog you've built from CakePHP could keep a record of a user's accumulated points, so a less strict rule set could be applied against them for previous good behavior. In any case you've got me thinking! Thanks.

Maikel GonzÃ¡lez said on February 13, 2008

Thanks by yours Ideas.
Others points to take:
- First, make sure the form was posted from a browser.
- Make sure the form was indeed POST.
- Host names from where the form is authorized
- Attempt to defend against header injections "Content-Type:", "MIME-Version:", "Content-Transfer-Encoding:", "bcc:", "cc:"

Andy Kant said on February 13, 2008

@EJ
Jonathan has that already, see "Number of previous comments from email: approved comments = +1 per".

Alexander said on February 13, 2008

And I who thought of start to use a new email (Since the old one is quite outdated) ;) Well, the algorithm for catching spam is really intresting - it's quite different from getting the CAPTCHA's to work for you (like the 2+2 or the enter the characters on the image) but I am a bit thoughtful about the length of the links as well the intro text. I find it quite unfair to entirely ban "Interesting, sorry, cool" because I feel that soon spammer would skip the word and just to "I think..." - maybe -5 instead?

I like the idea of the approved comments +1/per; because thats what internet has become more about, earn trust/fame/etc. instead of get everything instantly for a couple of days and either your fame continues well earned or it sinks. Great system overall though!

gurde said on February 14, 2008

I had the same problem with spam comments. I have a good solution for automated spam... AJAX.

Erik Sagen said on February 14, 2008

Okay, I just need to make this longer than 20 words and I'll deflect Snook's spam patrol. Let me just put on this deflection suit (with +5 agility and defense).

The point-based system of moderating comments is great, especially the way in which you've broken it down; clear as salt water. As for spammers becoming smarter, sure they will, they do every day and I wonder if they'll ever create a script that'll add up the mathematical captchas to push the comment through? Those are easier to break (maybe) than your run-of-the-mill captcha.

I am still in love with reCaptcha.

Whatever the case you've built some great tools to circumvent the issue and every day someone keeps at it means sending these goldminers to the hills, not permanently but better than what we were dealing with just 2 years ago.

Fight the power!

David Racho said on February 15, 2008

Okay, so let me give this a try. I have opted not to use my blogger's address as that's a possible filter. This is longer than it should be.

But I really honestly like the idea and am considering stuff like this to put into my own site - which right now is nothing, really, it's nothing, I just got the domain and sat on it for a few years - I'm still sitting on it.

I'm a programmer at heart but I've been trying to learn php and cms since, maybe, 5 to 10 years ago. And I like "english-code" like this, or pseudo-code so a human being can understand it, and use whatever server side thing to actually implement it.

Math captchas are funny though. You should have Jeopardy style captchas or "Are you smarter than a 5th grader" questions. Sorry, you need a higher IQ to make a comment.

The intelligent spammers will figure out your database of questions, but it will take them awhile.

george said on February 22, 2008

Just a question Jonathan: Is it really worth it? Maybe I'm missing something, but why in the world would you need to go through all of this to keep some "spam" comments out? If it is manual comments, on recent blogs, why would you feel the need to make sure that "nobody slips though"? I'm not asking this to be a jerk. Obviously I don't understand something. There must eb a good reason for people spending hours every day to keep spam out. What is that reason??

Chris Amini said on February 23, 2008

This is awesome, John.

I am going to implement a similar item into the custom CMS that I and a few co-workers are making. Obviously crediting you with the idea.

Keep up the awesome job!

thorsten said on February 23, 2008

You forgot some very typical words. If the words cheap, cheapest or buy in the URL the score must be -20. I delete them all without reading. But you are right with the german guys. It is very sad whats going on here.

James said on October 06, 2008

While I must admit closed comments is annoying I totally understand why you do it now. I think the way you rate spam is quite logical although, dependent on the blog and it's topics, a few of the rules would have to be altered (such as the amount of links in a post).

I currently use Akismet for my wordpress blogs - it's okay but sometimes a couple of legitimate posts get in there somehow.

Gary Adams said on January 13, 2009

You are god! My blog used to be flooded with spam like you wouldnt believe! Thanks for this very interesting article! Much appreciated! Gary

Slone said on January 16, 2009

Excuse me. What more felicity can fall to creature, than to enjoy delight with liberty.
I am from Canada and now study English, tell me whether I wrote the following sentence: "The independent service for resolving disputes between consumers and financial firms."

With respect :o, Slone.

Oakes said on January 16, 2009

Hey. If you really do put a small value upon yourself, rest assured that the world will not raise your price.
I am from Ukraine and learning to write in English, give true I wrote the following sentence: "Resume? Learn about effective design and layouts, what to include and what to leave off your resume."

With love :D, Oakes.

Mark said on January 18, 2009

Y'know, I actually had to build a sort of spam filter for my Machine Learning class at school. We had a big sample of spam messages and non-spam messages, with word frequencies (like, how many times the v-word appears -- I'm afraid to write it here). We then used an SVM (http://en.wikipedia.org/wiki/Support_vector_machine -- oh God, am I over the 30 char limit? :p) to split the messages into spam and non-spam. Using your idea for moderation, you can flag things close to the line as awaiting moderation, and things further from the line as clearly spam or non-spam. I'd also use other dimensions like the ones you mentioned -- author's name, and URLs and the characters within... you just to collect a bunch of spam, and then you can train it off-line. You might have a little fun writing a classifier in PHP though. Would be a fun little project for me to try some day....

Miles Johnson said on February 06, 2009

I just built a CakePHP Behavior, that checks/processes and deletes the comment during an afterSave().

The only point system I didn't add is the "5 consonant" one. How did you do that, regex?

Vivek said on February 08, 2009

Point system given is good I think but anyone till now planned to offer some API based access to this? (apart from Akismet).

because manual approval of comments takes lots of time and auto approval will kill your site.

SammyTerO said on February 16, 2009

:) Wow. nice arcticle

HomlallAmax said on February 18, 2009

Create pure drinking water with our Atmospheric Water Generators
http://tinyurl.com/aerl6u

Belajar Blogging said on February 18, 2009

Thanks for sharing this Snook.

Now I can learn a lot that there's another way to fight against spam than just installing akismet plugin.

Recently, I just don't think that akismet work efficiently. Several spam comments with 5-6 paragraphs keep coming in to my blog everyday and I have to delete them manually.

Will try to use your method.
Thanks.. :D

Taudiazepam said on February 07, 2011

cheap diazepam, ERIDAN, SEDUXEN, VALIUM, buy online

sohbet said on February 08, 2011

Bu sohbet sitesi tek kelimeyle Muhtes. Sohbet Etmek ve Arkadas, Olmak iÃ§in Arad? SeÃ§menin Faydalar?

download sRs Trend Rider FREE said on February 14, 2011

This really solved my problem, thank you!

download Forex MegaDroid FREE said on February 15, 2011

Weâ€™re a group of volunteers and starting a brand new initiative in a community. Your weblog supplied us valuable information to work on. You have done a marvellous work!

dirt bike parts said on February 23, 2011

I wanted to make public in to repulse you in quittance looking tailored this capacious look during the course of!! I some conditions ago enjoying every dialect trig soupå™Šn of it I gomerel you bookmarked to juxtapose turn up broken of the closet far-off pronounced screw up you recording

Hello! dcdecbe interesting dcdecbe site! said on February 25, 2011

Hello! dcdecbe interesting dcdecbe site!

Very nice site! said on February 25, 2011

Very nice site!

Plavuse said on March 06, 2011

Ues, but not everthing black and white, something is gray :)

Miranda

Theceiciape said on March 15, 2011

Ð”Ð°, Ð²ÐµÑ€Ð¾ÑÑ‚Ð½Ð¾ Ñ‚Ð°ÐºÐ° Ñ‡Ðµ Ðµ

Janoszen said on March 27, 2011

To address the SpamAssassin question: SA is made for e-mail. A lot of checks are only valid for e-mail, so even if you'd use it, you'd have to handpick your ruleset. You'll end up with a handful of rules the most powerful being the Bayesian classifier, which can be very easily implemented in any web programming language as well. Drupal even has a module for it.

Anyway, the ruleset is very nice, creating an update service for such rules (like sa-update) would be nice.

Noor said on May 05, 2011

from where did you get the database of spam messages?
Did u collect them yourself?
I need a dataset of spam and non-spam comments any idea from where I can get it?

Sorry, comments are closed for this post. If you have any further questions or comments, feel free to send them to me directly.

Snook.ca

LIFE & TIMES of a WEB DEVELOPER

How I built an effective blog comment spam blocker

Points System

Types of Spam

Quick Solution

The Rules

Body length

URL matches

URL Length

Body matches

Random character matches

Effective?

Conversation