Combating Website Spam

Back toward the start of February, I wrote about the implementation and ramifications of the rel=”nofollow” attribute on user submitted content. Shortly afterward, I was prompted to write the rel=”┬Łnofollow”┬Ł follow up entry as well. The idea back then was that website spam, in particular weblog spam, was rife and we (read: search engines) needed an answer. As a *cough*solution*cough*, some bright people thought that if they removed the reward (of PageRank in Google for instance), that the spammers would stop.

I’m here today to tell you that it hasn’t stopped, in fact I’d nearly say that it has increased. On the grander scale of things, I don’t get a lot of spam. A useful by product of that is, it’s also very handy for me to gauge how much spam I’m getting. If I were receiving hundreds or thousands of spam messages daily, I’d be handling them through ‘mass editing’ methods – “select all, delete”. Since I’m not though, I get to see, glance and sometimes even read the spam!

Over the past months, I’ve tried various spam defense mechanisms on the site. Some people have gone to extremes to implement some of the things I mentioned (ie: a ‘set’ of mechanisms that change each time you attempt to post) – so as to require the human factor to make the post. These systems no doubt work very well, however the problem I see with some of them is that they are restricting users from commenting on your site. A friend of mine has a small blog, hosted at Blogger – problem is that her site requires that you are a member before you can post. This single thing alone has stopped me to date from leaving comments – I want to but I just can’t be bothered signing up to leave a comment. I’d consider myself someone that will go to fairly long lengths to get between A and B, but if I won’t sign up for a dummy account on blogger to post – this to me proves that some anti-spam techniques aren’t just stopping the spammers.

One of the anti-spam techniques that I find works very well is keyword detection. If a spammer mentions a certain phrase in their spam, it is flagged that it requires moderation before it will go live. I think one of the primary reasons that this method is so easy to implement and effective, is that the spammers utilise SEO techniques in their spamming. By implementing SEO techniques, I mean that they, for instance use the same word or phrase to help build their keyword importance and visibility. A practical example might be having 100 inbound links to a particular page on your site, but all with different link text versus the same 100 links but all with the same link text. Since we know that they use these SEO techniques, it plays into our favour for simple detection. I’ve got a list of less than 50 words that I use to capture spam and so far they are working out wonderfully.

This isn’t earth shattering news for most but it might remind some people that simplicity is a beautiful thing. I often think we get caught up in overly complex systems to get between A and Z when there is often a shorter simpler path available that we’ve overlooked or discounted previously.

6 thoughts on “Combating Website Spam

  1. When you read around on some of the big sites, it is one of the single most effective mechanisms. I think the flexible aspects of it, as opposed to rigid “if it has a,b,c then it is spam” method is what makes it work well. You do need to have a good vault of spam for it to work on initially though, so either keep the spam you get between now and when you implement it or go and grab a ‘sample block’ of spam from somewhere.

  2. Good idea. I volunteer you to be the official spam collector so that everyone else can get a good filter setup. Open your blog back up to spam, and in a few months we’ll come back and get started :P

    I liked your post though – in the two day or two I’ve been reminded at least four times that the simplest option can often be the best!

  3. I don’t think I’m up for being the collector at all really Paul! How about you write a comment system for your blogging package tonight and just open it up. I suspect that because your new site (beta.paulstovell.net) isn’t going to be well linked just yet, it might take quite a bit of time before you’ll get spammed regardless.

    I was talking to some folk on the #wordpress/irc.freenode.net channel the other day and we some how got onto the topic of spam. Some of them say that if they turn off their filtering (whatever method(s) they use), they’ll pull in between 500-1000 per day. With that in mind, it is no surprise that good accurate spam filtering is always going to be an issue.

  4. How do you filter spam ?

    I mean, processed ham is quite chunky and I suspect it makes a great deal of mess.

    How does one dispose of said filtered spam.

    Another dilemma.

Comments are closed.