Unsolicited Commercial Email (UCE) or spam continues to be a growing problem. Early on, I would deal with spam messages one by one - tracking down the source and paths to hosted urls, then go about getting them cancelled.

Over the past couple years, the amount of spam has made that an impossibility. I began to look for automated ways to deal with spam. Tools I've used that have value include Spam Cop - worthwhile to look at their free trial if for nothing more than seeing the technical details related to tracking down the real source of the spam you receive. Mailwasher - a tool to automatically bounce mail.

Most recently, I came across an article describing Bayesian Spam Filtering.

The link to the article describing the idea is:

http://www.paulgraham.com/spam.html

The implementation of this that I fiddled with for a bit and has been just outrageously accurate is located at:

http://popfile.sourceforge.net/

I've done so much work with Eudora filtering that I figured all I really need popfile to do is peel the spam off, then Eudora can deal with ease with my real mail. When I first read Paul Graham's article, I began saving spam in hopes of training software such as popfile when I finally found it. So, I fed popfile the 2000 pieces of spam that I accumulated. I also fed it the contents of my various saved mailboxes as "good mail".

For now, I have spam filtered off into it's own mailbox to be inspected before I delete it - but the temptation is getting stronger day by day to just let popfile and Eudora automatically take out the trash.

If you don't have a couple thousand pieces of spam saved up to train your filtering software, you can borrow mine. Spam (was) packed 1000 per zip. All my email addresses have one word in common - and that word has been replaced in these archives with the word spambait.

Since I last reset the statistics that POPfile keeps, it has processed 10747 pieces of mail. I've had to reclassify 59 pieces (meaning it made 59 errors). By reclassifying the mail, it learns from its mistakes. That's an accuracy rating of 99.45%. It has never failed me on mail from real people. I manage a couple of domains and so some spam relating to the registering of domain names slips through, for example.

The first and second 10,000 chunks of spam have been repackaged in new economy-sized zips. Having passed the 20,000 mark, it seems like the right time to have POPfile and Eudora start dumping spam directly into the trash bin. For now then, the plan will be to have spam10k1.zip and spam10k2.zip be the last collections of spam available here.

As I look at the counter page and review my logs, there are a few spots where having these collections available appear to have value. For example, it looks like a popular maker of anti-virus software downloaded this collection in the same time frame that they were filing a lawsuit against some spammers for software piracy.

On the other hand, a number of the downloads are from search engines. I cringe at the thought of dornbos.com rising in the search engine ranks for all the wrong terms as I upload another 1000 pieces of spam containing another hundred or so references to buying viagra online from the privacy of blah blah blah blah blah.

spam10k1.zip 12.6 meg 11.18.02 - 01.28.03

spam10k2.zip 11.1 meg 02.21.03 - 04.21.03

If you'd like more information on the battle against spam, you can start here at cauce.org.

Back to the Dornbos home page