azurelunatic: Dreamwidth antispam: a dreamsheep holding a hammer, the better to smack spammers with. (spamhammer)
[personal profile] azurelunatic
It's been an interesting year in [site community profile] dw_antispam!

Every week, more or less, I pull the spam statistics for all items reported as spam sitewide. (Eventually that part of my job may be replaced by a very small script.) In some weeks I am able to review all reports, and some weeks I only have a chance to look over the top few. When possible, I make a note of how many of the reports were actual spam, and how many of them were other things that made their way into the spam reporting system. (For example, anonymous insults are certainly unpleasant and deserve deletion, but are not actually the commercially-motivated, high-volume sort of thing that the antispam system is designed for, and thus not actionable by the antispam team. Some reports, while not spam as such, were forwarded to developers who were better able to address the specific problem, such as comments that "broke" the page for other readers.)

The numbers for each item here are, in order: valid reports, invalid reports, and total reports. (When exact numbers were unavailable and the old reports had been cleared, I skewed in the direction of counting unknown/uncertain items as valid; if entirely unknown, I left the invalid number as 0.)

During some weeks, for one reason or another, I was not able to pull the reports as usual; in the interests of not having the numbers wildly out of whack, I kept the numbers the same as the previous or next week. I have noted in my source data which weeks were the result of estimates, and made a note with each total.

These numbers only take into account the spam that is deleted-and-reported, so the numbers for spam actually received across the service are assuredly higher, due to spam in abandoned journals, spam that is being deliberately saved, and spam that the journal owner either hasn't yet found the time/energy to delete or is unlikely to find the time/energy to remove at all.

Valid spam reports sitewide in 2011: ~4,800
Invalid (non-spam) reports in 2011: ~200
Total spam reports sitewide in 2011: ~5,000

Total registered user spammers in 2011: 16

Year Weekly Average
Valid: 90
Invalid: 4
Total: 94
Maximum reported registered user spammers in any week: 4

In an average week, 10-20 pieces of reported spam are reported by a single user. This does mean that spammers are singling out some users to barrage more than others. A rise in your personal spam does not mean that spam is necessarily up for the whole site, just that you are the unlucky user who is getting a lot of it this week.

The vast majority of spam reports are of anonymous comments. The breakdown (weeks without data were excluded from this):

Anonymous comments: 3735
OpenID comments: 62
Registered user comments, entries, and private messages: 122, of which 71 were valid; that's 58% of reports that were valid, and 42% that were not actual spam.

The vast majority of anonymous spammers are defeated by CAPTCHAs.
Most OpenID spammers originate from LiveJournal. Many of their spam comments are not left on Dreamwidth directly, but imported along with a journal.
A relatively significant proportion of the registered user spammers (most of whom are from open registration periods) were caught due to what I like to call "flagrantly notable" spamming -- spam directed at official areas of the site, where it comes directly to the attention of people who will issue the smackdown.

I've pulled the numbers from my weekly reports into a spreadsheet, for the curious, with some commentary:
azurelunatic: Warning: participating in #dw may result in blacking out and discovering yourself as head of a project team. (#dw warning: department head)
[personal profile] azurelunatic
Quarter 1:

Closed beta. Site owners handled all spam.

Quarter 2:

[staff profile] denise got tired of handling all the spam and appointed [personal profile] azurelunatic head of antispam. [personal profile] invisionary became co-head.

Open beta.

Internal tools and policy were worked out.

Quarter 3:

[personal profile] exor674 discovered a bug whereby spam was getting through that ought not to have been. She and Mark got this squared away very quickly, which lessened the load on the team and the entire site.

Dreamwidth picked up the Spamhaus drop list.

There were continued improvements to internal tools and policy.

Quarter 4:

There was a notable rise in erroneous reporting of non-spam but unwanted comments, followed by a fall back to previously established levels.

Weekly reporting started. There was a drop in spam over Christmas and the new year.

One of the very few advantages to seeing this much spam is being able to note the highlights.
  • There is a surprising amount of spam that does not actually include directly profitable content (links at which you may be persuaded to partake of their dodgy goods and/or services, or links to drive up their search engine credibility). The working theory is that these are test runs so they can see what's being left unguarded. The occasional quotes from a variety of dead philosophers are just a bonus.
  • Santa Claus and Viagra in the same sentence makes me run screaming.
  • Styles designed for non-LiveJournal-based blogs are not likely to work on Dreamwidth.
The spammers would also like you to know that there are many fine establishments on the internet where you can obtain
  • adult goods and/or services
  • pharmaceuticals, recreational and otherwise
  • Genuine fakes
  • shoes
Perhaps you will choose one of theirs?

Policy and internal tools, as ever, improved in response to c'thia.