Thursday 3 June 2010

SpamAssassin (spamd) deadlocks, runnning slow, Postgres

The usual way to use spamassassin is to implement the Bayesian filtering database in flat files. That's the default configuration and most distros ship with this option.

But that's a bad option. If you want to implement a database, use a proper database.

The advantages are numerous: Proper caching and cache management, ability to tune the database and tables, better locking capabilities (as in the locking capabilities exist), clustering and fail-over capabilities which all amount to better resource usage, better throughput, less server strain, better maintenance and a much better service.

There is no better enterprise class OpenSource database than Postgres (or it's twin, Ingres). Frankly, your decision to NOT use Postgres for your database needs has to be very well justified as PG is not only truly enterprise class, but it's also easy to set up, easy to admin and, most importantly. you can tune it properly.

Which is why you should be using Postgres with SpamAssassin.

Unfortunately, though, the latest version (and previous versions), 3.3.1 have some pretty bad SQL in them. Rather than utilise the PG strengths and keys, the SQL has not been optimised from a performance point of view. Which, in an email system, is one of the most important things to consider!

The biggest hole is the use of the SQL IN operator on the bayes_token table. This effectively forces a full table scan because the unique key is id, token. On a system-wide implementation, the ID column is a particularly weak key (i.e. not a key at all because it's always the same value) so this is a real deal-breaker.

The solution is to use the primary key wherever possible, which, it turns out, is nearly all the time.

On a system with a large spam database, this is the difference between a powerful server grinding to its knees v.s. the same server flying at vast throughput.

The biggest deal-breaker is in the update of the atime column, which is about the most regularly performed task. So it's the hottest of the hot spots in the spamd PG code and also the worst implemented. The fix, however, is very easy.

Simply edit this file (note the path will be different on your machine:
/usr/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/BayesStore/PgSQL.pm

and make these changes:


Original code fragment:
-------------
sub tok_touch_all {
.
.
.
  my $sql = "UPDATE bayes_token SET atime = ? WHERE id = ? AND token IN (";

  my @bindings;
  foreach my $token (sort @{$tokens}) {
    $sql .= "?,";
    push(@bindings, $token);
  }
  chop($sql); # get rid of trailing ,

  $sql .= ") AND atime < ?";

--------------



Amendments:
----------

sub tok_touch_all {
.
.
.

  foreach my $token (sort @{$tokens}) {
  my $sql = "UPDATE bayes_token SET atime = ? WHERE id = ? AND token =";

  my @bindings;
    $sql .= "?,";
    push(@bindings, $token);
  chop($sql); # get rid of trailing ,

  $sql .= " AND atime < ?";

.
.
.
  $self->{_dbh}->commit();
  }

  return 1;
}

----------------------------

Note that I insert the closing } before the "return 1".

I.e. I have converted this into a line by line update, so that the DB can use the very strong primary key of id,token.

The performance difference that this makes is absolutely enormous on a busy system.

In case you want to simply cut and paste the entire function, here is the tok_touch_all function with the amendments in it:

sub tok_touch_all {
  my ($self, $tokens, $atime) = @_;

  return 0 unless (defined($self->{_dbh}));

  return 1 unless (scalar(@{$tokens}));

  foreach my $token (sort @{$tokens}) {
  my $sql = "UPDATE bayes_token SET atime = ? WHERE id = ? AND token =";

  my @bindings;
    $sql .= "?,";
    push(@bindings, $token);
  chop($sql); # get rid of trailing ,

  $sql .= " AND atime < ?";

  $self->{_dbh}->begin_work();

  my $sth = $self->{_dbh}->prepare_cached($sql);

  unless (defined($sth)) {
    dbg("bayes: tok_touch_all: SQL error: ".$self->{_dbh}->errstr());
    $self->{_dbh}->rollback();
    return 0;
  }

  my $bindcount = 1;

  $sth->bind_param($bindcount++, $atime);
  $sth->bind_param($bindcount++, $self->{_userid});

  foreach my $binding (@bindings) {
    $sth->bind_param($bindcount++, $binding, { pg_type => DBD::Pg::PG_BYTEA });
  }

  $sth->bind_param($bindcount, $atime);

  my $rc = $sth->execute();

  unless ($rc) {
    dbg("bayes: tok_touch_all: SQL error: ".$self->{_dbh}->errstr());
    $self->{_dbh}->rollback();
    return 0;
  }

  my $rows = $sth->rows;

  unless (defined($rows)) {
    dbg("bayes: tok_touch_all: SQL error: ".$self->{_dbh}->errstr());
    $self->{_dbh}->rollback();
    return 0;
  }

  # if we didn't update a row then no need to update newest_token_age
  if ($rows eq '0E0') {
    $self->{_dbh}->commit();
    return 1;
  }

  # need to check newest_token_age
  # no need to check oldest_token_age since we would only update if the
  # atime was newer than what is in the database
  $sql = "UPDATE bayes_vars
             SET newest_token_age = ?
           WHERE id = ?
             AND newest_token_age < ?";

  $rows = $self->{_dbh}->do($sql, undef, $atime, $self->{_userid}, $atime);

  unless (defined($rows)) {
    dbg("bayes: tok_touch_all: SQL error: ".$self->{_dbh}->errstr());
    $self->{_dbh}->rollback();
    return 0;
  }

  $self->{_dbh}->commit();
  }

  return 1;
}