Thursday, 16 September 2010

Converting PST files to Linux MBOX format

You want to upgrade from Outlook to Thunderbird and you've changed from Windows to Linux as a desktop. If you have control over a mail server running dovecot, you are in luck.

The tool that you need is the very handy libpst library, although if you run Kubuntu or Ubuntu then this is available on synaptic.

Simply export your Outlook mail into a pst file. Let's call the file mailBackup.pst. Now copy the file to your email server or linux desktop.

It is important to create a directory to work in because for each folder in the PST (Outlook) file, there will be a mbox file created with the same name. Cd into this directory and keep emailBackup.pst in the directory above it.

Now simply run:
   readpst ../emailBackup.pst

The Outlook file will be converted into a series of mbox files. If you run Dovecot, these can be put into your mail directory (be careful not to overwrite existing files!) and you will be able to access them with Thunderbird or your webmail software (I recommend horde).

You are also able to open mbox files in Thunderbird.

Thursday, 3 June 2010

SpamAssassin (spamd) deadlocks, runnning slow, Postgres

The usual way to use spamassassin is to implement the Bayesian filtering database in flat files. That's the default configuration and most distros ship with this option.

But that's a bad option. If you want to implement a database, use a proper database.

The advantages are numerous: Proper caching and cache management, ability to tune the database and tables, better locking capabilities (as in the locking capabilities exist), clustering and fail-over capabilities which all amount to better resource usage, better throughput, less server strain, better maintenance and a much better service.

There is no better enterprise class OpenSource database than Postgres (or it's twin, Ingres). Frankly, your decision to NOT use Postgres for your database needs has to be very well justified as PG is not only truly enterprise class, but it's also easy to set up, easy to admin and, most importantly. you can tune it properly.

Which is why you should be using Postgres with SpamAssassin.

Unfortunately, though, the latest version (and previous versions), 3.3.1 have some pretty bad SQL in them. Rather than utilise the PG strengths and keys, the SQL has not been optimised from a performance point of view. Which, in an email system, is one of the most important things to consider!

The biggest hole is the use of the SQL IN operator on the bayes_token table. This effectively forces a full table scan because the unique key is id, token. On a system-wide implementation, the ID column is a particularly weak key (i.e. not a key at all because it's always the same value) so this is a real deal-breaker.

The solution is to use the primary key wherever possible, which, it turns out, is nearly all the time.

On a system with a large spam database, this is the difference between a powerful server grinding to its knees v.s. the same server flying at vast throughput.

The biggest deal-breaker is in the update of the atime column, which is about the most regularly performed task. So it's the hottest of the hot spots in the spamd PG code and also the worst implemented. The fix, however, is very easy.

Simply edit this file (note the path will be different on your machine:

and make these changes:

Original code fragment:
sub tok_touch_all {
  my $sql = "UPDATE bayes_token SET atime = ? WHERE id = ? AND token IN (";

  my @bindings;
  foreach my $token (sort @{$tokens}) {
    $sql .= "?,";
    push(@bindings, $token);
  chop($sql); # get rid of trailing ,

  $sql .= ") AND atime < ?";



sub tok_touch_all {

  foreach my $token (sort @{$tokens}) {
  my $sql = "UPDATE bayes_token SET atime = ? WHERE id = ? AND token =";

  my @bindings;
    $sql .= "?,";
    push(@bindings, $token);
  chop($sql); # get rid of trailing ,

  $sql .= " AND atime < ?";


  return 1;


Note that I insert the closing } before the "return 1".

I.e. I have converted this into a line by line update, so that the DB can use the very strong primary key of id,token.

The performance difference that this makes is absolutely enormous on a busy system.

In case you want to simply cut and paste the entire function, here is the tok_touch_all function with the amendments in it:

sub tok_touch_all {
  my ($self, $tokens, $atime) = @_;

  return 0 unless (defined($self->{_dbh}));

  return 1 unless (scalar(@{$tokens}));

  foreach my $token (sort @{$tokens}) {
  my $sql = "UPDATE bayes_token SET atime = ? WHERE id = ? AND token =";

  my @bindings;
    $sql .= "?,";
    push(@bindings, $token);
  chop($sql); # get rid of trailing ,

  $sql .= " AND atime < ?";


  my $sth = $self->{_dbh}->prepare_cached($sql);

  unless (defined($sth)) {
    dbg("bayes: tok_touch_all: SQL error: ".$self->{_dbh}->errstr());
    return 0;

  my $bindcount = 1;

  $sth->bind_param($bindcount++, $atime);
  $sth->bind_param($bindcount++, $self->{_userid});

  foreach my $binding (@bindings) {
    $sth->bind_param($bindcount++, $binding, { pg_type => DBD::Pg::PG_BYTEA });

  $sth->bind_param($bindcount, $atime);

  my $rc = $sth->execute();

  unless ($rc) {
    dbg("bayes: tok_touch_all: SQL error: ".$self->{_dbh}->errstr());
    return 0;

  my $rows = $sth->rows;

  unless (defined($rows)) {
    dbg("bayes: tok_touch_all: SQL error: ".$self->{_dbh}->errstr());
    return 0;

  # if we didn't update a row then no need to update newest_token_age
  if ($rows eq '0E0') {
    return 1;

  # need to check newest_token_age
  # no need to check oldest_token_age since we would only update if the
  # atime was newer than what is in the database
  $sql = "UPDATE bayes_vars
             SET newest_token_age = ?
           WHERE id = ?
             AND newest_token_age < ?";

  $rows = $self->{_dbh}->do($sql, undef, $atime, $self->{_userid}, $atime);

  unless (defined($rows)) {
    dbg("bayes: tok_touch_all: SQL error: ".$self->{_dbh}->errstr());
    return 0;


  return 1;

Monday, 29 March 2010

No sound in flash on AMD 64 Kubuntu

The standard ubuntu/kubuntu installation breaks sound in flash. There are several reasons for this, the first that Kubuntu/Ubuntu uses pulse audio and flash (along with many other applications) doesn't play nicely with pulseaudio. However, PulseAudio problems is just the first of two issues. The second is that Adobe 32 bit flash cannot share audio devices and output sound in the shared environment. For that you need the 64 bit version.

So the first port of call in Kubuntu is to go to system settings/multimedia/device preference/audio output/Music and bump PulseAudio to the preferred (top) device.

Click on "test" to make sure that pulseaudio is actually working on your system. If you don't have a sound coming out, you need to first get pulseaudio working properly.

Next, you need to download the 64bit version of flash from Adobe and you can do this here:

(note the above is for version 10, you may wish to check that it's still the latest 64bit version).

Unzip the file. It gives you a file.

Now you need to overwrite this for firefox (and other applications that use flash).

For example in my system, in my user home directory, I have ~/.mozilla/plugins/
This is the directory that firefox/mozilla/chrome loads flash from. So I copy the file into this directory, overwriting the existing 32 bit version.

You may want to

sudo find / -name -print

to find out where this file is elsewhere on your system and update it there too.

Tuesday, 19 January 2010

KNetworkManager and WEP encryption

The KDE desktop environment is excellent. KDE is famous for its well integrated, powerful set of system utilities and applications. It's stable, fast and fully featured yet remains easy to use and highly customisable. In short, it rocks.

KDE is so widely used and supported that there is usually more than one utility to do any given task. However, there is usually an official tool and then alternatives.

For managing networks, especially wireless and bluetooth networks, the knetworkmanager utility is the official tool to use. However, although knetworkmanager is very promising and integrates nicely into the KDE bar and is well designed and thought out, at present it seems to not be able to handle WEP encryption very well. Many people have reported problems with knetworkmanager whereas other network managers work well with the same settings on the same wireless network and hardware.

We tested knetworkmanager with kubuntu Heron and a 10 character ASCII WEP key. While WiFI Radar worked well, connected immediately and scanned the range of wifi networks accurately, on the same laptop (Dell XPS M1730) knetworkmanager simply was unable to connect to the same WiFI connection that WiFI Radar connected to. We ensured that kwallet had the correct key in it and to make extra sure we also tested in the config file mode that knetworkmanager offers (storing the WEP passphrase in unencrypted text format).

In /var/log/syslog there were numerous lines with the following entries:

wlan0: AP denied authentication (auth_alg=1 code=15)

NetworkManager: Old device 'wlan0' activating, won't change.

wlan0: RX authentication from XX:XX:XX:XX:XX:XX (alg=1 transaction=4 status=15)

wlan0: unexpected authentication frame (alg=1 transaction=2)

wlan0: replying to auth challenge

wlan0: authentication with AP XX:XX:XX:XX:XX:XX timed out

It seems that for some reason, whilst other network management tools are able to configure the WEP passphrase correctly, KNetworkManager cannot. However, when we tested on unprotected WiFI networks, KNetworkManager worked a treat, reinforcing the notion that it only struggles with encryption.

We also discovered during testing that sometimes other WiFI tools such as WiFI Radar scanned and reported more WiFI networks in the same area with the same laptop at the same time than KNetworkManager did. To be fair, we tried several scanning interations, starting up WiFI Radar and then KNetworkManager alternatively to ensure that the laptop hardware could still see all the WiFIs in the area. Not only was WiFI Radar consistent in its reports, but KNetworkManager was inconsistent, sometimes reporting the same number of WiFI networks as WiFI Radar, other times not seeing several of the networks.

Browsing around the 'net, it seems that some people have KNetworkManager working and others do not. So at least part of KNetworkManager is functional, however if you are roaming networks and encounter a wide range of passphrases and WiFI configurations, this means that for now, KNetworkManager is practically unusable.

So for now, it seems that unfortunately the KDE default network management tool should not be used. Instead, we would recommend that you try other tools. We found WiFI Radar to be excellent.However other tools are also available.

Spamassassin tips and tricks

Spamassassin is a powerful antispam tool. However, it consumes a lot of processing power, so a good idea is to install amavis. This is a lightweight Perl script that pre-scans emails and rejects many of them based on rules that you set up within the Amaviz configuration file.

NOTE: This page won't attempt to teach you how to install and configure Spamassassin or Amavis.  other tutorials exist online. This tutorial is here to give you tips that you may not find elsewhere.

Spamassassin uses bayesian filters (think of this as a form of artificial intelligence) that can learn about what sort of emails are spam (bad) and what sort are ham (good). The key to this is a tool called sa-learn which you run against mailbox files that either contain only ham or only spam emails. This allows Spamassassin to learn which emails you think are spam. Spamassassin uses several files to store this information, kept in a  hidden directory (.spamassassin) for each mail user.

To teach Spamassassin about spam, you pass the –-spam paramter to sa-learn. For ham, the parameter is –-ham.

In the examples below we will assume that Spamassassin is running under the user account spamd and that a mailbox file (in the mbox format common with IMAP servers) that contains only sample spam emails is called Junk and is in the /tmp directory.

Tip 1: Spamassassin with amavis uses the .spamassassindirectory in the Amavis working directory (usually
/var/spool/amavis). Therefore when you are teaching Spamassassin called by Amavis, you need to use the --dbpath parameter. E.g.:

sa-learn --dbpath /var/spool/amavis/.spamassassin --mbox --spam -u spamd /tmp/Junk

sa-learn will look at the emails and will teach Spamassassin that the emails are spam. However, Spamassassin needs to be told to reload its bayesian knowledge files in order to gain this new-found knowledge.

Tip 2: After running sa-learn, issue a kill -HUP to the spamd parent process to force a reload of the bayesian  knowledge base. E.g.:

kill -HUP `cat /var/run/`

In very active system the spam flies in quickly filling the Junk file. This can slow down the sa-learn processing dramatically so a good idea is to clear it down. A common way in Linux to truncate a file is to issue a command such as:

> /tmp/Junk

However, for some IMAP servers, this can produce some nasty lockups in client email software when the mail user tries to add spam emails to the folder.

Tip 3: Clear down the Junk file(s) in an IMAP-friendly way. This means moving the file somewhere else for  processing and recreating the user file rather than truncating it (note that we mv and recreate first before  running sa-learn to ensure that the IMAP “folder” has only disappeared for a fraction of a second rather than waiting for a potentially very long sa-learn run to finish before recreating the file):

mv/home/username/mail/Junk /tmp/Junk

touch /home/username/mail/Junk

chown brad /home/username/mail/Junk

chmod 700 /home/username/mail/Junk

sa-learn --dbpath /var/spool/amavis/.spamassassin --mbox --spam -u spamd

Spammers use automated tools to harvest email addresses. Publishing an email address online is a magnet for spam. This can be to your advantage if you want Spamassassin to learn about new spam messages before they arrive at your legitimate email addresses. The trick is to make spammers send spam to honeypot email
addresses first:

Tip 4: Create honeypot email addresses that route all email received at those addresses into a spam email file. This can then be used to teach Spamassassin about new forms of spam before the spammers send to your  legitimate email addresses. Seed the spam email addresses on the Internet. Put them into web pages where email address harvesting software will find them but ensure that humans will not send legitimate email to them by putting up suitable messages around the email addresses.

Of course, you want Spamassassin to learn about spam automatically. This means that you will want sa-learn to run periodically. 

Tip 5: Create a cron job to run sa-learn periodically, letting it learn what is spam from the honeypot email  addresses as well as the Junk folders maintained by your email users. To do this, you need a suitable cron script. Below is a template for you to use. You will need to adjust the paths to the executables and files applicable on your system. In the example below, we have called the file where the emails from the honeypots are stored honeypot which we store in /var/spool/mail.

We have assumed that users move spam that they receive into (an IMAP) file on the server called Junk. In the example we show two techniques for processing this Junk user file. For username we truncate the file in an IMAP friendly manner by moving it and recreating the user file before sa-learn processes the moved file.
For usernameX we don't truncate the file. This means that the file will continue to grow in size until it's truncated by some other means. Sa-learn will ignore spam emails that it has already learned about so it is safe to not truncate a file provided that it doesn't grow to a point that sa-learn takes a long time to process it. If in doubt, truncate.

Also in the example below, we show how sa-learn can simply take a list of filenames on the command line which is handy if you have more than one file building up a store of spam emails:


/bin/mv /home/username/mail/Junk /tmp/Junk

/bin/touch /home/username/mail/Junk

/bin/chown brad /home/username/mail/Junk

/bin/chmod 700 /home/username/mail/Junk

/usr/bin/sa-learn --dbpath /var/spool/amavis/.spamassassin --mbox --spam -u spamd /tmp/Junk /home/usernameX/mail/Junk /var/spool/mail/honeypot >/tmp/sa-learn.log 2>&1

Truncate the honeypot file

> /var/spool/mail/honeypot

rm -f /tmp/Junk

/bin/kill -HUP `/bin/cat /var/run/`


xen "unpack list of wrong size" error

One of the fantastic features about Xen is that when you build ane w Xen virtual machine (VM), you can specify a file on domain0 as a physical device to the VM (domU). Here is an example from a Xen
machine configuration file (typically found in /etc/xen):
disk = ['file:/xen_files/215_main_disk.img,hda1,w','file:/xen_files/215_swap.img,hda2,w']

The above example shows a correctly configured definition for device hda1 (which is mounted from the file /xen_files/215_main_disk.img) and hda2 (which is mounted from the file /xen_files/215_swap.img).

However, a frequent newbie error is to forget the “file:” tag. If you entered this, for example:
disk = ['file:/xen_files/215_main_disk.img,hda1,w','/xen_files/215_swap.img,hda2,w']

then when you run xm create for that machine, it will output the error “unpack list of wrong size” which isn't very helpful in telling you that you forgot the “file:” tag!

Please remember to link to this page if you found this useful so that others can find it too!


Disabling password expiry for specific accounts in msec

These notes are written specifically for Mandrake 10.1, however they can apply equally well to many other releases and distributions that use the msec security package.

The msec package is a powerful tool for establishing tight security controls on your linux machine. It is highly customisable and comes with six pre-defined security settings that can be further customised to your requirements. However, there is a catch. The most useful setting is the higher level of security. With this level, though, comes a vicious password expiry regime that includes the root password. Worse still, there is a bug that sets password expiry to be immediate under certain conditions. This affects all user account in addition to the root account.

The result is that your computer can be locked out to all users needing a reboot into stand-alone mode (failsafe) in order to unlock it. Not exactly the best scenario especially if your machine happens to be a server in a remote location!

There is a solution to this problem though. The file /etc/security/msec/level.local allows you to fine tune the security settings in the msec package. You can add


to /etc/security/msec/level.local to disable password expiry. In fact, you can call this multiple times
to add any number of accounts, so for example


will disable password expiry for the sales login. However, there is another gotcha. The chances are that if you found this page you already have a problem with password expiry. Setting the above will not unset an expiry that is permanently expiring an account. For that you need to log into the machine and su – to root. Then you meet your new best friend, the chage command. This changes the password aging setting for an existing entry. So, to make sales never expire, you simply run:
chage -M 99999 'sales'

This sets sales to expire in 99999 days' time. And with the no_password_aging_for('sales') setting above, this will not be reset next time msec runs.

Of course, you need to take careful note of which accounts you turn off password expiry and ensure that these passwords are changed at regular intervals when it suits you, otherwise you may be compromising the security on your machine, especially if it is online.


Preserving postgres default values on tables that have views and update rules

The Postgres database has many strengths, one of the most powerful being the rules and triggers system. Combined with views, rules and triggers allow you to control access to data in underlying tables, stricting users to seeing only the data that they are allowed to see and to enforce business logic. Even complex views with data that comes from many tables through complex joins can be made updatable (insert, update and delete) through using update rules. This can dramatically simplify and speed up application development and makes
rapid application development (for example using Borland's Jbuilder).

To enable views to be used for updating the underlying datasets, you have to create update rules.This implies creating one or more rule for each update action: Update, delete and insert.

The postgres manual for rule creation is here (

Here is an example of creating a table, a view and an updatable rule, in this case for inserts:
create table test1 (id serial, col1 integer not null default 10, col2 text not null);
create view test1v as select * from test1;
create rule testins as on insert to test1v do instead (
insert into test1 (col1,col2) values (NEW.col1,NEW.col2);

However, the default values for col1 and the id columns in test1 will not be preserved on insert into view est1v. Inserting a null value into these columns will cause a not-null violation. Postgres does not propogate the rules and triggers in the object beneath the view into the update rules on the view. To do this you need to explicitly add these constraints to the view using alter table.

In this case we have the col1 default value constraint to apply:
alter table test1v alter column col1 set default 10;

We now have the default values added to the view. In this way you can build up very complex views, abstracting a good underlying database design, adding strong security and maintaining the sort of database interface that RAD tools such as JBuilder and Delphi excel at using.


Enabling ping, NFS and ssh in Mandrake at server-grade security levels

You have installed Mandrake Linux and have discovered that you cannot ping the machine or ssh onto it.

To enable pings, do this:
Add/Edit /etc/security/msec/level.local
add the line: accept_icmp_echo(yes)

Edit /etc/sysctl.conf

change the line:

and then run sysctl -p

To enable ssh, ensure that you have ssh installed (urpmi ssh). Mandrake does not automatically enable ssh at
server-grade security levels. The key here is the /etc/hosts.allow file. Ensure that you have this line in /etc/hosts.allow:
sshd : ALL

There is a similar problem if you run NFS mounts on your machine. Your portmap is disabled by default at certain security levels. The key here is to enable NFS ONLY for those IPs that need access to that machine. Here is an example of enabling portmap for a subnet and also the server itself (LOCAL) within the /etc/hosts.allow file:
portmap : 111.222.333.444/, LOCAL


How do I search for keywords in OpenOffice word document files?

This is the tool that you need:

Compiling PHP versions >= 4.3.0 fails

You downloaded the latest PHP version with its built in GD library code and tried to compile it. All goes well until it fails with this nto so meaningful message:

In file included from gdft.c:37:
/usr/local/include/freetype2/freetype/ftglyph.h:104: parse error
before `FT_Library'

You consider crying, but then you remember that the guys know their stuff and maybe they have the answer? Well, you are in luck. This error normally means that your /usr/local/include directory has an old freetype subdirectory as well as the new freetype2 directory that contains both freetype 1 and 2 include files in it. The solution is simple:
mv /usr/local/include/freetype /usr/local/include/old.freetype
and try again! It should work now! ;-)


NFS mounts that hang with status "D"


You got a great Linux office going or maybe a network of servers working together. They share their drives and data through NFS mounts.

Then disaster happens! A NFS server goes down and every machine locks up! The machines trying to mount the NFS partition hang. The processes trying to access the NFS disk (probably your root partition) freeze and won't die!

The reason this happens is simple: A disastrous decision taken by Sun who developed NFS in the first place, compounded with distributions that simply haven't figured it out yet, means that NFS partitions are mounted, as a default, with the hard option. This tells the computer that it cannot function without that NFS mount ... so it hangs, trying and retrying forever to get that mount up and running.

Great for diskless work stations.

Horribly stupid for the real world!

Fortunately the solution is as easy as drinking a cappucino. Simply add the option soft into the /etc/fstab entry for the NFS mount. This tells NFS to try but not hang up. Adding bg into the options tells it to background the retries, meaning that reboots and mounts will keep the retries in the background allowing the machine to continue processing as normal.

Here is an example NFS line with
suitable options set: 
/sms_team    nfs 
soft,bg,intr,timeo=10,retrans=2,retry=2,user,owner,exec,dev,suid,rw 0 0


Unicode, PostgreSQL, JDBC and truncated data

The problem

Unicode is a 16 bit character encoding that is destined to replace ASCII as the universal character encoding standard. Unlike ASCII, which was design for (and is largely limited to) the american character set, Unicode caters for the entire world's character sets, even the awesome array of Chinese characters! If you can type it on a keyboard, you can store it in Unicode. At last! A truly global, properly usable character encoding that doesn't assume that the entire world is an additional state of the USA!

Er, right.

Problem number one is that ASCII is a de facto standard across most of the world's machines and software.and things get seriously tricky when you start trying to map local incarnations of character sets that have been shoe-horned into the basic ASCII encoding (limited to 7 or 8 bits) to the universal big brother of
them all: Unicode.


PostgreSQL is the best database around. Yes, we are biased. Yes there are other excellent databases (you may think Oracle or Ingres, but we think of MySQL, which very seriously rocks as well. PostgreSQL supprts many different types of character sets (see the create database command for more details of its encoding option).


The problem is that when you connect to PostgreSQL via JDBC and you select text rows as a string from a
table, if those rows contain 8 bit characters (for example a pound (£) sign for the UK), then you may find that the data for that column gets truncated just before that character. I.e., this command fails:

Trawling the news groups, it seems that this is caused by Java, which also uses Unicode internally, not being able to map the character coming back from Postgres into a valid Unicode character. Often, the problem lies with the original insert into the Postgres table. The character in question is sent from some client software (with its own character set) to Postgres (which stores it in the character set for that database). On insert, Postgres doesn't check that the value stored in the table is a legal map from the client character encoding set, it merely stores that ASCII value for that character (there is a mapping option that you can turn on when you
build Postgres, see the Postgres Manual for more information on this).

It appears that the data is truncated due to a fault in the error handling of the JDBC software.

We don't really agree with this analysis. Although the common concensus is that this is a Postgres fault, we're not actually convinced about this. There are numerous reports of the same problem with different databases, including Oracle, MySQL and DB2. We think that the problem lies with the JDBC system and it not being able to determine what the character set is that the data is stored in. This may well come down to the database
supplying more information to JDBC but it may equally be that JDBC needs to examine the environment or use some other resolution mechanism
The solution

The best solution is to force Java to read the column as a set of byte values and then explicitly tell Java what the character set is that the table is stored in and then it can do the translation no problem! SO, converting this line:
into this line:
  new String(read_rs.getBytes(1),"ISO-8859-1")
does the trick. Of course, in the above example the table was in ISO-8859-1 format. This is the most likely format if Unicode translations are failing, but you do need to check which character set is used in you local software! See here for a list of possible character codes, what they are, and a very handy discussion of the codes and the characters in each coding.


Cron job to check for RAID disk failure

Linux's software RAID handling is fantastic, but how do you know if one of the disks have failed? RAID is designed to not have a single point of failure which means that if one of your disks goes West, you won't know about it. In a busy server environment you probably don't have the time to keep checking your Linux kit. Linux has a habit of running reliably for years and years until the hardware fails or you need to upgrade the system.

Well, fear not! to the rescue! We have this neat little script that does and elementary check for disk failure and then emails you if it detects a failure. You should install it on your server as root, and chmod 500 so that it is executable by cron. Of course, you also need to use crontab -e to make cron run it at a sensible frequency.

Here is the script for you to cut and paste into a suitable file:


SYSTEM=`uname --nodename`

echo "The $SYSTEM system has RAID failures on it." >>$LOG_FILE
echo "Below is the output from /proc/mdstat" >> $LOG_FILE
echo "===========================================" >> $LOG_FILE

cat /proc/mdstat | egrep 'md.*raid' | fgrep -i '(f)' >> $LOG_FILE

if [ $? -eq 0 ]
cat /proc/mdstat >> $LOG_FILE

echo "===========================================" >> $LOG_FILE
mail -s 'URGENT: RAID disk failure detected' $MAILTO < $LOG_FILE

rm -f $LOG_FILE
exit 0


Recovering a RAID disk back into a RAID device

Okay, so you have been clever! You figured that with Linux you can build a RAID using nice cheap IDE disks. Linux's fantastic software RAID feature allows you to do this saving loads of money on harware
RAID and expensive SCSI disks. Maybe you did the easy thing and used a distribution like Mandrake Linux that makes it oh so easy to set up.

Then disaster happened! Maybe you did a forced reboot, maybe something else happened, but when the reboot had finished you did

dmesg | less

and you saw something like this in the log:
hdf7's event counter: 00000006
hde5's event counter: 00000003
md: superblock update time
inconsistency -- using the most recent one freshest: hdf7

md: kicking non-fresh hde5 from array!

Oh boy! Quick as a flash you look into the status of the array:
cat /proc/mdstat

and it looks bad:
# cat /proc/mdstat
Personalities : [raid0] [raid1]
read_ahead 1024 sectors

md2 : active raid1 hdf7[1]
39262720 blocks [2/1] [_U]
md1 : active raid0 hde2[0] hdf6[1]
497792 blocks 64k chunks
md0 : active raid1 hde1[0] hdf5[1]
505920 blocks [2/2] [UU]

Now, in the above, /dev/md2 is the root partition on your machine (of course this is only an example and it may NOT be this device but some other /dev/md* device). It should be a RAID level 1 (mirrored) but there is now only one disk in that array!

What to do?

Well, you need to restate the kicked out disk (in this case, /dev/hde5). There is a useful command to do this:
raidhotadd /dev/md2 /dev/hde5
(NOTE: you need need substitute your own correct devices. The above is an example only)

That will rebuild the dirty mirror disk from the main mirror disk. It will bring the RAID back to a fully flying 2-disk mirrored setup provided, of course, that the disk doesn't have a fault making it fail. While the rebuild is happening, you can monitor the rebuild by:
cat /proc/mdstat

It may be that your disk fails to join the araay and after raidhotadd completes, you see something like this:

# cat /proc/mdstat
Personalities : [raid0] [raid1]
read_ahead 1024 sectors
md2 : active raid1 hde5[0](F) hdf7[1]
39262720 blocks [2/1] [_U]
Note the (F) which means that the disk failed. Now hard drives are extremely reliable and it us unlikely that your disk is toasted (although you can always assume this to be safe). There is a great Linux command, badblocks that will scan your disk and mark off the bad blcoks on it. You can then safely add it back into the array. Please note though:
Only run this on unmounted disks

It takes a LONG time to run.

Simply run:
badblocks -f /dev/hd*
where /dev/hd* is the device name for your drive. In the example above this would be /dev/hde5. After the badblocks has run, try to raidhotadd the disk back into the array again.

You have to admit it: Linux is HOT!