[MLB-WIRELESS] Details on the server outage.

Dan Flett conhoolio at hotmail.com
Mon Feb 27 08:49:34 EST 2006

Nice work Steve,

Looks like we had our server move forced on us sooner than we expected!



View my blog:

> -----Original Message-----
> From: melbwireless-bounces at wireless.org.au 
> [mailto:melbwireless-bounces at wireless.org.au] On Behalf Of 
> Steven Haigh
> Sent: Monday, 27 February 2006 12:04 AM
> To: melbwireless at wireless.org.au
> Subject: [MLB-WIRELESS] Details on the server outage.
> Ok. I said I'd write something up, so, here it is.
> On Friday around 3pm, I started to upgrade various packages 
> on the server. The distro in use at the time was Fedora Core 
> 2 - which had been out of the whole update scene for quite a 
> while. This is something I wanted to correct.
> I started installing a few packages that would have little 
> impact on operations, when everything stopped. As only my ssh 
> session was responding (no web, no new ssh sessions etc) I 
> told the box to reboot. At this point, the server did a 
> kernel crash and refused to do anything.
> I called someone onsite to hard reset the server, and they 
> watched the screen as it booted, however the kernel panic'ed 
> on all reboots with an error in ext3.ko. This is where things 
> get fun. It seems something (still unknown as to what at this 
> point) corrupted around 91Mb total of files on the 
> filesystem. One of these was ext3.ko - which made the box 
> unbootable. At this point, I also redelegated the multiple 
> domains on the server to another primary nameserver to stop 
> having DNS issues with the primary nameserver offline.
> I then pulled the server out and brought it home to work on - 
> and I figured that as it wouldn't boot at all, I'd upgrade it 
> to the latest Fedora Core 4 packages as I went. It then 
> turned out that the journal was also corrupt on the ext3 
> filesystems - and as it was the root drive, the system would 
> not let me fcsk it without major hassles - and 91Mb worth of 
> lost data.
> So, I booted of the FC 'panic' DVD and copied as much data as 
> possible off the system, reformatted the whole thing and 
> installed FC4. The rest went without a hitch. The fairly 
> recent tape backup (done on the 19/2) restored without a 
> hitch, and 95% of things were back to normal. This took 
> between 7pm and around 4:30am Friday night/ Sat morning.
> I had to work at 9am, so I did my 9->5 shift and then came 
> home to work more in the server. As most of the data was 
> repaired, I spent myself punishing the server to see if I 
> could make it crash again - with no luck.
> Today, the server went with me to work (for another 9am -> 
> 5pm shift) where I tried harder to make it crash (no 
> success!) and finally tonight at around 7:30pm the server was 
> put back online in it's new home in Collins St.
> Sorry to all for the outage. It wasn't something planned, 
> however the backup system worked flawlessly to get the 
> machine fully rebuilt and back online in a shade over 48 
> hours with minimal data lost.
> I'm currently working on improving the backup to a nightly 
> setup at a remote location, to minimise data loss to under a 
> day - however this is currently in the planning/testing stage.
> --
> Steven Haigh
> Email: netwiz at crc.id.au
> Web: http://www.crc.id.au
> Phone: (03) 9017 0597 - 0412 935 897
> _______________________________________________
> Melbwireless mailing list
> Melbwireless at wireless.org.au
> http://wireless.org.au/mailman/listinfo/melbwireless

More information about the Melbwireless mailing list