[MLB-WIRELESS] Details on the server outage.

Steven Haigh netwiz at crc.id.au
Mon Feb 27 00:03:31 EST 2006


Ok. I said I'd write something up, so, here it is.

On Friday around 3pm, I started to upgrade various packages on the  
server. The distro in use at the time was Fedora Core 2 - which had  
been out of the whole update scene for quite a while. This is  
something I wanted to correct.

I started installing a few packages that would have little impact on  
operations, when everything stopped. As only my ssh session was  
responding (no web, no new ssh sessions etc) I told the box to  
reboot. At this point, the server did a kernel crash and refused to  
do anything.

I called someone onsite to hard reset the server, and they watched  
the screen as it booted, however the kernel panic'ed on all reboots  
with an error in ext3.ko. This is where things get fun. It seems  
something (still unknown as to what at this point) corrupted around  
91Mb total of files on the filesystem. One of these was ext3.ko -  
which made the box unbootable. At this point, I also redelegated the  
multiple domains on the server to another primary nameserver to stop  
having DNS issues with the primary nameserver offline.

I then pulled the server out and brought it home to work on - and I  
figured that as it wouldn't boot at all, I'd upgrade it to the latest  
Fedora Core 4 packages as I went. It then turned out that the journal  
was also corrupt on the ext3 filesystems - and as it was the root  
drive, the system would not let me fcsk it without major hassles -  
and 91Mb worth of lost data.

So, I booted of the FC 'panic' DVD and copied as much data as  
possible off the system, reformatted the whole thing and installed  
FC4. The rest went without a hitch. The fairly recent tape backup  
(done on the 19/2) restored without a hitch, and 95% of things were  
back to normal. This took between 7pm and around 4:30am Friday night/ 
Sat morning.

I had to work at 9am, so I did my 9->5 shift and then came home to  
work more in the server. As most of the data was repaired, I spent  
myself punishing the server to see if I could make it crash again -  
with no luck.

Today, the server went with me to work (for another 9am -> 5pm shift)  
where I tried harder to make it crash (no success!) and finally  
tonight at around 7:30pm the server was put back online in it's new  
home in Collins St.

Sorry to all for the outage. It wasn't something planned, however the  
backup system worked flawlessly to get the machine fully rebuilt and  
back online in a shade over 48 hours with minimal data lost.

I'm currently working on improving the backup to a nightly setup at a  
remote location, to minimise data loss to under a day - however this  
is currently in the planning/testing stage.

--
Steven Haigh

Email: netwiz at crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9017 0597 - 0412 935 897







More information about the Melbwireless mailing list