[MLB-WIRELESS] Details on the server outage.
Steven Haigh
netwiz at crc.id.au
Mon Feb 27 00:03:31 EST 2006
Ok. I said I'd write something up, so, here it is.
On Friday around 3pm, I started to upgrade various packages on the
server. The distro in use at the time was Fedora Core 2 - which had
been out of the whole update scene for quite a while. This is
something I wanted to correct.
I started installing a few packages that would have little impact on
operations, when everything stopped. As only my ssh session was
responding (no web, no new ssh sessions etc) I told the box to
reboot. At this point, the server did a kernel crash and refused to
do anything.
I called someone onsite to hard reset the server, and they watched
the screen as it booted, however the kernel panic'ed on all reboots
with an error in ext3.ko. This is where things get fun. It seems
something (still unknown as to what at this point) corrupted around
91Mb total of files on the filesystem. One of these was ext3.ko -
which made the box unbootable. At this point, I also redelegated the
multiple domains on the server to another primary nameserver to stop
having DNS issues with the primary nameserver offline.
I then pulled the server out and brought it home to work on - and I
figured that as it wouldn't boot at all, I'd upgrade it to the latest
Fedora Core 4 packages as I went. It then turned out that the journal
was also corrupt on the ext3 filesystems - and as it was the root
drive, the system would not let me fcsk it without major hassles -
and 91Mb worth of lost data.
So, I booted of the FC 'panic' DVD and copied as much data as
possible off the system, reformatted the whole thing and installed
FC4. The rest went without a hitch. The fairly recent tape backup
(done on the 19/2) restored without a hitch, and 95% of things were
back to normal. This took between 7pm and around 4:30am Friday night/
Sat morning.
I had to work at 9am, so I did my 9->5 shift and then came home to
work more in the server. As most of the data was repaired, I spent
myself punishing the server to see if I could make it crash again -
with no luck.
Today, the server went with me to work (for another 9am -> 5pm shift)
where I tried harder to make it crash (no success!) and finally
tonight at around 7:30pm the server was put back online in it's new
home in Collins St.
Sorry to all for the outage. It wasn't something planned, however the
backup system worked flawlessly to get the machine fully rebuilt and
back online in a shade over 48 hours with minimal data lost.
I'm currently working on improving the backup to a nightly setup at a
remote location, to minimise data loss to under a day - however this
is currently in the planning/testing stage.
--
Steven Haigh
Email: netwiz at crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9017 0597 - 0412 935 897
More information about the Melbwireless
mailing list