[MLB-WIRELESS] Details on the server outage.
conhoolio at hotmail.com
Mon Feb 27 08:49:34 EST 2006
Nice work Steve,
Looks like we had our server move forced on us sooner than we expected!
View my blog:
> -----Original Message-----
> From: melbwireless-bounces at wireless.org.au
> [mailto:melbwireless-bounces at wireless.org.au] On Behalf Of
> Steven Haigh
> Sent: Monday, 27 February 2006 12:04 AM
> To: melbwireless at wireless.org.au
> Subject: [MLB-WIRELESS] Details on the server outage.
> Ok. I said I'd write something up, so, here it is.
> On Friday around 3pm, I started to upgrade various packages
> on the server. The distro in use at the time was Fedora Core
> 2 - which had been out of the whole update scene for quite a
> while. This is something I wanted to correct.
> I started installing a few packages that would have little
> impact on operations, when everything stopped. As only my ssh
> session was responding (no web, no new ssh sessions etc) I
> told the box to reboot. At this point, the server did a
> kernel crash and refused to do anything.
> I called someone onsite to hard reset the server, and they
> watched the screen as it booted, however the kernel panic'ed
> on all reboots with an error in ext3.ko. This is where things
> get fun. It seems something (still unknown as to what at this
> point) corrupted around 91Mb total of files on the
> filesystem. One of these was ext3.ko - which made the box
> unbootable. At this point, I also redelegated the multiple
> domains on the server to another primary nameserver to stop
> having DNS issues with the primary nameserver offline.
> I then pulled the server out and brought it home to work on -
> and I figured that as it wouldn't boot at all, I'd upgrade it
> to the latest Fedora Core 4 packages as I went. It then
> turned out that the journal was also corrupt on the ext3
> filesystems - and as it was the root drive, the system would
> not let me fcsk it without major hassles - and 91Mb worth of
> lost data.
> So, I booted of the FC 'panic' DVD and copied as much data as
> possible off the system, reformatted the whole thing and
> installed FC4. The rest went without a hitch. The fairly
> recent tape backup (done on the 19/2) restored without a
> hitch, and 95% of things were back to normal. This took
> between 7pm and around 4:30am Friday night/ Sat morning.
> I had to work at 9am, so I did my 9->5 shift and then came
> home to work more in the server. As most of the data was
> repaired, I spent myself punishing the server to see if I
> could make it crash again - with no luck.
> Today, the server went with me to work (for another 9am ->
> 5pm shift) where I tried harder to make it crash (no
> success!) and finally tonight at around 7:30pm the server was
> put back online in it's new home in Collins St.
> Sorry to all for the outage. It wasn't something planned,
> however the backup system worked flawlessly to get the
> machine fully rebuilt and back online in a shade over 48
> hours with minimal data lost.
> I'm currently working on improving the backup to a nightly
> setup at a remote location, to minimise data loss to under a
> day - however this is currently in the planning/testing stage.
> Steven Haigh
> Email: netwiz at crc.id.au
> Web: http://www.crc.id.au
> Phone: (03) 9017 0597 - 0412 935 897
> Melbwireless mailing list
> Melbwireless at wireless.org.au
More information about the Melbwireless