TFL status (original post was about the THR disruption early A.M. on July 26)


PDA






tyme
July 26, 2004, 07:16 AM
THR was inaccessible for the better part of an hour ending around 6 A.M. Eastern.

The upgrade to php 4.3.8 took longer than expected due to php developers being jerks and not applying obviously correct patches, and due to freebsd jerks who altered the mod_php install procedure.

Edit... the people who write turck mmcache are jerks, too.

Sorry there was a disturbance, but everything's fine now. :)

If you enjoyed reading about "TFL status (original post was about the THR disruption early A.M. on July 26)" here in TheHighRoad.org archive, you'll LOVE our community. Come join TheHighRoad.org today for the full version!
Mal H
July 26, 2004, 10:55 AM
Didn't notice a thing, tyme. :) Thanks for keeping the night watch.

While we're here and talking about outages, do you know what is going on over at TFL? I haven't been able to get in for 2 days.

tyme
July 26, 2004, 04:46 PM
Yes, there should be a message visible, but here's the situation.

There have been nightly crashes (coinciding with the nightly backups) for quite a while. They seem to be preventable only by disabling nightly backups :( . The meager evidence left in the kernel logs points strongly in the direction of either a raid or drive failure, or a kernel aacraid driver problem. Right before a freeze, there's typically a scsi bus reset.

The scsi raid card is an adaptec 2120S crusader; the following kernels have all crashed under heavy load: 2.4.27-pre5, 2.6.5, 2.6.6, 2.6.6mm2, 2.6.7bk13. It's strange that crashes have continued with the last kernel, because Alan Cox claimed in the changelog to have fixed an aacraid freezing problem similar to what TFL has experienced.

Then, Sunday morning, there was another crash. The system didn't boot up, and from what we can tell from someone on-site, the raid controller is complaining about a disk missing from the raid volume. (that's sketchy... we don't have a verbatim error message) It obviously then doesn't boot the raid volume, so it doesn't even get to LILO much less boot a kernel.

I've had a shell open several times when the server has had one of these nightly "freezes." The ssh and zsh processes survive and are interactive, but anything requiring disk access results in an error (I don't recall exactly what, but it's something like an IO error.) The first time that happened I was trying to get something to run in that situation, and either the sshd process or the shell received an illegal instruction error/exception (and died, obviously).

Mal H
July 26, 2004, 05:25 PM
Thanks for the update. The outage message did show up a few minutes after I posted above even though it appeared to be time stamped at 0330 EDT.

If you enjoyed reading about "TFL status (original post was about the THR disruption early A.M. on July 26)" here in TheHighRoad.org archive, you'll LOVE our community. Come join TheHighRoad.org today for the full version!