THR Upgrade scheduled (ish)


PDA






Derek Zeanah
January 26, 2013, 12:21 PM
The recent performance problems were triggered by a spike in users due to the recent tragedy, but the cause was unnoticed degradation in the write speed of our storage array. After a few phone calls plus some testing it's been determined that this degradation was due to a change in functionality triggered by a recent firmware update. We lost 80% of our write speed as a result, and it's not coming back.

After quite a bit of research I've chosen to replace that array with one from a much more well-known vendor. For those who care the new system is a 16 drive RAID-10 array of 15k RPM SAS drives, housed in a box with redundant controllers and power supplies. I won't call the system overkill, but I'll say that I don't anticipate using more than a fraction of this system's capacity in the near term, and it gives us plenty of room to grow.

So, in the near future I'll be performing the following tasks. I believe they can be performed without interrupting THR, but errors on my part (or unanticipated events) can certainly bring things down for an hour or two:
Install the new array and configure it properly.
Tweak both storage network switches to support the full functionality of the new array.
Configure network connections differently on the servers so that redundant connections work properly with the new array.
Rewire those same network connections with both redundancy and performance in mind.
Test the new array before any normal load is placed on it to understand its performance.
Test the old array once all normal loads are removed to understand where it was falling down. If I can find someone to buy it then they might care about this.
Possibly use this time to install a different load balancing solution so we don't get weird redirect errors or security certificate errors when the web servers are flagged as down for some reason. Also, a more resilient test for web server availability would be good.
Pull excess equipment from the rack.
Possibly install another power run - I don't know how many amps we'll be pulling with the new stack of fast drives. This might require a shutdown and restart, but I hope I can avoid it. Redundant power supplies and all that.
Change the backup solution we're using. Right now backups are performed hourly and are capturing everything, but they're only being performed to an off-site backup server. If I can get things working correctly we'll have everything backed up locally to the datacenter and off-site. So recoveries will take minutes instead of hours, and we'll still be able to recover should a meteor strike the data center or something.
I've got a pen and paper, so I'll probably add more to the listTiming is up in the air, though. I would have said Feb 4 due to scheduling conflicts, but it turns out my schedule opened up while I was typing this, so theoretically I can do this in 3 days or so.

The question is: will I be prepared by then, do I have all the equipment I need, and will I have kicked the lingering remnants of the flu by then? (I got a flu shot, but we all know that this wasn't as effective this year as was hoped.)

Anyway, now you know what I know, so when something happens I can claim I gave advance notice. ;)

If you enjoyed reading about "THR Upgrade scheduled (ish)" here in TheHighRoad.org archive, you'll LOVE our community. Come join TheHighRoad.org today for the full version!
If you enjoyed reading about "THR Upgrade scheduled (ish)" here in TheHighRoad.org archive, you'll LOVE our community. Come join TheHighRoad.org today for the full version!