AAR: Recent Performance Problems


PDA






Derek Zeanah
January 12, 2013, 10:37 AM
The Problem: We've been really slow of late, with page loads taking up to 40 seconds at times for non-logged-in users (i.e., those whose responses should be the fastest.) We've been trying to remedy this.

The Trigger: Our load, as measured by people online at any time, jumped from ~ 1,900 to over 5,200 in less than two weeks. As a result the database server was under greater load than it could handle.

Efforts to solve the problem: Our primary effort was on the database server itself, with the assumption that tuning and/or changes to the structure of the database would remove the bottlenecks we were seeing. And they did make things better; just not enough.

It took bringing in a database consultant to find the problem. A simple test showed that writes to the disk were horrendously slow. This prompted a whole new area of investigation.

The Real Cause: Database writes happen on a networked storage appliance that was chosen for both redundancy and speed. Unfortunately, a firmware update a few months back dropped the write speed we were experiencing by somewhere around 80%, but THR was in the slow season, and we didn't see any appreciable symptoms early enough to address this before the recent unexpected flood of users.

To resolve the issue I reconfigured things and changed how writes are happening. The transition yesterday took a few hours, but now we're seeing write speeds on the order of three times faster than they were one week ago. This appears to be fast enough to keep up with load, and I am hopeful that we have the problem fixed now.

Why do I think it's fixed? Well, here's a chart of the load times THR was seeing that captures the transition to the new setup:

http://www.thehighroad.org/attachment.php?attachmentid=177543&stc=1&d=1358000382

Pretty drastic.

Here's a representative graph from post-migration:

http://www.thehighroad.org/attachment.php?attachmentid=177544&stc=1&d=1358001241

You won't see those kinds of speed as I'm measuring inside the network and the measurement doesn't take into account the time for packets to travel over the Internet, pass the firewall, get assigned by the load balancer, etc, but from where I sit the problem seems resolved.

Other Issues Still Affecting Some Users: Errors connecting to the site. We use a load balancer to distribute load against a few web servers. When all web servers are down, there's something about the load balancer implementation that causes some browsers to refuse to connect to THR. This can be fixed by closing and reopening the browser, or by clearing the cache created since the load balancer issue, or by connecting to a page/forum within THR.
The load balancer went down a couple of days ago for no good reason. This was the cause of the above - for some reason the load balancer thought the web server pool was completely unavailable, and stopped allowing connections. The problem happened even though the pool was available, and was kind of weird. My best guess as to the cause was that our CDN changed primary interfaces on us, the old interface wasn't showing account usage correctly (updates weren't happening), and the bandwidth we'd purchased expired the day of the weirdness. I can't say for sure that this was the cause, but if so we should be good for another year. I'd simply replace the load balancer with another system or with a single beefy web server, but since SHOT is coming up I'm unwilling to make any unnecessary changes. Changes break things. We'll save these kinds of changes for when I can be at the data center with hours to spare just in case.
Web Servers now cache more: The way the web servers interact with the database server has changed. The "who's online" listing on the main page, for instance, now only updates every 4 minutes rather than instantly. We also turned off some features that are known to have a negative impact on performance while diagnosing this whole problem, and we'll be slowly turning them back on over time.

That's all I've got off the top of my head. I'll be searching for a redundant network storage solution once SHOT is over and I'll be transitioning as a result, but I'm hopeful I can do this without bringing THR down as part of the process. No guarantees, but I'm hopeful.

Now, everyone cross your fingers, and hope my assessment is correct. These issues are simply no fun, especially with a new gun control push coming and SHOT around the corner. We need to be talking, not fighting stupid technical problems.

As always, thanks for your patience while this problem was slowly being worked through. I wish I'd seen the problem sooner, but this was the best I was able to do.

If you enjoyed reading about "AAR: Recent Performance Problems" here in TheHighRoad.org archive, you'll LOVE our community. Come join TheHighRoad.org today for the full version!
If you enjoyed reading about "AAR: Recent Performance Problems" here in TheHighRoad.org archive, you'll LOVE our community. Come join TheHighRoad.org today for the full version!