Yesterday was our first major “bad day” for Canvas, and it couldn’t have happened on a worse day since many of you were starting Fall classes. We had some serious operational issues and around a third of our customers experienced super slow page loads, poor response times and timeouts for most of the day. Our internal alerts went off, and shortly after that, we emailed our Canvas Administrators and began posting regular updates to our support forum.
As of right now, everything is running normally. Last night the first steps of cluster reconfiguration and hardware upgrades were completed, and things were carefully monitored all night and this morning things appear to be back to normal.
So what happened?
The short version is this: basically, one of our databases in our cluster started to register an abnormal load, which jacked things up.
The longer version is this: first understand that our architecture is not a simple app-server-to-database configuration. It’s a clustered architecture and has layers of app servers, caching, connection pooling as well as a cluster of database servers. It’s built this way to manage scale, and more importantly, when scaling limitations are hit, to mitigate the failures. In this case, it worked the way it should in that it protected the majority of our customers from the slow downs and time outs – but a significant, unacceptable number were affected. The near term fix is in place, and is keeping things in check, but the true, long term fix to facilitate parallelism in the database cluster is being worked on as I write this, and should be deployed before the end of the week.
What about the Cloud and Automated Provisioning?
The Cloud was the hero in this case – it allowed us to immediately provision bigger and better hardware to act as a stopgap while we proceeded with the cluster reconfiguration. Think about the way the world used to work, when you’d be on the phone trying to get a new beefier server shipped overnight to your data center and you’d pray that it would even arrive in 48 hours, let alone be set up and ready to take load. Not much consolation, I know, but just some perspective on the benefits of the Cloud.
Automated Provisioning allows us to scale our app servers to manage the dynamic load demands during the peak hours and days of the week – but it does not extend to the database cluster layer, which is where we experienced our failure on Monday.
Will this happen again?
No, probably not this specific problem – but Canvas will experience some less than perfect minutes, hours, and yes, even days sometime in the future. It’s not magic – it’s software. While we have an excellent native cloud architecture, and we have incredibly brilliant and experienced operations and engineering teams – we will never achieve 100% uptime. We’re really good, better than most – but we’re not perfect.
Again, I apologize to those of you that experienced these awful slow downs and timeouts. I know it’s not acceptable. We remain committed to giving you an awesome experience.
Onward to normalcy,