I wanted to take a moment to address the unfortunate downtime last Friday that impacted the majority of our Americas-based customers who couldn’t use our application in the middle of a workday.
First off, we’re very sorry for disrupting your workday. I want to assure you that we take this very seriously and find it absolutely unacceptable. As a team we feel bad for letting you down.
Secondly, we’ve learned some important lessons and have already made some changes to improve issue detection, technical processes and communication. While this doesn’t reduce the frustration you must have felt last Friday, we’ll be both quicker and more efficient in responding to any issues in the future.
What happened on Friday
For background: for some months now we’ve been working on significantly improving the infrastructure underlying our application. This project is ongoing and is meant to dramatically improve stability and performance of our app – along with giving us enough headroom to support our rapid growth.
Some configuration elements (namely, switch ports) with our new infrastructure setup failed on Friday and a rare technical event (a spanning tree failure event) took down our network.
Our operations team was immediately aware as we monitor the performance of Pipedrive 24×7 through many automated tests and alarms. We urgently coordinated a response with our network hosting providers. While the effort was immediate parts of our application take time to recover from a complete shutdown so some customers faced unreliable functionality for up to two hours.
Steps taken to avoid such situations in the future
Together with our hosting providers we have taken extra precautions in migrating to our new infrastructure setup. In the last couple of days we’ve also discussed and agreed upon several new internal workflows that will help to identify and fix any issues faster.
As I mentioned, there already were quite a few things in our roadmap that will reduce our exposure to issues like this. For example, reducing the size of databases and using multiple hosting locations. We’ll continue executing along these plans.
I hope this explains the reasons behind the downtime on Friday. I hope it also sheds light on our commitment to avoid outages and keep improving the speed and reliability of our app. Last but not least, I hope you’ll accept our apologies.
I expect to be writing more upbeat blog posts in the future.
PS. If you have questions or if you’d like to know more about our infrastructure improvements please contact us via our support email.