On Friday, Sept 26, 2014 at 6:32 PM GMT (11:32 AM PST), a network failure caused Pipedrive app to be unavailable. Our infrastructure team was immediately notified. Because the network failure affected our entire physical infrastructure, we had to get on the phone with Rackspace, our hosting provider, who started looking at the issues immediately to restore access to the services for us.
By 6:57 PM GMT (11:57 AM PST), our central physical networking layer had been restored and we started to eradicate the effects of the network failure. However the external storage volume mounts had gone into read-only mode, and thus our databases were not operating correctly.
Update #1: As of 7:21 PM GMT (12:21 PM PST) we are actively working on regaining read/write access to the storage volumes which would allow us to bring the app fully up again. We are still on the direct line with Rackspace engineers and work actively to get the services restored fully.
Update #2: As of 7:36 PM GMT (12:36 PM PST) we have regained network access to our environment and are currently restarting core services as fast as possible.
Update #3: As of 7:50 PM GMT (12:50 PM PST) we have restored functionality across roughly 65% of our database clusters. The remaining DBs are in the process of restarting now. Side services are being started now as well but some of these such as Google sync may continue to be unreliable until they are all up and running again.
Update #4: As of 8:01 PM GMT (1:01 PM PST) we have restored functionality across roughly 90% of our database clusters. The remaining DBs are in the process of restarting now. Side services continue to progress toward restoration and our operations team will begin validating performance across the application soon.
Update #5: As of 8:16 PM GMT (1:16 PM PST) we have restored functionality across all database clusters. Side services are generally available too but some users may continue to experience issues with search and mailbox-beta as the components facing these features are still being restored. Pipedrive operations is now commencing performance validation and monitoring the situation closely.
Update #6: As of 9:24 PM GMT (2:24 PM PST) search functionality has been restored for all users. Our elasticsearch clusters are now syncing to their replica shards and this will improve search performance throughout the day. Mailbox beta users should see delayed mail delivered within the next few minutes, and mail will return to real-time delivery status shortly thereafter.
Fortunately, for those of you that need to make use of search today we do have a proposed workaround. You can actually use Pipedrive Filters as a form of advanced search, and this is a best practice we often recommend for people trying to search deeper than the search bar allows them to normally.
To take advantage of this all you need to do is create a Filter that you can then edit later each time you want to search for something different. An example screenshot is below:
- Using filters for advanced search
Below is our support center documentation on Filters as well, in case you wanted to brush up on the power of searching cross-item records easily.
Update #7: As of 11:24 PM GMT (4:24 PM PST) we’ve completed validation of the app and all services are fully operational. Search performance will continue to improve throughout the day as our elasticsearch clusters sync back to their replica shards. I want to assure you that we take matters like these very seriously – today’s events should simply never happen and we will be conducting a full root cause analysis investigation in concert with our datacenter provider to ensure precisely that.
We are profusely apologetic to all our customers for the interruption today and sincerely appreciate the patience you’ve shown us. For the technical crowd out there, initial reports indicate today’s outage was caused by a rare and unexpected spanning tree event during a planned network extension as part of our ongoing infrastructure migration project.
We have temporarily suspended any further network extension efforts today and will cautiously proceed with the scheduled maintenance tomorrow morning, however the recent events have prompted us to extend tomorrow’s maintenance window hours to 2:00 AM – 9:00 AM PDT.