This is Part 1 in a two-part series on ridiculously fast MongoDB replica set recovery. In Part 1, we will demonstrate how recovering a 5.6 GB MongoDB replica set from a persistent external volume (e.g. Amazon EBS )speeds up cluster recovery by 3446% (1 minute and 9 seconds vs 40 minutes and 47 seconds) compared to using MongoDB’s built-in replica feature.
In Part 2, we will show you step-by-step how to automate Mongo replica recovery using Kubernetes and Flocker so you can take advantage of this awesome improvement in cluster recovery yourself.
Introduction to MongoDB Replica Sets
Replication of a database creates multiple copies of the data. With multiple copies of data on different database servers, replication provides a level of fault tolerance against the loss of a single database server. This built-in replication is one of the reasons that MongoDB is so popular with developers.
Replication can be used with or without sharding. Sharding is the act of splitting a single database across multiple servers using a shard key. If you use replica sets with sharding, each shard of your database is itself replicated. For example, if you have a three-shard MongoDB database and set up a three-node replica set on each shard, you will have a total of nine database instances (3 replicas for each of the 3 shards).
In a nutshell, here is how MongoDB replicas work:
- A replica set consists of a primary replica and multiple secondaries.
- When a replica fails, a new primary is elected from the remaining secondaries.
- When you lose too many secondaries to be able to elect a new primary, recovery will require manual intervention which will result in downtime.
- In all instances, when a new secondary is added to the cluster, you’ll have to do a full re-sync with your primary to get your data onto the new node.
Here is a diagram from the MongoDB documentation which examines some important replication concepts.
##Recovering from replica set failure Here is a typical workflow when a node goes down.
- Engineer receives a page and needs to get to a terminal.
- Engineer diagnoses the issue.
- Engineer stands up a fresh new replica node or manually attempts to recover old replica node.
- CYFAP (cross your fingers and pray), as you wait for the new replica member to sync the entire dataset from another member of your replica set.
Bringing up a fresh new replica member requires a full initial resync of the entire existing dataset, even if you’ve built automation to handle this task. This is because when you bring up a new secondary, the primary must replicate its data to the new host.
The time it takes to replicate this data becomes longer as your dataset grows. Production datasets range from several hundred megabytes to several terabytes. Sharding can help make these datasets smaller since each replica has less of your total data, but even with sharding, the replication time can be significant.
Moreover, during this time when you are replicating your data to the new secondary, you can experience the following negative consequences:
- Strain is put on the internal network of your cluster and other MongoDB replica members as a portion of their resources are consumed with the replication process
- Depending on your level of fault tolerance, if you lose another secondary during the recovery process, you might have to intervene manually.
Ain’t no body got time for that! -me
This blog post examines two methods for recovering from a MongoDB replica failure and compares how long it takes to recover in each case.
Method 1 - Native MongoDB replication
In this method, when either a primary or secondary replica is lost, it is brought back into the cluster by a replication process controlled by MongoDB. If, for instance, one of two secondary replicas is lost and a new secondary replica is returned to the cluster, MongoDB will replicate all the data from the primary to that new secondary.
Method 2 - Persistent volume recovery
In this method, when either a primary or secondary replica is lost, it is brought back into the cluster from a persistent volume stored on network-attached shared storage like Amazon EBS. This method takes advantage of the fact that even though a server running the Mongo replica might die, the volume of that server persists, and can be used to jump start the recovery process. Rather than having to replicate the entire dataset in order to bring a new replica into the cluster, MongoDB only has to sync the most recent changes to oplog since the replica died, usually a significantly smaller amount of data.
Here is the methodology that we used to arrive at the conclusion that the persistent volume recovery method significantly outperforms native MongoDB replication for replica recovery:
Test Cluster Architecture
- 5 host machines consisting of AWS EC2 m3.medium or m3.2xlarge instances
- 1 Host machine to operate Kubernetes master and Flocker Control Service
- Flocker v1.10 installed on all three instances
- Kubernetes 1.1 to schedule our MongoDB replica member containers along with a small node sidecar application that configures a new member to the cluster.
- MongoDB version 3.2.3 from official MongoDB image on DockerHub
- 1.4GB dataset, by Bryan Nehl loaded manually using
mongorestoretool. Download this dump file.
Method 1 implementation - Native MongoDB replication
A three node MongoDB replica set consisting of one primary and two secondaries was deployed by Kubernetes using an empty directory, (EmptyDir) on the host machine. Using an EmptyDir means that nodes are forced to re-sync from another member upon being added to the cluster.
Instigating node failure and recovery
We removed the underlying node that was hosting a member in the replica set. This was done via the following command
kubectl delete no ip-10-0-0-XX.ec2.internal
Cluster recovery time was measured by polling
rs.status() using the script below.
This polling allows us to precisely measure:
- Time when the member left the replica set
- Status state changes of the newly added member going from “STARTUP2” -> “RECOVERY” -> “SECONDARY”
The test is complete when the script displays the replica back up with a status of “SECONDARY”
Time to recovery is simply calculated as [T2 (new replica online)] - [T1 (original replica destroyed)]
Test cases In order to understand how server performance and data volume size impact time-to-recovery, the test was run multiple times using:
- AWS EC2 m3.medium and AWS EC2 m3.2xlarge instances.
- Data set sizes of 1.4GB, 5.6GB and 7.75 GB.
Method 2 implementation - Persistent volume recovery
Setup The setup is identical to Method 1 except instead of having Kubernetes deploy the containers with an EmptyDir, a Flocker Volume is used. This Flocker-managed volume is created on the Amazon EBS block storage backend. When the secondary is removed from the replica set during the test and Kubernetes reschedules the MongoDB container to a new host, instead of having to do a full re-sync in order return to functional status as a secondary, Flocker can reattach the data volume used by the replica previously to the new MongoDB container.
All Flocker Volumes in this test were instantiated prior to running the test as Kubernetes does not create volumes on pod instantiation.
Instigating node failure and recovery, measurement and test cases were identical for Method 2 as described in Method 1 above.
Prior to the deletion of a member’s node this script was running on another member’s mongo shell.
Based on 51 tests run using the methodology above, time to recovery of a MongoDB replica was dramatically faster using the persistent volume method than using MongoDB native replication.
Here is a table summarizing these test results.
|Instance size||Method||1.4 GB||5.6 GB||7.7 GB|
*because 5.6GB took ~40 minutes on a m3.mediums a test with 7.7GB was not pursued on m3.mediums
Not only was time-to-recovery faster in every instance using the Persistent Volume method, it was highly consistent, varying less +/- 10 seconds regardless of instance size or dataset size. While we didn’t test it, we believe that this consistent, fast performance would continue even with datasets in the hundreds of gigabyte or even terabyte range because the only variable impacting time-to-recovery is speed of the AWS EBS API for detaching and attaching block devices.
The Native Replication method on the other hand is highly sensitive dataset size and instance size. On an m3.medium instance, going from 1.4 GB to 5.6 GB increased recovery time by 718%!
While using larger instance sizes narrowed the gap between time-to-recover for both methods, the persistent volume method was always faster.
The closest the two methods ever came was when the Peristant Volume Method was used on a 1.4 GB dataset running on a m3.medium (avg time to recovery 1:10) compared to the Native Method also on a 1.4 GB dataset running on a m3.2xlarge (1:16). Arguably, this 6 second improvement is not dramatic, except when you consider the price of the m3.2xlarge instance ($0.532) is 694% more than an m3.medium ($0.067 per Hour). The recovery performance per dollar is much better using the persistent volume method, something to consider when designing your application.
Let’s look a little deeper at these results broken out by instance type.
EC2 m3.medium instances
As you can see, the bulk of the time in the native MongoDB replication recovery method is spent replicating data between the primary and the secondary replicas.
EC2 m3.2xlarge instances
You can see in the chart above that the results are less dramatic when compared the the results in our prior example but the resulting benefit with Flocker still holds true. You can expect even longer recovery times for datasets that are larger than 1.4GB because resync time increases with the size of the dataset. In limited test 2.80GB and 5.6GB datasets took ~13min and ~40min respectively.
Just as important, the length “re-attach existing data” process used in the persistent volume recovery method does not change with the the size of the dataset. So recovery of a 1 GB, 100 GB or 1 TB replica will be constant using persistent volume method.
You can view the results on this google spreadsheet here.
Choosing the best cluster architecture and infrastructure
Not withstanding the performance improvements of the persistent volume recovery method, how can you decide which recovery method is right for you?
Start by answering these questions:
Can a replica be down for 10min? 30min? 1hr?
If you can tolerate a long recovery time, relying on MongoDB’s replication may suffice. If not, you might want to investigate the persistent volume method.
What level of fault tolerance do you want to enable.
The greater the number of replicas you want to be able to lose, the greater the number of replicas you need in your cluster (please read this documentation on fault tolerance and MongoDB replica sets if you haven’t yet already).
With more replicas, what you gain in fault tolerance, you might lose by introducing increased complexity and cost for your cluster.
By reducing the impact of losing a replica, the persistent volume recovery method might allow you to run a smaller cluster without any drawbacks.
How much stress does a completely new replica put on my cluster?
You should carefully consider how CPU, RAM ,Network I/O and Disk Performance are affected by replica recovery. A new replica can put enough stress to adversely affect the performance of your primary node along with increasing the amount of network traffic as a result of a full re-sync. If you are monitoring your MongoDB cluster, it’s worth doing a quick fire drill and benchmarking how much time it takes to stand up and sync a complete new replica for your specific cluster’s dataset size.
How redundant does my cluster need to be in the future?
Smaller datasets will recover much quicker than larger ones and may not need persistent external storage. As your dataset grows you can create replicas with different persistent strategies to migrate from local host storage to more durable external storage. You can even implement replicas in other regions and data center providers to further increase redundancy and fail-over capabilities.
How is read/write performance affected by using shared storage?
One potential drawback of the persistent volume recovery option is I/O performance. While most cloud block storage providers offer a dedicated iOPS option, this is something to consider when picking your recovery strategy.
Thanks for reading! Special Thanks to leportlabs for the MongoDB replica sidecar, Sandeep Dinesh, thesandlord, for the advice and Makefile on the Kubernetes slack group. Don’t forgot to look out for Part 2 of this post in which we will show you step-by-step how to use the persistent volume recovery method.
We’d love to hear your feedback!