7.4.2.5 Failover
In this section, you will look at how
primary and secondary member failovers are handled in replica sets.
All members of a replica set are connected to each other. As shown
in Figure 7-5,
they exchange a heartbeat message amongst each other.
If the Node Is a Secondary Node
If the Node Is the Primary Node
If the node is a primary node, in this
scenario if the majority of the members of the original replica
sets are able to connect to each other, a new primary will be
elected by these nodes, which is in accordance with the automatic
failover capability of the replica set.
The new primary is elected by majority of
the replica set nodes. Arbiters can be used to break ties in
scenarios such as when network partitioning splits the
participating nodes into two halves and the majority cannot be
reached.
7.4.2.6 Rollbacks
In scenario of a primary node change, the
data on the new primary is assumed to be the latest data in the
system. When the former primary joins back, any operation that is
applied on it will also be rolled back. Then it will be synced with
the new primary.
The rollback operation reverts all the
write operations that were not replicated across the replica set.
This is done in order to maintain database consistency across the
replica set.
When connecting to the new primary, all
nodes go through a resync process to ensure the rollback is
accomplished. The nodes look through the operation that is not
there on the new primary, and then they query the new primary to
return an updated copy of the documents that were affected by the
operations. The nodes are in the process of resyncing and are said
to be recovering; until the process is complete, they will not be
eligible for primary election.
This happens very rarely, and if it
happens, it is often due to network partition with replication lag
where the secondaries cannot keep up with the operation’s
throughput on the former primary.
It needs to be noted that if the write
operations replicate to other members before the primary steps
down, and those members are accessible to majority of the nodes of
the replica set, the rollback does not occur.
The rollback data is written to a BSON
file with filenames such as
<database>.<collection>.<timestamp>.bson in the
database’s dbpath
directory.
The administrator can decide to either
ignore or apply the rollback data. Applying the rollback data can
only begin when all the nodes are in sync with the new primary and
have rolled back to a consistent state.
The content of the rollback files can be
read using Bsondump, which then need to be manually applied to the
new primary using mongorestore.
There is no method to handle rollback
situations automatically for MongoDB. Therefore manual intervention
is required to apply rollback data. While applying the rollback,
it’s vital to ensure that these are replicated to either all or
at least some of the members in the set so that in case of any
failover rollbacks can be avoided.
7.4.2.7 Consistency
You have seen that the replica set members
keep on replicating data among each other by reading the oplog. How
is the consistency of data maintained? In this section, you will
look at how MongoDB ensures that you always access consistent data.
In MongoDB, although the reads can be
routed to the secondaries, the writes are always routed to the
primary, eradicating the scenario where two nodes are
simultaneously trying to update the same data set. The data set on
the primary node is always consistent.
If the read requests are routed to the
primary node, it will always see the up-to-date changes, which
means the read operations are always consistent with the last write
operations.
However, if the application has changed
the read preference to read from secondaries, there might be a
probability of user not seeing the latest changes or seeing
previous states. This is because the writes are replicated
asynchronously on the secondaries.
7.4.2.8 Possible Replication Deployment
The
architecture you chose to deploy a replica set affects its
capability and capacity. In this section, you will look at few
strategies that you need to be aware of while deciding on the
architecture. We will also be discussing the deployment
architecture .
1.
Odd number of
members: This should be done in order to ensure that there is no
tie when electing a primary. If the number of nodes is even, then
an arbiter can be used to ensure that the total nodes
participating in election is odd, as shown in Figure 7-6.
2.
Replica set
fault tolerance is the count of members, which can go down but
still the replica set has enough members to elect a primary in
case of any failure. Table 7-1
indicates the relationship between the member count in the replica
set and its fault tolerance. Fault tolerance should be considered
when deciding on the number of members.
3.
If the
application has specific dedicated requirements, such as for
reporting or backups, then delayed or hidden members can be
considered as part of the replica set, as shown in Figure 7-7.
4.
5.
The members
should be distributed geographically in order to cater to main
data center failure. As shown in Figure 7-8,
the members that are kept at a geographically different location
other than the main data center can have priority set as 0, so
that they cannot be elected as primary and can act as a standby
only.
6.
7.4.2.9 Scaling Reads
Although the primary purpose of the
secondaries is to ensure data availability in case of downtime of
the primary node, there are other valid use cases for secondaries.
They can be used dedicatedly to perform backup operations or data
processing jobs or to scale out reads. One of the ways to scale
reads is to issue the read queries against the secondary nodes; by
doing so the workload on the master is reduced.
One important point that you need to
consider when using secondaries for scaling read operations is that
in MongoDB the replication is asynchronous, which means if any
write or update operation is performed on the master’s data, the
secondary data will be momentarily out-of-date. If the application
in question is read-heavy and is accessed over a network and does
not need up-to-date data, the secondaries can be used to scale out
the read in order to provide a good read throughput. Although by
default the read requests are routed to the primary node, the
requests can be distributed over secondary nodes by specifying the
read preferences. Figure 7-9
depicts the default read preference.
Figure 7-9.
Default read preference
The following
are ideal use cases whereby routing the reads on secondary node can
help gain a significant improvement in the read throughput and can
also help reduce the latency:
1.
Applications
that are geographically distributed: In such cases, you can have a
replica set that is distributed across geographies. The read
preferences should be set to read from the nearest secondary node.
This helps in reducing the latency that is caused when reading
over network and this improves the read performance. See Figure
7-10.
Figure 7-10.
Read Preference – Nearest
2.
If the
application always requires up-to-date data, it uses the option
primaryPreferred, which in normal circumstances will always read
from the primary node, but in case of emergency will route the
read to secondaries. This is useful during failovers. See Figure
7-11.
Figure 7-11.
Read Preference –
primaryPreferred
3.
If you have an
application that supports two types of operations, the first
operation is the main workload that involves reading and doing
some processing on the data, whereas the second operation
generates reports using the data. In such a scenario, you can have
the reporting reads directed to the secondaries.
In addition to scaling reads, the second
ideal use case for using secondaries is to offload intensive
processing, aggregating, and administration tasks in order to avoid
degrading the primary’s performance. Blocking operations can be
performed on the secondary without ever affecting the primary
node’s performance.
7.4.2.10 Application Write Concerns
When the client application interacts with
MongoDB, it is generally not aware whether the database is on
standalone deployment or is deployed as a replica set. However,
when dealing with replica sets, the client should be aware of write
concern and read concern.
Since a replica set duplicates the data
and stores it across multiple nodes, these two concerns give a
client application the flexibility to enforce data consistency
across nodes while performing read or write operations.
When used in a replica set deployment of
MongoDB, the write concern sends a confirmation from the server to
the application that the write has succeeded on the primary node.
However, this can be configured so that the write concern returns
success only when the write is replicated to all the nodes
maintaining the data.
In practical scenario, this isn’t
feasible because it will reduce the write performance. Ideally the
client can ensure, using a write concern, that the data is
replicated to one more node in addition to the primary, so that the
data is not lost even if the primary steps down.
The w option ensures that the write has
been replicated to the specified number of members. Either a number
or a majority can be specified as the value of the w option .
If a number is specified, the write
replicates to that many number of nodes before returning success.
If a majority is specified, the write is replicated to a majority
of members before returning the result.
If while specifying number the number is
greater than the nodes that actually hold the data, the command
will keep on waiting until the members are available. In order to
avoid this indefinite wait time, wtimeout should also be used along
with w, which will ensure that it will wait for the specified time
period, and if the write has not succeeded by that time, it will
time out.
How Writes Happen with Write Concern
In order to ensure that the written data
is present on say at least two members, issue the following
command :
In order to understand how this command
will be executed, say you have two members, one named primary and
the other named secondary, and it is syncing its data from the
primary.
But how will the primary know the point
at which the secondary is synced? Since the primary’s oplog is
queried by the secondary for op results to be applied, if the
secondary requests an op written at say t time, it implies to the
primary that the secondary has replicated all ops written before
t.
1.
2.
3.
4.
5.
6.
7.
7.4.3 Implementing Advanced Clustering with Replica Sets
Having learned
the architecture and inner workings of replica sets, you will now
focus on administration and usage of replica sets. You will be
focusing on the following:
1.
2.
3.
4.
5.
6.
7.
The following examples assume a replica
set named testset that has the configuration shown in Table 7-2.
In the following examples, the [hostname]
need to be substituted with the value that the hostname command
returns on your system. In our case, the value returned is ANOC9,
which is used in the following examples.
7.4.3.1 Setting Up a Replica Set
In order to get the replica set up and
running , you need to make all the active members up and running.
The first step is to start the first
active member. Open a terminal window and create the data
directory :
c:\practicalmongodb\bin>mongod
--dbpath C:\db1\active1\data --port 27021 --replSet
testset/ANOC9:27021 –rest
2015-07-13T23:48:40.543-0700 I CONTROL
Hotfix KB2731284 or later update is installed, no need to zero-out
data files
2015-07-13T23:48:40.564-0700 I JOURNAL
[initandlisten] recover : no journal files present, no recovery
needed
2015-07-13T23:48:40.614-0700 I CONTROL
[initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2
As you can see, the –replSet option
specifies the name of the replica set the instance is joining and
the name of one more member of the set, which in the above example
is Active_Member_2.
Although you have only specified one
member in the above example, multiple members can be provided by
specifying comma-separated addresses like so:
mongod –dbpath C:\db1\active1\data
–port 27021 –replset testset/[hostname]:27022,[hostname]:27023
--rest
In the next step, you get the second
active member up and running. Create the data directory for the
second active member in a new terminal window.
c:\ practicalmongodb \bin>mongod
--dbpath C:\db1\active2\data --port 27022 –replSet
testset/ANOC9:27021 –rest
2015-07-13T00:39:11.604-0700 I CONTROL
Hotfix KB2731284 or later update is installed, no need to zero-out
data files
2015-07-13T00:39:11.615-0700 I JOURNAL
[initandlisten] recover : no journal files present, no recovery
needed
2015-07-13T00:39:11.664-0700 I JOURNAL
[journal writer] Journal writer thread started rs.initiate() in
the shell -- if that is not already done
Finally, you need to start the passive
member. Open a separate window and create the data directory for
the passive member.
c:\ practicalmongodb \bin>mongod
--dbpath C:\db1\passive1\data --port 27023 --replSet testset/
ANOC9:27021 –rest
2015-07-13T05:11:43.746-0700 I CONTROL
Hotfix KB2731284 or later update is installed, no need to zero-out
data files
2015-07-13T05:11:43.808-0700 I CONTROL
[initandlisten] MongoDB starting : pid=620 port=27019
dbpath=C:\db1\passive1\data 64-bit host= ANOC9
In the preceding examples, the --rest
option is used to activate a REST interface on port +1000.
Activating REST enables you to inspect the replica set status
using web interface.
By the end of the above steps, you have
three servers that are up and running and are communicating with
each other; however the replica set is still not initialized. In
the next step, you initialize the replica set and instruct each
member about their responsibilities and roles.
In order to initialize the replica set,
you connect to one of the servers. In this example, it is the
first server, which is running on port 27021.
You have used 0 priority when defining the
role for the passive member. This means that the member cannot be
promoted to primary.
The output indicates that all is OK. The
replica set is now successfully configured and initialized.
7.4.3.2 Removing a Server
In this example, you will remove the
secondary active member from the set. Let’s connect to the
secondary member mongo instance. Open a new command prompt, like
so:
7.4.3.3 Adding a Server
You will next add a new active member to
the replica set. As with other members, you begin by opening a new
command prompt and creating the data directory first:
c:\practicalmongodb\bin>mongod
--dbpath C:\db1\active3\data --port 27024 --replSet
testset/ANOC9:27021 --rest
7.4.3.4 Adding an Arbiter to a Replica Set
In this example, you will add an arbiter
member to the set. As with the other members, you begin by
creating the data directory for the MongoDB instance:
c:\practicalmongodb\bin>mongod --dbpath
c:\db1\arbiter\data --port 30000 --replSet testset/ANOC9:27021
--rest
2015-07-13T22:05:10.205-0700 I CONTROL
[initandlisten] MongoDB starting : pid=3700 port=30000
dbpath=c:\db1\arbiter\data 64-bit host=ANOC9
7.4.3.5 Inspecting the Status Using rs.status()
We have been referring to rs.status()
throughout the examples above to check the replica set status. In
this section, you will learn what this command is all about.
It enables you to check the status of the
member whose console they are connected to and also enables them
to view its role within the replica set.
The myState field’s value indicates the
status of the member and it can have the values shown in Table
7-3.
Replica Set Status
myState |
Description |
---|---|
0 |
Phase 1, starting up |
1 |
Primary member |
2 |
Secondary member |
3 |
Recovering state |
4 |
Fatal error state |
5 |
Phase 2, Starting up |
6 |
Unknown state |
7 |
Arbiter member |
8 |
Down or unreachable |
9 |
This state is reached when a write operation is rolled
back by the secondary after transitioning from primary. |
10 |
Members enter this state when removed from the replica
set. |
7.4.3.6 Forcing a New Election
The current primary server can be forced to step down using the
rs.stepDown () command. This force starts the election for a new
primary.
1.
2.
3.
7.4.3.7 Inspecting Status of the Replica Set Using a Web Interface
A web-based console is maintained by
MongoDB for viewing the system status. In your example, the
console can be accessed via http://localhost:28021.
7.5 Sharding
You saw in the previous section how replica
sets in MongoDB are used to duplicate the data in order to protect
against any adversity and to distribute the read load in order to
increase the read efficiency.
MongoDB uses memory extensively for low
latency database operations. When you compare the speed of reading
data from memory to reading data from disk, reading from memory is
approximately 100,000 times faster than reading from the disk.
In MongoDB, ideally the working set should
fit in memory. The working set consists of the most frequently
accessed data and indexes.
A page fault happens when data which is not
there in memory is accessed by MongoDB. If there’s free memory
available, the OS will directly load the requested page into memory;
however, in the absence of free memory, the page in memory is
written to the disk and then the requested page is loaded in the
memory, slowing down the process. Few operations accidentally purge
large portion of the working set from the memory, leading to an
adverse effect on the performance. One example is a query scanning
through all documents of a database where the size exceeds the
server memory. This leads to loading of the documents in memory and
moving the working set out to disk.
Ensuring you have defined the appropriate
index coverage for your queries during the schema design phase of
the project will minimize the risk of this happening. The MongoDB
explain operation can be used to provide information on your query
plan and the indexes used.
MongoDB’s serverStatus command returns a
workingSet document that provides an estimate of the instance’s
working set size. The Operations team can track how many pages the
instance accessed over a given period of time and the elapsed time
between the working set’s oldest and newest document. Tracking all
these metrics, it’s possible to detect when the working set will
be hitting the current memory limit, so proactive actions can be
taken to ensure the system is scaled well enough to handle that.
In MongoDB, the scaling is handled by
scaling out the data horizontally (i.e. partitioning the data across
multiple commodity servers), which is also called sharding
(horizontal scaling).
Sharding addresses the challenges of
scaling to support large data sets and high throughput by
horizontally dividing the datasets across servers where each server
is responsible for handling its part of data and no one server is
burdened. These servers are also called shards.
Every shard is an independent database. All
the shards collectively make up a single logical database .
Sharding reduces the operations count
handled by each shard. For example, when data is inserted, only the
shards responsible for storing those records need to be accessed.
The processes that need to be handled by
each shard reduce as the cluster grows because the subset of data
that the shard holds reduces. This leads to an increase in the
throughput and capacity horizontally.
Let’s assume you have a database that is
1TB in size. If the number of shards is 4, you will have
approximately 265GB of data handled by each shard, whereas if the
number of shards is increased to 40, only 25GB of data will be held
on each shard.
Figure 7-15
depicts how a collection that is sharded will appear when
distributed across three shards.
Figure 7-15.
Sharded collection across three
shards
Although sharding is a compelling and
powerful feature, it has significant infrastructure requirements and
it increases the complexity of the overall deployment. So you need
to understand the scenarios where you might consider using sharding.
7.5.1 Sharding Components
You will next look at the components that
enable sharding in MongoDB. Sharding is enabled in MongoDB via
sharded clusters.
The shard is the component where the actual
data is stored. For the sharded cluster, it holds a subset of data
and can either be a mongod or a replica set. All shard’s data
combined together forms the complete dataset for the sharded
cluster.
Sharding is enabled per collection basis,
so there might be collections that are not sharded. In every sharded
cluster there’s a primary shard where all the unsharded
collections are placed in addition to the sharded collection data.
When deploying a sharded cluster, by
default the first shard becomes the primary shard although it’s
configurable. See Figure 7-16.
Figure 7-16.
Primary shard
Config servers are special mongods that
hold the sharded cluster’s metadata. This metadata depicts the
sharded system state and organization.
The config server stores data for a single
sharded cluster. The config servers should be available for the
proper functioning of the cluster.
One config server can lead to a cluster’s
single point of failure. For production deployment it’s
recommended to have at least three config servers, so that the
cluster keeps functioning even if one config server is not
accessible.
A config server stores the data in the
config database, which enables routing of the client requests to the
respective data. This database should not be updated.
MongoDB writes data to the config server
only when the data distribution has changed for balancing the
cluster.
The mongos act as the routers. They are
responsible for routing the read and write request from the
application to the shards.
An application interacting with a mongo
database need not worry about how the data is stored internally on
the shards. For them, it’s transparent because it’s only the
mongos they interact with. The mongos, in turn, route the reads and
writes to the shards.
The mongos cache the metadata from config
server so that for every read and write request they don’t
overburden the config server.
However, in the following cases, the data is
read from the config server :
No comments:
Post a Comment