EasyReliableDBA: MongoDB Concept PART 4

7.4.2.5 Failover

In this section, you will look at how primary and secondary member failovers are handled in replica sets. All members of a replica set are connected to each other. As shown in Figure 7-5, they exchange a heartbeat message amongst each other.

Figure 7-5.

Heartbeat message exchange

Hence a node with missing heartbeat is considered as crashed.

If the Node Is a Secondary Node

If the node is a secondary node , it will be removed from the membership of the replica set. In the future, when it recovers, it can re-join. Once it re-joins, it needs to update the latest changes.

If the down period is small, it connects to the primary and catches up with the latest updates.

However, if the down period is lengthy, the secondary server will need to resync with primary where it deletes all its data and does an initial sync as if it’s a new server.

If the Node Is the Primary Node

If the node is a primary node, in this scenario if the majority of the members of the original replica sets are able to connect to each other, a new primary will be elected by these nodes, which is in accordance with the automatic failover capability of the replica set.

The election process will be initiated by any node that cannot reach the primary.

The new primary is elected by majority of the replica set nodes. Arbiters can be used to break ties in scenarios such as when network partitioning splits the participating nodes into two halves and the majority cannot be reached.

The node with the highest priority will be the new primary. If you have more than one node with same priority, the data freshness can be used for breaking ties.

The primary node uses a heartbeat to track how many nodes are visible to it. If the number of visible nodes falls below the majority, the primary automatically falls back to the secondary state. This scenario prevents the primary from functioning when it’s separated by a network partition.

7.4.2.6 Rollbacks

In scenario of a primary node change, the data on the new primary is assumed to be the latest data in the system. When the former primary joins back, any operation that is applied on it will also be rolled back. Then it will be synced with the new primary.

The rollback operation reverts all the write operations that were not replicated across the replica set. This is done in order to maintain database consistency across the replica set.

When connecting to the new primary, all nodes go through a resync process to ensure the rollback is accomplished. The nodes look through the operation that is not there on the new primary, and then they query the new primary to return an updated copy of the documents that were affected by the operations. The nodes are in the process of resyncing and are said to be recovering; until the process is complete, they will not be eligible for primary election.

This happens very rarely, and if it happens, it is often due to network partition with replication lag where the secondaries cannot keep up with the operation’s throughput on the former primary.

It needs to be noted that if the write operations replicate to other members before the primary steps down, and those members are accessible to majority of the nodes of the replica set, the rollback does not occur.

The rollback data is written to a BSON file with filenames such as <database>.<collection>.<timestamp>.bson in the database’s dbpath directory.

The administrator can decide to either ignore or apply the rollback data. Applying the rollback data can only begin when all the nodes are in sync with the new primary and have rolled back to a consistent state.

The content of the rollback files can be read using Bsondump, which then need to be manually applied to the new primary using mongorestore.

There is no method to handle rollback situations automatically for MongoDB. Therefore manual intervention is required to apply rollback data. While applying the rollback, it’s vital to ensure that these are replicated to either all or at least some of the members in the set so that in case of any failover rollbacks can be avoided.

7.4.2.7 Consistency

You have seen that the replica set members keep on replicating data among each other by reading the oplog. How is the consistency of data maintained? In this section, you will look at how MongoDB ensures that you always access consistent data.

In MongoDB, although the reads can be routed to the secondaries, the writes are always routed to the primary, eradicating the scenario where two nodes are simultaneously trying to update the same data set. The data set on the primary node is always consistent.

If the read requests are routed to the primary node, it will always see the up-to-date changes, which means the read operations are always consistent with the last write operations.

However, if the application has changed the read preference to read from secondaries, there might be a probability of user not seeing the latest changes or seeing previous states. This is because the writes are replicated asynchronously on the secondaries.

This behavior is characterized as eventual consistency, which means that although the secondary’s state is not consistent with the primary node state, it will eventually become consistent over time.

There is no way that reads from the secondary can be guaranteed to be consistent, except by issuing write concerns to ensure that writes succeed on all members before the operation is actually marked successful. We will be discussing write concerns in a while.

7.4.2.8 Possible Replication Deployment

The architecture you chose to deploy a replica set affects its capability and capacity. In this section, you will look at few strategies that you need to be aware of while deciding on the architecture. We will also be discussing the deployment architecture .

Odd number of members: This should be done in order to ensure that there is no tie when electing a primary. If the number of nodes is even, then an arbiter can be used to ensure that the total nodes participating in election is odd, as shown in Figure 7-6.

Figure 7-6.

Members replica set with primary, secondary, and arbiter

Replica set fault tolerance is the count of members, which can go down but still the replica set has enough members to elect a primary in case of any failure. Table 7-1 indicates the relationship between the member count in the replica set and its fault tolerance. Fault tolerance should be considered when deciding on the number of members.

Table 7-1.

Replica Set Fault Tolerance

Number of Members	Majority Required for Electing a Primary	Fault Tolerance
3	2	1
4	3	1
5	3	2
6	4	2

If the application has specific dedicated requirements, such as for reporting or backups, then delayed or hidden members can be considered as part of the replica set, as shown in Figure 7-7.

Figure 7-7.

Members replica set with primary, secondary, and hidden members

If the application is read-heavy, the read can be distributed across secondaries. As the requirement increases, more nodes can be added to increase the data duplication; this can have a positive impact on the read throughput.

The members should be distributed geographically in order to cater to main data center failure. As shown in Figure 7-8, the members that are kept at a geographically different location other than the main data center can have priority set as 0, so that they cannot be elected as primary and can act as a standby only.

Figure 7-8.

Members replica set with primary, secondary, and a priority 0 member distributed across the data center

When replica set members are distributed across data centers, network partitioning can prevent data centers from communicating with each other. In order to ensure a majority in the case of network partitioning, it keeps a majority of the members in one location.

7.4.2.9 Scaling Reads

Although the primary purpose of the secondaries is to ensure data availability in case of downtime of the primary node, there are other valid use cases for secondaries. They can be used dedicatedly to perform backup operations or data processing jobs or to scale out reads. One of the ways to scale reads is to issue the read queries against the secondary nodes; by doing so the workload on the master is reduced.

One important point that you need to consider when using secondaries for scaling read operations is that in MongoDB the replication is asynchronous, which means if any write or update operation is performed on the master’s data, the secondary data will be momentarily out-of-date. If the application in question is read-heavy and is accessed over a network and does not need up-to-date data, the secondaries can be used to scale out the read in order to provide a good read throughput. Although by default the read requests are routed to the primary node, the requests can be distributed over secondary nodes by specifying the read preferences. Figure 7-9 depicts the default read preference.

Figure 7-9.

Default read preference

The following are ideal use cases whereby routing the reads on secondary node can help gain a significant improvement in the read throughput and can also help reduce the latency:

Applications that are geographically distributed: In such cases, you can have a replica set that is distributed across geographies. The read preferences should be set to read from the nearest secondary node. This helps in reducing the latency that is caused when reading over network and this improves the read performance. See Figure 7-10.

Figure 7-10.

Read Preference – Nearest

If the application always requires up-to-date data, it uses the option primaryPreferred, which in normal circumstances will always read from the primary node, but in case of emergency will route the read to secondaries. This is useful during failovers. See Figure 7-11.

Figure 7-11.

Read Preference – primaryPreferred

If you have an application that supports two types of operations, the first operation is the main workload that involves reading and doing some processing on the data, whereas the second operation generates reports using the data. In such a scenario, you can have the reporting reads directed to the secondaries.

MongoDB supports the following read preference modes:

primary: This is the default mode. All the read requests are routed to the primary node.

primaryPreferred: In normal circumstances the reads will be from primary but in an emergency such as a primary not available, reads will be from the secondary nodes.

secondary: Reads from the secondary members.

secondaryPreferred: Reads from secondary members. If secondaries are unavailable, then read from the primary.

nearest: Reads from the nearest replica set member.

In addition to scaling reads, the second ideal use case for using secondaries is to offload intensive processing, aggregating, and administration tasks in order to avoid degrading the primary’s performance. Blocking operations can be performed on the secondary without ever affecting the primary node’s performance.

7.4.2.10 Application Write Concerns

When the client application interacts with MongoDB, it is generally not aware whether the database is on standalone deployment or is deployed as a replica set. However, when dealing with replica sets, the client should be aware of write concern and read concern.

Since a replica set duplicates the data and stores it across multiple nodes, these two concerns give a client application the flexibility to enforce data consistency across nodes while performing read or write operations.

Using a write concern enables the application to get a success or failure response from MongoDB.

When used in a replica set deployment of MongoDB, the write concern sends a confirmation from the server to the application that the write has succeeded on the primary node. However, this can be configured so that the write concern returns success only when the write is replicated to all the nodes maintaining the data.

In practical scenario, this isn’t feasible because it will reduce the write performance. Ideally the client can ensure, using a write concern, that the data is replicated to one more node in addition to the primary, so that the data is not lost even if the primary steps down.

The write concern returns an object that indicates either error or no error.

The w option ensures that the write has been replicated to the specified number of members. Either a number or a majority can be specified as the value of the w option .

If a number is specified, the write replicates to that many number of nodes before returning success. If a majority is specified, the write is replicated to a majority of members before returning the result.

Figure 7-12 shows how a write happens with w: 2.

Figure 7-12.

writeConcern

If while specifying number the number is greater than the nodes that actually hold the data, the command will keep on waiting until the members are available. In order to avoid this indefinite wait time, wtimeout should also be used along with w, which will ensure that it will wait for the specified time period, and if the write has not succeeded by that time, it will time out.

How Writes Happen with Write Concern

In order to ensure that the written data is present on say at least two members, issue the following command :

>db.testprod.insert({i:”test”, q: 50, t: “B”}, {writeConcern: {w:2}})

In order to understand how this command will be executed, say you have two members, one named primary and the other named secondary, and it is syncing its data from the primary.

But how will the primary know the point at which the secondary is synced? Since the primary’s oplog is queried by the secondary for op results to be applied, if the secondary requests an op written at say t time, it implies to the primary that the secondary has replicated all ops written before t.

The following are the steps that a write concern takes.

The write operation is directed to the primary.

The operation is written to the oplog of primary with ts depicting the time of operation.

A w: 2 is issued, so the write operation needs to be written to one more server before it’s marked successful.

The secondary queries the primary’s oplog for the op, and it applies the op.

Next, the secondary sends a request to the primary requesting for ops with ts greater than t.

At this point, the primary sends an update that the operation until t has been applied by the secondary as it’s requesting for ops with {ts: {$gt: t}}.

The writeConcern finds that a write has occurred on both the primary and secondary, satisfying the w: 2 criteria, and the command returns success.

7.4.3 Implementing Advanced Clustering with Replica Sets

Having learned the architecture and inner workings of replica sets, you will now focus on administration and usage of replica sets. You will be focusing on the following:

Setting up a replica set.

Removing a server.

Adding a server.

Adding an arbiter.

Inspecting the status.

Forcing a new election of a primary.

Using the web interface to inspect the status of the replica set.

The following examples assume a replica set named testset that has the configuration shown in Table 7-2.

Table 7-2.

Replica Set Configuration

Member	Daemon	Host:Port	Data File Path
Active_Member_1	Mongod	[hostname]:27021	C:\db1\active1\data
Active_Member_2	Mongod	[hostname]:27022	C:\db1\active2\data
Passive_Member_1	Mongod	[hostname]:27023	C:\db1\passive1\data

The hostname used in the above table can be found out using the following command :

C:\>hostname

ANOC9

C:\>

In the following examples, the [hostname] need to be substituted with the value that the hostname command returns on your system. In our case, the value returned is ANOC9, which is used in the following examples.

Use the default (MMAPv1) storage engine in the following implementation.

7.4.3.1 Setting Up a Replica Set

In order to get the replica set up and running , you need to make all the active members up and running.

The first step is to start the first active member. Open a terminal window and create the data directory :

C:\>mkdir C:\db1\active1\data

C:\>

Connect to the mongod:

c:\practicalmongodb\bin>mongod --dbpath C:\db1\active1\data --port 27021 --replSet testset/ANOC9:27021 –rest

2015-07-13T23:48:40.543-0700 I CONTROL ** WARNING: --rest is specified without --httpinterface,

2015-07-13T23:48:40.543-0700 I CONTROL ** enabling http interface

2015-07-13T23:48:40.543-0700 I CONTROL Hotfix KB2731284 or later update is installed, no need to zero-out data files

2015-07-13T23:48:40.563-0700 I JOURNAL [initandlisten] journal dir=C:\db1\active1\data\journal

2015-07-13T23:48:40.564-0700 I JOURNAL [initandlisten] recover : no journal files present, no recovery needed

..................................... port=27021 dbpath=C:\db1\active1\data 64-bit host=ANOC9

2015-07-13T23:48:40.614-0700 I CONTROL [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2

2015-07-13T23:48:40.615-0700 I CONTROL [initandlisten] db version v3.0.4

As you can see, the –replSet option specifies the name of the replica set the instance is joining and the name of one more member of the set, which in the above example is Active_Member_2.

Although you have only specified one member in the above example, multiple members can be provided by specifying comma-separated addresses like so:

mongod –dbpath C:\db1\active1\data –port 27021 –replset testset/[hostname]:27022,[hostname]:27023 --rest

In the next step, you get the second active member up and running. Create the data directory for the second active member in a new terminal window.

C:\>mkdir C:\db1\active2\data

C:\>

Connect to mongod:

c:\ practicalmongodb \bin>mongod --dbpath C:\db1\active2\data --port 27022 –replSet testset/ANOC9:27021 –rest

2015-07-13T00:39:11.599-0700 I CONTROL ** WARNING: --rest is specified without --httpinterface,

2015-07-13T00:39:11.599-0700 I CONTROL ** enabling http interface

2015-07-13T00:39:11.604-0700 I CONTROL Hotfix KB2731284 or later update is installed, no need to zero-out data files

2015-07-13T00:39:11.615-0700 I JOURNAL [initandlisten] journal dir=C:\db1\active2\data\journal

2015-07-13T00:39:11.615-0700 I JOURNAL [initandlisten] recover : no journal files present, no recovery needed

2015-07-13T00:39:11.664-0700 I JOURNAL [durability] Durability thread started

2015-07-13T00:39:11.664-0700 I JOURNAL [journal writer] Journal writer thread started rs.initiate() in the shell -- if that is not already done

Finally, you need to start the passive member. Open a separate window and create the data directory for the passive member.

C:\>mkdir C:\db1\passive1\data

C:\>

Connect to mongod:

c:\ practicalmongodb \bin>mongod --dbpath C:\db1\passive1\data --port 27023 --replSet testset/ ANOC9:27021 –rest

2015-07-13T05:11:43.746-0700 I CONTROL Hotfix KB2731284 or later update is installed, no need to zero-out data files

2015-07-13T05:11:43.757-0700 I JOURNAL [initandlisten] journal dir=C:\db1\passive1\data\journal

2015-07-13T05:11:43.808-0700 I CONTROL [initandlisten] MongoDB starting : pid=620 port=27019 dbpath=C:\db1\passive1\data 64-bit host= ANOC9

......................................................................................

2015-07-13T05:11:43.812-0700 I CONTROL [initandlisten] options: { net: { http:

{ RESTInterfaceEnabled: true, enabled: true }, port: 27019 }, replication: { re

lSet: "testset/ ANOC9:27017" }, storage: { dbPath: "C:\db1\passive1\data" }

In the preceding examples, the --rest option is used to activate a REST interface on port +1000. Activating REST enables you to inspect the replica set status using web interface.

By the end of the above steps, you have three servers that are up and running and are communicating with each other; however the replica set is still not initialized. In the next step, you initialize the replica set and instruct each member about their responsibilities and roles.

In order to initialize the replica set, you connect to one of the servers. In this example, it is the first server, which is running on port 27021.

Open a new command prompt and connect to the mongo interface for the first server:

C:\>cd c:\practicalmongodb\bin

c:\practicalmongodb\bin>mongo ANOC9 --port 27021

MongoDB shell version: 3.0.4

connecting to: ANOC9:27021/test

Next, switch to the admin database.

> use admin

switched to db admin

Next, a configuration data structure is set up, which mentions server wise roles:

>cfg = {

... _id: 'testset',

... members: [

... {_id:0, host: 'ANOC9:27021'},

... {_id:1, host: 'ANOC9:27022'},

... {_id:2, host: 'ANOC9:27023', priority:0}

... ]

... }

{ "_id" : "testset",

"members" : [

{

"_id" : 0,

"host" : "ANOC9:27021"

..........

{

"_id" : 2,

"host" : "ANOC9:27023",

"priority" : 0

} ]}>

With this step the replicas set structure is configured.

You have used 0 priority when defining the role for the passive member. This means that the member cannot be promoted to primary.

The next command initiates the replica set:

> rs.initiate(cfg)

{ "ok" : 1}

Let’s now view the replica set status in order to vet that it’s set up correctly:

testset:PRIMARY> rs.status()

{

"set" : "testset",

"date" : ISODate("2015-07-13T04:32:46.222Z")

"myState" : 1,

"members" : [

{

"_id" : 0,

...........................

testset:PRIMARY>

The output indicates that all is OK. The replica set is now successfully configured and initialized.

Let’s see how you can determine the primary node. In order to do so, connect to any of the members and issue the following and verify the primary:

testset:PRIMARY> db.isMaster()

{

"setName" : "testset",

"setVersion" : 1,

"ismaster" : true,

"primary" : " ANOC9:27021",

"me" : "ANOC9:27021",

...........................................

"localTime" : ISODate("2015-07-13T04:36:52.365Z"),

.........................................................

"ok" : 1

}testset:PRIMARY>

7.4.3.2 Removing a Server

In this example, you will remove the secondary active member from the set. Let’s connect to the secondary member mongo instance. Open a new command prompt, like so:

C:\>cd c:\practicalmongodb\bin

c:\practicalmongodb\bin>mongo ANOC9 --port 27022

MongoDB shell version: 3.0.4

connecting to: 127.0.0.1:27022/ANOC9

testset:SECONDARY>

Issue the following command to shut down the instance:

testset:SECONDARY> use admin

switched to db admin

testset:SECONDARY> db.shutdownServer()

2015-07-13T21:48:59.009-0700 I NETWORK DBClientCursor::init call() failed server should be down...

Next, you need to connect to the primary member mongo console and execute the following to remove the member:

testset:PRIMARY> use admin

switched to db admin

testset:PRIMARY> rs.remove("ANOC9:27022")

{ "ok" : 1 }

testset:PRIMARY>

In order to vet whether the member is removed or not you can issue the rs.status() command .

7.4.3.3 Adding a Server

You will next add a new active member to the replica set. As with other members, you begin by opening a new command prompt and creating the data directory first:

C:\>mkdir C:\db1\active3\data

C:\>

Next, you start the mongod using the following command:

c:\practicalmongodb\bin>mongod --dbpath C:\db1\active3\data --port 27024 --replSet testset/ANOC9:27021 --rest

..........

You have the new mongod running, so now you need to add this to the replica set. For this you connect to the primary’s mongo console:

C:\>c:\practicalmongodb\bin\mongo.exe --port 27021

MongoDB shell version: 3.0.4

connecting to: 127.0.0.1:27021/test

testset:PRIMARY>

Next, you switch to admin db:

testset:PRIMARY> use admin

switched to db admin

testset:PRIMARY>

Finally, the following command needs to be issued to add the new mongod to the replica set:

testset:PRIMARY> rs.add("ANOC9:27024")

{ "ok" : 1 }

The replica set status can be checked to vet whether the new active member is added or not using rs.status().

7.4.3.4 Adding an Arbiter to a Replica Set

In this example, you will add an arbiter member to the set. As with the other members, you begin by creating the data directory for the MongoDB instance:

C:\>mkdir c:\db1\arbiter\data

C:\>

You next start the mongod using the following command:

c:\practicalmongodb\bin>mongod --dbpath c:\db1\arbiter\data --port 30000 --replSet testset/ANOC9:27021 --rest

2015-07-13T22:05:10.205-0700 I CONTROL [initandlisten] MongoDB starting : pid=3700 port=30000 dbpath=c:\db1\arbiter\data 64-bit host=ANOC9

..........................................................

Connect to the primary’s mongo console, switch to the admin db, and add the newly created mongod as an arbiter to the replica set:

C:\>c:\practicalmongodb\bin\mongo.exe --port 27021

MongoDB shell version: 3.0.4

connecting to: 127.0.0.1:27021/test

testset:PRIMARY> use admin

switched to db admin

testset:PRIMARY> rs.addArb("ANOC9:30000")

{ "ok" : 1 }

testset:PRIMARY>

Whether the step is successful or not can be verified using rs.status().

7.4.3.5 Inspecting the Status Using rs.status()

We have been referring to rs.status() throughout the examples above to check the replica set status. In this section, you will learn what this command is all about.

It enables you to check the status of the member whose console they are connected to and also enables them to view its role within the replica set.

The following command is issued from the primary’s mongo console:

testset:PRIMARY> rs.status()

{

"set" : "testset",

"date" : ISODate("2015-07-13T22:15:46.222Z")

"myState" : 1,

"members" : [

{

"_id" : 0,

...........................

"ok" : 1

testset:PRIMARY>

The myState field’s value indicates the status of the member and it can have the values shown in Table 7-3.

Table 7-3.

Replica Set Status

myState	Description
0	Phase 1, starting up
1	Primary member
2	Secondary member
3	Recovering state
4	Fatal error state
5	Phase 2, Starting up
6	Unknown state
7	Arbiter member
8	Down or unreachable
9	This state is reached when a write operation is rolled back by the secondary after transitioning from primary.
10	Members enter this state when removed from the replica set.

Hence the above command returns myState value as 1, which indicates that this is the primary member.

7.4.3.6 Forcing a New Election

The current primary server can be forced to step down using the rs.stepDown () command. This force starts the election for a new primary.

This command is useful in the following scenarios :

When you are simulating the impact of a primary failure, forcing the cluster to fail over. This lets you test how your application responds in such a scenario.

When the primary server needs to be offline. This is done for either a maintenance activity or for upgrading or to investigating the server.

When a diagnostic process need to be run against the data structures.

The following is the output of the command when run against the testset replica set :

testset:PRIMARY> rs.stepDown()

2015-07-13T22:52:32.000-0700 I NETWORK DBClientCursor::init call() failed

2015-07-13T22:52:32.005-0700 E QUERY Error: error doing query: failed

2015-07-13T22:52:32.009-0700 I NETWORK trying reconnect to 127.0.0.1:27021 (127.0.0.1) failed

2015-07-13T22:52:32.011-0700 I NETWORK reconnect 127.0.0.1:27021 (127.0.0.1) ok testset:SECONDARY>

After execution of the command the prompt changed from testset:PRIMARY to testset:SECONDARY.

rs.status() can be used to check whether the stepDown () is successful or not.

Please note the myState value it returns is 2 now, which means the “Member is operating as secondary.”

7.4.3.7 Inspecting Status of the Replica Set Using a Web Interface

A web-based console is maintained by MongoDB for viewing the system status. In your example, the console can be accessed via http://localhost:28021.

By default the web interface port number is set to X+1000 where X is the mongod instance port number. In this chapter’s example, since the primary instance is on 27021, the web interface is on port 28021.

Figure 7-13 shows a link to the replica set status. Clicking the link takes you to the replica set dashboard shown in Figure 7-14.

Figure 7-13.

Web interface

Figure 7-14.

Replica set status report

7.5 Sharding

You saw in the previous section how replica sets in MongoDB are used to duplicate the data in order to protect against any adversity and to distribute the read load in order to increase the read efficiency.

MongoDB uses memory extensively for low latency database operations. When you compare the speed of reading data from memory to reading data from disk, reading from memory is approximately 100,000 times faster than reading from the disk.

In MongoDB, ideally the working set should fit in memory. The working set consists of the most frequently accessed data and indexes.

A page fault happens when data which is not there in memory is accessed by MongoDB. If there’s free memory available, the OS will directly load the requested page into memory; however, in the absence of free memory, the page in memory is written to the disk and then the requested page is loaded in the memory, slowing down the process. Few operations accidentally purge large portion of the working set from the memory, leading to an adverse effect on the performance. One example is a query scanning through all documents of a database where the size exceeds the server memory. This leads to loading of the documents in memory and moving the working set out to disk.

Ensuring you have defined the appropriate index coverage for your queries during the schema design phase of the project will minimize the risk of this happening. The MongoDB explain operation can be used to provide information on your query plan and the indexes used.

MongoDB’s serverStatus command returns a workingSet document that provides an estimate of the instance’s working set size. The Operations team can track how many pages the instance accessed over a given period of time and the elapsed time between the working set’s oldest and newest document. Tracking all these metrics, it’s possible to detect when the working set will be hitting the current memory limit, so proactive actions can be taken to ensure the system is scaled well enough to handle that.

In MongoDB, the scaling is handled by scaling out the data horizontally (i.e. partitioning the data across multiple commodity servers), which is also called sharding (horizontal scaling).

Sharding addresses the challenges of scaling to support large data sets and high throughput by horizontally dividing the datasets across servers where each server is responsible for handling its part of data and no one server is burdened. These servers are also called shards.

Every shard is an independent database. All the shards collectively make up a single logical database .

Sharding reduces the operations count handled by each shard. For example, when data is inserted, only the shards responsible for storing those records need to be accessed.

The processes that need to be handled by each shard reduce as the cluster grows because the subset of data that the shard holds reduces. This leads to an increase in the throughput and capacity horizontally.

Let’s assume you have a database that is 1TB in size. If the number of shards is 4, you will have approximately 265GB of data handled by each shard, whereas if the number of shards is increased to 40, only 25GB of data will be held on each shard.

Figure 7-15 depicts how a collection that is sharded will appear when distributed across three shards.

Figure 7-15.

Sharded collection across three shards

Although sharding is a compelling and powerful feature, it has significant infrastructure requirements and it increases the complexity of the overall deployment. So you need to understand the scenarios where you might consider using sharding.

Use sharding in the following instances:

The size of the dataset is huge and it has started challenging the capacity of a single system.

Since memory is used by MongoDB for quickly fetching data, it becomes important to scale out when the active work set limits are set to reach.

If the application is write-intensive, sharding can be used to spread the writes across multiple servers.

7.5.1 Sharding Components

You will next look at the components that enable sharding in MongoDB. Sharding is enabled in MongoDB via sharded clusters.

The following are the components of a sharded cluster:

Shards

mongos

Config servers

The shard is the component where the actual data is stored. For the sharded cluster, it holds a subset of data and can either be a mongod or a replica set. All shard’s data combined together forms the complete dataset for the sharded cluster.

Sharding is enabled per collection basis, so there might be collections that are not sharded. In every sharded cluster there’s a primary shard where all the unsharded collections are placed in addition to the sharded collection data.

When deploying a sharded cluster, by default the first shard becomes the primary shard although it’s configurable. See Figure 7-16.

Figure 7-16.

Primary shard

Config servers are special mongods that hold the sharded cluster’s metadata. This metadata depicts the sharded system state and organization.

The config server stores data for a single sharded cluster. The config servers should be available for the proper functioning of the cluster.

One config server can lead to a cluster’s single point of failure. For production deployment it’s recommended to have at least three config servers, so that the cluster keeps functioning even if one config server is not accessible.

A config server stores the data in the config database, which enables routing of the client requests to the respective data. This database should not be updated.

MongoDB writes data to the config server only when the data distribution has changed for balancing the cluster.

The mongos act as the routers. They are responsible for routing the read and write request from the application to the shards.

An application interacting with a mongo database need not worry about how the data is stored internally on the shards. For them, it’s transparent because it’s only the mongos they interact with. The mongos, in turn, route the reads and writes to the shards.

The mongos cache the metadata from config server so that for every read and write request they don’t overburden the config server.

However, in the following cases, the data is read from the config server :

Either an existing mongos has restarted or a new mongos has started for the first time.

Migration of chunks. We will explain chunk migration in detail later.

EasyReliableDBA

Monday, 5 March 2018

MongoDB Concept PART 4

7.4.2.5 Failover

If the Node Is a Secondary Node

If the Node Is the Primary Node

7.4.2.6 Rollbacks

7.4.2.7 Consistency

7.4.2.8 Possible Replication Deployment

7.4.2.9 Scaling Reads

7.4.2.10 Application Write Concerns

How Writes Happen with Write Concern

7.4.3 Implementing Advanced Clustering with Replica Sets

7.4.3.1 Setting Up a Replica Set

7.4.3.2 Removing a Server

7.4.3.3 Adding a Server

7.4.3.4 Adding an Arbiter to a Replica Set

7.4.3.5 Inspecting the Status Using rs.status()

7.4.3.6 Forcing a New Election

7.4.3.7 Inspecting Status of the Replica Set Using a Web Interface

7.5 Sharding

7.5.1 Sharding Components

No comments:

Post a Comment

Search This Blog