Monday 5 March 2018

MongoDB concept part 3

The people viewing the blog can comment on the blog posts.

6.3.1 Relational Data Modeling and Normalization

Before jumping into MongoDB’s approach, let’s take a little detour into how you would model this in a relational database such as SQL.
In relational databases, the data modelling typically progresses by defining the tables and gradually removing data redundancy to achieve a normal form.

6.3.1.1 What Is a Normal Form?

In relational databases , a normal form typically begins by creating tables as per the application requirement and then gradually removing redundancy to achieve the highest normal form, which is also termed the third normal form or 3NF.In order to understand this better, let’s put the blogging application data in tabular form. The initial data is shown in Figure 6-2.



A332296_1_En_6_Fig2_HTML.gif
Figure 6-2.
Blogging application initial data
This data is actually in the first normal form. You will have lots of redundancy because you can have multiple comments against the posts and multiple tags can be associated with the post. The problem with redundancy, of course, is that it introduces the possibility of inconsistency, where various copies of the same data may have different values. To remove this redundancy, you need to further normalize the data by splitting it into multiple tables. As part of this step, you must identify a key column that uniquely identifies each row in the table so that you can create links between the tables. The above scenarios when modeled using the 3NF normal forms will look like the RDBMs diagram shown in Figure 6-3.



A332296_1_En_6_Fig3_HTML.jpg
Figure 6-3.
RDBMS diagram
In this case, you have a data model that is free of redundancy, allowing you to update it without having to worry about updating multiple rows. In particular, you no longer need to worry about inconsistency in the data model.

6.3.1.2 The Problem with Normal Forms

As mentioned, the nice thing about normalization is that it allows for easy updating without any redundancy (i.e. it helps keep the data consistent). Updating a user name means updating the name in the Users table.
However, a problem arises when you try to get the data back out. For instance, to find all tags and comments associated with posts by a specific user, the relational database programmer uses a JOIN. By using a JOIN, the database returns all data as per the application screen design, but the real problem is what operation the database performs to get that result set.
Generally, any RDBMS reads from a disk and does a seek, which takes well over 99% of the time spent reading a row. When it comes to disk access, random seeks are the enemy. The reason why this is so important in this context is because JOINs typically require random seeks. The JOIN operation is one of the most expensive operations within a relational database. Additionally, if you end up needing to scale your database to multiple servers, you introduce the problem of generating a distributed join, a complex and generally slow operation.

6.3.2 MongoDB Document Data Model Approach

As you know, in MongoDB, data is stored in documents. Fortunately for us as application designers, this opens up some new possibilities in schema design. Unfortunately for us, it also complicates our schema design process. Now when faced with a schema design problem there’s no longer a fixed path of normalized database design, as there is with relational databases. In MongoDB, the schema design depends on the problem you are trying to solve.
If you have to model the above using the MongoDB document model, you might store the blog data in a document as follows:
{
"_id" : ObjectId("509d27069cc1ae293b36928d"),
"title" : "Sample title",
"body" : "Sample text.",
"tags" : [
"Tag1",
"Tag2",
"Tag3",
"Tag4"
],
"created_date" : ISODate("2015-07-06T12:41:39.110Z"),
"author" : "Author 1",
"category_id" : ObjectId("509d29709cc1ae293b369295"),
"comments" : [
{
"subject" : "Sample comment",
"body" : "Comment Body",
"author " : "author 2",
"created_date":ISODate("2015-07-06T13:34:23.929Z")
}
]}
As you can see, you have embedded the comments and tags within a single document only. Alternatively, you could “normalize” the model a bit by referencing the comments and tags by the _id field:
// Authors document:
{
"_id": ObjectId("509d280e9cc1ae293b36928e "),
"name": "Author 1",}
// Tags document:
{
"_id": ObjectId("509d35349cc1ae293b369299"),
"TagName": "Tag1",.....}
// Comments document:
{
"_id": ObjectId("509d359a9cc1ae293b3692a0"),
"Author": ObjectId("508d27069cc1ae293b36928d"),
.......
"created_date" : ISODate("2015-07-06T13:34:59.336Z")
}
//Category Document
{
"_id": ObjectId("509d29709cc1ae293b369295"),
"Category": "Catgeory1"......
}
//Posts Document
{
"_id" : ObjectId("509d27069cc1ae293b36928d"),
"title" : "Sample title","body" : "Sample text.",
"tags" : [ ObjectId("509d35349cc1ae293b369299"),
ObjectId("509d35349cc1ae293b36929c")
],
"created_date" : ISODate("2015-07-06T13:41:39.110Z"),
"author_id" : ObjectId("509d280e9cc1ae293b36928e"),
"category_id" : ObjectId("509d29709cc1ae293b369295"),
"comments" : [
ObjectId("509d359a9cc1ae293b3692a0"),
]}
The remainder of this chapter is devoted to identifying which solution will work in your context (i.e. whether to use referencing or whether to embed).

6.3.2.1 Embedding

In this section, you will see if embedding will have a positive impact on the performance. Embedding can be useful when you want to fetch some set of data and display it on the screen, such as a page that displays comments associated with the blog; in this case the comments can be embedded in the Blogs document.
The benefit of this approach is that since MongoDB stores the documents contiguously on disk, all the related data can be fetched in a single seek.
Apart from this, since JOINs are not supported and you used referencing in this case, the application might do something like the following to fetch the comments data associated with the blog.
1.
Fetch the associated comments_id from the blogs document.

2.
Fetch the comments document based on the comments_id found in the first step.

If you take this approach, which is referencing, not only does the database have to do multiple seeks to find your data, but additional latency is introduced into the lookup since it now takes two round trips to the database to retrieve your data.
If the application frequently accesses the comments data along with the blogs, then almost certainly embedding the comments within the blog documents will have a positive impact on the performance.
Another concern that weighs in favor of embedding is the desire for atomicity and isolation in writing data. MongoDB is designed without multi-documents transactions. In MongoDB, the atomicity of the operation is provided only at a single document level so data that needs to be updated together atomically needs to be placed together in a single document.
When you update data in your database, you must ensure that your update either succeeds or fails entirely, never having a “partial success,” and that no other database reader ever sees an incomplete write operation.

6.3.2.2 Referencing

You have seen that embedding is the approach that will provide the best performance in many cases; it also provides data consistency guarantees. However, in some cases, a more normalized model works better in MongoDB.
One reason for having multiple collections and adding references is the increased flexibility it gives when querying the data. Let’s understand this with the blogging example mentioned above.
You saw how to use embedded schema, which will work very well when displaying all the data together on a single page (i.e. the page that displays the blog post followed by all of the associated comments).
Now suppose you have a requirement to search for the comments posted by a particular user. The query (using this embedded schema) would be as follows:
db.posts.find({'comments.author': 'author2'},{'comments': 1})
The result of this query, then, would be documents of the following form:
{
"_id" : ObjectId("509d27069cc1ae293b36928d"),
"comments" : [ {
"subject" : "Sample Comment 1 ",
"body" : "Comment1 Body.",
"author_id" : "author2",
"created_date" : ISODate("2015-07-06T13:34:23.929Z")}...]
}
"_id" : ObjectId("509d27069cc1ae293b36928d"),
"comments" : [
{
"subject" : "Sample Comment 2",
"body" : "Comments Body.",
"author_id" : "author2",
"created_date" : ISODate("2015-07-06T13:34:23.929Z")
}...]}
The major drawback to this approach is that you get back much more data than you actually need. In particular, you can’t ask for just author2’s comments; you have to ask for posts that author2 has commented on, which includes all of the other comments on those posts as well. This data will require further filtering within the application code.
On the other hand, suppose you decide to use a normalized schema. In this case you will have three documents: “Authors,” “Posts,” and “Comments.”
The “Authors” document will have Author-specific content such as Name, Age, Gender, etc., and the “Posts” document will have posts-specific details such as post creation time, author of the post, actual content, and the subject of the post.
The “Comments” document will have the post’s comments such as CommentedOn date time, created by author, and the text of the comment. This is depicted as follows:
// Authors document:
{
"_id": ObjectId("508d280e9cc1ae293b36928e "),
"name": "Jenny",
..........
}
//Posts Document
{
"_id" : ObjectId("508d27069cc1ae293b36928d"),....................
}
// Comments document:
{
"_id": ObjectId("508d359a9cc1ae293b3692a0"),
"Author": ObjectId("508d27069cc1ae293b36928d"),
"created_date" : ISODate("2015-07-06T13:34:59.336Z"),
"Post_id": ObjectId("508d27069cc1ae293b36928d"),
..........
}
In this scenario, the query to find the comments by “author2” can be fulfilled by a simple find() on the comments collection:
db.comments.find({"author": "author2"})
In general, if your application’s query pattern is well known, and data tends to be accessed in only one way, an embedded approach works well. Alternatively, if your application may query data in many different ways, or you are not able to anticipate the patterns in which data may be queried, a more “normalized” approach may be better.
For instance, in the above schema, you will be able to sort the comments or return a more restricted set of comments using the limit, skip operators. In the embedded case, you’re stuck retrieving all the comments in the same order in which they are stored in the post.
Another factor that may weigh in favor of using document references is when you have one-to-many relationships.
For instance, a popular blog with a large amount of reader engagement may have hundreds or even thousands of comments for a given post. In this case, embedding carries significant penalties with it:
  • Effect on read performance: As the document size increases, it will occupy more memory. The problem with memory is that a MongoDB database caches frequently accessed documents in memory, and the larger the documents become, the lesser the probability of them fitting into memory. This will lead to more page faults while retrieving the documents, which will lead to random disk I/O, which will further slow down the performance.
  • Effect on update performance: As the size increases and an update operation is performed on such documents to append data, eventually MongoDB is going to need to move the document to an area with more space available. This movement, when it happens, significantly slows update performance.
Apart from this, MongoDB documents have a hard size limit of 16MB. Although this is something to be aware of, you will usually run into problems due to memory pressure and document copying well before you reach the 16MB size limit.
One final factor that weighs in favor of using document references is the case of many-to-many or M:N relationships.
For instance, in the above example, there are tags. Each blog can have multiple tags and each tag can be associated to multiple blog entries.
One approach to implement the blogs-tags M:N relationship is to have the following three collections:
  • The Tags collection, which will store the tags details
  • The Blogs collection, which will have blogs details
  • A third collection, called Tag-To-Blog Mapping, which will map between the tags and the blogs
This approach is similar to the one in relational databases, but this will negatively impact the application’s performance because the queries will end up doing a lot of application-level “joins.”
Alternatively, you can use the embedding model where you embed the tags within the blogs document, but this will lead to data duplication. Although this will simplify the read operation a bit, it will increase the complexity of the update operation, because while updating a tag detail, the user needs to ensure that the updated tag is updated at each and every place where it has been embedded in other blog documents.
Hence for many-to-many joins, a compromise approach is often best, embedding a list of _id values rather than the full document:
// Tags document:
{
"_id": ObjectId("508d35349cc1ae293b369299"),
"TagName": "Tag1",
..........
}
// Posts document with Tag IDs added as References
//Posts Document
{ "_id" : ObjectId("508d27069cc1ae293b36928d"),
"tags" : [
ObjectId("509d35349cc1ae293b369299"),
ObjectId("509d35349cc1ae293b36929a"),
ObjectId("509d35349cc1ae293b36929b"),
ObjectId("509d35349cc1ae293b36929c")
],....................................
}
Although querying will be a bit complicated, you no longer need to worry about updating a tag everywhere.
In summary, schema design in MongoDB is one of the very early decisions that you need to make, and it is dependent on the application requirements and queries.
As you have seen, when you need to access the data together or you need to make atomic updates, embedding will have a positive impact. However, if you need more flexibility while querying or if you have a many-to-many relationships, using references is a good choice.
Ultimately, the decision depends on the access patterns of your application, and there are no hard-and-fast rules in MongoDB. In the next section, you will learn about various data modelling considerations.

6.3.2.3 Decisions of Data Modelling

This involves deciding how to structure the documents so that the data is modeled effectively. An important point to decide is whether you need to embed the data or use references to the data (i.e. whether to use embedding or referencing).
This point is best demonstrated with an example. Suppose you have a book review site which has authors and books as well as reviews with threaded comments.
Now the question is how to structure the collections. The decision depends on the number of comments expected on per book and how frequently the read vs. write operations will be performed.

6.3.2.4 Operational Considerations

In addition to the way the elements interact with each other (i.e. whether to store the documents in an embedded manner or use references), a number of other operational factors are important when designing a data model for the application. These factors are covered in the following sections.
Data Lifecycle Management
This feature needs to be used if your application has datasets that need to be persisted in the database only for a limited time period.
Say you need to retain the data related to the review and comments for a month. This feature can be taken into consideration.
This is implemented by using the Time to Live (TTL) feature of the collection. The TTL feature of the collection ensures that the documents are expired after a period of time.
Additionally, if the application requirement is to work with only the recently inserted documents, using capped collections will help optimize the performance.
Indexes
Indexes can be created to support commonly used queries to increase the performance. By default, an index is created by MongoDB on the _id field.
The following are a few points to consider when creating indexes:
  • At least 8KB of data space is required by each index.
  • For write operations, an index addition has some negative performance impact. Hence for collections with heavy writes, indexes might be expensive because for each insert, the keys must be added to all the indexes.
  • Indexes are beneficial for collections with heavy read operations such as where the proportion of read-to-write operations is high. The un-indexed read operations are not affected by an index.
Sharding
One of the important factors when designing the application model is whether to partition the data or not. This is implemented using sharding in MongoDB.
Sharding is also referred as partitioning of data. In MongoDB, a collection is partitioned with its documents distributed across cluster of machines, which are referred as shards. This can have a significant impact on the performance. We will discuss sharding more in Chapter tk.
A Large Number of Collections
The design considerations for having multiple collections vs. storing data in a single collection are the following:
  • There is no performance penalty in choosing multiple collections for storing data.
  • Having distinct collections for different types of data can have performance improvements in high-throughput batch processing applications.
When you are designing models that have a large number of collections, you need to take into consideration the following behaviors:
  • A certain minimum overhead of few kilobytes is associated with each collection.
  • At least 8KB of data space is required by each index, including the _id index.
You know by now that the metadata for each database is stored in the <database>.ns file. Each collection and index has its own entry in the namespace file, so you need to consider the limits_on_the_size_of_namespace files when deciding to implement a large number of collections.
Growth of the Document
Few updates, such as pushing an element to an array, adding new fields, etc., can lead to an increase in the document size, which can lead to the movement of the document from one slot to another in order to fit in the document. This process of document relocation is both resource and time consuming. Although MongoDB provides padding to minimize the relocation occurrences, you may need to handle the document growth manually.

6.4 Summary

In this chapter you learned the basic CRUD operations plus advanced querying capabilities. You also examined the two ways of storing and retrieving data: embedding and referencing.
In the following chapter, you will learn about the MongoDB architecture, its core components, and features.















7. MongoDB Architecture




“MongoDB architecture covers the deep-dive architectural concepts of MongoDB.”
In this chapter, you will learn about the MongoDB architecture, especially core processes and tools, standalone deployment, sharding concepts, replication concepts, and production deployment.

7.1 Core Processes

The core components in the MongoDB package are
  • mongod, which is the core database process
  • mongos, which is the controller and query router for sharded clusters
  • mongo, which is the interactive MongoDB shell
These components are available as applications under the bin folder. Let’s discuss these components in detail.

7.1.1 mongod

The primary daemon in a MongoDB system is known as mongod . This daemon handles all the data requests, manages the data format, and performs operations for background management.
When a mongod is run without any arguments, it connects to the default data directory, which is C:\data\db or /data/db, and default port 27017, where it listens for socket connections.
It’s important to ensure that the data directory exists and you have write permissions to the directory before the mongod process is started.
If the directory doesn’t exist or you don’t have write permissions on the directory, the start of this process will fail. If the default port 27017 is not available, the server will fail to start.
mongod also has a HTTP server which listens on a port 1000 higher than the default port, so if you started the mongod with the default port 27017, in this case the HTTP server will be on port 28017 and will be accessible using the URL http://localhost:28017. This basic HTTP server provides administrative information about the database.

7.1.2 mongo

mongo provides an interactive JavaScript interface for the developer to test queries and operations directly on the database and for the system administrators to manage the database. This is all done via the command line. When the mongo shell is started, it will connect to the default database called test. This database connection value is assigned to global variable db.
As a developer or administrator you need to change the database from test to your database post the first connection is made. You can do this by using <databasename>.

7.1.3 mongos

mongos is used in MongoDB sharding. It acts as a routing service that processes queries from the application layer and determines where in the sharded cluster the requested data is located.
We will discuss mongos in more detail in the sharding section. Right now you can think of mongos as the process that routes the queries to the correct server holding the data.

7.2 MongoDB Tools

Apart from the core services, there are various tools that are available as part of the MongoDB installation:
  • mongodump: This utility is used as part of an effective backup strategy. It creates a binary export of the database contents.
  • mongorestore: The binary database dump created by the mongodump utility is imported to a new or an existing database using the mongorestore utility.
  • bsondump: This utility converts the BSON files into human-readable formats such as JSON and CSV. For example, this utility can be used to read the output file generated by mongodump.
  • mongoimport, mongoexport: mongoimport provides a method for taking data in JSON, CSV, or TSV formats and importing it into a mongod instance. mongoexport provides a method to export data from a mongod instance into JSON, CSV, or TSV formats.
  • mongostat, mongotop, mongosniff: These utilities provide diagnostic information related to the current operation of a mongod instance.

7.3 Standalone Deployment

Standalone deployment is used for development purpose; it doesn’t ensure any redundancy of data and it doesn’t ensure recovery in case of failures. So it’s not recommended for use in production environment. Standalone deployment has the following components: a single mongod and a client connecting to the mongod, as shown in Figure 7-1.



A332296_1_En_7_Fig1_HTML.jpg
Figure 7-1.
Standalone deployment
MongoDB uses sharding and replication to provide a highly available system by distributing and duplicating the data. In the coming sections, you will look at sharding and replication. Following that you’ll look at the recommended production deployment architecture.

7.4 Replication

In a standalone deployment, if the mongod is not available, you risk losing all the data, which is not acceptable in a production environment. Replication is used to offer safety against such kind of data loss.
Replication provides for data redundancy by replicating data on different nodes, thereby providing protection of data in case of node failure. Replication provides high availability in a MongoDB deployment.
Replication also simplifies certain administrative tasks where the routine tasks such as backups can be offloaded to the replica copies, freeing the main copy to handle the important application requests.
In some scenarios, it can also help in scaling the reads by enabling the client to read from the different copies of data.
In this section, you will learn how replication works in MongoDB and its various components. There are two types of replication supported in MongoDB: traditional master/slave replication and replica set.

7.4.1 Master/Slave Replication

In MongoDB, the traditional master/slave replication is available but it is recommended only for more than 50 node replications. The preferred replication approach is replica sets, which we will explain later. In this type of replication, there is one master and a number of slaves that replicate the data from the master. The only advantage with this type of replication is that there’s no restriction on the number of slaves within a cluster. However, thousands of slaves will overburden the master node, so in practical scenarios it’s better to have less than dozen slaves. In addition, this type of replication doesn’t automate failover and provides less redundancy.
In a basic master/slave setup , you have two types of mongod instances: one instance is in the master mode and the remaining are in the slave mode, as shown in Figure 7-2. Since the slaves are replicating from the master, all slaves need to be aware of the master’s address.



A332296_1_En_7_Fig2_HTML.jpg
Figure 7-2.
Master/slave replication
The master node maintains a capped collection (oplog) that stores an ordered history of logical writes to the database.
The slaves replicate the data using this oplog collection. Since the oplog is a capped collection, if the slave’s state is far behind the master’s state, the slave may become out of sync. In that scenario, the replication will stop and manual intervention will be needed to re-establish the replication.
There are two main reasons behind a slave becoming out of sync:
  • The slave shuts down or stops and restarts later. During this time, the oplog may have deleted the log of operations required to be applied on the slave.
  • The slave is slow in executing the updates that are available from the master.

7.4.2 Replica Set

The replica set is a sophisticated form of the traditional master-slave replication and is a recommended method in MongoDB deployments.
Replica sets are basically a type of master-slave replication but they provide automatic failover. A replica set has one master, which is termed as primary, and multiple slaves, which are termed as secondary in the replica set context; however, unlike master-slave replication, there’s no one node that is fixed to be primary in the replica set.
If a master goes down in replica set, automatically one of the slave nodes is promoted to the master. The clients start connecting to the new master, and both data and application will remain available. In a replica set, this failover happens in an automated fashion. We will explain the details of how this process happens later.
The primary node is selected through an election mechanism. If the primary goes down, the selected node will be chosen as the primary node.
Figure 7-3 shows how a two-member replica set failover happens. Let’s discuss the various steps that happen for a two-member replica set in failover .



A332296_1_En_7_Fig3_HTML.jpg
Figure 7-3.
Two-member replica set failover
1.
The primary goes down, and the secondary is promoted as primary.

2.
The original primary comes up, it acts as slave, and becomes the secondary node.

The points to be noted are
  • A replica set is a mongod’s cluster, which replicates among one another and ensures automatic failover.
  • In the replica set, one mongod will be the primary member and the others will be secondary members.
  • The primary member is elected by the members of the replica set. All writes are directed to the primary member whereas the secondary members replicate from the primary asynchronously using oplog.
  • The secondary’s data sets reflect the primary data sets, enabling them to be promoted to primary in case of unavailability of the current primary.
Replica set replication has a limitation on the number of members. Prior to version 3.0, the limit was 12 but this has been changed to 50 in version 3.0. So now replica set replication can have maximum of 50 members only, and at any given point of time in a 50-member replica set, only 7 can participate in a vote. We will explain the voting concept in a replica set in detail.
Starting from Version 3.0, replica set members can use different storage engines. For example, the WiredTiger storage engine might be used by the secondary members whereas the MMAPv1 engine could be used by the primary. In the coming sections, you will look at the different storage engines available with MongoDB.

7.4.2.1 Primary and Secondary Members

Before you move ahead and look at how the replica set functions, let’s look at the type of members that a replica set can have. There are two types of members: primary members and secondary members .
  • Primary member: A replica set can have only one primary, which is elected by the voting nodes in the replica set. Any node with associated priority as 1 can be elected as a primary. The client redirects all the write operations to the primary member, which is then later replicated to the secondary members.
  • Secondary member: A normal secondary member holds the copy of the data. The secondary member can vote and also can be a candidate for being promoted to primary in case of failover of the current primary.
In addition to this, a replica set can have other types of secondary members.

7.4.2.2 Types of Secondary Members

Priority 0 members are secondary members that maintain the primary’s data copy but can never become a primary in case of a failover. Apart from that, they function as a normal secondary node, and they can participate in voting and can accept read requests. The Priority 0 members are created by setting the priority to 0.
Such types of members are specifically useful for the following reasons:
1.
They can serve as a cold standby.

2.
In replica sets with varied hardware or geographic distribution, this configuration ensures that only the qualified members get elected as primary.

3.
In a replica set that spans multiple data centers across network partitioning, this configuration can help ensure that the main data center has the eligible primary. This is used to ensure that the failover is quick.

Hidden members are 0-priority members that are hidden from the client applications. Like the 0-priority members, this member also maintains a copy of the primary’s data, cannot become the primary, and can participate in the voting, but unlike 0-priotiy members, it can’t serve any read requests or receive any traffic beyond what replication requires. A node can be set as hidden member by setting the hidden property to true. In a replica set, these members can be dedicated for reporting needs or backups.
Delayed members are secondary members that replicate data with a delay from the primary’s oplog. This helps to recover from human errors, such as accidentally dropped databases or errors that were caused by unsuccessful application upgrades.
When deciding on the delay time, consider your maintenance period and the size of the oplog. The delay time should be either equal to or greater than the maintenance window and the oplog size should be set in a manner to ensure that no operations are lost while replicating.
Note that since the delayed members will not have up-to-date data as the primary node, the priority should be set to 0 so that they cannot become primary. Also, the hidden property should be true in order to avoid any read requests.
Arbiters are secondary members that do not hold a copy of the primary’s data, so they can never become the primary. They are solely used as member for participating in voting. This enables the replica set to have an uneven number of nodes without incurring any replication cost which arises with data replication.
Non-voting members hold the primary’s data copy, they can accept client read operations, and they can also become the primary, but they cannot vote in an election.
The voting ability of a member can be disabled by setting its votes to 0. By default every member has one vote. Say you have a replica set with seven members. Using the following commands in mongo shell, the votes for fourth, fifth, and sixth member are set to 0:
cfg_1 = rs.conf()
cfg_1.members[3].votes = 0
cfg_1.members[4].votes = 0
cfg_1.members[5].votes = 0
rs.reconfig(cfg_1)
Although this setting allows the fourth, fifth, and sixth members to be elected as primary, when voting their votes will not be counted. They become non-voting members, which means they can stand for election but cannot vote themselves.
You will see how the members can be configured later in this chapter.

7.4.2.3 Elections

In this section, you will look at the process of election for selecting a primary member. In order to get elected, a server need to not just have the majority but needs to have majority of the total votes.
If there are X servers with each server having 1 vote, then a server can become primary only when it has at least [(X/2) + 1] votes.
If a server gets the required number of votes or more, then it will become primary.
The primary that went down still remains part of the set; when it is up, it will act as a secondary server until the time it gets a majority of votes again.
The complication with this type of voting system is that you cannot have just two nodes acting as master and slave. In this scenario, you will have total of two votes, and to become a master, a node will need the majority of votes, which will be both of the votes in this case. If one of the servers goes down, the other server will end up having one vote out of two, and it will never be promoted as master, so it will remain a slave.
In case of network partitioning, the master will lose the majority of votes since it will have only its own one vote and it’ll be demoted to slave and the node that is acting as slave will also remain a slave in the absence of the majority of the votes. You will end up having two slaves until both servers reach each other again.
A replica set has number of ways to avoid such situations. The simplest way is to use an arbiter to help resolve such conflicts. It’s very lightweight and is just a voter, so it can run on either of the servers itself.
Let’s now see how the above scenario will change with the use of an arbiter. Let’s first consider the network partitioning scenario. If you have a master, a slave, and an arbiter, each has one vote, totalling three votes. If a network partition occurs with the master and arbiter in one data center and the slave in another data center, the master will remain master since it will still have the majority of votes.
If the master fails with no network partitioning, the slave can be promoted to master because it will have two votes (slave + arbiter).
This three-server setup provides a robust failover deployment.
Example - Working of Election Process in More Details
This section will explain how the election happens.
Let’s assume you have a replica set with the following three members: A1, B1, and C1. Each member exchanges a heartbeat request with the other members every few seconds. The members respond with their current situation information to such requests. A1 sends out heartbeat request to B1 and C1. B1 and C1 respond with their current situation information, such as the state they are in (primary or secondary), their current clock time, their eligibility to be promoted as primary, and so on. A1 receives all this information’s and updates its “map” of the set, which maintains information such as the members changed state, members that have gone down or come up, and the round trip time.
While updating the A1’s map changes, it will check a few things depending on its state:
  • If A1 is primary and one of the members has gone down, then it will ensure that it’s still able to reach the majority of the set. If it’s not able to do so, it will demote itself to secondary state.
  • Demotions: There’s a problem when A1 undergoes a demotion. By default in MongoDB writes are fire-and-forget (i.e. the client issues the writes but doesn’t wait for a response). If an application is doing the default writes when the primary is stepping down, it will never realize that the writes are actually not happening and might end up losing data. Hence it’s recommended to use safe writes. In this scenario, when the primary is stepping down, it closes all its client connections, which will result in socket errors to the clients. The client libraries then need to recheck who the new primary is and will be saved from losing their write operations data.
  • If A1 is a secondary and if the map has not changed, it will occasionally check whether it should elect itself.
  • The first task A1 will do is run a sanity check where it will check answers to few question such as, Does A1 think it’s already primary? Does another member think its primary? Is A1 not eligible for election? If it can’t answer any of the basic questions, A1 will continue idling as is; otherwise, it will proceed with the election process:
    • A1 sends a message to the other members of the set, which in this case are B1 and C1, saying “I am planning to become a primary. Please suggest”
    • When B1 and C1 receive the message, they will check the view around them. They will run through a big list of sanity checks, such as, Is there any other node that can be primary? Does A1 have the most recent data or is there any other node that has the most recent data? If all the checks seem ok, they send a “go-ahead” message; however, if any of the checks fail, a “stop election” message is sent.
    • If any of the members send a “stop election” reply, the election is cancelled and A1 remains a secondary member.
    • If the “go-ahead” is received from all, A1 goes to the election process final phase.
  • In the second (final) phase,
    • A1 resends a message declaring its candidacy for the election to the remaining members.
    • Members B1 and C1 do a final check to ensure that all the answers still hold true as before.
    • If yes, A1 is allowed to take its election lock, which prevents its voting capabilities for 30 seconds and sends back a vote.
    • If any of the checks fail to hold true, a veto is sent.
    • If any veto is received, the election stops.
    • If no one vetoes and A1 gets a majority of the votes, it becomes a primary.
The election is affected by the priority settings. A 0 priority member can never become a primary.

7.4.2.4 Data Replication Process

Let’s look at how the data replication works. The members of a replica set replicate data continuously. Every member, including the primary member, maintains an oplog. An oplog is a capped collection where the members maintain a record of all the operations that are performed on the data set.
The secondary members copy the primary member’s oplog and apply all the operations in an asynchronous manner.
Oplog
Oplog stands for the operation log . An oplog is a capped collection where all the operations that modify the data are recorded.
The oplog is maintained in a special database, namely local in the collection oplog.$main. Every operation is maintained as a document, where each document corresponds to one operation that is performed on the master server. The document contains various keys, including the following keys :
  • ts: This stores the timestamp when the operations are performed. It’s an internal type and is composed of a 4-byte timestamp and a 4-byte incrementing counter.
  • op: This stores information about the type of operation performed. The value is stored as 1-byte code (e.g. it will store an “I” for an insert operation).
  • ns: This key stores the collection namespace on which the operation was performed.
  • o: This key specifies the operation that is performed. In case of an insert, this will store the document to insert.
Only operations that change the data are maintained in the oplog because it’s a mechanism for ensuring that the secondary node data is in sync with the primary node data.
The operations that are stored in the oplog are transformed so that they remain idempotent, which means that even if it’s applied multiple times on the secondary, the secondary node data will remain consistent. Since the oplog is a capped collection, with every new addition of an operation, the oldest operations are automatically moved out. This is done to ensure that it does not grow beyond a pre-set bound, which is the oplog size.
Depending on the OS, whenever the replica set member first starts up, the oplog is created of a default size by MongoDB.
By default in MongoDB, available free space or 5% is used for the oplog on Windows and 64-bit Linux instances. If the size is lower than 1GB, then 1GB of space is allocated by MongoDB.
Although the default size is sufficient in most cases, you can use the –oplogsize option to specify the oplog size in MB when starting the server.
If you have the following workload, there might be a requirement of reconsidering the oplog size:
  • Updates to multiple documents simultaneously: Since the operations need to be translated into operations that are idempotent, this scenario might end up requiring great deal of oplog size.
  • Deletes and insertions happening at the same rate involving same amount of data: In this scenario, although the database size will not increase, the operations translation into an idempotent operation can lead to a bigger oplog.
  • Large number of in-place updates: Although these updates will not change the database size, the recording of updates as idempotent operations in the oplog can lead to a bigger oplog.
Initial Sync and Replication
Initial sync is done when the member is in either of the following two cases:
1.
The node has started for the first time (i.e. it’s a new node and has no data).

2.
The node has become stale, where the primary has overwritten the oplog and the node has not replicated the data. In this case, the data will be removed.

In both cases, the initial sync involves the following steps:
1.
First, all databases are cloned.

2.
Using oplog of the source node, the changes are applied to its dataset.

3.
Finally, the indexes are built on all the collections.

Post the initial sync, the replica set members continuously replicate the changes in order to be up-to-date.
Most of the synchronization happens from the primary, but chained replication can be enabled where the sync happens from a secondary only (i.e. the sync targets are changed based on the ping time and state of other member’s replication).
Syncing – Normal Operation
In normal operations, the secondary chooses a member from where it will sync its data, and then the operations are pulled from the chosen source’s oplog collection (local.oplog.rs).
Once the operation (op) is get, the secondary does the following:
1.
It first applies the op to its data copy.

2.
Then it writes the op to its local oplog.

3.
Once the op is written to the oplog, it requests the next op.

Suppose it crashes between step 1 and step 2, and then it comes back again. In this scenario, it’ll assume the operation has not been performed and will re-apply it.
Since oplog ops are idempotent, the same operation can be applied any number of times, and every time the result document will be same.
  • If you have the following doc
  • {I:11}
  • an increment operation is performed on the same, such as
  • {$inc:{I:1}} on the primary
  • In this case the following will be stored in the primary oplog:
  • {I:12}.
This will be replicated by the secondaries. So the value remains the same even if the log is applied multiple times.
Starting Up
When a node is started, it checks its local collection to find out the lastOpTimeWritten. This is the time of the latest op that was applied on the secondary.
The following shell helper can be used to find the latest op in the shell:
> rs.debug.getLastOpWritten()
The output returns a field named ts, which depicts the last op time.
If a member starts up and finds the ts entry, it starts by choosing a target to sync from and it will start syncing as in a normal operation. However, if no entry is found, the node will begin the initial sync process.
Whom to Sync From?
In this section, you will look at how the source is chosen to sync from . As of 2.0, based on the average ping time servers automatically sync from the “nearest” node.
When you bring up a new node, it sends heartbeats to all nodes and monitors the response time. Based on the data received, it then decides the member to sync from using the following algorithm:
for each healthy member Loop:
if state is Primary
add the member to possible sync target set
if member’s lastOpTimeWritten is greater then the local lastOpTime Written
add the member to possible sync target set
Set sync_from = MIN (PING TIME to members of sync target set)
Note: A “healthy member” can be thought of as a “normal” primary or secondary member.
In version 2.0, the slave’s delayed nodes were debatably included in “healthy” nodes. Starting from version 2.2, delayed nodes and hidden nodes are excluded from the “healthy” nodes.
Running the following command will show the server that is chosen as the source for syncing:
db.adminCommand({replSetGetStatus:1})
The output field of syncingTo is present only on secondary nodes and provides information on the node from which it is syncing.
Making Writes Work with Chaining Slaves
You have seen that the above algorithm for choosing a source to sync from implies that slave chaining is semi-automatic. When a server is started, it’ll most probably choose a server within the same data center to sync from, thus reducing the WAN traffic.
However, this will never lead to a loop because the nodes will sync only from a secondary that has a latest value of lastOpTimeWritten which is greater than its own. You will never end up in a scenario where N1 is syncing from N2 and N2 is syncing from N1. It will always be either N1 is syncing from N2 or N2 is syncing from N1.
In this section, you will see how w (write operation ) works with slave chaining . If N1 is syncing from N2, which is further syncing from N3, in this case how N3 will know that until which point N1 is synced to.
When N1 starts its sync from N2, a special “handshake” message is sent, which intimates to N2 that N1 will be syncing from its oplog. Since N2 is not primary, it will forward the message to the node it is syncing from (i.e. it opens a connection to N3 pretending to be N1). By the end of the above step, N2 has two connections that are opened with N3: one connection for itself and the other for N1.
Whenever an op request is made by N1 to N2, the op is sent by N2 from its oplog and a dummy request is forwarded on the link of N1 to N3, as shown in Figure 7-4.



A332296_1_En_7_Fig4_HTML.jpg
Figure 7-4.
Writes via chaining slaves
Although this minimizes network traffic, it increases the absolute time for the write to reach to all of the members.

No comments:

Post a Comment