7.5.2 Data Distribution Process
You will next look at how the data is
distributed among the shards for the collections where sharding is
enabled. In MongoDB, the data is sharded or distributed at the
collection level. The collection is partitioned by the shard key.
7.5.2.1 Shard Key
Any indexed single/compound field that
exists within all documents of the collection can be a shard key.
You specify that this is the field basis which the documents of the
collection need to be distributed. Internally, MongoDB divides the
documents based on the value of the field into chunks and
distributes them across the shards.
There are two ways MongoDB enables
distribution of the data: range-based partitioning and hash-based
partitioning.
Range-Based Partitioning
In range-based partitioning , the shard key values are divided
into ranges. Say you consider a timestamp field as the shard key.
In this way of partitioning, the values are considered as a
straight line starting from a Min value to Max value where Min is
the starting period (say, 01/01/1970) and Max is the end period
(say, 12/31/9999). Every document in the collection will have
timestamp value within this range only, and it will represent some
point on the line.
Based on the number of shards available,
the line will be divided into ranges, and documents will be
distributed based on them.
In this scheme of partitioning, shown in
Figure 7-17,
the documents where the values of the shard key are nearby are
likely to fall on the same shard. This can significantly improve
the performance of the range queries.
Figure 7-17.
Range-based partitioning
Hash-Based Partitioning
In hash-based partitioning , the data is
distributed on the basis of the hash value of the shard field. If
selected, this will lead to a more random distribution compared to
range-based partitioning.
It’s unlikely that the documents with
close shard key will be part of the same chunk. For example, for
ranges based on the hash of the _id field, there will be a
straight line of hash values, which will again be partitioned on
basis of the number of shards. On the basis of the hash values,
the documents will lie in either of the shards. See Figure 7-18.
Figure 7-18.
Hash-based partitioning
Chunks
The data is moved between the shards in
form of chunks. The shard key range is further partitioned into
sub-ranges, which are also termed as chunks. See Figure 7-19.
Figure 7-19.
Chunks
For a sharded cluster, 64MB is the default
chunk size. In most situations, this is an apt size for chunk
slitting and migration.
Let’s discuss the execution of sharding
and chunks with an example. Say you have a blog posts collection
which is sharded on the field date. This implies that the
collection will be split up on the basis of the date field values.
Let’s assume further that you have three shards. In this
scenario the data might be distributed across shards as follows:
1.
2.
3.
4.
The application doesn’t need to be
sharding-aware. It can query the mongos as though it’s a normal
mongod.
Role of ConfigServers in the Above Scenario
Consider a scenario where you start
getting insert requests for millions of documents with the date of
September 2009. In this case, Shard #2 begins to get overloaded.
The config server steps in once it
realizes that Shard #2 is becoming too big. It will split the data
on the shard and start migrating it to other shards. After the
migration is completed, it sends the updated status to the mongos.
So now Shard #2 has data from August 2009 until September 18, 2009
and Shard #3 contains data from September 19, 2009 until the end
of time.
When a new shard is added to the cluster,
it’s the config server’s responsibility to figure out what to
do with it. The data may need to be immediately migrated to the
new shard, or the new shard may need to be in reserve for some
time. In summary, the config servers are the brains. Whenever any
data is moved around, the config servers let the mongos know about
the final configuration so that the mongos can continue doing
proper routing.
7.5.3 Data Balancing Process
You will next look at how the cluster is
kept balanced (i.e. how MongoDB ensures that all the shards are
equally loaded).
The addition of new data or modification
of existing data, or the addition or removal of servers, can lead
to imbalance in the data distribution, which means either one shard
is overloaded with more chunks and the other shards have less
number of chunks, or it can lead to an increase in the chunk size,
which is significantly greater than the other chunks.
7.5.3.1 Chunk Splitting
Chunk splitting is one of the processes
that ensures the chunks are of the specified size. As you have
seen, a shard key is chosen and it is used to identify how the
documents will be distributed across the shards. The documents are
further grouped into chunks of 64MB (default and is configurable)
and are stored in the shards based on the range it is hosting.
If the size of the chunk changes due to
an insert or update operation, and exceeds the default chunk size,
then the chunk is split into two smaller chunks by the mongos. See
Figure 7-20.
Figure 7-20.
Chunk splitting
This process keeps the chunks within a
shard of the specified size or lesser than that (i.e. it ensures
that the chunks are of the configured size).
Insert and update operations trigger
splits. The split operation leads to modification of the data in
the config server as the metadata is modified. Although splits
don’t lead to migration of data, this operation can lead to an
unbalance of the cluster with one shard having more chunks
compared to another.
7.5.3.2 Balancer
Balancer is the background process that
is used to ensure that all of the shards are equally loaded or are
in a balanced state. This process manages chunk migrations.
Splitting of the chunk can cause
imbalance. The addition or removal of documents can also lead to a
cluster imbalance. In a cluster imbalance, balancer is used, which
is the process of distributing data evenly.
When you have a shard with more chunks as
compared to other shards, then the chunks balancing is done
automatically by MongoDB across the shards. This process is
transparent to the application and to you.
Any of the
mongos within the cluster can initiate the balancer process. They
do so by acquiring a lock on the config database of the config
server, as balancer involves migration of chunks from one shard to
another, which can lead to a change in the metadata, which will
lead to change in the config server database. The balancer process
can have huge impact on the database performance, so it can either
1.
Be configured
to start the migration only when the migration threshold has
reached. The migration threshold is the difference in the number
of maximum and minimum chunks on the shards. Threshold is shown
in Table 7-4.
2.
The balancer
migrates one chunk at a time (see Figure 7-21)
and follows these steps:
1.
2.
An internal
moveChunk command is started on the source where it creates the
copy of the documents within the chunk and queues it. In the
meantime, any operations for that chunk are routed to the source
by the mongos because the config database is not yet changed and
the source will be responsible for serving any read/write request
on that chunk.
3.
4.
5.
6.
Figure 7-21.
Chunk migration
If in the meanwhile the balancer needs
additional chunk migration from the source shard, it can start
with the new migration without even waiting for the deletion step
to finish for the current migration.
In case of any error during the migration
process, the process is aborted by the balancer, leaving the
chunks on the original shard. On successful completion of the
process, the chunk data is removed from the original shard by
MongoDB.
7.5.4 Operations
You will next look at how the read and
write operations are performed on the sharded cluster. As
mentioned, the config servers maintain the cluster metadata. This
data is stored in the config database. This data of the config
database is used by the mongos to service the application read and
write requests.
The data is cached by the mongos
instances, which is then used for routing write and read operations
to the shards. This way the config servers are not overburdened.
Whenever any operation is issued, the first
step that the mongos need to do is to identify the shards that will
be serving the request. Since the shard key is used to distribute
data across the sharded cluster, if the operation is using the
shard key field, then based on that specific shards can be
targeted.
1.
2.
If a single update operation uses
employeeid for updating the document, the request will be routed
to the shard holding that employee data.
However, if the operation is not using the
shard key, then the request is broadcast to all the shards.
Generally a multi-update or remove operation is targeted across the
cluster.
While querying the data, there might be
scenarios where in addition to identifying the shards and getting
the data from them, the mongos might need to work on the data
returned from various shards before sending the final output to the
client.
Say an application has issued a find()
request with sort(). In this scenario, the mongos will pass the
$orderby option to the shards. The shards will fetch the data from
their data set and will send the result in an ordered manner. Once
the mongos has all the shard’s sorted data, it will perform an
incremental merge sort on the entire data and then return the final
output to the client.
Similar to sort are the aggregation
functions such as limit(), skip(), etc., which require mongos to
perform operations post receiving the data from the shards and
before returning the final result set to the client.
The mongos consumes minimal system
resources and has no persistent state. So if the application
requirement is a simple find () queries that can be solely met by
the shards and needs no manipulation at the mongos level, you can
run the mongos on the same system where your application servers
are running.
7.5.5 Implementing Sharding
You will keep the example simple by using
only two shards. In this configuration, you will be using the
services listed in Table 7-5.
Table 7-5.
Sharding Cluster Configuration
Component |
Type |
Port |
Datafile path |
---|---|---|---|
Shard Controller |
Mongos |
27021 |
- |
Config Server |
Mongod |
27022 |
C:\db1\config\data |
Shard0 |
Mongod |
27023 |
C:\db1\shard1\data |
Shard1 |
Mongod |
27024 |
C:\db1\shard2\data |
1.
2.
3.
4.
5.
7.5.5.1 Setting the Shard Cluster
In order to set up the cluster , the
first step is to set up the configuration server. Enter the
following code in a new terminal window to create the data
directory for the config server and start the mongod:
2015-07-13T23:02:41.984-0700 I CONTROL
[initandlisten] MongoDB starting : pid=3084 port=27022
dbpath=C:\db1\config\data master=1 64-bit host=ANOC9
2015-07-13T23:06:07.246-0700 W SHARDING
running with 1 config server should be done only for testing
purposes and is not recommended for production
2015-07-13T23:09:07.464-0700 I SHARDING
[Balancer] distributed lock 'balancer/ ANOC9:27021:1429783567:41'
unlocked
If you switch to the window where the
config server has been started, you will find a registration of
the shard server to the config server.
In this example you have used chunk size
of 1MB. Note that this is not ideal in a real-life scenario since
the size is less than 4MB (a document’s maximum size). However,
this is just for demonstration purpose since this creates the
necessary amount of chunks without loading a large amount of data.
The chunkSize is 128MB by default unless otherwise specified.
2015-07-13T23:14:58.076-0700 I CONTROL
[initandlisten] MongoDB starting : pid=1996 port=27023
dbpath=c:\db1\shard0\data 64-bit host=ANOC9
2015-07-13T23:17:01.704-0700 I CONTROL
[initandlisten] MongoDB starting : pid=3672 port=27024
dbpath=C:\db1\shard1\data 64-bit host=ANOC9
All the servers relevant for the setup are
up and running by the end of the above step. The next step is to
add the shards information to the shard controller.
7.5.5.2 Creating a Database and Shard Collection
In order to continue further with the
example, you will create a database named testdb and a collection
named testcollection, which you will be sharding on the key
testkey.
Next, specify the collection that needs to
be sharded and the key on which the collection will be sharded:
With the completion of the above steps you
now have a sharded cluster set up with all components up and
running. You have also created a database and enabled sharding on
the collection.
You will be using the import command to
load data in the testcollection. Connect to a new terminal window
and execute the following:
C:\practicalmongodb\bin>mongoimport
--host ANOC9 --port 27021 --db testdb --collection testcollection
--type csv --file c:\mongoimport.csv –-headerline
The mongoimport.csv consists of two
fields. The first is the testkey, which is a randomly generated
number. The second field is a text field; it is used to ensure
that the documents occupy a sufficient number of chunks, making it
feasible to use the sharding mechanism.
This inserts 100,000 objects in the collection.
In order to vet whether the records are
inserted or not, connect to the mongo console of the mongos and
issue the following command:
Next, connect to the consoles of the two
shards (Shard0 and Shard1) and look at how the data is
distributed. Open a new terminal window and connect to Shard0’s
console:
7.5.5.3 Adding a New Shard
You have a sharded cluster set up and you
also have sharded a collection and looked at how the data is
distributed amongst the shards. Next, you’ll add a new shard to
the cluster so that the load is spread out a little more.
You will be repeating the steps mentioned
above. Begin by creating a data directory for the new shard in a
new terminal window:
2015-07-13T23:25:49.103-0700 I CONTROL
[initandlisten] MongoDB starting : pid=3744 port=27025
dbpath=C:\db1\shard2\data 64-bit host=ANOC9
Next, the new shard server will be added
to the shard cluster. In order to configure it, open the mongos
mongo console in a new terminal window:
Switch to the admin database and run the
addshard command . This command adds the shard server to the
sharded cluster.
Next, check how the testcollection data is
distributed. Connect to the new shard’s console in a new
terminal window:
Interestingly, the number of items in the
collection is slowly going up. The mongos is rebalancing the
cluster.
With time, the chunks will be migrated
from the shard servers Shard0 and Shard1 to the newly added shard
server, Shard2, so that the data is evenly distributed across all
the servers. Post completion of this process the config server
metadata is updated. This is an automatic process and it happens
even if there’s no new data addition in the testcollection. This
is one of the important factors you need to consider when deciding
on the chunk size.
7.5.5.4 Removing a Shard
In the following example, you will see
how to remove a shard server. For this example, you will be
removing the server you added in the above example.
In order to initiate the process, you
need to log on to the mongos console, switch to the admin db, and
execute the following command to remove the shard from the shard
cluster:
As you can see, the removeShard command
returns a message. One of the message fields is state, which
indicates the process state. The message also states that the
draining process has started. This is indicated by the field msg.
The response tells you the number of
chunks and databases that still need to be drained from the
server. If you reissue the command and the process is terminated,
the output of the command will depict the same.
No comments:
Post a Comment