8. MongoDB Explained
Shakuntala Gupta Edward1
and Navin Sabharwal2
8.1 Data Storage Engine
In the previous chapter, you looked at the
core services that are deployed as part of MongoDB; you also looked
at replica sets and sharding. In this section, we will talk about
the data storage engine .
MongoDB uses MMAP as its default storage
engine. This engine works with memory-mapped files . Memory-mapped
files are data files that are placed by the operating system in
memory using the mmap() system call. mmap is a feature of OS that
maps a file on the disk into virtual memory.
Virtual memory is not equivalent to
physical memory. Virtual memory is space on the computer’s hard
disk that is used in conjunction with physical RAM.
MongoDB uses memory-mapped files for any
data interaction or data management activity. As and when the
documents are accessed, the data files are memory mapped to the
memory. MongoDB allows the OS to control the memory mapping and
allocate the maximum amount of RAM. Doing this results in minimal
effort and coding at MongoDB level. The caching is done based on
LRU behavior wherein the least recently used files are moved out to
disk from the working set, making space for the new recently and
frequently used pages.
However, this method comes with its own
drawbacks. For instance, MongoDB has no control over what data to
keep in memory and what to remove. So every server restart will
lead to a page fault because every page that is accessed will not
be available in the working set, leading to a long data retrieval
time.
MongoDB also has no control over
prioritizing the content of the memory. Given an evacuation
situation, it can mention what content needs to be maintained in
the cache and what can be removed. For example, if a read is fired
on a large collection that is not indexed, it might result in
loading the entire collection to memory, which might lead to
evacuation of the RAM contents including removal of indexes of
other collections that might be very important. This lack of
control might also result in shrinking the cache assigned to
MongoDB when any external process outside MongoDB tries to access a
large portion of memory; this eventually will lead to slowness in
the MongoDB response.
With the release of version 3.0, MongoDB
comes along with a pluggable storage engine API wherein it enables
you to select between the storage engines based on the workload,
application need, and available infrastructure.
The vision behind the pluggable storage
engine layer is to have one data model, one querying language, and
one set of operational concerns, but under the hood many storage
engine options optimized for different use cases, as shown in
Figure 8-1.
The pluggable storage engine feature also
provides flexibility in terms of deployments wherein multiple types
of storage engines can coexist in the same deployment.
The default, MMAPv1, is an improved
version of the MMAP engine used in the prior versions. The updated
MongoDB MMAPv1 storage engine implements collection-level
concurrency control. This storage engine excels at workloads with
high volume reads, inserts, and in-place updates.
The new WiredTiger storage engine was
developed by the architects of Berkeley DB, the most widely
deployed embedded data management software in the world. WiredTiger
scales on modern multi-CPU architectures. It is designed to take
advantage of modern hardware with multi-core CPUs and more RAM.
WIredTiger stores d ata in compressed
fomat on the disk. Compression reduces the data size by up to 70%
(disk only) and index size by up to 50% (disk and memory both)
depending on the compression algorithm used. In addition to reduced
storage space, compression enables much higher I/O scalability as
fewer bits are read from disk. It provides significant benefits in
the areas of greater hardware utilization, lower storage costs, and
more predictable performance.
Administrators can modify the default
compression settings for all collections and indexes. Compression
is also configurable on a per-collection and per-index basis during
collection and index creation.
WiredTiger also provides granular
document-level concurrency. Writes are no longer blocked by other
writes unless they are accessing the same document. Thus it
supports concurrent access by readers and writers to the documents
in a collection. Clients can read documents while write operations
are in progress, and multiple threads can modify different
documents in a collection at the same time. Thus it excels for
write-intensive workloads (7-10X improvement in write performance).
Higher concurrency also drives
infrastructure simplification. Applications can fully utilize
available server resources, simplifying the architecture needed to
meet performance SLAs. With the more coarse grained database-level
locking of previous MongoDB generations, users often had to
implement sharding in order to scale workloads stalled by a single
write lock to the database, even when sufficient memory, I/O
bandwidth, and disk capacity was still available in the host
system. Greater system utilization enabled by fine-grained
concurrency reduces this overhead, eliminating unnecessary cost and
management load.
This storage engine provides control to
you per collection per index level to decide on what to compress
and what not to compress.
8.2 Data File (Relevant for MMAPv1)
First, let’s examine the data file. As
you have seen, under the core services the default data directory
used by mongod is /data/db/.
Under this directory there are separate
files for every database. Each database has a single .ns file and
multiple data files with monotonically increasing numeric
extensions.
For example, if you create a database
called mydbpoc, it will be stored in the following files: mydb.ns,
mydb.1, mydb.2, and so on, as shown in Figure 8-2.
For each new numeric data file for a
database, the size will be double the size of the previous number
data file. The limit of the file size is 2GB. If the file size has
reached 2GB, all subsequent numbered files will remain 2GB in size.
This behavior is by design. This behavior ensures that small
databases do not waste too much space on disk, and large databases
are mostly kept in contiguous regions on the disk.
Note that in order to ensure consistent
performance, MongoDB preallocates data files. The preallocation
happens in the background and is initiated every time a data file
is filled. This means that the MongoDB server always attempts to
keep an extra, empty data file for every database in order to avoid
blocking on file allocation.
If multiple small databases exist on disk,
using the storage.mmapv1.smallFiles option will reduce the size of
these files.
Next, you will see how the data is
actually stored under the hood. Doubly linked lists are the key
data structure used for storing the data.
8.2.1 Namespace (.ns File)
Within the data files you have data space
divided into namespaces , where the namespace can correspond to
either a collection or an index.
The metadata of these namespaces are
stored in the .ns file. If you check your data directory, you will
find a file named [dbname].ns.
The size of the .ns file that is used for
storing the metadata is 16MB. This file can be thought of as a big
hash table that is partitioned into small buckets, which are
approximately 1KB in size.
8.2.1.1 Collection Namespace
Extent
Extent refers to a group of data
records within a data file, so a group of extents forms the
complete data for a namespace. An extent uses disk locations to
refer to the location on the disk where the data is actually
residing. It consists of two parts: file number and offset.
Offset is the position within the file
(how deep within the file you need to look for the data). The
offset size is 4KB. Hence the offset’s maximum value can be up
to 231-1, which is the maximum file size the data
files can grow to (2048MB or 2 GB).
As shown in Figure 8-5,
an extent data structure consists of the following things:
-
Once it has the file number
it’s referring to, the group of the data records within the
file it’s pointing to are further stored as doubly linked
list. Hence it maintains a pointer to the first data record
and the last data record of the data block it’s pointing to,
which are nothing but the offsets within the file (how deep
within the file the data is stored).
Data Record
Next you will look at the data record
structure . The data structure consists of the following
details:
The data block can have either a BTree
Bucket (in case of an index namespace) or a BSON object. You
will be looking into the BTree structure in a while.
The BSON object corresponds to the
actual data for the collection. The size of the BSON object need
not be same as the data block. Power of 2-sized allocation is
used by default so that every document is stored in a space that
contains the document plus extra space or padding. This design
decision is useful to avoid movement of an object from one block
to another whenever an update leads to a change in the object
size.
MongoDB supports multiple allocation
strategies, which determine how to add padding to a document
(Figure 8-6).
As in-place updates are more efficient than relocations, all
padding strategies trade extra space for increased efficiency
and decreased fragmentation. Different workloads are supported
by different strategies. For instance, exact fit allocation is
ideal for collections with insert-only workloads where the size
is fixed and never varies, whereas power of 2 allocations are
efficient for insert/update/delete workloads.
Deleted List
The deleted list stores details of the
extent whose data has been deleted or moved (movement whenever
an update has caused change in size, leading to non-fitment of
data in the allocated space).
The size of the record determines the
bucket in which the free extent needs to be placed. Basically
these are bucketed single linked lists. When a new extent is
required for fitting the data for the namespace, it will first
search the free list to check whether any appropriate size
extent is available.
In Summary
Hence you can assume the data files
(files with numeric extensions) to be divided across different
collection namespaces where extents of the namespace specify the
range of data from the data file belonging to that respective
collection.
It will first check the mydbpoc.ns file
to reach the users’ namespace and find out the first extent
it’s pointing to. It’ll follow the first extent link to the
first record, and following the next record pointer, it will
read the data records of the first extent until the last record
is reached. Then it will follow the next extent pointer to read
its data records in a similar fashion. This pattern is followed
until the last extent data records are read.
8.2.1.3 Indexes BTree
Now let’s look at how the indexes are
stored. The BTree structure is used for storing the indexes. A
BTree is shown in Figure 8-7.
In a standard implementation of BTree,
whenever a new key is inserted in a BTree, the default behavior
is as shown in Figure 8-8.
In the above scenario, if you have keys
such as Timestamp, ObjectID, or an incrementing number, then the
buckets will always be half full, leading to lots of wasted
space.
In order to overcome this, MongoDB has
modified this slightly. Whenever it identifies that the index key
is an incrementing key, rather than doing a 50/50 split, it does
a 90/10 split as shown in Figure 8-9.
No comments:
Post a Comment