Monday 5 March 2018

MongooDB Concept Part 7


8. MongoDB Explained

Shakuntala Gupta Edward and Navin Sabharwal2
(1)
Ghaziabad, Uttar Pradesh, India
(2)
New Delhi, Delhi, India



“MongoDB explained covers deep-dive concepts of what happens under the hood in MongoDB.”
In this chapter, you will learn how data is stored in MongoDB and how writes happen using journaling. Finally, you will learn about GridFS and the different types of indexes available in MongoDB.

8.1 Data Storage Engine

In the previous chapter, you looked at the core services that are deployed as part of MongoDB; you also looked at replica sets and sharding. In this section, we will talk about the data storage engine .
MongoDB uses MMAP as its default storage engine. This engine works with memory-mapped files . Memory-mapped files are data files that are placed by the operating system in memory using the mmap() system call. mmap is a feature of OS that maps a file on the disk into virtual memory.
Virtual memory is not equivalent to physical memory. Virtual memory is space on the computer’s hard disk that is used in conjunction with physical RAM.
MongoDB uses memory-mapped files for any data interaction or data management activity. As and when the documents are accessed, the data files are memory mapped to the memory. MongoDB allows the OS to control the memory mapping and allocate the maximum amount of RAM. Doing this results in minimal effort and coding at MongoDB level. The caching is done based on LRU behavior wherein the least recently used files are moved out to disk from the working set, making space for the new recently and frequently used pages.
However, this method comes with its own drawbacks. For instance, MongoDB has no control over what data to keep in memory and what to remove. So every server restart will lead to a page fault because every page that is accessed will not be available in the working set, leading to a long data retrieval time.
MongoDB also has no control over prioritizing the content of the memory. Given an evacuation situation, it can mention what content needs to be maintained in the cache and what can be removed. For example, if a read is fired on a large collection that is not indexed, it might result in loading the entire collection to memory, which might lead to evacuation of the RAM contents including removal of indexes of other collections that might be very important. This lack of control might also result in shrinking the cache assigned to MongoDB when any external process outside MongoDB tries to access a large portion of memory; this eventually will lead to slowness in the MongoDB response.
With the release of version 3.0, MongoDB comes along with a pluggable storage engine API wherein it enables you to select between the storage engines based on the workload, application need, and available infrastructure.
The vision behind the pluggable storage engine layer is to have one data model, one querying language, and one set of operational concerns, but under the hood many storage engine options optimized for different use cases, as shown in Figure 8-1.



A332296_1_En_8_Fig1_HTML.jpg
Figure 8-1.
Pluggable storage engine API
The pluggable storage engine feature also provides flexibility in terms of deployments wherein multiple types of storage engines can coexist in the same deployment.
MongoDB version 3.0 ships with two storage engines.
The default, MMAPv1, is an improved version of the MMAP engine used in the prior versions. The updated MongoDB MMAPv1 storage engine implements collection-level concurrency control. This storage engine excels at workloads with high volume reads, inserts, and in-place updates.
The new WiredTiger storage engine was developed by the architects of Berkeley DB, the most widely deployed embedded data management software in the world. WiredTiger scales on modern multi-CPU architectures. It is designed to take advantage of modern hardware with multi-core CPUs and more RAM.
WIredTiger stores d ata in compressed fomat on the disk. Compression reduces the data size by up to 70% (disk only) and index size by up to 50% (disk and memory both) depending on the compression algorithm used. In addition to reduced storage space, compression enables much higher I/O scalability as fewer bits are read from disk. It provides significant benefits in the areas of greater hardware utilization, lower storage costs, and more predictable performance.
The following compression algorithms are available to choose from:
  • Snappy is the default, which is used for documents and journals. It provides a good compression ratio with little CPU overhead. Depending on data types, the compression ratio is somewhere around 70%.
  • zlib provides extremely good compression but at the expense of extra CPU overhead.
  • Prefix compression is the default used for indexes, reducing the in-memory footprint of index storage by around 50% (workload dependent) and freeing up more of the working set for frequently accessed documents.
Administrators can modify the default compression settings for all collections and indexes. Compression is also configurable on a per-collection and per-index basis during collection and index creation.
WiredTiger also provides granular document-level concurrency. Writes are no longer blocked by other writes unless they are accessing the same document. Thus it supports concurrent access by readers and writers to the documents in a collection. Clients can read documents while write operations are in progress, and multiple threads can modify different documents in a collection at the same time. Thus it excels for write-intensive workloads (7-10X improvement in write performance).
Higher concurrency also drives infrastructure simplification. Applications can fully utilize available server resources, simplifying the architecture needed to meet performance SLAs. With the more coarse grained database-level locking of previous MongoDB generations, users often had to implement sharding in order to scale workloads stalled by a single write lock to the database, even when sufficient memory, I/O bandwidth, and disk capacity was still available in the host system. Greater system utilization enabled by fine-grained concurrency reduces this overhead, eliminating unnecessary cost and management load.
This storage engine provides control to you per collection per index level to decide on what to compress and what not to compress.
The WiredTiger storage engine is only available with 64-bit MongoDB.
WiredTiger manages data through its cache. The WiredTiger storage engine gives more control of memory by allowing you to configure how much RAM to allocate to the WiredTiger cache, defaulting to either 1GB or 50% of available memory, whichever is larger.
You will next briefly look at how the data is stored on the disk.

8.2 Data File (Relevant for MMAPv1)

First, let’s examine the data file. As you have seen, under the core services the default data directory used by mongod is /data/db/.
Under this directory there are separate files for every database. Each database has a single .ns file and multiple data files with monotonically increasing numeric extensions.
For example, if you create a database called mydbpoc, it will be stored in the following files: mydb.ns, mydb.1, mydb.2, and so on, as shown in Figure 8-2.



A332296_1_En_8_Fig2_HTML.jpg
Figure 8-2.
Data files
For each new numeric data file for a database, the size will be double the size of the previous number data file. The limit of the file size is 2GB. If the file size has reached 2GB, all subsequent numbered files will remain 2GB in size. This behavior is by design. This behavior ensures that small databases do not waste too much space on disk, and large databases are mostly kept in contiguous regions on the disk.
Note that in order to ensure consistent performance, MongoDB preallocates data files. The preallocation happens in the background and is initiated every time a data file is filled. This means that the MongoDB server always attempts to keep an extra, empty data file for every database in order to avoid blocking on file allocation.
If multiple small databases exist on disk, using the storage.mmapv1.smallFiles option will reduce the size of these files.
Next, you will see how the data is actually stored under the hood. Doubly linked lists are the key data structure used for storing the data.

8.2.1 Namespace (.ns File)

Within the data files you have data space divided into namespaces , where the namespace can correspond to either a collection or an index.
The metadata of these namespaces are stored in the .ns file. If you check your data directory, you will find a file named [dbname].ns.
The size of the .ns file that is used for storing the metadata is 16MB. This file can be thought of as a big hash table that is partitioned into small buckets, which are approximately 1KB in size.
Each bucket stores metadata specific to a namespace (Figure 8-3).



A332296_1_En_8_Fig3_HTML.jpg
Figure 8-3.
Namespace data structure

8.2.1.1 Collection Namespace

As shown in Figure 8-4, the collection namespace bucket contains metadata such as
  • Name of the collection
  • A few statistics on the collection such as count, size, etc. (This is why whenever a count is issued against the collection it returns quick response.)
  • Index details, so it can maintain links to each index created
  • A deleted list
  • A doubly linked list storing the extent details (it stores pointer to the first and the last extent)



A332296_1_En_8_Fig4_HTML.jpg
Figure 8-4.
Collection namespace details
Extent
Extent refers to a group of data records within a data file, so a group of extents forms the complete data for a namespace. An extent uses disk locations to refer to the location on the disk where the data is actually residing. It consists of two parts: file number and offset.
The file number specifies the data file it’s pointing to (0, 1, etc.).
Offset is the position within the file (how deep within the file you need to look for the data). The offset size is 4KB. Hence the offset’s maximum value can be up to 231-1, which is the maximum file size the data files can grow to (2048MB or 2 GB).
As shown in Figure 8-5, an extent data structure consists of the following things:
  • Location on the disk, which is the file number it is pointing to.
  • Since an extent is stored as a doubly linked list element, it has a pointer to the next and the previous extent.
  • Once it has the file number it’s referring to, the group of the data records within the file it’s pointing to are further stored as doubly linked list. Hence it maintains a pointer to the first data record and the last data record of the data block it’s pointing to, which are nothing but the offsets within the file (how deep within the file the data is stored).



A332296_1_En_8_Fig5_HTML.jpg
Figure 8-5.
Extent
Data Record
Next you will look at the data record structure . The data structure consists of the following details:
  • Since the data record structure is an element of the extent’s doubly linked list, it stores information of the previous and the next record.
  • It has length with headers.
  • The data block.
The data block can have either a BTree Bucket (in case of an index namespace) or a BSON object. You will be looking into the BTree structure in a while.
The BSON object corresponds to the actual data for the collection. The size of the BSON object need not be same as the data block. Power of 2-sized allocation is used by default so that every document is stored in a space that contains the document plus extra space or padding. This design decision is useful to avoid movement of an object from one block to another whenever an update leads to a change in the object size.
MongoDB supports multiple allocation strategies, which determine how to add padding to a document (Figure 8-6). As in-place updates are more efficient than relocations, all padding strategies trade extra space for increased efficiency and decreased fragmentation. Different workloads are supported by different strategies. For instance, exact fit allocation is ideal for collections with insert-only workloads where the size is fixed and never varies, whereas power of 2 allocations are efficient for insert/update/delete workloads.



A332296_1_En_8_Fig6_HTML.jpg
Figure 8-6.
Record data structure
Deleted List
The deleted list stores details of the extent whose data has been deleted or moved (movement whenever an update has caused change in size, leading to non-fitment of data in the allocated space).
The size of the record determines the bucket in which the free extent needs to be placed. Basically these are bucketed single linked lists. When a new extent is required for fitting the data for the namespace, it will first search the free list to check whether any appropriate size extent is available.
In Summary
Hence you can assume the data files (files with numeric extensions) to be divided across different collection namespaces where extents of the namespace specify the range of data from the data file belonging to that respective collection.
Having understood how the data is stored, now let’s see how db.users.find() works.
It will first check the mydbpoc.ns file to reach the users’ namespace and find out the first extent it’s pointing to. It’ll follow the first extent link to the first record, and following the next record pointer, it will read the data records of the first extent until the last record is reached. Then it will follow the next extent pointer to read its data records in a similar fashion. This pattern is followed until the last extent data records are read.

8.2.1.2 $freelist

The .ns file has a special namespace called $freelist for extents. $freelist keeps track of the extents that are no longer used, such as extents of a dropped index or collection.

8.2.1.3 Indexes BTree

Now let’s look at how the indexes are stored. The BTree structure is used for storing the indexes. A BTree is shown in Figure 8-7.



A332296_1_En_8_Fig7_HTML.jpg
Figure 8-7.
BTree
In a standard implementation of BTree, whenever a new key is inserted in a BTree, the default behavior is as shown in Figure 8-8.



A332296_1_En_8_Fig8_HTML.jpg
Figure 8-8.
B-Tree standard implementation
There’s a slight variation in the way MongoDB implements the BTree.
In the above scenario, if you have keys such as Timestamp, ObjectID, or an incrementing number, then the buckets will always be half full, leading to lots of wasted space.
In order to overcome this, MongoDB has modified this slightly. Whenever it identifies that the index key is an incrementing key, rather than doing a 50/50 split, it does a 90/10 split as shown in Figure 8-9.



A332296_1_En_8_Fig9_HTML.jpg
Figure 8-9.
MongoDB’s B-Tree 90/10 split
Figure 8-10 shows the bucket structure. Each bucket of the BTree is of 8KB.



A332296_1_En_8_Fig10_HTML.jpg
Figure 8-10.
BTree bucket data structure
The bucket consists of the following:
  • Pointer to the parent
  • Pointer to the right child
  • Pointer to key nodes
  • A list of key objects (these objects are of varying size and are stored in an unsorted manner; these objects are actually the value of the index keys)
Key Nodes
Key nodes are nodes of a fixed size and are stored in a sorted manner. They enable easy split and movement of the elements between different nodes of the BTree.
A key node contains the following:
  • A pointer to the left child
  • The data record the index key belongs to
  • A key offset (the offset of the key objects, which basically tells where in the bucket the key’s value is stored)




No comments:

Post a Comment