8.3 Data File (Relevant for WiredTiger)
In this section, you will look at the
content of data directory when the mongod is started with the
WiredTiger storage engine .
When the storage option selected is
WiredTiger, data, journals, and indexes are compressed on disk. The
compression is done based on the compression algorithm specified
when starting the mongod.
Under the data directory, there are
separate compressed wt files corresponding to each collection and
indexes. Journals have their own folder under the data directory.
The compressed files are actually created
when data is inserted in the collection (the files are allocated at
write time, no preallocation).
For example, if you create collection
called users, it will be stored in collection-0—2259994602858926461
files and the associated index will be stored in
index-1—2259994602858926461, index-2—2259994602858926461, and so
on.
In addition to the collection and index
compressed files, there is a _mdb_catalog file that stores metadata
mapping collection and indexes to the files in the data directory.
In the above example it will store mapping of collection users to
the wt file collection-0—2259994602858926461. See Figure 8-11.
When specifying the DBPath you need to
ensure that the directory corresponds to the storage engine, which
is specified using the –storageEngine option when starting the
mongod. The mongod will fail to start if the dbpath contains files
created by a storage engine other than the one specified using the
–storageEngine option. So if MMAPv1 files are found in DBPath,
then WT will fail to start.
Internally, WiredTiger uses the traditional
B+ tree structure for storing and managing data but that’s where
the similarity ends. Unlike B+ tree, it doesn’t support in-place
updates.
WiredTiger cache is used for any read/write
operations on the data. The trees in cache are optimized for
in-memory access.
8.4 Reads and Writes
You will briefly look at how the reads and
writes happen. As mentioned, when MongoDB updates and reads from
the DB, it is actually reading and writing to memory.
If a modification operation in the MongoDB
MMAPv1 storage engine increases the record size bigger then the
space allocated for it, then the entire record will be moved to a
much bigger space with extra padding bytes. By default, MongoDB
uses power of 2-sized allocations so that every document in MongoDB
is stored in a record that contains the document itself and extra
space (padding). Padding allows the document to grow as the result
of updates while minimizing the likelihood of reallocations. Once
the record is moved, the space that was originally occupied is
freed up and will be tracked as free lists of different size. As
mentioned, it’s the $freelist namespace in the .ns file.
In the MMAPv1 storage engine, as objects
are deleted, modified, or created, fragmentation will occur over
time, which will affect the performance. The compact command should
be executed to move the fragmented data into contiguous spaces.
Every 60 seconds the files in RAM are
flushed to disk. To prevent data loss in the event of a power
failure, the default is to run with journaling switched on. The
behavior of journal is dependent on the configured storage engine.
The MMAPv1 journal file is flushed to disk
every 100ms, and if there is power loss, it is used to bring the
database back to a consistent state.
In WiredTiger, the data in the cache is
stored in a B+ tree structure which is optimized for in-memory. The
cache maintains an on-disk page image in association with an index
, which is used to identify where the data being asked for actually
resides within the page (see Figure 8-12).
Whenever an operation is issued to
WiredTiger, internally it’s broken into multiple transactions
wherein each transaction works within the context of an in-memory
snapshot. The snapshot is of the committed version before the
transactions started. Writers can create new versions concurrent
with the readers.
The write operations do not change the
page; instead the updates are layered on top of the page. A
skipList data structure is used to maintain all the updates, where
the most recent update is on the top. Thus, whenever a user
reads/writes the data, the index checks whether a skiplist exists.
If a skiplist is not there, it returns data from the on-disk page
image. If skiplist exists, the data at the head of the list is
returned to the threads, which then update the data. Once a commit
is performed, the updated data is added to the head of the list and
the pointers are adjusted accordingly. This way multiple users can
access data concurrently without any conflict. The conflict occurs
only when multiple threads are trying to update the same record. In
that case, one update wins and the other concurrent update needs to
retry.
Any changes to the tree structure due to
the update, such as splitting the data if the page sizes increase,
relocation, etc., are later reconciled by a background process.
This accounts for the fast write operations of the WiredTiger
engine; the task of data arrangement is left to the background
process. See Figure 8-13.
WiredTiger uses the MVCC approach to
ensure concurrency control wherein multiple versions of the data
are maintained. It also ensures that every thread that is trying to
access the data sees the most consistent version of the data. As
you have seen, the writes are not in place; instead they are
appended on top of the data in a skipList data structure with the
most recent update on the top. Threads accessing the data get the
latest copy, and they continue to work with that copy uninterrupted
until the time they commit. Once they commit, the update is
appended at the top of the list and thereafter any thread accessing
the data will see that latest update.
This enables multiple threads to access
the same data concurrently without any locking or contention. This
also enables the writers to create new versions concurrently with
the readers. The conflict occurs only when multiple threads are
trying to update the same record. In that case, one update wins and
the other concurrent update needs to retry.
The WiredTiger journal ensures that
writes are persisted to disk between checkpoints. WiredTiger uses
checkpoints to flush data to disk by default every 60 seconds or
after 2GB of data has been written. Thus, by default, WiredTiger
can lose up to 60 seconds of writes if running without journaling,
although the risk of this loss will typically be much less if using
replication for durability. The WiredTiger transaction log is not
necessary to keep the data files in a consistent state in the event
of an unclean shutdown, and so it is safe to run without journaling
enabled, although to ensure durability the “replica safe” write
concern should be configured. Another feature of the WiredTiger
storage engine is the ability to compress the journal on disk,
thereby reducing storage space.
8.5 How Data Is Written Using Journaling
MongoDB disk writes are lazy, which means
if there are 1,000 increments in one second, it will only be
written once. The physical writes occurs a few seconds after the
operation.
As you know, in the MongoDB system,
mongod is the primary daemon process. So the disk has the data
files and the journal files . See Figure 8-15.
When the mongod is started, the data files
are mapped to a shared view. In other words, the data file is
mapped to a virtual address space. See Figure 8-16.
Basically, the OS recognizes that your
data file is 2000 bytes on disk, so it maps this to memory address
1,000,000 – 1,002,000. Note that the data will not be actually
loaded until accessed; the OS just maps it and keeps it.
Until now you still had files backing up
the memory. Thus any change in memory will be flushed to the
underlying files by the OS.
This is how the mongod works when
journaling is not enabled. Every 60 seconds the in-memory changes
are flushed by the OS.
In this scenario, let’s look at writes
with journaling enabled. When journaling is enabled, a second
mapping is made to a private view by the mongod.
That’s why the virtual memory amount
used by mongod doubles when the journaling is enabled. See Figure
8-17.
You can see in Figure 8-17
how the data file is not directly connected to the private view, so
the changes will not be flushed from the private view to the disk
by the OS.
Let’s see what sequence of events
happens when a write operation is initiated. When a write operation
is initiated it, first it writes to the private view (Figure 8-18).
Next, the changes are written to the
journal file, appending a brief description of what’s changed in
the files (Figure 8-19).
The journal keeps appending the change
description as and when it gets the change. If the mongod fails at
this point, the journal can replay all the changes even if the data
file is not yet modified, making the write safe at this point.
Finally, with a very fast speed the
changes are written to the disk. By default, the OS is requested to
do this every 60 seconds by the mongod (Figure 8-21).
In the last step, the shared view is
remapped to the private view by the mongod. This is done to prevent
the private view from getting too dirty (Figure 8-22).
8.6 GridFS – The MongoDB File System
You looked at what happens under the hood.
You saw that MongoDB stores data in BSON documents. BSON documents
have a document size limit of 16MB.
GridFS is MongoDB’s specification for
handling large files that exceed BSON’s document size limit. This
section will briefly cover GridFS.
Here “specification” means that it is
not a MongoDB feature by itself, so there is no code in MongoDB
that implements it. It just specifies how large files need to be
handled. The language drivers such as PHP, Python, etc. implement
this specification and expose an API to the user of that driver,
enabling them to store/retrieve large files in MongoDB.
8.6.1 The Rationale of GridFS
By design, a MongoDB document (i.e. a
BSON object) cannot be larger than 16MB. This is to keep
performance at an optimum level, and the size is well suited for
our needs. For example, 4MB of space might be sufficient for
storing a sound clip or a profile picture. However, if the
requirement is to store high quality audio or movie clips, or even
files that are more than several hundred megabytes in size,
MongoDB has you covered by using GridFS.
GridFS specifies a mechanism for dividing
a large file among multiple documents. The language driver that
implements it, for example, the PHP driver, takes care of the
splitting of the stored files (or merging the split chunks when
files are to be retrieved) under the hood. The developer using the
driver does not need to know of such internal details. This way
GridFS allows the developer to store and manipulate files in a
transparent and efficient way .
GridFS uses two collections for storing
the file. One collection maintains the metadata of the file and
the other collection stores the file’s data by breaking it into
small pieces called chunks. This means the file is divided into
smaller chunks and each chunk is stored as a separate document. By
default the chunk size is limited to 255KB.
This approach not only makes the storing
of data scalable and easy but also makes the range queries easier
to use when a specific part of files are retrieved.
8.6.2 GridFSunder the Hood
There’s no “special case” handling
done at the MongoDB server for the GridFS requests. All the work
is done at the client side.
GridFS enables you to store large files
by splitting them up into smaller chunks and storing each of the
chunks as separate documents. In addition to these chunks, there’s
one more document that contains the metadata about the file. Using
this metadata information, the chunks are grouped together,
forming the complete file.
The storage overhead for the chunks can
be kept to a minimum, as MongoDB supports storing binary data in
documents.
The two collections that are used by
GridFS for storing of large files are by default named as fs.files
and fs.chunks, although a different bucket name can be chosen than
fs.
The chunks are stored by default in the
fs.chunks collection. If required, this can be overridden. Hence
all of the data is contained in the fs.chunks collection.
The fs.files collection stores the
metadata for each file. Each document within this collection
represents a single file in GridFS. In addition to the general
metadata information, each document of the collection can contain
custom metadata specific to the file it’s representing.
The following are the keys that are
mandated by the GridFS specification :
-
md5: This is generated on the
server side and is the md5 checksum of the files contents.
MongoDB server generates its value by using the filemd5 command,
which computes the md5 checksum of the uploaded chunks. This
implies that the user can check this value to ensure that the
file is uploaded correctly.
8.6.3 Using GridFS
[{u'length': 38, u'_id':
ObjectId('52fdd6189cd2fd08288d5f5c'), u'uploadDate':
datetime.datetime(2014, 11, 04, 4, 20, 41, 800000), u'md5':
u'332de5ca08b73218a8777da69293576a', u'chunkSize': 262144}]
[{u'files_id':
ObjectId('52fdd6189cd2fd08288d5f5c'), u'_id':
ObjectId('52fdd6189cd2fd08288d5f5d'), u'data': Binary('This is my
new sample file. It is just grand!', 0), u'n': 0}]
Let’s force split the file. This is
done by specifying a small chunkSize while file creation, like so:
You now know how the file is actually
stored in the database. Next, using the client driver, you will
now read the file:
The user need not be aware of the chunks
at all. You need to use the APIs exposed by the client to read and
write files from GridFS.
8.6.3.1 Treating GridFS More Like a File System
You can pass any number of keywords as
arguments to the new_file(). This will be added in the fs.files
document:
{u'contentType': u'text/plain',
u'chunkSize': 262144, u'my_other_attribute': 42, u'filename':
u'practicalfile.txt', u'length': 8, u'uploadDate':
datetime.datetime(2014, 11, 04, 9, 01, 32, 800000), u'_id':
ObjectId('52fdd8db9cd2fd08288d5f66'), u'md5':
u'681e10aecbafd7dd385fa51798ca0fd6'}
A file can be overwritten using
filenames. Since _id is used for indexing files in GridFS, the
old file is not removed. Just a file version is maintained.
In the above case, get_version or
get_last_version can be used to retrieve the file with the
filename.
Note that only one version of
practicalfile.txt was removed. You still have a file named
practicalfile.txt in the filesystem.
8.7 Indexing
In this part of the book, you will briefly
examine what an index is in the MongoDB context. Following that, we
will highlight the various types of indexes available in MongoDB,
concluding the section by highlighting the behavior and
limitations.
An index is a data structure that speeds
up the read operations. In layman terms, it is comparable to a book
index where you can reach any chapter by looking in the index for
the chapter and jumping directly to the page number rather than
scanning the entire book to reach to the chapter, which would be
the case if no index existed.
Similarly, an index is defined on fields,
which can help in searching for information in a better and
efficient manner.
As in other databases, in MongoDB also
it’s perceived in a similar fashion (it’s used for speeding up
the find () operation ). The type of queries you run help to create
efficient indexes for the databases. For example, if most of the
queries use a Date field, it would be beneficial to create an index
on the Date field. It can be tricky to figure out which index is
optimal for your query, but it’s worth a try because the queries
that otherwise take minutes will return results instantaneously if
a proper index is in place.
In MongoDB, an index can be created on any
field or sub-field of a document. Before you look at the various
types of indexes that can be created in MongoDB, let’s list a few
core features of the indexes:
8.7.1 Types of Indexes
8.7.1.2 Secondary Indexes
1.
2.
3.
4.
5.
MongoDB indexes maintain references to
the fields. The refernces are maintained in either an ascending
order or descending order. This is done by specifying a number
with the key when creating an index. This number indicates the
index direction. Possible options are 1 and -1, where 1 stands
for ascending and -1 stands for descending.
In a single key index, it might not be
too important; however, the direction is very important in
compound indexes.
Consider an Events collection that
includes both username and timestamp. Your query is to return
events ordered by username first and then with the most recent
event first. The following index will be used:
1.
2.
When you create an index, you need to
ensure uniqueness of the values being stored in the indexed
field. In such cases, you can create indexes with the Unique
property set to true (by default it’s false).
Say you want a unique_index on the field
userid. The following command can be run to create the unique
index:
This command ensures that you have
unique values in the user_id field. A few points that you need to
note for the uniqueness constraint are
If you are creating a unique index on a
collection that already has documents, the creation might fail
because you may have some documents that contain duplicate values
in the indexed field. In such scenarios, the dropDups options can
be used for force creation of the unique index. This works by
keeping the first occurrence of the key value and deleting all
the subsequent values. By default dropDups is false.
A sparse index is an index that holds
entries of the documents within a collection that has the fields
on which the index is created. If you want to create a sparse
index on the LastName field of the User collection, the following
command can be issued:
The index is said to be sparse because
this only contains documents with the indexes field and miss the
documents when the fields are missing. Due to this nature, sparse
indexes provide a significant space saving.
In contrast, the non-sparse index
includes all documents irrespective of whether the indexed field
is available in the document or not. Null value is stored in case
the fields are missing.
A new index property was introduced in
version 2.2 that enables you to remove documents from the
collection automatically after the specified time period is
elapsed. This property is ideal for scenarios such as logs,
session information, and machine-generated event data, where the
data needs to be persistent only for a limited period.
With the rise of the smartphone, it’s
becoming very common to query for things near a current location.
In order to support such location-based queries, MongoDB provides
geospatial indexes .
A geospatial index assumes that the
values will range from -180 to 180 by default. If this needs to
be changed, it can be specified along with ensureIndex as
follows:
Any documents with values beyond the
maximum and the minimum values will be rejected. You can also
create compound geospatial indexes.
Let’s understand with an example how
this index works. Say you have documents that are of the
following type:
If the query of a user is to find all
coffee shops near her location, the following compound index can
help:
Geohaystack indexes are bucket-based
geospatial indexes (also called geospatial haystack indexes).
They are useful for queries that need to find out locations in a
small area and also need to be filtered along another dimension,
such as finding documents with coordinates within 10 miles and a
type field value as restaurant.
While defining the index, it’s
mandatory to specify the bucketSize parameter as it determines
the haystack index granularity. For example,
This example creates an index wherein
keys within 1 unit of latitude or longitude are stored together
in the same bucket. You can also include an additional category
in the index, which means that information will be looked up at
the same time as finding the location details.
If your use case typically searches for
"nearby" locations (i.e. "restaurants within 25
miles"), a haystack index can be more efficient.
The matches for the additional indexed
field (e.g. category) can be found and counted within each
bucket.
If, instead, you are searching for
"nearest restaurant" and would like to return results
regardless of distance, a normal 2d index will be more efficient.
8.7.1.3 Index Intersection
Index intersection is introduced in
version 2.6 wherein multiple indexes can be intersected to
satiate a query. To explain it a bit further, let’s consider a
products collection that holds documents of the following format
{ "_id":
ObjectId(...),"category": ["food",
"grocery"], "item": "Apple",
"location": "16 th Floor Store",
"arrival": Date(...)}.
You can run explain() to determine if
index intersection is used for the above query. The explain
output will include either of the following stages: AND_SORTED or
AND_HASH. When doing index intersection, either the entire index
or only the index prefix can be used.
You next need to understand how this
index intersection feature impacts the compound index creation.
While creating a compound index, both
the order in which the keys are listed in the index and the sort
order (ascending and descending) matters. Thus a compound index
may not support a query that does not have the index prefix or
has keys with different sort order.
To explain it a bit further, let’s
consider a products collection that has the following compound
index:
In addition to the query, which is
referring to all the fields of the compound index, the above
compound index can also support queries that are using any of the
index prefix (it can also support queries that are using only the
item field). But it won’t be able to support queries that are
using either only the location field or are using the item key
with a different sort order.
Instead, if you create two separate
indexes, one on the item and the other on the location, these two
indexes can either individually or though intersections support
the four queries mentioned above. Thus, the choice between
whether to create a compound index or to rely on intersection of
indexes depends on the system’s needs.
Note that index intersection will not
apply when the sort() operation needs an index that is completely
separate from the query predicate.
That is, MongoDBwill not use the { item:
1 } index for the query, and the separate { location: 1 } or the
{ location: 1, arrival_date: -1 } index for the sort.
8.7.2 Behaviors and Limitations
Finally, the following are a few
behaviors and limitations that you need to be aware of:
-
More than 64
indexes may not be allowed in a collection.
-
Index keys cannot be larger than
1024
bytes.
-
An index name (including the
namespace)
must be less than 128
characters.
-
Since each clause of an $or
query executes in parallel, each can use a different index.
-
Queries that use the $or operator
are not supported by the second geospatial
query.
very helpful, thank you
ReplyDelete