12.1.1 Hardware Suggestions from the MongoDB Site
The following are only intended to provide
high-level guidance for a MongoDB deployment. The actual hardware
configuration depends on your data, availability requirement,
queries, performance criteria, and the selected hardware components
’ capabilities.
-
Storage: MongoDB can use SSDs
(solid state drives ) or local attached storage. Since MongoDB’s
disk access patterns don’t have sequential properties, SSDs
usage can enable customers to experience substantial performance
gains. Another benefit of using a SSD is if the working set no
longer fits in memory, they provide a gentle degradation of
performance.
-
CPU: Since MongoDB with a MMAPv1
storage engine rarely encounters workloads needing a large number
of cores, it’s preferable to use servers with a faster clock
speed than the ones with multiple cores but slower clock speed.
However, the WiredTiger storage engine is CPU bound, so using a
server with multiple cores will offer a significant performance
improvement.
12.1.2 Few Points to be Noted
To summarize
this section, when choosing hardware for MongoDB, consider the
following important points :
1.
2.
3.
4.
5.
MongoDB on
NUMA Hardware : This point is only applicable for mongod running
in Linux and not for instances that run on Windows or other
Unix-like systems. NUMA (non-uniform memory access) and MongoDB
don’t work well together, so when running MongoDB on NUMA
hardware, you need to disable NUMA for MongoDB and run with an
interleave memory policy because NUMA causes a number of
operational problems for MongoDB, including performance slowness
for periods of time or high processor usage.
12.2 Coding
Once the hardware is acquired, consider
the following tips when coding with the database:
-
The first point is to think of the
data model to be used for the given application requirement and
to decide on embedding or referencing or a mix of both. For more
on this, please look at Chapter tk. There’s a trade-off between
fast performance and guaranteed immediate consistency, so decide
based on your application.
-
For instance, an application
should not update documents a way that leads them to grow
significantly. When the document size exceeds the allocated size,
MongoDB will relocate the document. This process is not only time
consuming, but is also resource intensive and can unnecessarily
slow down other database operations. In addition, it can lead to
inefficient use of storage.
For example, let’s consider a blogging
application . In this application, it’s difficult to estimate how
many responses a blog post will receive. The application is set to
only display a subset of comments to the user, say the most recent
comment or the first 10 comments. In this case, rather than
creating an embedded model where the blog post and the user
responses are maintained as a single document, you should create a
referencing model where each response or group of responses are
maintained as separate documents and then add a reference to the
blog post in the documents. In this way, the unbound growth of the
documents can be controlled, which will happen if you follow the
first model of embedding the data.
-
You can also design documents for
the future. Although MongoDB provides the option of appending new
fields within the documents as and when required, it has a
drawback. When new fields are introduced, there might be a
scenario where the document might not fit in the current space
available, leading to MongoDB finding a new space for the
document and moving it there, which might take time. So it is
always efficient to create all the fields at the start if you are
aware of the structure, irrespective of whether you have data
available at that time or not. As highlighted above, the space
will be allotted to the document and whenever value is there only
needs to be updated. In doing so, MongoDB will not have to look
for space; it merely updates the values entered, which is much
faster.
-
If you want to query for
information that must be computed and is not explicitly present
in the document, the best choice is to make the information
explicit in the document. As MongoDB is designed to just store
and retrieve the data, it does no computation. Any trivial
computation is pushed to the client, leading to performance
issues.
Say you have a collection that maintains
documents containing details of the user and the system
interaction. The documents have a date field called lastActivity,
which tracks the user and the system interaction. Let’s say you
have a requirement that says that you need to maintain the user
session only for an hour. In this scenario, you can set the TTL to
3600 seconds for the field lastActivity. A background thread will
run automatically and will check and delete documents that are idle
for more than 3600 seconds.
-
Use capped collections if you
require high throughput based on insertion orders. In some
scenarios, based on data size you need to maintain a rolling
window of data in the system. For example, a capped collection
can be used to store a high-volume system’s log information to
quickly retrieve the most recent log entries.
12.3 Application Response Time Optimization
Once you start developing the application,
one of the most important requirements is to have an acceptable
response time. In other words, the application should respond
instantly. You can use the following tips for optimization
purposes:
-
Avoid disk access and page faults
as much as possible. Proactively figure out the data set size the
application will be expected to deal with and add more memory in
order to avoid page faults and disk read. Also, program your
application in such a way that it mostly access data available in
memory so page faults will happen infrequently.
-
Indexing is not always good. You
need to maintain an optimal balances of indexes used. Although
you should create indexes for supporting your queries, you should
also remember to delete indexes that are no longer used because
every index has a cost associated for insert/update operations.
If an index is not used but still it exists, it can have an
adverse effect on the overall database capacity. This is
especially important for insert-heavy workloads.
12.4 Data Safety
You learned what you need to keep in mind
when deciding on your deployment; you also learned a few important
tips for good performance. Now let’s look at some tips for data
safety and consistency :
-
Replication and journaling are two
approaches that are provided for data safety. Generally it’s
recommended to run the production setup using replication rather
than running it using a single server. And you should have at
least one of the servers journaled. When it is not possible to
have replication enabled and you are running on a single server,
journaling provides data safety. Chapter tk explains how writes
work when journaling is enabled.
12.5 Administration
The following are some administration
tips:
-
Take instant-in-time backups of
durable servers. To take a backup of a database with journaling
enabled, you can take a file system snapshot or do a normal
fsync+lock and then dump. Note that you can’t just copy all of
the files without fsync and locking because copying is not an
instantaneous operation.
- An explain plan can be used to see how a
query is being resolved. This involves information such as which
index is used, how many documents are returned, whether the index
is covering the query, how many index entries were scanned, and
the time the query took to return results in milliseconds. When a
query is resolved in less than 1ms, the explain plan shows 0.
When you make a call to the explain plan, it discards the old
plan and initiates the process of testing available indexes to
ensure that the best possible plan is used.
12.6 Replication Lag
Replication lag is the primary
administrative concern behind monitoring replica sets. Replication
lag for a given secondary is the difference in time when an
operation is written in primary and the time when the same was
replicated on the secondary. Often, the replication lag remedies
itself and is transient. However, if it remains high and continues
to rise, there might be a problem with the system. You might end up
either shutting down the system until the problem is resolved, or
it might require manual intervention for reconciling the mismatch,
or you might even end up running the system with outdated data.
MongoDB Cloud Manager can also be used to
view recent and historical replication lag information. The repl
lag graph is available from the Status tab of each SECONDARY node.
Here are some tips to help reduce this
time:
-
In scenarios with a heavy write
load, you should have a secondary as powerful as the primary node
so that it can keep up with the primary and the writes can be
applied on the secondary at the same rate. Also, you should have
enough network bandwidth so that the ops can be retrieved from
the primary at the same rate at which they are getting created.
12.7 Sharding
12.8 Monitoring
MongoDB system should be proactively
monitored to detect unusual behaviors so that necessary actions can
be taken to resolve issues. Several tools are available for
monitoring the MongoDB deployment.
A free hosted monitoring service named
MongoDB Cloud Manager is provided by MongoDB developers. MongoDB
Cloud Manager offers a dashboard view of the entire cluster
metrics. Alternatively, you can use nagios, SNMP, or munin to build
your own tool.
MongoDB also provides several tools such
as mongostat and mongotop to gain insights into the performance.
When using monitoring services, the following should be watched
closely:
-
Queues: Prior to the release of
MongoDB 3.0, a reader-writer lock was used for simultaneous reads
and exclusive access was used for writes. In such scenario, you
might end up with queues behind a single writer, which may
contain read/write queries. Queue metrics need to be monitored
along with the lock percentage. If the queues and the lock
percentage are trending upwards, that implies that you have
contention within the database. Changing the operation to batch
mode or changing the data model can have a significant, positive
impact on the concurrency. Starting from Version 3.0, collection
level locking (in the MMAPv1 storage engine) and document level
locking (in the WiredTiger storage engine) have been introduced.
This leads to an improvement in concurrency wherein no write lock
with exclusive access will be required at the database level. So
starting from this version you just need to measure the Queue
metric.
-
It’s recommended to run the
entire performance test against a full-size database, such as the
production database copy, because performance characteristic are
often highlighted when dealing with the actual data. This also
lets you to avoid unpleasant surprises that might crop up when
dealing with the actual performance database.
Index
A
ACID transactions
ACID vs. BASE
BASE acronym
cached data/even
unlocked records
CAP theorem
e-commerce
shopping site
requirements
Administration
tools
mongo shell
MongoDB
overview
third party tools
aggregate()
operation
Application write
concerns
command
deployment
steps
w option
writeConcern
B
BASE
NoSQL databases
notation
configurations
NRW
RDBMS ACID
transactions
read and write
operations
Big data
accessing data
customizations
data model
decision making
definition
IDC’s analysis
information and
knowledge
innovation
legacy systems
policies and
procedures
sectors
segmentation
size of
sources
statistics
technology and
techniques
Twitter
type
use of
visibility
3Vs
SeeVolume,
variety and velocity (3 Vs)
Binary JSON
(BSON)
32-bit vs. 64-bit
Blogging
application
Brewer’stheorem
SeeCAP theorem
BTree
bucket data
structure
key nodes
MongoDB 90/10
split
standard
implementation
structure
C
Capped collection
CAP theorem
Case-sensitive
queries
Chunks
Chunk splitting
Cloud Manager
account profile
automation agent
deployment
group creation
installation
instructions
MMS
monitoring
solution
review and deploy
button
standalone
instance
Clustering
add server
administration
and usage
arbiter member
configuration
data structure
data directory
election
host command
myState replica
set
mongo interface
output
remove
replica set
configuration
–replSet option
--rest option
rs.status()
scenarios
sharding cluster
configuration
set up and
running
status
web interface
Coding
Collection level
backup
Collection
namespace
data files
(summary)
data record
structure
deleted list
details
extent
metadata
record data
structure
Conditional
operators
aggregate()
database
$gt and $gte
operators
$in and $nin
operator
$lt and $lte
operators
MapReduce
framework
regular
expressions
ConfigServers
Core processes
components
mongo
mongod
mongos
CRUD operations
D
Data file
journal files
MMAPv1
mydbpoc data
files
namespaces
(MMAPv1)
SeeNamespace (.ns
File)
storage.mmapv1.smallFiles
option
WiredTiger
storage engine
Data file backup
Data lifecycle
management
Data model
application
blog application
decisions
growth document
MongoDB document
multiple
collections vs. storing data
operational
factors
references
relational
database
Data modeling
Data model of
MongoDB
Binary JSON
(BSON)
capped collection
database model
documents
dynamic schema
identifier (_id)
JSON
polymorphic
schemas
Data replication
process
keys
member starts up
normal operation
operation log
(oplog)
requirement
slave chaining
sync and
replication
sync from
write operation
writes via
chaining slaves
Data safety and
consistency
Data storage
engine
API
compression
algorithms
memory-mapped
files
virtual memory
WiredTiger
storage data
db.serverStatus()
command
Deployment
strategy
CPU
data importance
disk type
hardware
components
memory
MongoDB 2*1
NUMA hardware
points
sharding and
replication
dropDups
dropIndex command
E
Embedding
ensureIndex()
function
explain() command
F
File system
delete()
exists() and
put()
get_version()/get_last_version()
new_file()
find() command
findOne() method
fsync and lock
command
G
Geohaystack
indexes
Geospatial
indexes
GridFS
API
chunk document
file system
find() command
fs.files document
PyMongo driver
rationale
read() command
specification
storing files
structure of
write() command
$gt and $gte
operators
H
Hash-based
partitioning
hasNext()
function
help() command
hint() method
I
Identifier (_id)
Import and export
data
mongoexport
mongoimport
$in and $nin
Indexing
behaviors and
limitations
BTree structure
compound index
data structure
dropIndex command
ensureIndex()
events
explain() command
features
find () operation
hint() method
_id index
intersection
limitation
read operation
reIndex command
relational
databases
secondary indexes
single key index
sort operations
system.indexes
types of
unique index
$lt and $lte
operators
J, K
JavaScript Object
Notation (JSON)
JOIN
Journal files
data file
mongod
private view
remapping view
shared view
updates
write operation
L
Legacy systems
(Big data)
data processing
data storage
structure
LINUX system
installation
repository
M
MapReduce
framework
Master/slave
replication
Metrics
CPU time
IOWait
OpCounters
page faults
queues graph
ratio (page
fault-Opcounters)
system resources
user time
mmap() command
MMAPv1
MongoDB
bin directory
binary JSON
documents
32-bit limitation
32-bit vs. 64-bit
BSON documents
capped collection
cloudmanager
SeeCloud Manager
database running
datamodel
SeeData model of
MongoDB
design decisions
directories
creation
history
indexing
installation step
JSON-based
document store
limitations
LINUX
SeeLINUX system
memory issues
MMAPv1
namespace
non-relational
approach
performance vs.
features
platform database
preconditions
range
sections
secure deployment
security
limitations
sharding
shell
See also(Query
processing)
speed,
scalability and agility
SQL comparison
tools
Windows
write and read
limitations
MongoDB cloud
manager
custom date range
Email and text
alerts
metrics
query response
time
steps
mongodump and
mongorestore
backup utility
collection level
backup
collection
restoring database
–drop command
dump folder
–help option
mongorestore
restoring
database
single database
backup
terminal window
mongod web
interface
mongo import
command
Monitoring and
analysis
config servers
lock status
service
shard status
balancing and chunk distribution
system
diagnosing
problems
MongoDB cloud
manager
mongod web
interface
mongostat
third-party
tools
bulk inserting
events
CSV format
data management
inserting data
operations
query patterns
schema design
shard cluster
use case
moveChunk command
Multiple
collections vs. storing data
N
Namespace (.ns
File)
collection bucket
data structure
$freelist
limitation
indexes BTree
Network
controlling access
bind_ip limits
communication
diagnostic and
monitoring information
encrypting data
firewalls
HTTP status page
OpenSSL shell
Windows platform
next() function
Non-relational
approach
Normal form
blogging
application
problem
RDBMS diagram
NoSQL
ACID vs. BASE
advantages
categories
definition
disadvantages
history
non-relational
databases
SQL
SQL
See also(Sequel
Query Language (SQL))
structured vs.
un/semi-structured data
O
Object-oriented
programming
P
Performance vs.
features
Polymorphic
schema
object-oriented
programming
schema evolution
printjson()
function
Production
cluster architecture
architecture
components
config server
available
mongos
unavailable
replica set
unavailable
shard unavailable
Q
Query processing
collection
creation
command prompt
create and insert
options
CRUD operations
CSV file
cursor object
database server
db command
delete
explain()
function
findOne() method
for loop
hasNext()
function
help() command
_id field
insertion
import command
indexes
SeeIndexing
limit()
localhost
database server
MyDB database
MYDBPOC database
next() function
printjson()
projector
query documents
read
selector
skip()
sort()
SQL terminology
and concepts
update() command
R
Range-based
partitioning
RDBMS vs. NoSQL
read() command
Reads and writes
MongoDB
compact command
modification
operation
SkipList
update action
WiredTiger cache
write operation
Referencing model
reIndex command
Relational
databases
MongoDB document
normal form
blogging
application
problem
RDBMS diagram
remove () method
Replica sets
arbiters
clustering
See
also(Clustering)
consistency
data center
data replication
process
delayed members
deployment
architecture
elections
failovers
fault tolerance
heartbeat message
exchange
hidden members
limitation
master-slave
replication
members
non-voting
members
points
primary node
read and write
concerns
rollbacks
scaling reads
secondary node
status report
types of
secondary members
Replication
data redundancy
master/slave
setup
lag
replicaset
SeeReplica sets
rs.status()
command
S
Secondary indexes
dropDups
geohaystack
indexes
geospatial
indexes
keys ordering
MongoDB
multikey compound
sparse index
TTL (Time To
Live)
unique indexes
Secure
application deployment
admin database
and user
authentication
authorization
enabling
authentication
levels
network exposure
practicaldb
database
product database
role-based
approach
system.users
collection
user and enabling
authorization
Security
authentication
encrypted
limitations
Sensitive fields
Sequel Query
Language (SQL)
ACID transactions
comparison
data update
queries
RDBMS databases
scalability
schema
flexibility
SQL vs. NoSQL
technical
scenarios
Server management
collection level
data
corrupt database
db.ServerStatus()
start command
status
stop command
ShardedEnvironment
chunk size
pre-splitting
shard key
Sharding
cluster
collection limit
correct shard key
issues
limitation
MongoDB
operations
relational
databases
update
system
add shard command
balancer
balancing process
chunks
collection across
commands
components
ConfigServers
count() command
data directory
database and
collection
data distribution
process
hash-based
partitioning
implementation
listshards
command
memory
migration
monitoring and
analysis
moveChunk command
operations
primary shard
printShardingStatus()
command
range-based
partitioning
removing cluster
scenarios
servers
serverStatus
command
set up cluster
ShardedEnvironment
shard key
single logical
database
tag-basedsharding
SeeTag-based
sharding
terminal window
testcollection.count
command
Single database
backup
Slave backups
Social networking
comment creation
operations
posts view
schema design
sequence creation
sharding
solution
Solid state
drives (SSD)
sort() operation
Sparse indexes
Standalone
deployment
Structured vs.
un/semi-structured data
T
Tag-based
sharding
chunk
distribution
collections
distribution
multiple tags
prerequisite
scale with
tagging
tagging
Time optimization
Time To Live
(TTL) index
Transactions
U
Unique indexes
V
validate() option
Volume, variety
and velocity (3 Vs)
aspects and big
data
digital universe
size
variety
velocity
volume
W, X, Y, Z
Web interface
Windows
installation
WiredTiger
storage engine
write() command
Write operation
No comments:
Post a Comment