Monday, 5 March 2018

MongoDB Concept part1

What you need for this book

MongoDB supports the most popular platforms.
Download the latest stable production release of MongoDB from the MongoDB downloads page ( http://www.mongodb.org/downloads/ ).

In this book we have focused on using MongoDB on a 64-bit Windows platform and at places have cited references on how to work with MongoDB running on Linux.
We will be using 64-bit Windows 2008 R2 and LINUX for examples of the installation process.
Who this book is for
This book would be of interest to Programmers, Big Data Architects, Application Architects, Technology Enthusiasts, Students, Solution Experts and those wishing to choose the right big data products for their needs.
The book covers aspects on Big Data, NOSQL and details on architecture and development on MongoDB. Thus it serves the use cases of developers, architects and operations teams who work on MongoDB.

Contents
Chapter 1:​ Big Data 1
Getting Started 1
Big Data 3
Facts About Big Data 3
Big Data Sources 4
Three Vs of Big Data 6
Volume 7
Variety 8
Velocity 8
Usage of Big Data 9
Visibility 9
Discover and Analyze Information 9
Segmentation and Customizations 9
Aiding Decision Making 9
Innovation 9
Big Data Challenges 10
Policies and Procedures 10
Access to Data 10
Technology and Techniques 10
Legacy Systems and Big Data 10
Structure of Big Data 10
Data Storage 11
Data Processing 11
Big Data Technologies 11
Summary 12
Chapter 2:​ NoSQL 13
SQL 13
NoSQL 13
Definition 14
A Brief History of NoSQL 15
ACID vs.​ BASE 15
CAP Theorem (Brewer’s Theorem) 15
The BASE 16
NoSQL Advantages and Disadvantages 17
Advantages of NoSQL 17
Disadvantages of NoSQL 18
SQL vs.​ NoSQL Databases 18
Categories of NoSQL Databases 22
Summary 23
Chapter 3:​ Introducing MongoDB 25
History 25
MongoDB Design Philosophy 26
Speed, Scalability, and Agility 26
Non-Relational Approach 26
JSON-Based Document Store 26
Performance vs.​ Features 27
Running the Database Anywhere 27
SQL Comparison 27
Summary 28
Chapter 4:​ The MongoDB Data Model 29
The Data Model 29
JSON and BSON 31
The Identifier (_​id) 32
Capped Collection 32
Polymorphic Schemas 32
Object-Oriented Programming 32
Schema Evolution 33
Summary 34
Chapter 5:​ MongoDB - Installation and Configuration 35
Select Your Version 35
Installing MongoDB on Linux 36
Installing Using Repositories 36
Installing Manually 36
Installing MongoDB on Windows 37
Running MongoDB 37
Preconditions 37
Starting the Service 38
Verifying the Installation 38
MongoDB Shell 38
Securing the Deployment 39
Using Authentication and Authorization 39
Controlling Access to a Network 44
Provisioning Using MongoDB Cloud Manager 47
Summary 52
Chapter 6:​ Using MongoDB Shell 53
Basic Querying 53
Create and Insert 58
Explicitly Creating Collections 60
Inserting Documents Using Loop 60
Inserting by Explicitly Specifying _​id 60
Update 61
Delete 62
Read 63
Using Indexes 69
Stepping Beyond the Basics 78
Using Conditional Operators 79
Regular Expressions 81
MapReduce 82
aggregate( ) 83
Designing an Application’s Data Model 84
Relational Data Modeling and Normalization 84
MongoDB Document Data Model Approach 86
Summary 93
Chapter 7:​ MongoDB Architecture 95
Core Processes 95
mongod 95
mongo 96
mongos 96
MongoDB Tools 96
Standalone Deployment 96
Replication 97
Master/​Slave Replication 97
Replica Set 98
Implementing Advanced Clustering with Replica Sets 115
Sharding 124
Sharding Components 125
Data Distribution Process 127
Data Balancing Process 130
Operations 133
Implementing Sharding 134
Controlling Collection Distribution (Tag-Based Sharding) 142
Points to Remember When Importing Data in a ShardedEnvironme​nt 151
Monitoring for Sharding 152
Monitoring the Config Servers 152
Production Cluster Architecture 152
Scenario 1 153
Scenario 2 154
Scenario 3 155
Scenario 4 156
Summary 157
Chapter 8:​ MongoDB Explained 159
Data Storage Engine 159
Data File (Relevant for MMAPv1) 161
Namespace (.​ns File) 162
Data File (Relevant for WiredTiger) 170
Reads and Writes 172
How Data Is Written Using Journaling 174
GridFS – The MongoDB File System 178
The Rationale of GridFS 178
GridFSunder the Hood 179
Using GridFS 180
Indexing 183
Types of Indexes 184
Behaviors and Limitations 190
Summary 190
Chapter 9:​ Administering MongoDB 191
Administration Tools 191
mongo 191
Third-Party Administration Tools 191
Backup and Recovery 192
Data File Backup 192
mongodump and mongorestore 192
fsync and Lock 196
Slave Backups 197
Importing and Exporting 197
mongoimport 198
mongoexport 198
Managing the Server 199
Starting a Server 199
Stopping a Server 200
Viewing Log Files 200
Server Status 200
Identifying and Repairing MongoDB 202
Identifying and Repairing Collection Level Data 203
Monitoring MongoDB 204
mongostat 204
mongod Web Interface 205
Third-Party Plug-Ins 205
MongoDB Cloud Manager 206
Summary 212
Chapter 10:​ MongoDB Use Cases 213
Use Case 1 - Performance Monitoring 213
Schema Design 213
Operations 214
Sharding 218
Managing the Data 219
Use Case 2 – Social Networking 220
Schema Design 220
Operations 222
Sharding 225
Summary 225
Chapter 11:​ MongoDB Limitations 227
MongoDB Space Is Too Large (Applicable for MMAPv1) 227
Memory Issues (Applicable for Storage Engine MMAPv1) 228
32-bit vs.​ 64-bit 228
BSON Documents 228
Namespaces Limits 229
Indexes Limit 229
Capped Collections Limit - Maximum Number of Documents in a Capped Collection 229
Sharding Limitations 230
Shard Early to Avoid Any Issues 230
Shard Key Can’t Be Updated 230
Shard Collection Limit 230
Select the Correct Shard Key 230
Security Limitations 230
No Authentication by Default 230
Traffic to and from MongoDB Isn’t Encrypted 231
Write and Read Limitations 231
Case-Sensitive Queries 231
Type-Sensitive Fields 231
No JOIN 231
Transactions 231
MongoDB Not Applicable Range 232
Summary 232
Chapter 12:​ MongoDB Best Practices 233
Deployment 233
Hardware Suggestions from the MongoDB Site 235
Few Points to be Noted 235
Coding 236
Application Response Time Optimization 238
Data Safety 239
Administration 240
Replication Lag 240
Sharding 241
Monitoring 241
Summary 242
Index243
Contents at a Glance
About the Authorsxv

About the Technical Reviewersxvii

Acknowledgmentsxix

Prefacexxi

Chapter 1 : Big Data 1

Chapter 2 : NoSQL 13

Chapter 3 : Introducing MongoDB 25

Chapter 4 : The MongoDB Data Model 29

Chapter 5 : MongoDB - Installation and Configuration 35

Chapter 6 : Using MongoDB Shell 53

Chapter 7 : MongoDB Architecture 95

Chapter 8 : MongoDB Explained 159

Chapter 9 : Administering MongoDB 191

Chapter 10 : MongoDB Use Cases 213

Chapter 11 : MongoDB Limitations 227

Chapter 12 : MongoDB Best Practices 233


“Big data is a term used to describe data that has massive volume, comes in a variety of structures, and is generated at high velocity. This kind of data poses challenges to the traditional RDBMS systems used for storing and processing data. Bid data is paving way for newer approaches of processing and storing data.”
In this chapter, we will talk about big data basics, sources, and challenges. We will introduce you to the three Vs (volume, velocity, and variety) of big data and the limitations that traditional technologies face when it comes to handling big data.
1.1 Getting Started
Big data, along with cloud, social, analytics, and mobility, are buzz words today in the information technology world. The availability of the Internet and electronic devices for the masses is increasing every day. Specifically, smartphones, social networking sites, and other data-generating devices such as tablets and sensors are creating an explosion of data. Data is generated from various sources in various formats such as video, text, speech, log files, and images. A single second of a high-definition (HD) video generates 2,000 times more bytes than that of a single page of text.
Consider the following statistics about Facebook, as reported on the company’s web site:
1.There were 968 million daily active users on average for June of 2015. There were 844 million mobile daily active users on average for June of 2015.

2.There were 1.49 billion monthly active users as of June 30, 2015. There were 1.31 billion mobile monthly active users as of June 30, 2015.

3.There were 4.5 billion likes generated daily as of May 2013, which is a 67 percent increase from August 2012.

Figure 1-1 depicts the statistics of Twitter .


Figure 1-1.
If you printed Twitter…
Here’s another example: consider the amount of data that a simple event like going to a movie can generate. You start by searching for a movie on movie review sites, reading reviews about that movie, and posting queries. You may tweet about the movie or post photographs of going to the movie on Facebook. While travelling to the theater, your GPS system tracks your course and generates data.
You get the picture: smartphones, social networking sites, and other media are creating flood of data for companies to process and store. When the size of data poses challenges to the ability of typical software tools to capture, process, store, and manage data, then we have big data in hand. Figure 1-2 graphically defines big data.


Figure 1-2.
Definition of Big Data
1.2 Big Data
Big data is data that has high volume, is generated at high velocity, and has multiple varieties. Let’s look at few facts and figures of big data.
1.2.1 Facts About Big Data
Various research teams around the world have done analysis on the amount of data being generated. For example, IDC’s analysis revealed that the amount of digital data generated in a single year (2007) is larger than the world’s total capacity to store it, which means there is no way in which to store all of the data that is being generated. Also, the rate at which data is getting generated will soon outgrow the rate at which data storage capacity is expanding.
The following sections cover insights from the MGI (McKinsey Global Institute) report ( www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation)that was published in May 2011. The study makes the case that the business and economic possibilities of big data and its wider implications are important issues that business leaders and policy makers must tackle.
1.2.1.1 The Size of Big Data Varies Across Sectors
The growth of big data is a phenomenon that is observed in every sector.MGI estimates that enterprises around the world used more than 7 exabytes of incremental disk drive data storage capacity in 2010; what’s interesting is that nearly 80 percent of that total seemed to duplicate data that was stored elsewhere. MGI also estimated that, by 2009, nearly all sectors in the US economy had at least an average of 200 terabytes of stored data per company and that many sectors had more than 1 petabyte in mean stored data per company.
Some sectors exhibited far higher levels of data intensity than others; in this case, data intensity refers to the average amount of data getting accumulated across companies/firms of that sector, implying that they have more potential to capture value from big data.
Financial services sectors, including banking, investment, and securities services, are highly transaction-oriented; they are also required by regulations to store data. The analysis shows that they have the most digital data stored per firm on average.
Communications and media firms, utilities, and government also have significant digital data stored per enterprise or organization, which appears to reflect the fact that such entities have a high volume of operations and multimedia data.
Discrete and process manufacturing have the highest aggregate data stored in bytes. However, these sectors rank much lower in intensity terms, since they are fragmented into a large number of firms.
1.2.1.2 The Big Data Type Varies Across Sectors
The MGI research also shows that the type of data stored also varies by sector. For instance, retail and wholesale, administrative parts of government, and financial services all generate significant amounts of text and numerical data including customer data, transaction information, and mathematical modeling and simulations. Sectors such as manufacturing, health care, media and communications are responsible for higher percentages of multimedia data. And image data in the form of X-rays, CT, and other scans dominate data storage volumes in health care.
In terms of geographic spread of big data, North America and Europe have 70% of the global total currently. Thanks to cloud computing, data generated in one region can be stored in another country’s datacenter. As a result, countries with significant cloud and hosting provider offerings tend to have high storage of data.
1.3 Big Data Sources
In this section, we will cover the major factors that are contributing to the ever increasing size of data. Figure 1-3 depicts the major contributing sources.


Figure 1-3.
Sources of data
As highlighted in the MGI report, the major sources of this data are
Enterprises, which are collecting data with more granularities now, attaching more details with every transaction in order to understand consumer behavior.
Increase in multimedia usage across industries such as health care, product companies, etc.
Increased popularity of social media sites such as Facebook, Twitter, etc.
Rapid adoption of smartphones, which enable users to actively use social media sites and other Internet applications.
Increased usage of sensors and devices in the day-to-day world, which are connected by networks to computing resources.

The MGI report also projects that the number of machine-to-machine devices such as sensors (which are also referred as the Internet of Things, or IoT) will grow at a rate exceeding 30 percent annually over the next five years.
Thus, the rate of growth of data is increasing and so is the diversity. Also, the model of data generation has changed from few companies generating data and others consuming it to everyone generating data and everyone consuming it. This is due to the penetration of consumer IT and internet technologies along with trends like social media. Figure 1-4 depicts the change in the data generation model.


Figure 1-4.
Data model
1.4 Three Vs of Big Data
We have defined big data as data with three Vs:volume , velocity, and variety, as shown in Figure 1-5. Let’s look at the three Vs. It is imperative that organizations and IT leaders focus on these aspects.


Figure 1-5.
The three Vs of big data. The “big” isn’t just the volume
1.4.1 Volume
Volume in big data means the size of the data. As discussed in the previous sections, various factors contribute to the size of big data: as businesses are becoming more transaction-oriented, we see ever increasing numbers of transactions; more devices are getting connected to the Internet, which is adding to the volume; there is an increased usage of the Internet; and there is an increase in the digitization of content. Figure 1-6 depicts the growth in digital universe since 2009.


Figure 1-6.
Digital universe size
In today’s scenario, data is not just generated from within the enterprise; it’s also generated based on transactions with the extended enterprise and customers. This requires extensive maintenance of customer data by the enterprises. A petabyte scale is becoming commonplace these days. Figure 1-7 depicts the data growth rate.


Figure 1-7.
Growth rate
This huge volume of data is the biggest challenge for big data technologies. The storage and processing power needed to store, process, and make accessible the data in a timely and cost effective manner is massive.
1.4.2 Variety
The data generated from various devices and sources follows no fixed format or structure. Compared to text, CSV or RDBMS data varies from text files, log files, streaming videos, photos, meter readings, stock ticker data, PDFs, audio, and various other unstructured formats.
There is no control over the structure of the data these days. New sources and structures of data are being created at a rapid pace. So the onus is on technology to find a solution to analyze and visualize the huge variety of data that is out there. As an example, to provide alternate routes for commuters, a traffic analysis application needs data feeds from millions of smartphones and sensors to provide accurate analytics on traffic conditions and alternate routes.
1.4.3 Velocity
Velocity in big data is the speed at which data is created and the speed at which it is required to be processed. If data cannot be processed at the required speed, it loses its significance. Due to data streaming in from social media sites, sensors, tickers, metering, and monitoring, it is important for the organizations to speedily process data both when it is on move and when it is static (see Figure 1-8). Reacting and processing quickly enough to deal with the velocity of data is one more challenge for big data technology.


Figure 1-8.
The three aspects of data
Real-time insight is essential in many big data use cases. For example, an algorithmic trading system takes real-time feeds from the market and social media sites like Twitter to make decisions on stock trading. Any delay in processing this data can mean millions of dollars in lost opportunities on a stock trade.
There is a fourth V that is talked about whenever big data is discussed. The fourth V is veracity, which means not all the data out there is important, so it’s essential to identify what will provide meaningful insight, and what should be ignored.
1.5 Usage of Big Data
This section will focus on ways of using big data for creating value for organizations. Before we delve into how big data can be made usable to the organizations, let’s first look at why big data is important.
Big data is a completely new source of data; it’s data that is generated when you post on a blog, like a product, or travel. Previously, such minutely available information was not captured. Now it is and organizations that embrace such data can pursue innovations, improve their agility, and increase their profitability.
Big data can create value for any organization in a variety of ways. As listed in the MGI report, this can be broadly categorized into five ways of usage of big data.
1.5.1 Visibility
Accessibility to data in a timely fashion to relevant stakeholders generates a tremendous amount of value. Let’s understand this with an example. Consider a manufacturing company that has R&D, engineering, and manufacturing departments dispersed geographically. If the data is accessible across all these departments and can be readily integrated, it can not only reduce the search and processing time but will also help in improving the product quality according to the present needs.
1.5.2 Discover and Analyze Information
Most of the value of big data comes from when the data collected from outside sources can be merged with the organization’s internal data. Organizations are capturing detailed data on inventories, employees, and customers. Using all of this data, they can discover and analyze new information and patterns; as a result, this information and knowledge can be used to improve processes and performance.
1.5.3 Segmentation and Customizations
Big data enables organizations to create tailor-made products and services to meet specific segment needs. This can also be used in the social sector to accurately segment populations and target benefit schemes for specific needs. Segmentation of customers based on various parameters can aid in targeted marketing campaigns and tailoring of products to suit the needs of customers.
1.5.4 Aiding Decision Making
Big data can substantially minimize risks, improve decision making , and uncover valuable insights. Automated fraud alert systems in credit card processing and automatic fine-tuning of inventory are examples of systems that aid or automate decision-making based on big data analytics.
1.5.5 Innovation
Big data enables innovation of new ideas in the form of products and services. It enables innovation in the existing ones in order to reach out to large segments of people. Using data gathered for actual products, the manufacturers can not only innovate to create the next generation product but they can also innovate sales offerings.
As an example, real-time data from machines and vehicles can be analyzed to provide insight into maintenance schedules; wear and tear on machines can be monitored to make more resilient machines; fuel consumption monitoring can lead to higher efficiency engines. Real-time traffic information is already making life easier for commuters by providing them options to take alternate routes.
Thus, big data is not just the volume of data. It’s the opportunities in finding meaningful insights from the ever-increasing pool of data. It’s helping organizations make more informed decisions, which makes them more agile. It not only provides the opportunity for organizations to strengthen existing business by making informed decisions, it also helps in identifying new opportunities.
1.6 Big Data Challenges
Big data also poses some challenges. In this section, we will highlight a few of them.
1.6.1 Policies and Procedures
As more and more data is gathered, digitized, and moved around the globe, the policy and compliance issues become increasingly important. Data privacy, security, intellectual property, and protection are of immense importance to organizations.
Compliance with various statutory and legal requirements poses a challenge in data handling. Issues around ownership and liabilities around data are important legal aspects that need to be dealt with in cases of big data.
Moreover, many big data projects leverage the scalability features of public cloud computing providers. This poses a challenge for compliance.
Policy questions on who owns the data, what is defined as fair use of data, and who is responsible for accuracy and confidentiality of data also need to be answered.
1.6.2 Access to Data
Accessing data for consumption is a challenge for big data projects. Some of the data may be available to third parties, and gaining access can be a legal, contractual challenge.
Data about a product or service is available on Facebook, Twitter feeds, reviews, and blogs, so how does the product owner access this data from various sources owned by various providers?
Likewise, contractual clauses and economic incentives for accessing big data need to be tied in to enable the availability of data by the consumer.
1.6.3 Technology and Techniques
New tools and technologies built specifically to address the needs of big data must be leveraged, rather than trying to address the aforementioned issues through legacy systems. The inadequacy of legacy systems to deal with big data on one hand and the lack of experienced resources in newer technologies is a challenge that any big data project has to manage.
1.7 Legacy Systems and Big Data
In this section, we will discuss the challenges that organizations are facing when managing big data using legacy systems .
1.7.1 Structure of Big Data
Legacy systems are designed to work with structured data where tables with columns are defined. The format of the data held in the columns is also known.
However, big data is data with many structures. It’s basically unstructured data such as images, videos, logs, etc.
Since big data can be unstructured, legacy systems created to perform fast queries and analysis through techniques like indexing based on particular data types held in various columns cannot be used to hold or process big data.
1.7.2 Data Storage
Legacy systems use big servers and NAS and SAN systems to store the data. As the data increases, the server size and the backend storage size has to be increased. Traditional legacy systems typically work in a scale-up model where more and more compute, memory, and storage needs to be added to a server to meet the increased data needs. Hence the processing time increases exponentially, which defeats the other important requirement of big data, which is velocity.
1.7.3 Data Processing
The algorithms in legacy system are designed to work with structured data such as strings and integers. They are also limited by the size of data. Thus, legacy systems are not capable of handling the processing of unstructured data, huge volumes of such data, and the speed at which the processing needs to be performed.
As a result, to capture value from big data, we need to deploy newer technologies in the field of storing, computing, and retrieving, and we need new techniques for analyzing the data.
1.8 Big Data Technologies
You have seen what big data is. In this section we will briefly look at what technologies can handle this humongous source of data. The technologies in discussion need to efficiently accept and process different types of data.
The recent technology advancements that enable organizations to make the most of its big data are the following:
1.New storage and processing technologies designed specifically for large unstructured data

2.Parallel processing

3.Clustering

4.Large grid environments

5.High connectivity and high throughput

6.Cloud computing and scale-out architectures

There are a growing number of technologies that are making use of these technological advancements. In this book, we will be discussing MongoDB, one of the technologies that can be used to store and process big data.
1.9 Summary
In this chapter you learned about big data. You looked into the various sources that are generating big data, and the usage and challenges posed by big data. You also looked why newer technologies are needed to store and process big data.
In the following chapters, you will look into a few of the technologies that help organizations manage big data and enable them to get meaningful insights from big data.



“NoSQL is a new way of designing Internet-scale database solutions. It is not a product or technology but a term that defines a set of database technologies that are not based on the traditional RDBMS principles.”
In this chapter, we will cover the definition and basics of NoSQL. We will introduce you to the CAP theorem and will talk about the NRW notations. We will compare the ACID and BASE approaches and finally conclude the chapter by comparing NoSQL and SQL database technologies.
2.1 SQL
The idea of RDBMS was borne from E.F. Codd’s 1970 whitepaper titled “A relational model of data for large shared data banks.” The language used to query RDBMS systems is SQL (Sequel Query Language ).
RDBMS systems are well suited for structured data held in columns and rows, which can be queried using SQL. The RDBMS systems are based on the concept of ACID transactions. ACID stands for Atomic, Consistent, Isolated, and Durable, where
Atomic implies either all changes of a transaction are applied completely or not applied at all.
Consistent means the data is in a consistent state after the transaction is applied. This means after a transaction is committed, the queries fetching a particular data will see the same result.
Isolated means the transactions that are applied to the same set of data are independent of each other. Thus, one transaction will not interfere with another transaction.
Durable means the changes are permanent in the system and will not be lost in case of any failures.

2.2 NoSQL
NoSQL is a term used to refer to non-relational databases . Thus, it encompasses majority of the data stores that are not based on the conventional RDBMS principles and are used for handling large data sets on an Internet scale.
Big data, as discussed in the previous chapter, is posing challenges to the traditional ways of storing and processing data, such as the RDBMS systems. As a result, we see the rise of NoSQL databases, which are designed to process this huge amount and variety of data within time and cost constraints.
Thus NoSQL databases evolved from the need to handle big data; traditional RDBMS technologies could not provide adequate solutions. Figure 2-1 shows the rise of un/semi-structured data over the years as compared to structured data .


Figure 2-1.
Structured vs. un/Semi-Structured data
Some examples of big data use cases that are a good fit for NoSQL databases are the following:
Social Network Graph: Who is connected to whom? Whose post should be visible on the user’s wall or homepage on a social network site?
Search and Retrieve: Search all relevant pages with a particular keyword ranked by the number of times a keyword appears on a page.

2.2.1 Definition
NoSQL doesn’t have a formal definition . It represents a form of persistence/data storage mechanism that is fundamentally different from RDBMS. But if pushed to define NoSQL, here it is: NoSQL is an umbrella term for data stores that don’t follow the RDBMS principles.
Note
The term was used initially to mean “do not use SQL if you want to scale.” Later this was redefined to “not only SQL,” which means that in addition to SQL other complimentary database solutions exist.
2.2.2 A Brief History of NoSQL
In 1998, Carlo Strozzi coined the term NoSQL. He used this term to identify his database because the database didn’t have a SQL interface. The term resurfaced in early 2009 when Eric Evans (a Rackspace employee) used this term in an event on open source distributed databases to refer to distributed databases that were non-relational and did not follow the ACID features of relational databases.
2.3 ACID vs. BASE
In the introduction, we mentioned that the traditional RDBMS applications have focused on ACID transactions . Howsoever essential these qualities may seem, they are quite incompatible with availability and performance requirements for applications of a Web scale.
Let’s say, for example, that you have a company like OLX, which sells products such as unused household goods (old furniture, vehicles, etc.) and uses RDBMS as its database. Let’s consider two scenarios.
First scenario: Let’s look at an e-commerce shopping site where a user is buying a product. During the transaction, the user locks a part of database, the inventory, and every other user must wait until the user who has put the lock completes the transaction.
Second scenario: The application might end up using cached data or even unlocked records , resulting in inconsistency. In this case, two users might end up buying the product when the inventory actually was zero.
The system may become slow, impacting scalability and user experience.
In contrary to the ACID approach of traditional RDBMS systems, NoSQL solves the problem using an approach popularly called as BASE. Before explaining BASE, let’s explore the concept of the CAP theorem.
2.3.1 CAP Theorem (Brewer’s Theorem )
Eric Brewer outlined the CAP theorem in 2000. This is an important concept that needs to be well understood by developers and architects dealing with distributed databases. The theorem states that when designing an application in a distributed environment there are three basic requirements that exist, namely consistency, availability, and partition tolerance.
Consistency means that the data remains consistent after any operation is performed that changes the data, and that all users or clients accessing the application see the same updated data.
Availability means that the system is always available.
Partition Tolerance means that the system will continue to function even if it is partitioned into groups of servers that are not able to communicate with one another.

The CAP theorem states that at any point in time a distributed system can fulfil only two of the above three guarantees (Figure 2-2).


Figure 2-2.
CAP Theorem
2.3.2 The BASE
Eric Brewer coined the BASE acronym . BASE can be explained as
Basically Available means the system will be available in terms of the CAP theorem.
Soft state indicates that even if no input is provided to the system, the state will change over time. This is in accordance to eventual consistency.
Eventual consistency means the system will attain consistency in the long run, provided no input is sent to the system during that time.

Hence BASE is in contrast with the RDBMS ACID transactions.
You have seen that NoSQL databases are eventually consistent but the eventual consistency implementation may vary across different NoSQL databases.
NRW is the notation used to describe how the eventual consistency model is implemented across NoSQL databases where
N is the number of data copies that the database has maintained.
R is the number of copies that an application needs to refer to before returning a read request’s output.
W is the number of data copies that need to be written to before a write operation is marked as completed successfully.

Using these notation configurations , the databases implement the model of eventual consistency.
Consistency can be implemented at both read and write operation levels.
Write Operations
N=W implies that the write operation will update all data copies before returning the control to the client and marking the write operation as successful. This is similar to how the traditional RDBMS databases work when implementing synchronous replication. This setting will slow down the write performance.
If write performance is a concern, which means you want the writes to be happening fast, you can set W=1, R=N. This implies that the write will just update any one copy and mark the write as successful, but whenever the user issues a read request, it will read all the copies to return the result. If either of the copies is not updated, it will ensure the same is updated, and then only the read will be successful. This implementation will slow down the read performance.
Hence most NoSQL implementations use N>W>1. This implies that greater than one node needs to be updated successfully; however, not all nodes need to be updated at the same time.
Read Operations
If R is set to 1, the read operation will read any data copy, which can be outdated. If R>1, more than one copy is read, and it will read most recent value. However, this can slow down the read operation.
Using N<W+R always ensures that a read operation retrieves the latest value. This is because the number of written copies and read copies are always greater than the actual number of copies, ensuring that at least one read copy has the latest version. This is quorum assembly.

Table 2-1 compares ACID vs. BASE.
Table 2-1.
ACID vs. BASE
ACID
BASE
Atomicity
Basically Available
Consistency
Eventually Consistency
Isolation
Soft State
Durable

2.4 NoSQL Advantages and Disadvantages
In this section, you will look at the advantages and disadvantages of NoSQL databases.
2.4.1 Advantages of NoSQL
Let’s talk about the advantages of NoSQL databases.
High scalability: This scaling up approach fails when the transaction rates and fast response requirements increase. In contrast to this, the new generation of NoSQL databases is designed to scale out (i.e. to expand horizontally using low-end commodity servers).
Manageability and administration: NoSQL databases are designed to mostly work with automated repairs, distributed data, and simpler data models, leading to low manageability and administration.
Low cost: NoSQL databases are typically designed to work with a cluster of cheap commodity servers, enabling the users to store and process more data at a low cost.
Flexible data models: NoSQL databases have a very flexible data model, enabling them to work with any type of data; they don’t comply with the rigid RDBMS data models. As a result, any application changes that involve updating the database schema can be easily implemented.

2.4.2 Disadvantages of NoSQL
In addition to the above mentioned advantages, there are many impediments that you need to be aware of before you start developing applications using these platforms.
Maturity: Most NoSQL databases are pre-production versions with key features that are still to be implemented. Thus, when deciding on a NoSQL database, you should analyze the product properly to ensure the features are fully implemented and not still on the To-do list .
Support: Support is one limitation that you need to consider. Most NoSQL databases are from start-ups which were open sourced. As a result, support is very minimal as compared to the enterprise software companies and may not have global reach or support resources.
Limited Query Capabilities: Since NoSQL databases are generally developed to meet the scaling requirement of the web-scale applications, they provide limited querying capabilities. A simple querying requirement may involve significant programming expertise.
Administration: Although NoSQL is designed to provide a no-admin solution, it still requires skill and effort for installing and maintaining the solution.
Expertise: Since NoSQL is an evolving area, expertise on the technology is limited in the developer and administrator community.

Although NoSQL is becoming an important part of the database landscape, you need to be aware of the limitations and advantages of the products to make the correct choice of the NoSQL database platform.
2.5 SQL vs. NoSQL Databases
Now you know the details regarding NoSQL databases. Although NoSQL is increasingly getting adopted as a database solution, it’s not here to replace SQL or RDBMS databases . In this section, you will look at the differences between SQL and NoSQL databases.
Let’s do a quick recap of the RDBMS system. RDBMS systems have prevailed for about 30 years, and even now they are the default choice of the solution architect for data storage for an application. If we will list a few of the good points of RDBMS system, first and the foremost is the use of SQL, which is a rich declarative query language used for data processing. It is well understood by users. In addition, the RDBMS system offers ACID support for transactions, which is a must in many sectors, such as banking applications.
However, the biggest drawbacks of the RDBMS system are its difficulty in handling schema changes and scaling issues as data increases. As data increases, the read read/write performance degrades. You face scaling issues with RDBMS systems because they are mostly designed to scale up and not scale out.
In contrast to the SQL RDBMS databases, NoSQL promotes the data stores, which break away from the RDBMS paradigm.
Let’s talk about technical scenarios and how they compare in RDBMS vs. NoSQL :
Schema flexibility : This is a must for easy future enhancements and integration with external applications (outbound or inbound).
RDBMS are quite inflexible in their design. Adding a column is an absolute no-no, especially if the table has some data. The reasons range from default value, indexes, and performance implications. More often than not, you end up creating new tables, and increasing the complexity by introducing relationships across tables.
Complex queries : The traditional designing of the tables leads to developers writing complex JOIN queries, which are not only difficult to implement and maintain but also take substantial database resources to execute.
Data update : Updating data across tables is probably one of the more complex scenarios, especially if they are a part of the transaction. Note that keeping the transaction open for a long duration hampers the performance. You also have to plan for propagating the updates to multiple nodes across the system. And if the system does not support multiple masters or writing to multiple nodes simultaneously, there is a risk of node failure and the entire application moving to read-only mode.
Scalability : Often the only scalability that may be required is for read operations. However, several factors impact this speed as operations grow. Some of the key questions to ask are:
What is the time taken to synchronize the data across physical database instances?
What is the time taken to synchronize the data across datacenters?
What is the bandwidth requirement to synchronize data?
Is the data exchanged optimized?
What is the latency when any update is synchronized across servers? Typically, the records will be locked during an update.

NoSQL-based solutions provide answers to most of the challenges listed above.
Let’s now see what NoSQL has to offer against each technical question mentioned above.
Schema flexibility : Column-oriented databases store data as columns as opposed to rows in RDBMS. This allows the flexibility of adding one or more columns as required, on the fly. Similarly, document stores that allow storing semi-structured data are also good options.
Complex queries : NoSQL databases do not have support for relationships or foreign keys. There are no complex queries. There are no JOIN statements.
Is that a drawback? How does one query across tables?
It is a functional drawback, definitely. To query across tables, multiple queries must be executed. A database is a shared resource, used across application servers and must not be released from use as quickly as possible. The options involve a combination of simplifying the queries to be executed, caching data, and performing complex operations in the application tier. A lot of databases provide in-built entity-level caching. This means that when a record is accessed, it may be automatically cached transparently by the database. The cache may be in-memory distributed cache for performance and scale.
Data update : Data updating and synchronization across physical instances are difficult engineering problems to solve. Synchronization across nodes within a datacenter has a different set of requirements compared to synchronizing across multiple datacenters. One would want the latency within a couple of milliseconds or tens of milliseconds at the best. NoSQL solutions offer great synchronization options.
MongoDB, for example, allows concurrent updates across nodes, synchronization with conflict resolution, and eventually, consistency across the datacenters within an acceptable time that would run in few milliseconds. As such, MongoDB has no concept of isolation. Note that now because the complexity of managing the transaction may be moved out of the database, the application will have to do some hard work.
An example of this is a two-phase commit while implementing transactions ( http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/ ).
A plethora of databases offer multiversion concurrency control (MCC) to achieve transactional consistency.
Well, as Dan Pritchett ( www.addsimplicity.com/ ), Technical Fellow at eBay puts it, eBay.com does not use transactions. Note that PayPal does use transactions.
Scalability : NoSQL solutions provider greater scalability for obvious reasons. A lot of the complexity that is required for transaction-oriented RDBMS does not exist in ACID non-compliant NoSQL databases. Interestingly, since NoSQL does not provide cross-table references and there are no JOIN queries possible, and because you can’t write a single query to collate data across multiple tables, one simple and logical solution is to—at times—duplicate the data across tables. In some scenarios, embedding the information within the primary entity—especially in one-to-one mapping cases—may be a great idea.

Table 2-2 compares SQL and NoSQL technologies.
Table 2-2.
SQL vs. NoSQL

SQL Databases
NoSQL Databases
Types
All types support SQL standard.
Multiple types exists, such as document stores, key value stores, column databases, etc.
Development History
Developed in 1970.
Developed in 2000s.
Examples
SQL Server, Oracle, MySQL.
MongoDB, HBase, Cassandra.
Data Storage Model
Data is stored in rows and columns in a table, where each column is of a specific type.
The tables generally are created on principles of normalization.
Joins are used to retrieve data from multiple tables.
The data model depends on the database type. Say data is stored as a key-value pair for key-value stores. In document-based databases, the data is stored as documents.
The data model is flexible, in contrast to the rigid table model of the RDBMS.
Schemas
Fixed structure and schema, so any change to schema involves altering the database.
Dynamic schema, new data types, or structures can be accommodated by expanding or altering the current schema. New fields can be added dynamically.
Scalability
Scale up approach is used; this means as the load increases, bigger, expensive servers are bought to accommodate the data.
Scale out approach is used; this means distributing the data load across inexpensive commodity servers.
Supports Transactions
Supports ACID and transactions.
Supports partitioning and availability, and compromises on transactions. Transactions exist at certain level, such as the database level or document level.
Consistency
Strong consistency.
Dependent on the product. Few chose to provide strong consistency whereas few provide eventual consistency.
Support
High level of enterprise support is provided.
Open source model. Support through third parties or companies building the open source products.
Maturity
Have been around for a long time.
Some of them are mature; others are evolving.
Querying Capabilities
Available through easy-to-use GUI interfaces.
Querying may require programming expertise and knowledge. Rather than an UI, focus is on functionality and programming interfaces.
Expertise
Large community of developers who have been leveraging the SQL language and RDBMS concepts to architect and develop applications.
Small community of developers working on these open source tools.
2.6 Categories of NoSQL Databases
In this section, you will quickly explore the NoSQL landscape. You will look at the emerging categories of NoSQL databases. Table 2-3 shows a few of the projects in the NoSQL landscape, with the types and the players in each category.
Table 2-3.
NoSQL Categories
Category
Brief Description
For E.g.
Document-based
Data is stored in form of documents.
For instance,
{Name=“Test User”, Address=“Address1”, Age:8}
MongoDB
XML database
XML is used for storing data.
MarkLogic
Graph databases
Data is stored as node collections. The nodes are connected via edges. A node is comparable to an object in a programming language.
GraphDB
Key-value store
Stores data as key-value pairs.
Cassandra, Redis, memcached
The NoSQL databases are categorized on the basis of how the data is stored. NoSQL mostly follows a horizontal structure because of the need to provide curated information from large volumes, generally in near real-time. They are optimized for insert and retrieve operations on a large scale with built-in capabilities for replication and clustering.
Table 2-4 briefly provides a feature comparison between the various categories of NoSQL databases.
Table 2-4.
Feature Comparison
Feature
Column-Oriented
Document Store
Key-Value Store
Graph
Table-like schema support (columns)
Yes
No
No
Yes
Complete update/fetch
Yes
Yes
Yes
Yes
Partial update/fetch
Yes
Yes
Yes
No
Query/filter on value
Yes
Yes
No
Yes
Aggregate across rows
Yes
No
No
No
Relationship between entities
No
No
No
Yes
Cross-entity view support
No
Yes
No
No
Batch fetch
Yes
Yes
Yes
Yes
Batch update
Yes
Yes
Yes
No
The important thing when considering a NoSQL project is the feature set you are interested in. When deciding on a NoSQL product, first you need to understand the problem requirements very carefully, and then you should look at other people who have already used the NoSQL product to solve similar problems. Remember that NoSQL is still maturing, so this will enable you to learn from peers and previous deployments, and make better choi ces.
In addition, you also need to consider the following questions.
How big is the data that needs to be handled?
What throughput is acceptable for read and write?
How is consistency is achieved in the system?
Does the system need to support high write performance or high read performance?
How easy is the maintainability and administration?
What needs to be queried?
What is the benefit of using NoSQL?

We recommend that you start small but significant, and consider a hybrid approach wherever possi ble.
2.7 Summary
In this chapter, you learned about NoSQL. You should now understand what NoSQL is and how it is different from SQL. You also looked into the various categories of NoSQL.
In the following chapters, you will look into MongoDB, which is a document-based NoSQL database.


3. Introducing MongoDB

“MongoDB is one of the leading NoSQL document store databases. It enables organizations to handle and gain meaningful insights from Big Data.”
Some leading enterprises and consumer IT companies have leveraged the capabilities of MongoDB in their products and solutions. The MongoDB 3.0 release introduced a pluggable storage engine and the Ops Manager, which has extended the set of applications that are best suited for MongoDB.
MongoDB derives its name from the word “humungous.”Like other NoSQL databases, MongoDB also doesn’t comply with the RDBMS principles. It doesn’t have the concepts of tables, rows, and columns. Also, it doesn’t provide features of ACID compliance, JOINS, foreign keys, etc.
MongoDB stores data as Binary JSON documents (also known as BSON). The documents can have different schemas, which means that the schema can change as the application evolves. MongoDB is built for scalability, performance, and high availability.
In this chapter, we will talk a bit about MongoDB’s creation and the design decisions. We will look at the key features, components, and architecture of MongoDB in the following chapters.
3.1 History
In the later part of 2007, Dwight Merriman, Eliot Horowitz, and their team decided to develop an online service. The intent of the service was to provide a platform for developing, hosting, and auto-scaling web applications, much in line with products such as the Google App Engine or Microsoft Azure. Soon they realized that no open source database platform suited the requirements of the service.
“We felt like a lot of existing databases didn’t really have the ‘cloud computing’ principles you want them to have: elasticity, scalability, and … easy administration, but also ease of use for developers and operators,” Merriman said. “[MySQL] doesn’t have all those properties.”1 So they decided to build a database that would not comply with the RDBMS model.
A year later, the database for the service was ready to use. The service itself was never released but the team decided in 2009 to open source the database as MongoDB. In March of 2010,the release of MongoDB 1.4.0 was considered production-ready. The latest production release is 3.0and it was released in March 2015.MongoDB was built under the sponsorship of 10gen, a New York–based startup.
3.2 MongoDB Design Philosophy
In one of his talks, Eliot Horowitz mentioned that MongoDB wasn’t designed in a lab and is instead built from the experiences of building large scale, high availability, and robust systems. In this section, we will briefly look at some of the design decisions that led to what MongoDB is today.
3.2.1 Speed, Scalability, and Agility
The design team’s goal when designing MongoDB was to create a database that was fast, massively scalable, and easy to use. To achieve speed and horizontal scalability in a partitioned database, as explained in the CAP theorem, the consistency and transactional support have to be compromised. Thus, per this theorem, MongoDB provides high availability, scalability, and partitioning at the cost of consistency and transactional support. In practical terms, this means that instead of tables and rows, MongoDB uses documents to make it flexible, scalable, and fast.
3.2.2 Non-Relational Approach
Traditional RDBMS platforms provide scalability using a scale-up approach, which requires a faster server to increase performance. The following issues in RDBMS systems led to why MongoDB and other NoSQL databases were designed the way they are designed:
In order to scale out, the RDBMS database needs to link the data available in two or more systems in order to report back the result. This is difficult to achieve in RDBMS systems since they are designed to work when all the data is available for computation together. Thus the data has to be available for processing at a single location.
In case of multiple Active-Active servers, when both are getting updated from multiple sources there is a challenge in determining which update is correct.
When an application tries to read data from the second server, and the information has been updated on the first server but has yet to be synchronized with the second server, the information returned may be stale.

The MongoDB team decided to take a non-relational approach to solving these problems. As mentioned, MongoDB stores its data in BSON documents where all the related data is placed together, which means everything is in one place. The queries in MongoDB are based on keys in the document, so the documents can be spread across multiple servers. Querying each server means it will check its own set of documents and return the result. This enables linear scalability and improved performance.
MongoDB has a primary-secondary replication where the primary accepts the write requests. If the write performance needs to be improved, then sharding can be used; this splits the data across multiple machines and enables these multiple machines to update different parts of the datasets. Sharding is automatic in MongoDB; as more machines are added, data is distributed automatically.
3.2.3 JSON-Based Document Store
MongoDB uses a JSON-based (JavaScript Object Notation) document store for the data. JSON/BSON offers a schema-less model, which provides flexibility in terms of database design. Unlike in RDBMSs, changes can be done to the schema seamlessly.
This design also makes for high performance by providing for grouping of relevant data together internally and making it easily searchable.
A JSON document contains the actual data and is comparable to a row in SQL. However, in contrast to RDBMS rows, documents can have dynamic schema. This means documents within a collection can have different fields or structure, or common fields can have different type of data.
A document contains data in form of key-value pairs. Let’s understand this with an example:
{
"Name": "ABC",
"Phone": ["1111111",
........"222222"
........],
"Fax":..
}
As mentioned, keys and values come in pairs. The value of a key in a document can be left blank. In the above example, the document has three keys, namely “Name,” ”Phone,” and “Fax.” The “Fax” key has no value.
3.2.4 Performance vs. Features
In order to make MongoDB high performance and fast, certain features commonly available in RDBMS systems are not available in MongoDB. MongoDB is a document-oriented DBMS where data is stored as documents. It does not support JOINs, and it does not have fully generalized transactions. However, it does provide support for secondary indexes, it enables users to query using query documents, and it provides support for atomic updates at a per document level. It provides a replica set, which is a form of master-slave replication with automated failover, and it has built-in horizontal scaling.
3.2.5 Running the Database Anywhere
One of the main design decisions was the ability to run the database from anywhere, which means it should be able to run on servers, VMs, or even on the cloud using the pay-for-what-you-use service. The language used for implementing MongoDB is C++, which enables MongoDB to achieve this goal. The 10gen site provides binaries for different OS platforms, enabling MongoDB to run on almost any type of machine.
3.3 SQL Comparison
The following are the ways in which MongoDB is different from SQL.
1.MongoDB uses documents for storing its data, which offer a flexible schema (documents in same collection can have different fields). This enables the users to store nested or multi-value fields such as arrays, hashes, etc. In contrast, RDBMS systems offer a fixed schema where a column’s value should have a similar data type. Also, it’s not possible to store arrays or nested values in a cell.

2.MongoDB doesn’t provide support for JOIN operations, like in SQL. However, it enables the user to store all relevant data together in a single document, avoiding at the periphery the usage of JOINs. It has a workaround to overcome this issue. We will be discussing this in more detail in a later chapter.

3.MongoDB doesn’t provide support for transactions in the same way as SQL. However, it guarantees atomicity at the document level. Also, it uses an isolation operator to isolate write operations that affect multiple documents, but it does not provide “all-or-nothing” atomicity for multi-document write operations.

3.4 Summary
In this chapter, you got to know MongoDB, its history, and brief details on design of the MongoDB system. In the next chapters, you will learn more about MongoDB’s data model.
Footnotes

“MongoDB is designed to work with documents without any need of predefined columns or data types (unlike relational databases), making the data model extremely flexible.”
In this chapter, you will learn about the MongoDB data model. You will also learn what flexible schema (polymorphic schema) means and why it’s a significant contemplation of MongoDB data model.
4.1 The Data Model
In the previous chapter, you saw that MongoDB is a document-based database system where the documents can have a flexible schema. This means that documents within a collection can have different (or same) sets of fields. This affords you more flexibility when dealing with data.
In this chapter, you will explore MongoDB’s flexible data model. Wherever required, we will demonstrate the difference in the approach compared to RDBMS systems.
A MongoDB deployment can have many databases. Each database is a set of collections. Collections are similar to the concept of tables in SQL; however, they are schemaless. Each collection can have multiple documents. Think of a document as a row in SQL. Figure 4-1 depicts the MongoDB database model .


Figure 4-1.
MongoDB database model
In an RDBMS system, since the table structures and the data types for each column are fixed, you can only add data of a particular data type in a column. In MongoDB, a collection is a collection of documents where data is stored as key-value pairs.
Let’s understand with an example how data is stored in a document. The following document holds the name and phone numbers of the users:
{"Name": "ABC", "Phone": ["1111111", "222222" ] }
Dynamic schema means that documents within the same collection can have the same or different sets of fields or structure, and even common fields can store different types of values across documents. There’s no rigidness in the way data is stored in the documents of a collection.
Let’s see an example of a Region collection:
{ "R_ID" : "REG001", "Name" : "United States" }
{ "R_ID" :1234, "Name" : "New York", "Country" : "United States" }
In this code, you have two documents in the Region collection. Although both documents are part of a single collection, they have different structures: the second collection has an additional field of information, which is country. In fact, if you look at the “R_ID” field, it stores a STRING value in the first document whereas it’s a number in the second document.
Thus a collection’s documents can have entirely different schemas. It falls to the application to store the documents in a particular collection together or to have multiple collections.
4.1.1 JSON and BSON
MongoDB is a document-based database. It uses Binary JSON for storing its data.
In this section, you will learn about JSON and Binary-JSON (BSON) . JSON stands for JavaScript Object Notation. It’s a standard used for data interchange in today’s modern Web (along with XML). The format is human and machine readable. It is not only a great way to exchange data but also a nice way to store data.
All the basic data types (such as strings, numbers, Boolean values, and arrays) are supported by JSON.
The following code shows what a JSON document looks like:
{
"_id" : 1,
"name" : { "first" : "John", "last" : "Doe" },
"publications" : [
{
"title" : "First Book",
"year" : 1989,
"publisher" : "publisher1"
},
{ "title" : "Second Book",
"year" : 1999,
"publisher" : "publisher2"
}
]
}
JSON lets you keep all the related pieces of information together in one place, which provides excellent performance. It also enables the updating of a document to be independent. It is schemaless.
4.1.1.1 Binary JSON (BSON)
MongoDB stores the JSON document in a binary-encoded format. This is termed as BSON. The BSON data model is an extended form of the JSON data model.
MongoDB’s implementation of a BSON document is fast, highly traversable, and lightweight. It supports embedding of arrays and objects within other arrays, and also enables MongoDB to reach inside the objects to build indexes and match objects against queried expressions, both on top-level and nested BSON keys.
4.1.2 The Identifier (_id)
You have seen that MongoDB stores data in documents. Documents are made up of key-value pairs. Although a document can be compared to a row in RDBMS, unlike a row, documents have flexible schema. A key, which is nothing but a label, can be roughly compared to the column name in RDBMS. A key is used for querying data from the documents. Hence, like a RDBMS primary key (used to uniquely identify each row), you need to have a key that uniquely identifies each document within a collection. This is referred to as _id in MongoDB.
If you have not explicitly specified any value for a key, a unique value will be automatically generated and assigned to it by MongoDB. This key value is immutable and can be of any data type except arrays.
4.1.3 Capped Collection
You are now well versed with collections and documents. Let’s talk about a special type of collection called a capped collection.
MongoDB has a concept of capping the collection. This means it stores the documents in the collection in the inserted order. As the collection reaches its limit, the documents will be removed from the collection in FIFO (first in, first out) order. This means that the least recently inserted documents will be removed first.
This is good for use cases where the order of insertion is required to be maintained automatically, and deletion of records after a fixed size is required. One such use cases is log files that get automatically truncated after a certain size.
Note
MongoDB itself uses capped collections for maintaining its replication logs. Capped collection guarantees preservation of the data in insertion order, so queries retrieving data in the insertion order return results quickly and don’t need an index. Updates that change the document size are not allowed.
4.2 Polymorphic Schemas
As you are already conversant with the schemaless nature of MongoDB data structure, let’s now explore polymorphic schemas and use cases.
A polymorphic schema is a schema where a collection has documents of different types or schemas. A good example of this schema is a collection named Users. Some user documents might have an extra fax number or email address, while others might have only phone numbers, yet all these documents coexist within the same Users collection. This schema is generally referred to as a polymorphic schema.
In this part of the chapter, you’ll explore the various reasons for using a polymorphic schema .
4.2.1 Object-Oriented Programming
Object-oriented programming enables you to have classes share data and behaviors using inheritance. It also lets you define functions in the parent class that can be overridden in the child class and thus will function differently in a different context. In other words, you can use the same function name to manipulate the child as well as the parent class, although under the hood the implementations might be different. This feature is referred to as polymorphism.
The requirement in this case is the ability to have a schema wherein all of the related sets of objects or objects within a hierarchy can fit in together and can also be retrieved identically.
Let’s consider an example. Suppose you have an application that lets the user upload and share different content types such as HTML pages, documents, images, videos, etc. Although many of the fields are common across all of the above-mentioned content types (such as Name, ID, Author, Upload Date, and Time), not all fields are identical. For example, in the case of images, you have a binary field that holds the image content, whereas an HTML page has a large text field to hold the HTML content.
In this scenario, the MongoDB polymorphic schema can be used wherein all of the content node types are stored in the same collection, such as LoadContent, and each document has relevant fields only.
// "Document collections" - "HTMLPage" document
{
_id: 1,
title: "Hello",
type: "HTMLpage",
text: "<html>Hi..Welcome to my world</html>"
}
...
// Document collection also has a "Picture" document
{
_id: 3,
title: "Family Photo",
type: "JPEG",
sizeInMB: 10,........
}
This schema not only enables you to store related data with different structures together in a same collection, it also simplifies the querying. The same collection can be used to perform queries on common fields such as fetching all content uploaded on a particular date and time as well as queries on specific fields such as finding images with a size greater than X MB.
Thus object-oriented programming is one of the use cases where having a polymorphic schema makes sense.
4.2.2 Schema Evolution
When you are working with databases, one of the most important considerations that you need to account for is the schema evolution (i.e. the change in the schema’s impact on the running application). The design should be done in a way as to have minimal or no impact on the application, meaning no or minimal downtime, no or very minimal code changes, etc.
Typically, schema evolution happens by executing a migration script that upgrades the database schema from the old version to the new one. If the database is not in production, the script can be simple drop and recreation of the database. However, if the database is in a production environment and contains live data, the migration script will be complex because the data will need to be preserved. The script should take this into consideration. Although MongoDB offers an Update option that can be used to update all the documents’ structure within a collection if there’s a new addition of a field, imagine the impact of doing this if you have thousands of documents in the collection. It would be very slow and would have a negative impact on the underlying application’s performance. One of the ways of doing this is to include the new structure to the new documents being added to the collection and then gradually migrating the collection in the background while the application is still running. This is one of the many use cases where having a polymorphic schema will be advantageous.
For example, say you are working with a Tickets collection where you have documents with ticket details, like so:
// "Ticket1" document (stored in "Tickets" collection")
{
_id: 1,
Priority: "High",
type: "Incident",
text: "Printer not working"
}...........
At some point, the application team decides to introduce a “short description” field in the ticket document structure, so the best alternative is to introduce this new field in the new ticket documents. Within the application, you embed a piece of code that will handle retrieving both “old style” documents (without a short description field) and “new style” documents (with a short description field). Gradually the old style documents can be migrated to the new style documents. Once the migration is completed, if required the code can be updated to remove the piece of code that was embedded to handle the missing field.
4.3 Summary
In this chapter, you learned about the MongoDB data model. You also looked at identifiers and capped collections. You concluded the chapter with an understanding of how the flexible schema helps.
In the following chapter, you will get started with MongoDB. You will perform the installation and configuration of MongoDB.




5. MongoDB - Installation and Configuration



“MongoDB is a cross-platform database.”
In this chapter, you will go over the process of installing MongoDB on Windows and Linux.
5.1 Select Your Version
MongoDB runs on most platforms. A list of all the available packages is available on the MongoDB downloads page at www.mongodb.org/downloads .
The correct version for your environment will depend on your server’s operating system and the kind of processor. MongoDB supports both 32-bit and 64-bit architecture but it’s recommended to use 64-bit in your production environment.
32-bit limitation
This is due to the usage of memory mapped files in MongoDB. This limits the 32-bit builds to around 2GB of data. It’s recommended to use a 64-bit build for a production environment for performance reasons.
The latest MongoDB production release is 3.0.4 at the time of writing this book. Downloads for MongoDB are available for Linux, Windows, Solaris, and Mac OS X.
The MongoDB download page is divided in the following sections :
Current Stable Release (3.0.4) – 6/16/2015
Previous Releases (stable)
Development Releases (unstable)

The current release is the most stable recent version available, which at time of writing of the book is 3.0.4. When a new version is released, the prior stable release is moved to the Previous Releases section.
The development releases, as the name suggests, are the versions that are still under development and hence are tagged as unstable. These versions can have additional features but they may not be stable since they are still in the development phase. You can use the development versions to try out new features and provide feedback to 10gen regarding the features and issues faced.
5.2 Installing MongoDB on Linux
This section covers installing MongoDB on a LINUX system . For the following demonstration, we will be using an Ubuntu Linux distribution. You can install MongoDB either manually or via repositories. We will walk you through both options.
5.2.1 Installing Using Repositories
In LINUX, repositories are the online directories that contain software. Aptitude is the program used to install software on Ubuntu. Although MongoDB might be present in the default repositories, there is the possibility of an out-of-date version, so the first step is to configure Aptitude to look at the custom repository .
1.Issue the following to import the public.GPG key for MongoDB:
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10

2.Next, use the following command to create the /etc/apt/sources.list.d/mongodb-org-3.0.list file:
echo "deb http://repo.mongodb.org/apt/ubuntu "$(lsb_release -sc)"/mongodb-org/3.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.0.list

3.Finally, use the following command to reload the repository:
sudo apt-get update
Now Aptitude is aware of the manually added repository.

4.Next, you need to install the software. The following command should be issued in the shell to install MongoDB’s current stable version:
sudo apt-get install -y mongodb-org

You’ve successfully installed MongoDB, and that’s all there is to it.
5.2.2 Installing Manually
In this section, you will see how MongoDB can be installed manually. This knowledge is important in the following cases:
When the Linux distribution doesn’t use Aptitude.
When the version you require is not available through repositories or is not part of the repository.
When you need to run multiple MongoDB versions simultaneously.

The first step in a manual installation is to decide on the version of MongoDB to use and then download from the site. Next, the package needs to be extracted using the following command:
$ tar -xvf mongodb-linux-x86_64-3.0.4.tgz
mongodb-linux-i686-3.0.4/THIRD-PARTY-NOTICES
mongodb-linux-i686-3.0.4/GNU-AGPL-3.0
mongodb-linux-i686-3.0.4/bin/mongodump
............
mongodb-linux-i686-3.0.4/bin/mongosniff
mongodb-linux-i686-3.0.4/bin/mongod
mongodb-linux-i686-3.0.4/bin/mongos
mongodb-linux-i686-3.0.4/bin/mongo
This extracts the package content to a new directory, namely mongodb-linux-x86_64-3.0.4 (which is located under your current directory). The directory contains many subdirectories and files. The main executable files are under the subdirectory bin.
This completes the MongoDB installation successfully.
5.3 Installing MongoDB on Windows
Installing MongoDB on Windows is a simple matter of downloading the msi file for the selected build of Windows and running the installer.
The installer will guide you through installation of MongoDB.
Following the wizard, you will reach the Choose Setup Type screen. There are two setup types available wherein you can customize your installation. In this example, select the setup type as Custom.
An installation directory needs to be specified when selecting Custom, so specify the directory to C:\PracticalMongoDB.
Note that MongoDB can be run from any folder selected by the user because it is self-contained and has no dependency on the system. If the setup type of Complete is selected, the default folder selected is C:\Program Files\MongoDB.
Clicking Next will take you to the Ready to installation screen. Click Install.
This will start the installation and will show the progress on a screen. Once the installation is complete, the wizard will take you to the completion screen.
Clicking Finish completes the setup. After successful completion of the above steps, you have a directory called C:\PracticalMongoDB with all the relevant applications in the bin folder. That’s all there is to it.
5.4 Running MongoDB
Let’s see how to start running and using MongoDB.
5.4.1 Preconditions
A data folder is required for storing the files. This by default is C:\data\db in Windows and /data/db in LINUX systems.
These data directories are not created by MongoDB, so before starting MongoDB the data directory needs to be manually created and you need to ensure that proper permissions are set (such as that MongoDB has read, write, and directory creation permissions).
If the MongoDB is started before you create the folder, it will throw an error message and will fail to run.
5.4.2 Starting the Service
Once the directories are created and permissions are in place, execute the mongod application (placed under the bin directory ) to start the MongoDB core database service.
In continuation of the above installation, the same can be started by opening the command prompt in Windows (which needs to be run as administrator) and executing the following:
c:\> c:\practicalmongodb\bin\mongod.exe
In case of Linux, the mongod process is started in the shell.
This will start the MongoDB database on the localhost interface. It will listen for connections from the mongo shell on port 27017.
As mentioned, the folder path needs to be created before starting the database, which by default is c:\data\db. An alternative path can also be provided when starting the database service by using the –dbpath parameter.
C :\> C:\ practicalmongodb \bin\mongod.exe --dbpath
C:\NewDBPath\DBContents
5.5 Verifying the Installation

The relevant executable will be present under the subdirectory bin. The following can be checked under the bin directory in order to vet the success of the installation step :
Mongod: the core database server
Mongo: The database shell
Mongos: The auto-sharding process
Mongoexport: The export utility
Mongoimport: The import utility

Apart from the above, there are other applications available in the bin folder.
The mongo application launches the mongo shell, which supplies access to the database contents and lets you fire selective queries or execute aggregation against the data in MongoDB.
The mongod application, as you saw above, is used to start the database service, or daemon.
Multiple flags can be set when launching the applications. For example, –dbpath can be used to specify an alternative path for where the database files should be stored. To get the list of all available options, include the --help flag when launching the service.
5.6 MongoDB Shell
The mongo shell comes as part of the standard distribution of MongoDB. The shell provides a full database interface for MongoDB, enabling you play around with the data stored in MongoDB using a JavaScript environment, which has complete access to the language and all the standard functions.
Once database services have started, you can fire up the mongo shell and start using MongoDB. This can be done using Shell in Linux or the command prompt in Windows (run as administrator).
You must refer to the exact location of the executable, such as in the C:\practicalmongodb\bin\ folder in a Windows environment.
Open the command prompt (run as administrator) and type mongo.exe. Press the Enter key. This will start the mongo shell.
C:\> C:\practicalmongodb\bin\mongo.exe
MongoDB shell version: 3.0.4
connecting to: test
>
If no parameters are specified when starting the service, it connects to the default database named test on the localhost instance.
The database will be created automatically when connected to it. MongoDB offers this feature of automatically creating a database if an attempt is made to access a one that is not there.
The next chapter offers more information on working with the mongo shell.
5.7 Securing the Deployment
You know how to install and start using MongoDB via the default configurations. Next, you need to ensure that the data that is stored within the database is secure in all aspects.
In this section, you will look at how to secure your data. You will change the configuration of the default installation to ensure that your database is more secure.
5.7.1 Using Authentication and Authorization
Authentication verifies the user’s identity, and authorization determines the level of actions that the user can perform on the authenticated database.
This means the users will be able to access the database only if they log in using the credentials that have access on the database. This disables anonymous access to the database. After the user is authenticated, authorization can be used to ensure that the user has only the required amount of access needed to accomplish the tasks at hand.
Both authentication and authorization exist at a per-database level. The users exist in the context of a single logical database.
The information on the users is maintained in a collection named system.users, which exists in the admin database. This collection maintains the credentials needed for authenticating the user wherein it stores the user id, password, and the database against which it is created, plus privileges needed for authorizing the user.
MongoDB uses a role-based approach for authorization (the roles of read, readWrite, readAnyDatabase, etc.). If needed, the user administrator can create custom roles.
A privilege document within the system.users collection is used for storing each user roles. The same document maintains the credentials for authenticated users.
An example of a document in the system.users collection is as follows:
{
_id : "practicaldb.Shaks",
user : "Shaks",
db : "practicaldb",
credentials : {.......},
roles : [
{ role: "read", db: "practicaldb" },
{ role: "readWrite", db: "MyDB" }
],
......
}
This document tells us that the user Shaks is associated with database practicaldb and it has read roles in the practicaldb database and a readWrite role in the MyDB database. Note that a user name and the associated database uniquely identifies a user within MongoDB, so if you have two users with the same name, but they are associated with different databases, then they are considered as two unique users. Thus a user can have multiple roles with different authorization levels on different databases.
The available roles are
read: This provides a read-only access of all the collections for the specified database.
readWrite: This provides a read-write access to any collection within the specified database.
dbAdmin: This enables the user to perform administrative actions within the specified database such as index management using ensureIndex, dropIndexes, reIndex, indexStats, renaming collections, creating collections, etc.
userAdmin: This enables the user to perform readWrite operations on the system.users collection of the specified database. It also enables altering permissions of existing users or creating new users. This is effectively the SuperUser role for the specified database.
clusterAdmin: This role enables the user to grant access to administrative operations that alter or display information about the complete system. clusterAdmin is applicable only on the admin database.
readAnyDatabase: This role enables user to read from any database in the MongoDB environment.
readWriteAnyDatabase: This role is similar to readWrite except it is for all databases.
userAdminAnyDatabase: This role is similar to the userAdmin role except it applies to all databases.
dbAdminAnyDatabase: This role is the same as dbAdmin, except it applies to all databases.
Starting from version 2.6, a user admin can also create user-defined roles to adhere to the policy of least privilege by providing access at collection level and command level. A user-defined role is scoped to the database in which it’s created and is uniquely identified by the combination of the database and the role name. All the user defined roles are stored in the system.roles collection.

5.7.1.1 Enabling Authentication
Authentication is disabled by default, so use --auth to enable authentication. While starting mongod, use mongod --auth. Before enabling authentication, you need to have at least one admin user. As you saw above, an admin user is a user who is responsible for creating and managing other users.
It is recommended that in production deployments such users are created solely for managing users and should not be used for any other roles. In a MongoDB deployment, this user is the first user that needs to be created; other users of the system can be created by this user.
The admin user can be created either way: before enabling the authentication or after enabling the authentication.
In this example, you will first create the admin user and then enable the auth settings. The below steps should be executed on the Windows platform.
Start the mongod with default settings:
C:\>C:\practicalmongodb\bin\mongod.exe
C:\practicalmongodb\bin\mongod.exe --help for help and startup options
2015-07-03T23:11:10.716-0700 I CONTROL Hotfix KB2731284 or later update is installed, no need to zero out data files
2015-07-03T23:11:10.716-0700 I JOURNAL [initandlisten] journal dir=C:\data\db\journal
...................................................
2015-07-03T23:11:10.763-0700 I CONTROL [initandlisten] MongoDB starting : pid=2776 port=27017 dbpath=C:\data\db\ 64-bit host=ANOC9
2015-07-03T23:11:10.763-0700 I CONTROL [initandlisten] targetMinOS: Windows 7/W
indows Server 2008 R2
2015-07-03T23:11:10.763-0700 I CONTROL [initandlisten] db version v3.0.4
2015-07-03T23:11:10.764-0700 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1j-fips 19 Mar 2015
2015-07-03T23:11:10.764-0700 I CONTROL [initandlisten] build info: windows sys. getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_49
2015-07-03T23:11:10.771-0700 I NETWORK [initandlisten] waiting for connections
on port 27017
5.7.1.2 Creating the Admin User
Run another instance of command prompt by running it as an administrator and execute the mongo application:
C:\> C:\practicalmongodb\bin\mongo.exe
MongoDB shell version: 3.0.4
connecting to: test
>
Switching to the Admin Database
Note
that admin db is a privileged database that the user needs access to in order to execute certain administrative commands such as creating an admin user.
> db = db.getSiblingDB('admin')
Admin
The user needs to be created with either of the roles: userAdminAnyDatabase or userAdmin:
>db.createUser({user: "AdminUser", pwd: "password", roles:["userAdminAnyDatabase"]})
Successfully added user: { "user" : "AdminUser", "roles" : [ "userAdminAnyDatabase" ] }
Next, authenticate using this user. Restart the mongod with auth settings:
C:\> C:\practicalmongodb\bin\mongod.exe -auth
C:\practicalmongodb\bin\mongod.exe --help for help and startup options
2015-07-03T23:11:10.716-0700 I CONTROL Hotfix KB2731284 or later update is installed, no need to zero out data files
2015-07-03T23:11:10.716-0700 I JOURNAL [initandlisten] journal dir=C:\data\db\journal
...................................................
2015-07-03T23:11:10.763-0700 I CONTROL [initandlisten] MongoDB starting : pid=2776 port=27017 dbpath=C:\data\db\ 64-bit host=ANOC9
2015-07-03T23:11:10.763-0700 I CONTROL [initandlisten] targetMinOS: Windows 7/W
indows Server 2008 R2
2015-07-03T23:11:10.763-0700 I CONTROL [initandlisten] db version v3.0.4
2015-07-03T23:11:10.764-0700 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1j-fips 19 Mar 2015
2015-07-03T23:11:10.764-0700 I CONTROL [initandlisten] build info: windows sys. getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_49
2015-07-03T23:11:10.771-0700 I NETWORK [initandlisten] waiting for connections
on port 27017
Start the mongo console and authenticate against the admin database using the AdminUser user created above:
C:\>c:\practicalmongodb\bin\mongo.exe
MongoDB shell version: 3.0.4
connecting to: test
>use admin
switched to db admin
>db.auth("AdminUser", "password")
1
>
5.7.1.3 Creating a User and Enabling Authorization
In this section, you will create a user and assign a role to the newly created user. You have already authenticated using the admin user, as shown:
C:\>c:\practicalmongodb\bin\mongo.exe
MongoDB shell version: 3.0.4
connecting to: test
>use admin
switched to db admin
>db.auth("AdminUser", "password")
1
>
Switch to the Product database and create user Alice and assign read access on the product database, like so:
> use product
switched to db product
>db.createUser({user: "Alice"
... , pwd:"Moon1234"
... , roles: ["read"]
... }
... )
Successfully added user: { "user" : "Alice", "roles" : [ "read" ] }
Next, validate that the user has read-only access on the database:
>db
product
>show users
{
"_id" : "product.Alice",
"user" : "Alice",
"db" : "product",
"roles" : [
{
"role" : "read",
"db" : "product"
}
]
}
Next, connect to a new mongo console and log in as Alice to the Products database to issue read-only commands:
C:\> c:\practicalmongodb\bin\mongo.exe -u Alice -p Moon1234 product
2015-07-03T23:11:10.716-0700 I CONTROL Hotfix KB2731284 or later update is installed, no need to zero-out data files
MongoDB shell version: 3.0.4
connecting to: products
Post successful authentication the following entry will be seen on the mongod console.
2015-07-03T23:11:26.742-0700 I ACCESS [conn2] Successfully authenticated as principal Alice on product
5.7.2 Controlling Access to a Network
By default, mongod and mongos bind to all the available IP addresses on a system. In this section, you will look at configuration options for restricting network exposure . The code below is executed on the Windows platform :
C:\> c:\practicalmongodb\bin\mongod.exe --bind_ip 127.0.0.1 --port 27017 --rest
2015-07-03T00:33:49.929-0700 I CONTROL Hotfix KB2731284 or later update is installed, no need to zero out data files
2015-07-03T00:33:49.946-0700 I JOURNAL [initandlisten] journal dir=C:\data\db\journal
2015-07-03T00:33:49.980-0700 I CONTROL [initandlisten] MongoDB starting : pid=1144 port=27017 dbpath=C:\data\db\ 64-bit host=ANOC9
2015-07-03T00:33:49.980-0700 I CONTROL [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2
2015-07-03T00:33:49.980-0700 I CONTROL [initandlisten] db version v3.0.4
2015-07-03T00:33:49.980-0700 I CONTROL [initandlisten] OpenSSL version: OpenSSL1.0.1j-fips 19 Mar 2015
2015-07-03T00:33:49.980-0700 I CONTROL [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_49
2015-07-03T00:33:49.981-0700 I CONTROL [initandlisten] allocator: system
2015-07-03T00:33:49.981-0700 I CONTROL [initandlisten] options: { net: { bindIp: "127.0.0.1", http: { RESTInterfaceEnabled: true, enabled: true }, port: 27017} }
2015-07-03T00:33:49.990-0700 I NETWORK [initandlisten] waiting for connections on port 27017
2015-07-03T00:33:49.990-0700 I NETWORK [websvr] admin web console waiting for connections on port 28017
2015-07-03T00:34:22.277-0700 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49164 #1 (1 connection now open)
You have started the server with bind_ip, which has one value set as 127.0.0.1, which is the localhost interface.
The bind_ip limits the network interfaces of the incoming connections for which the program will listen. Comma-separated IP addresses can be specified. In your case, you have restricted the mongod to listen to only the localhost interface.
When the mongod instance is started, by default it waits for any incoming connection on port 27017. You can change this using –port.
Just changing the port does not reduce the risk much. In order to completely secure the environment, you need to allow only trusted clients to connect to the port using firewall settings.
Changing this port also changes the HTTP status interface port, which by default is 28017. This port is available on a port that is X+1000, where X represents the connection port.
This web page exposes diagnostic and monitoring information , which includes operational data, a variety of logs, and status reports regarding the database instances. It provides management-level statistics that can be used for administration purpose. This page is by default read-only; to make it fully interactive, you will use the REST settings. This configuration makes the page fully interactive, helping the administrators troubleshoot any performance issues. Only trusted client access should be allowed on this port using firewalls.
It is recommended to disable the HTTP Status page as well as the REST configuration in the production environment.
5.7.2.1 Use Firewalls
Firewalls are used to control access within a network. They can be used to allow access from a specific IP address to specific IP ports, or to stop any access from any untrusted hosts. They can be used to create a trusted environment for your mongod instance where you can specify what IP addresses or hosts can connect to which ports or interfaces of the mongod.
On the Windows platform, use netsh to configure the incoming traffic for port 27017:
C:\> netsh advfirewall firewall add rule name="Open mongod port 27017" dir=in action=allow protocol=TCP localport=27017
Ok.
C:\>
This code says that all of the incoming traffic is allowed on port 27017, so any application servers can connect to the mongod.
5.7.2.2 Encrypting Data
You have seen that MongoDB stores all its data in a data directory that in Windows defaults to C:\data\db and /data/db in Linux. The files are stored unencrypted in the directory because there’s no provisioning of methods for automatically encrypting the files in Mongo. Any attacker with file system access can read the data stored in the files. It’s the application’s responsibility to ensure that sensitive information is encrypted before it’s written to the database.
Additionally, operating system-level mechanisms such as file system-level encryption and permissions should be implemented in order to prevent unauthorized access to the files.
5.7.2.3 Encrypting Communication
It’s often a requirement that communication between the mongod and the client (mongo shell, for instance) is encrypted. In this setup, you will see how to add one more level of security to the above installation by configuring SSL, so that the communication between the mongod and mongo shell (client) happens using a SSL certificate and key.
It is recommended to use SSL for communication between the server and the client.
Starting from Version 3.0, most of the MongoDB distributions now have support included for SSL. The below commands are executed on a Windows platform.
The first step is to generate the .pem file that will contain the public key certificate and the private key. MongoDB can use either a self-signed certificate or any valid certificate issued by a certificate authority.
In this book, you will use the following commands to generate a self-signed certificate and private key.
1.
Install OpenSSL and Microsoft Visual C++ 2008 redistributable as per the MongoDB distribution and the Windows platform. In this book, you have installed the 64-bit version.

2.
Run the following command to create a public key certificate and a private key:
C:\> cd c:\OpenSSL-Win64\bin
C:\OpenSSL-Win64\bin\>openssl
This opens the OpenSSL shell where you need to enter the following command:
OpenSSL>req -new -x509 -days 365 -nodes -out C:\practicalmongodb\mongodb-cert.crt -keyout C:\practicalmongodb\mongodb-cert.key
The above step generates a certificate key named mongodb-cert.key and places it in the C:\practicalmongodb folder.

3.
Next, you need to concatenate the certificate and the private key to the .pem file. In order to achieve this, run the following commands at the command prompt:
C:\> more C:\practicalmongodb\mongodb-cert.key > temp
C:\> copy \b temp C:\practicalmongodb\mongodb-cert.crt > C:\practicalmongodb\mongodb.pem

Now you have a .pem file. Use the following runtime options when starting the mongod:
C:\> C:\practicalmongodb\bin\mongod –sslMode requireSSL --sslPEMKeyFile C:\practicalmongodb\mongodb.pem
2015-07-03T03:45:33.248-0700 I CONTROL Hotfix KB2731284 or later update is installed, no need to zero-out data files
2015-07-03T02:54:30.630-0700 I JOURNAL [initandlisten] journal dir=C:\data\db\journal
2015-07-03T02:54:30.670-0700 I CONTROL [initandlisten] MongoDB starting : pid=2
816 port=27017 dbpath=C:\data\db\ 64-bit host=ANOC9
2015-07-03T02:54:30.670-0700 I CONTROL [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2
2015-07-03T02:54:30.670-0700 I CONTROL [initandlisten] db version v3.0.4
2015-07-03T02:54:30.670-0700 I CONTROL [initandlisten] OpenSSL version: OpenSSL1.0.1j-fips 19 Mar 2015
2015-07-03T02:54:30.670-0700 I CONTROL [initandlisten] build info: windows sys. getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_49
2015-07-03T02:54:30.671-0700 I CONTROL [initandlisten] allocator: system
2015-07-03T02:54:30.671-0700 I CONTROL [initandlisten] options: { net: { ssl: {
PEMKeyFile: "c:\practicalmongodb\mongodb.pem", mode: "requireSSL" } } }
2015-07-03T02:54:30.680-0700 I NETWORK [initandlisten] waiting for connections
on port 27017 ssl
2015-07-03T03:33:43.708-0700 I NETWORK [initandlisten] connection accepted from
127.0.0.1:49194 #2 (1 connection now open)
Note
Using a self-signed certificate is not recommended in a production environment unless it’s a trusted network because it will leave you vulnerable to man-in-the-middle attacks.
You will next connect to the above mongod using the mongo shell. When you run mongo with a –ssl option, you need to either specify –sslAllowInvalidCertificates or –sslCAFile. Let’s use –sslAllowInvalidCertificates.
Open a terminal window and enter the following:
C:\> C:\practicalmongodb\bin>mongo --ssl --sslAllowInvalidCertificates
2015-07-03T02:30:10.774-0700 I CONTROL Hotfix KB2731284 or later update is installed, no need to zero-out data files
MongoDB shell version: 3.0.4
connecting to: test
5.8 Provisioning Using MongoDB Cloud Manager
In the starting of the chapter, you learned how to install and configure MongoDB using Windows and Linux. In this part of the chapter, you will look at how to use MongoDB Cloud Manager .
Mongo DBCloud Manager is a monitoring solution built in by the developer of the database. Prior to version 2.6, MongoDB Cloud Manager (formerly known as MongoDB Monitoring Service or MMS ) was used for monitoring and administering MongoDB only. Starting from version 2.6, major enhancements have been introduced to MongoDB Cloud Manager including backup, point-in-time recovery, and an automation feature, making the task of operating MongoDB simpler than before. The automation feature provides power capabilities to administrators to quickly create, upgrade, scale, or shut down MongoDB instances in few clicks.
In this part of the book, you will see how to get started with MongoDB Cloud Manager. You will deploy a standalone MongoDB instance on AWS using MongoDB Cloud Manager.
When you start with MongoDB Cloud Manager, it asks to install an automation agent on each server, which is then used by the MongoDB Cloud Manager for communicating with the server.
In order to start provisioning, you first need to create your profile on MongoDB Cloud Manager.
Enter the following URL: https://cloud.mongodb.com . Click the Login or Sign up for Free button, based on whether you have an account or not.
Since you are starting for the first time, clicked the Sign up for Free button. This sends you to the page depicted in Figure 5-1.


Figure 5-1.
Account Profile
You will be creating a new profile. However, MongoDB provides an option for joining as existing Cloud Manager group.
Enter all the relevant details, as shown in Figure 5-1, and click Continue. This sends you to the page for providing company information. Once you complete the profile and company information, accept the terms and click the Create Account button. This completes the profile creation. The next step is to create a group (Figure 5-2).


Figure 5-2.
Create Group
Provide a unique name for the group, and click Create Group. Next is the deployment selection page shown in Figure 5-3, where you have the option to build a new deployment or manage an existing deployment.


Figure 5-3.
Deployment
Select to build a new deployment. Next, you’ll be prompted for the location of where to build the deployment (i.e. Local, AWS, or other remote environment). In this example, select AWS. Clicking the Deploy in AWS option leads you to choose between provision on your own and using Cloud Manager to provision.
Select the “I will Provision” option, which means you will be using a machine that is already provisioned to you on AWS.
The next screen provides options for the deployment type (i.e. standalone, replica set, or sharded cluster). You are doing a standalone deployment, so click the Create Standalone box. This sends you to the screen shown in Figure 5-4.


Figure 5-4.
Details for a standalone instance
Provide the instance name and data directory prefix, and click Continue. Next is the screen shown in Figure 5-5, which prompts you to install an automation agent on each server


Figure 5-5.
Installing an automation agent
This screen has an option for specifying the number of servers. In this example, you specify 1.
Next, you need to specify the platform. Choose Ubuntu. Then the screen in Figure 5-6 appears.


Figure 5-6.
Automation agent installation instructions
Follow the steps.
Before you implement the step where you start the agent, you need to ensure that all the relevant ports are open (443, 4949, 27000 to 27018).
Once all the steps are completed, click the Verify Agent button. Post verification, if everything is working as needed, you’ll see a Continue button.
When you click Continue, you will go to the Review and Deploy page shown in Figure 5-7 where you can see all of the processes that are going to get deployed. Here an automation agent downloads and installs the monitoring and backup agent.


Figure 5-7.
Review and deploy
Clicking the Deploy button takes you to the deployment page with the deploying changes status as “In progress.” When the installation is complete, the deployment status will change to “Goal State” and the provisioned server will appear in the toplogy view.
If your deployment supports SSL or using any authentication mechanism, you need to download and install a monitoring agent manually.
In order to vet whether all the agents are working properly or not, you can click the Administration tab on the console.
The Cloud Manager can deploy MongoDB replica sets, sharded clusters, and standalones on any Internet-connected server. The servers need only be able to make outbound TCP connections to the Cloud Manager.
5.9 Summary
In this chapter, you learned how to install MongoDB on the Windows and Linux platforms. You also looked at some important configurations that are necessary to ensure secure and safe usage of the database. You concluded the chapter by provisioning using MongoDB Cloud Manager.
In the following chapter, you will get started with MongoDB Shell.

No comments:

Post a Comment