July 29, 2017

MongoDB Learning Notes

1. Introduction

1.1 Intro to SQL and NoSQL

Two type of database solutions:

SQL: relational
- Provide strong persistence, integration, transactions reporting
- Impedance Mismatch problem
NoSQL: non-relational

Differences between SQL and NoSQL:

Storage:
- RDB : Structured, store in rows and columns (MySQL, Oracle, PostgreSQL, etc) If the data is structured/constant, RDB is appropriate.
- NoSQL : Unstructured - No limits on the types of data to store/ add new types as needs are changed
  - 4 common types of NoSQL:
    - Key-Value Stores: data is stored in key-value pairs (Redis, Dynamo)
    - Document Databases: data is stored in documents (MongoDB, CouchDB)
    - Column Databases: use column families (HBase, Cassandra)
    - Graph Databases: relations are represented in graphs (Neo4J, InfiniteGraph)
Schema:
- RDB : Has predefined schemas
- NoSQL : Without predefined schemas (schema-less) / have a dynamic schema
Querying:
- RDB : SQL (structured query language), defining and manipulating the data
- NoSQL :
Scalability:
- RDB :
  - vertically scalable: increasing performance of hardware: higher Memory, CPU, etc.
  - Disadvantages: expensive, time-consuming,etc
- NoSQL :
  - horizontally scalable: add more servers in the infrastructure.
  - Distribute data across servers automatically
  - Used mainly for big data: NoSQL handles data differently than RDB
  - One advantages: scale across multiple data centers -> make most use of cloud computing and storage -> a cost-saving solution
Reliability:
- RDB : ACID compliant, good data reliability, performing transactions guarantee -> If we need to ensure ACID compliance (ACIDity), we should choose RDB
- NoSQL : sacrifice ACID for performance and scalability ● Introduction to

1.2 Introduction to MongoDB and its advantages

Intro of MongoDB:
- Open source DB
- Document-oriented (JSON like documents, schemaless)
- One of the most popular databases in the world
Advantages:
- Easy scalability:
  - highly scalable horizontally, include sharding and partitioning
  - fault tolerance and auto-sharding
  - Cloud deployment, unlimited growth, higher throughput, lower latency
  - flexible schema
  - include support for MapReduce
- Developer agility:
  - Easy to setup
  - deployquickly/inavarietyofways/onmultipleservers
  - synchronizedataacrossservers
- Oriented toward programmers, MongoDB drivers support most programming languages
- Rich query language: MongoDB is not a key-value DB
- Mongo might be the closest to a RDBMS, refer to: NoSQL: If Only It Was That Easy
  - MongoDB does like most NoSQL databases that sacrifice capacities
  - MongoDB maintains features of relational DBs, built for CRUD.
  - MongoDB is ACID compliant at the document level.

1.3 Why companies use MongoDB/ Why companies shift from DBMS to MongoDB

Who use MongoDB? see:
- https://www.mongodb.com/who-uses-mongodb
- https://www.mongodb.com/industries/high-tech
- In tech industries: Expedia, BuzzFeed, eBay, Foursquare, Adobe…
Some third party tools that enhance interaction:
- Robomongo: http://robomongo.org
- Other MongoDB management tools: http://docs.mongodb.org/ecosystem/tools/administration interfaces

2. Scalability

What scalability options are available, and what trade-offs are associated with choosing a particular scaling method?

Intro
- MongoDB is strongly consistent by default
- MongoDB is a single-master system
- MongoDB can build high-performance systems at scale
- MongoDB Ops Manager makes it easy to deploy/monitor/backup/ scale .
Three Scaling matrix by MongoDB, refer to: https://www.mongodb.com/mongodb-scale - Cluster Scale: distributed database across nodes - Performance Scale: - Data Scale:
How does MongoDB scale?
- MongoDB scales horizontally (scale out) by replication and sharding. (a.k.a. MongoDB is built in replication and sharding)

3. Replication

A master-slave cluster is a most general mode. To set up, we need to start a master node and one or more slave nodes.
A Replica set is an asynchronous master/slave replication. Replication can be done through replica set (using native replication). One replica set is a group of mongod processes. A replica consists of multiple replicas. To set up, we can start by setting the smallest set of two servers.
- Has a primary replica node (called primary , with only one member, handling normal client requests, essentially the master → receive all write operations ) and one or more secondary replica nodes (called secondary, containing the rest of the members, mirroring data on the master → operations are recorded on the primary’s oplog. )
- Secondaries replicate the log and apply the operations to their data sets ). Secondaries will become new primary node if primary becomes unavailable (triggers an replica sets election).
- The difference between a master-slave cluster and a replica set is that a master-slave cluster have a single master, while a replica set do not.
- Multiple copies of data on multiple servers; read operations on multiple servers; Self-healing →(for distributed systems:) redundancy & high availability & fault tolerance & data locality & increased read capacity
- Automated: A replica set provides an automated method to promote slaves/store copies of data/handle failover → strongly consistency
- Failover mechanism: when master node is down
- Allows in-memory storage.
Advantages:
- High availability: if master fails, replica will select another to become the master.
- Distributed read load: replica set is accessed for all reads/writes. (read scaling is useful)
- Disaster recovery: delayed secondary node: recover disastrous events

4. Sharding

MongoDB achieves sharding through auto-sharding - it handles splitting up. Sharding is already built into the database.

MongoDB sharded cluster with three components working together:
- A shard / Or Replica Sets : A shard is a container holding a collection of data sets. A shard can be deployed as a single mongod server for development, or a replica set for production.
- Mongos Server / Query router : a routing process; routes requests and aggregate responses.
- Config Server : store metadata and configuration of a cluster
Multiple sharding methods are available at MongoDB:
- Range-based Sharding : partitioned by the shard key value.
- Hash-based Sharding : partitioned by a hash of the key’s value.
- Location-aware Sharding : partitioned by specified configuration.
Shard Key: To shard a collection, we need to choose a key from collection and use it to split up the data. Shard keys affect operations.
- Range based partitioning → supports more efficient range queries/ result in uneven distribution of data
- Hash based partitioning : compute a hash value → ensure an even distribution of data/ at the expense of efficient range queries.
  - Compare performance distinctions: see above
  - Tag aware sharding: tag with ranges of the shard key
Sharding provides: - Automatic balancing in load/data distribution - Adding new machines easily - Scaling out to many nodes - Automatic failover/ No single points of failure
1. What are the tradeoffs ?
  - Additional costs: duplicate replica set; setup configuration server and router server
  - Forecasting: To make the right decision, we can collect performance metrics over time and to see which choice is cheaper. E.g: forecast resource requirements: RAM, storage requirements, processor speeds, disk I/O speed

5. CAP Theorem

Many people think that MongoDB is CP in CAP theorem. MongoDB’s CAP performance in regards of consistency and availability:

Consistency : MongoDB is strongly consistent by default . → because it’s a single-master system. MongoDB selects consistency over availability.
- How do MongoDB persist strong consistency?
  - One piece of data in one shard only
Availability : MongoDB has a single master, so it’s not CAP-available. At sharding level, MongoDB has a weak availability. However, MongoDB can also support high availability.
- How to deploy MongoDB for high availability?
  - Master-slave replication with replica sets
  - Automatic failover with replica set elections: Secondary will become primary if primary is not available.
Partition Tolerance: strong partition tolerance for a distributed system

In mongoDB CAP has different options (there’s tradeoff between availability and consistency):

Write concerns: choose how many nodes to store data
Read preference: choose which node to read data.

References

MongoDB documentations, available at: https://docs.mongodb.com/manual/core/replica-set-elections
Sharded Clusters in Mongodb - the Key Considerations: http://blog.scottlogic.com/2014/08/08/sharded-clusters-mongodb-considerations.html
MongoDB Architecture Guide [book]
MongoDB - The Definitive Guide [book]
Scaling MongoDB [book]