July 29, 2017

Replication and Sharding in MongoDB

1. Introduction

1.1 Intro to SQL and NoSQL

  1. Two type of database solutions:
  • SQL: relational
    • Provide strong persistence, integration, transactions reporting
    • Impedance Mismatch problem
  • NoSQL: non-relational
  1. Differences between SQL and NoSQL:
  • Storage:
    • RDB : Structured, store in rows and columns (MySQL, Oracle, PostgreSQL, etc) If the data is structured/constant, RDB is appropriate.
    • NoSQL : Unstructured - No limits on the types of data to store/ add new types as needs are changed
      • 4 common types of NoSQL:
        • Key-Value Stores: data is stored in key-value pairs (Redis, Dynamo)
        • Document Databases: data is stored in documents (MongoDB, CouchDB)
        • Column Databases: use column families (HBase, Cassandra)
        • Graph Databases: relations are represented in graphs (Neo4J, InfiniteGraph)
  • Schema:
    • RDB : Has predefined schemas
    • NoSQL : Without predefined schemas (schema-less) / have a dynamic schema
  • Querying:
    • RDB : SQL (structured query language), defining and manipulating the data
    • NoSQL :
  • Scalability:
    • RDB :
      • vertically scalable: increasing performance of hardware: higher Memory, CPU, etc.
      • Disadvantages: expensive, time-consuming,etc
    • NoSQL :
      • horizontally scalable: add more servers in the infrastructure.
      • Distribute data across servers automatically
      • Used mainly for big data: NoSQL handles data differently than RDB
      • One advantages: scale across multiple data centers -> make most use of cloud computing and storage -> a cost-saving solution
  • Reliability:
    • RDB : ACID compliant, good data reliability, performing transactions guarantee -> If we need to ensure ACID compliance (ACIDity), we should choose RDB
    • NoSQL : sacrifice ACID for performance and scalability ● Introduction to

1.2 Introduction to MongoDB and its advantages

  1. Intro of MongoDB:
    • Open source DB
    • Document-oriented (JSON like documents, schemaless)
    • One of the most popular databases in the world
  2. Advantages:
    • Easy scalability:
      • highly scalable horizontally, include sharding and partitioning
      • fault tolerance and auto-sharding
      • Cloud deployment, unlimited growth, higher throughput, lower latency
      • flexible schema
      • include support for MapReduce
    • Developer agility:
      • Easy to setup
      • deployquickly/inavarietyofways/onmultipleservers
      • synchronizedataacrossservers
    • Oriented toward programmers, MongoDB drivers support most programming languages
    • Rich query language: MongoDB is not a key-value DB
    • Mongo might be the closest to a RDBMS, refer to: NoSQL: If Only It Was That Easy
      • MongoDB does like most NoSQL databases that sacrifice capacities
      • MongoDB maintains features of relational DBs, built for CRUD.
      • MongoDB is ACID compliant at the document level.

1.3 Why companies use MongoDB/ Why companies shift from DBMS to MongoDB

  1. Who use MongoDB? see:
  2. Some third party tools that enhance interaction:

2. Scalability

What scalability options are available, and what trade-offs are associated with choosing a particular scaling method?

  • Intro
    • MongoDB is strongly consistent by default
    • MongoDB is a single-master system
    • MongoDB can build high-performance systems at scale
    • MongoDB Ops Manager makes it easy to deploy/monitor/backup/ scale .
  • Three Scaling matrix by MongoDB, refer to: https://www.mongodb.com/mongodb-scale - Cluster Scale: distributed database across nodes - Performance Scale: - Data Scale:

  • How does MongoDB scale?
    • MongoDB scales horizontally (scale out) by replication and sharding. (a.k.a. MongoDB is built in replication and sharding)

3. Replication

  • A master-slave cluster is a most general mode. To set up, we need to start a master node and one or more slave nodes.

  • A Replica set is an asynchronous master/slave replication. Replication can be done through replica set (using native replication). One replica set is a group of mongod processes. A replica consists of multiple replicas. To set up, we can start by setting the smallest set of two servers.
    • Has a primary replica node (called primary , with only one member, handling normal client requests, essentially the master → receive all write operations ) and one or more secondary replica nodes (called secondary, containing the rest of the members, mirroring data on the master → operations are recorded on the primary’s oplog. )
    • Secondaries replicate the log and apply the operations to their data sets ). Secondaries will become new primary node if primary becomes unavailable (triggers an replica sets election).
    • The difference between a master-slave cluster and a replica set is that a master-slave cluster have a single master, while a replica set do not.
    • Multiple copies of data on multiple servers; read operations on multiple servers; Self-healing →(for distributed systems:) redundancy & high availability & fault tolerance & data locality & increased read capacity
    • Automated: A replica set provides an automated method to promote slaves/store copies of data/handle failover → strongly consistency
    • Failover mechanism: when master node is down
    • Allows in-memory storage.
  • Advantages:
    • High availability: if master fails, replica will select another to become the master.
    • Distributed read load: replica set is accessed for all reads/writes. (read scaling is useful)
    • Disaster recovery: delayed secondary node: recover disastrous events

4. Sharding

MongoDB achieves sharding through auto-sharding - it handles splitting up. Sharding is already built into the database.

  1. MongoDB sharded cluster with three components working together:
    • A shard / Or Replica Sets : A shard is a container holding a collection of data sets. A shard can be deployed as a single mongod server for development, or a replica set for production.
    • Mongos Server / Query router : a routing process; routes requests and aggregate responses.
    • Config Server : store metadata and configuration of a cluster
  2. Multiple sharding methods are available at MongoDB:
    • Range-based Sharding : partitioned by the shard key value.
    • Hash-based Sharding : partitioned by a hash of the key’s value.
    • Location-aware Sharding : partitioned by specified configuration.
  3. Shard Key: To shard a collection, we need to choose a key from collection and use it to split up the data. Shard keys affect operations.
    • Range based partitioning → supports more efficient range queries/ result in uneven distribution of data
    • Hash based partitioning : compute a hash value → ensure an even distribution of data/ at the expense of efficient range queries.
      • Compare performance distinctions: see above
      • Tag aware sharding: tag with ranges of the shard key
  4. Sharding provides: - Automatic balancing in load/data distribution - Adding new machines easily - Scaling out to many nodes - Automatic failover/ No single points of failure

    1. What are the tradeoffs ?
      • Additional costs: duplicate replica set; setup configuration server and router server
      • Forecasting: To make the right decision, we can collect performance metrics over time and to see which choice is cheaper. E.g: forecast resource requirements: RAM, storage requirements, processor speeds, disk I/O speed

5. CAP Theorem

Many people think that MongoDB is CP in CAP theorem. MongoDB’s CAP performance in regards of consistency and availability:

  1. Consistency : MongoDB is strongly consistent by default . → because it’s a single-master system. MongoDB selects consistency over availability.
    • How do MongoDB persist strong consistency?
      • One piece of data in one shard only
  2. Availability : MongoDB has a single master, so it’s not CAP-available. At sharding level, MongoDB has a weak availability. However, MongoDB can also support high availability.
    • How to deploy MongoDB for high availability?
      • Master-slave replication with replica sets
      • Automatic failover with replica set elections: Secondary will become primary if primary is not available.
  3. Partition Tolerance: strong partition tolerance for a distributed system

In mongoDB CAP has different options (there’s tradeoff between availability and consistency):

  • Write concerns: choose how many nodes to store data
  • Read preference: choose which node to read data.

References

  1. MongoDB documentations, available at: https://docs.mongodb.com/manual/core/replica-set-elections
  2. Sharded Clusters in Mongodb - the Key Considerations: http://blog.scottlogic.com/2014/08/08/sharded-clusters-mongodb-considerations.html
  3. MongoDB Architecture Guide [book]
  4. MongoDB - The Definitive Guide [book]
  5. Scaling MongoDB [book]
comments powered by Disqus