1. Introduction
1.1 Intro to SQL and NoSQL
- Two type of database solutions:
- SQL: relational
- Provide strong persistence, integration, transactions reporting
- Impedance Mismatch problem
- NoSQL: non-relational
- Differences between SQL and NoSQL:
- Storage:
- RDB : Structured, store in rows and columns (MySQL, Oracle, PostgreSQL, etc) If the data is structured/constant, RDB is appropriate.
- NoSQL : Unstructured - No limits on the types of data to store/ add new types as needs are changed
- 4 common types of NoSQL:
- Key-Value Stores: data is stored in key-value pairs (Redis, Dynamo)
- Document Databases: data is stored in documents (MongoDB, CouchDB)
- Column Databases: use column families (HBase, Cassandra)
- Graph Databases: relations are represented in graphs (Neo4J,
InfiniteGraph)
- Schema:
- RDB : Has predefined schemas
- NoSQL : Without predefined schemas (schema-less) / have a dynamic schema
- Querying:
- RDB : SQL (structured query language), defining and manipulating the data
- NoSQL :
- Scalability:
- RDB :
- vertically scalable: increasing performance of hardware: higher Memory, CPU, etc.
- Disadvantages: expensive, time-consuming,etc
- NoSQL :
- horizontally scalable: add more servers in the infrastructure.
- Distribute data across servers automatically
- Used mainly for big data: NoSQL handles data differently than RDB
- One advantages: scale across multiple data centers -> make most use of cloud computing and storage -> a cost-saving solution
- Reliability:
- RDB : ACID compliant, good data reliability, performing transactions guarantee -> If we need to ensure ACID compliance (ACIDity), we should choose RDB
- NoSQL : sacrifice ACID for performance and scalability ● Introduction to
1.2 Introduction to MongoDB and its advantages
- Intro of MongoDB:
- Open source DB
- Document-oriented (JSON like documents, schemaless)
- One of the most popular databases in the world
- Advantages:
- Easy scalability:
- highly scalable horizontally, include sharding and partitioning
- fault tolerance and auto-sharding
- Cloud deployment, unlimited growth, higher throughput, lower latency
- flexible schema
- include support for MapReduce
- Developer agility:
- Easy to setup
- deployquickly/inavarietyofways/onmultipleservers
- synchronizedataacrossservers
- Oriented toward programmers, MongoDB drivers support most programming languages
- Rich query language: MongoDB is not a key-value DB
- Mongo might be the closest to a RDBMS, refer to: NoSQL: If Only It Was That Easy
- MongoDB does like most NoSQL databases that sacrifice capacities
- MongoDB maintains features of relational DBs, built for CRUD.
- MongoDB is ACID compliant at the document level.
1.3 Why companies use MongoDB/ Why companies shift from DBMS to MongoDB
- Who use MongoDB? see:
- Some third party tools that enhance interaction:
2. Scalability
What scalability options are available, and what trade-offs are associated with choosing a particular scaling method?
MongoDB achieves sharding through auto-sharding - it handles splitting up. Sharding is already built into the database.
- MongoDB sharded cluster with three components working together:
- A shard / Or Replica Sets : A shard is a container holding a collection of data sets. A shard can be deployed as a single mongod server for development, or a replica set for production.
- Mongos Server / Query router : a routing process; routes requests and aggregate responses.
- Config Server : store metadata and configuration of a cluster
- Multiple sharding methods are available at MongoDB:
- Range-based Sharding : partitioned by the shard key value.
- Hash-based Sharding : partitioned by a hash of the key’s value.
- Location-aware Sharding : partitioned by specified configuration.
- Shard Key: To shard a collection, we need to choose a key from collection and use it to split up the data. Shard keys affect operations.
- Range based partitioning → supports more efficient range queries/ result in uneven distribution of data
- Hash based partitioning : compute a hash value → ensure an even distribution of data/ at the expense of efficient range queries.
- Compare performance distinctions: see above
- Tag aware sharding: tag with ranges of the shard key
-
Sharding provides:
- Automatic balancing in load/data distribution
- Adding new machines easily
- Scaling out to many nodes
- Automatic failover/ No single points of failure
- What are the tradeoffs ?
- Additional costs: duplicate replica set; setup configuration server and router server
- Forecasting: To make the right decision, we can collect performance metrics over time and to see which choice is cheaper. E.g: forecast resource requirements: RAM, storage requirements, processor speeds, disk I/O speed
5. CAP Theorem
Many people think that MongoDB is CP in CAP theorem. MongoDB’s CAP performance in regards of consistency and availability:
- Consistency : MongoDB is strongly consistent by default . → because it’s a single-master system. MongoDB selects consistency over availability.
- How do MongoDB persist strong consistency?
- One piece of data in one shard only
- Availability : MongoDB has a single master, so it’s not CAP-available. At sharding level, MongoDB has a weak availability. However, MongoDB can also support high availability.
- How to deploy MongoDB for high availability?
- Master-slave replication with replica sets
- Automatic failover with replica set elections: Secondary will become primary if primary is not available.
- Partition Tolerance: strong partition tolerance for a distributed system
In mongoDB CAP has different options (there’s tradeoff between availability and consistency):
- Write concerns: choose how many nodes to store data
- Read preference: choose which node to read data.
References
- MongoDB documentations, available at:
https://docs.mongodb.com/manual/core/replica-set-elections
- Sharded Clusters in Mongodb - the Key Considerations:
http://blog.scottlogic.com/2014/08/08/sharded-clusters-mongodb-considerations.html
- MongoDB Architecture Guide [book]
- MongoDB - The Definitive Guide [book]
- Scaling MongoDB [book]