April 12, 2017

NoSQL Notes

0. Reference of this note:

1. Introduction

1.1 Relational Databases

Relational databases have been a successful technology for twenty years, providing persistence, concurrency control, and an integration model.

1.2 Drivers of Change

1.3 Common characteristics of NoSQL Databases

1.4 NoSQL Data Models

Four categories are widely used in the NoSQL ecosystem: key-value, document, column-family, and graph.

The most important result of the rise of NoSQL is polyglot persistence - using different data stores in different circumstances.

2. Clustering Optimizations

The primary driver of interest in NoSQL has been its ability to run databases on a large cluster. As data volumes increase, it becomes more difficult and expensive to scale up. A more appealing option is to scale out. There are usually two approaches: replication and sharding.

2.1 Sharding

Sharding puts different data on separate nodes, each of which does its own reads and writes. It can improve DB read and write performance.

In relational databases, querying, referential integrity, transactions, and consistency controls are lost across shards.

Some NoSQL databases provide auto-sharding.

2.2 Replication

Replication puts multiple copies of data on separate nodes. There are two main types of replication:

  1. Master-slave: Master services all writes, reads can come from master or slaves. Data is replicated from master to slaves. Master-slave replication is read resilience: Should the master fail, the slave can still handle read requests.

  2. Peer-to-peer: All nodes read and write all data. The nodes coordinate to synchronize their copies of the data.

Replication can be combined with sharding or implemented on its own. The dark side of replication is inconsistency.

3. The CAP Theorem

It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: consistency, availability, and partition tolerance.

Consistency includes update consistency, read consistency, logical consistency, and replication consistency.

Sometimes we have to sacrifice consistency for availability. Some examples to relax consistency are:

To get good consistency, you need to involve many nodes in data operations, but this increases latency. So you often have to trade off consistency versus latency. Eventual consistency means that at some point the system will become consistent once all the writes have propagated to all the nodes.

Availability means that if you can talk to a node in a cluster, it can read and write data.

Partition tolerance means that the cluster can survive communication breakages in the cluster that separate the cluster into multiple partitions unable to communicate with each other.

The CAP theorem states that if you get a network partition, you have to trade off availability of data versus consistency. CAP is not a binary decision. There are varying degrees of relaxing consistency and availability.

4. Quorums

Quorum is a Strategy for maintaining consistency in peer-to-peer replicated databases, ensuring that no two inconsistent copies of a record are read or written by two transactions concurrently.

To perform a read or write operation, a read quorum or write quorum must be obtained.

  1. Write quorum: W > N/2, where W is the number of nodes participating in a write, and N is the number of nodes the data record is replicated to (the replication factor).

  2. Read quorum: R+W>N ,where R is the number of nodes that need to be contacted to confirm a read.

Some Sample Quorum Configurations are:

Version stamps are used to determine which record is the correct (most recent) record.

5. Map-Reduce

Mapreduce is a programming pattern for processing (and generating) large data sets over a distributed system using parallel computing algorithms.

The map function is to generate a list of (key, value) pairs from an input dataset. Map function calls are independent of one another, so they can be ran in parallel.

The reduce function is to perform some series of operations on several (key, value) pairs with the same key, and return a single output. Calls to the reduce function operate on values from only one key, so they can be ran in parallel.

Input processor divides input dataset into chunks that can be sent to parallel calls to the map function;

Shuffler group (k,v) pairs from different map function calls by key;

Output writer combines the outputs from different reduce function calls into one final output.

Map-Reduce cannot be used on problems where the computation of the next value depends on previously computed values. It also cannot be used with algorithms that depend on a shared global state during processing.

Share with your friends: Twitter Facebook
comments powered by Disqus