April 12, 2017

NoSQL Learning Notes

0. Reference of this note:

1. Introduction

1.1 Relational Databases

Relational databases have been a successful technology for twenty years, providing persistence, concurrency control, and an integration model.

1.2 Drivers of Change

  • Impedance Mismatch: Application developers have been frustrated with the impedance between the relational model and the in-memory data structures.

  • Web Services: Web services (applications communicating over HTTP) emerging from 2000s have more flexibility in the structure of data being exchanged per request / response.

  • Big Data: There’s two approaches to system design to address this: scale up, or scale out.
    Scaling up (vertical scaling) means bigger machines, more processors, more disk storage, more memory; Scaling out (horizontal scaling) means use lots of machines in a cluster.

  • Clustered Computing: The vital factor for a change in data storage was the need to support large volumes of data by running on clusters. Relational databases are not designed to be run on clusters.
    To run large clusters and capture huge amount of data, Amazon produced Dynamo in 2012 and Google produced BigTable in 2005.

1.3 Common characteristics of NoSQL Databases

  • Not using the relational model
  • Not using SQL as a query language
  • Schemaless
  • Mostly running well on clusters
  • Generally open-source
  • Built for the 21st century web estates

1.4 NoSQL Data Models

Four categories are widely used in the NoSQL ecosystem: key-value, document, column-family, and graph.

The most important result of the rise of NoSQL is polyglot persistence - using different data stores in different circumstances.

2. Clustering Optimizations

The primary driver of interest in NoSQL has been its ability to run databases on a large cluster. As data volumes increase, it becomes more difficult and expensive to scale up. A more appealing option is to scale out. There are usually two approaches: replication and sharding.

2.1 Sharding

Sharding puts different data on separate nodes, each of which does its own reads and writes. It can improve DB read and write performance.

In relational databases, querying, referential integrity, transactions, and consistency controls are lost across shards.

Some NoSQL databases provide auto-sharding.

2.2 Replication

Replication puts multiple copies of data on separate nodes. There are two main types of replication:

  1. Master-slave: Master services all writes, reads can come from master or slaves. Data is replicated from master to slaves. Master-slave replication is read resilience: Should the master fail, the slave can still handle read requests.

  2. Peer-to-peer: All nodes read and write all data. The nodes coordinate to synchronize their copies of the data.

Replication can be combined with sharding or implemented on its own. The dark side of replication is inconsistency.

3. The CAP Theorem

It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: consistency, availability, and partition tolerance.

Consistency includes update consistency, read consistency, logical consistency, and replication consistency.

  • Update consistency: the database can handle concurrent updates and all nodes apply operations in the same order.
  • Read consistency: readers of the database get consistent responses to their requests.
  • Logical consistency: database operations mirror business rules
  • Replication consistency: the same data item has the same value when read from different nodes

Sometimes we have to sacrifice consistency for availability. Some examples to relax consistency are:

  • The internet
  • Amazon shopping carts (referring to: Amazon’s Dynamo)
  • Some ATM’s
  • Version control systems
  • Any extensive use of caching

To get good consistency, you need to involve many nodes in data operations, but this increases latency. So you often have to trade off consistency versus latency. Eventual consistency means that at some point the system will become consistent once all the writes have propagated to all the nodes.

Availability means that if you can talk to a node in a cluster, it can read and write data.

Partition tolerance means that the cluster can survive communication breakages in the cluster that separate the cluster into multiple partitions unable to communicate with each other.

The CAP theorem states that if you get a network partition, you have to trade off availability of data versus consistency. CAP is not a binary decision. There are varying degrees of relaxing consistency and availability.

4. Quorums

Quorum is a Strategy for maintaining consistency in peer-to-peer replicated databases, ensuring that no two inconsistent copies of a record are read or written by two transactions concurrently.

To perform a read or write operation, a read quorum or write quorum must be obtained.

  1. Write quorum: W > N/2, where W is the number of nodes participating in a write, and N is the number of nodes the data record is replicated to (the replication factor).

  2. Read quorum: R+W>N ,where R is the number of nodes that need to be contacted to confirm a read.

Some Sample Quorum Configurations are:

  • N = 3, W = 3, R = 1: Slower writes, faster reads

  • N = 3, W = 2, R = 2: Reads and writes perform about the same

Version stamps are used to determine which record is the correct (most recent) record.

5. Map-Reduce

Mapreduce is a programming pattern for processing (and generating) large data sets over a distributed system using parallel computing algorithms.

  • Map-Reduce Fundamentals:

The map function is to generate a list of (key, value) pairs from an input dataset. Map function calls are independent of one another, so they can be ran in parallel.

The reduce function is to perform some series of operations on several (key, value) pairs with the same key, and return a single output. Calls to the reduce function operate on values from only one key, so they can be ran in parallel.

  • More Map-Reduce Components:

Input processor divides input dataset into chunks that can be sent to parallel calls to the map function;

Shuffler group (k,v) pairs from different map function calls by key;

Output writer combines the outputs from different reduce function calls into one final output.

  • Limitations of Map-Reduce:

Map-Reduce cannot be used on problems where the computation of the next value depends on previously computed values. It also cannot be used with algorithms that depend on a shared global state during processing.

comments powered by Disqus