Tuesday, June 20, 2017

Databases - RDBMS and NoSQL

If you are reading this blog post then you must have used at least one database in your career and wondering what's up with these new trend of "NoSQL".

Before getting into NoSQL databases, let’s first look at the history of RDBMS (Relational Data Base Management System) and its main concepts.

RBDMS has a history of 40 years with many enhancements. It all started with E.F. Codd and his 12 rules to make a database as RDBMS.

Beyond a lot of intended benefits, the relational model is well-suited to client-server programming and today it is the predominant technology for storing structured data in web and business applications. Classical relational databases follow the ACID property. That is, a database transaction must be Atomic, Consistent, Isolated and Durable. The details of ACID are as follows:


ACID - Atomicity, Consistency, Isolation, Durability
Atomicity
Consistency
Isolation
Durability
Atomicity requires that each transaction be "all or nothing": if one part of the transaction fails, then the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors and crashes. To the outside world, a committed transaction appears (by its effects on the database) to be indivisible ("atomic"), and an aborted transaction does not happen.
The consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This does not guarantee correctness of the transaction in all ways the application programmer might have wanted (that is the responsibility of application-level code), but merely that any programming errors cannot result in the violation of any defined rules.
The isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed sequentially, i.e., one after the other. Providing isolation is the main goal of concurrency control. Depending on the concurrency control method (i.e., if it uses strict - as opposed to relaxed - serializability), the effects of an incomplete transaction might not even be visible to another transaction.
The durability property ensures that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter). To defend against power loss, transactions (or their effects) must be recorded in a non-volatile memory.

Apart from these ACID properties, there are some basic characteristics due to which Relational DBMS become popular. Some of them are:
·         Data is stored in a set of Tables or data is stored in the format of row and column in a table.
·         Relationships are represented by data.
·         Tables are joined by relational links.
·         Reduced duplication of data in database can be achieved by normalization.
·         They allow greater flexibility and efficiency
Shortcomings of RDBMS
RDBMS is sufficient to store and manipulate all the structured data efficiently but in today’s world the velocity and nature of data used/generated over the Internet is growing exponentially. As we can often see in areas like social media, the data used has no specific structure boundary. This makes unavoidable the need to handle unstructured data which is non-relational and schema-less in nature. For RDBMS it becomes a real challenge to provide the cost effective and fast Create, Read, Update and Delete (CRUD) operation as it has to deal with the overhead of joins and maintaining relationships amongst various data.

Therefore a new mechanism is required to deal with such data in an easy and efficient way. This is where NoSQL comes into the picture to handle unstructured BIG data in an efficient way to provide maximum business value and customer satisfaction.
NoSQL

NoSQL is not a campaign against the SQL language. NoSQL stands for “Not Only SQL.” It provides more possibilities beyond the classic relational approach of data persistence to the developers.

NoSQL refers to a broad class of non-relational databases that differ from classical RDBMS in some significant aspects, most notably because they do not use SQL as their primary query language, instead providing access by means of Application Programming Interfaces (APIs).

The reason behind such a big switch or in other words the advantages of NoSQL are the following:
·         High scalability
·         Distributed Computing
·         Lower cost
·         Schema flexibility
·         Un/semi-structured data
·         No complex relationships
As RDBMS follows the ACID property, NoSQL databases are “BASE” Systems. The BASE acronym was defined by Eric Brewer, who is also known for formulating the CAP theorem whose properties are used by BASE System.

The CAP theorem states that a distributed computer system cannot guarantee all of the following three properties at the same time:

CAP theorem


Consistency

Every read receives the most recent write or an error
Every request receives a (non-error) response – without guarantee that it contains the most recent write
The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes


A BASE system gives up on consistency so as to have greater Availability and Partition tolerance. A BASE can be defined as following:
Basically Available indicates that the system does guarantee availability.
Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
Eventual consistency indicates that the system will become consistent over time, given that the system doesn’t receive input during that time.

With that said here is the simple pictorial representation of above mentioned concepts and which one to choose from the given business needs.


Types of NoSQL:
As of today there are four types of NoSQL Databases available:
Key-Value: Data is stored in an array of Key-Value pair. It has a Big Hash Table of keys & values. It is Fast scalable but cannot handle complex data.
Columnar: Unlike RDBMS, Each storage block contains data from only one column. Fast, scalable, denormalized, map reduce, good for series, not efficient for complex data.
Document Database: It stores documents made up of tagged elements. A Good operational system to store XML, JSON object related content. Not Analytical and moderately scalable. Its Object Oriented Database

Graph: A network database that uses edges and nodes to represent and store data. Great for complex data, transactional. Less scalable.
Detailed Description:
Key-Value:
These type of databases work by matching keys with values, similar to a dictionary. There is no structure nor relation. After connecting to the database server (e.g. Redis), an application can state a key (e.g. the_answer_to_life) and provide a matching value (e.g. 42) which can later be retrieved the same way by supplying the key.
Key / value DBMSs are usually used for quickly storing basic information, and sometimes not-so-basic ones after performing, for example, a CPU and memory intensive computation. They are extremely performant, efficient and usually easily scalable.
Columnar:
Column based NoSQL database management systems work by advancing the simple nature of key / value based ones.
Despite their complicated-to-understand image on the internet, these databases work very simply by creating collections of one or more key / value pairs that match a record.
Unlike the traditional defines schemas of relational databases, column-based NoSQL solutions do not require a pre-structured table to work with the data. Each record comes with one or more columns containing the information and each column of each record can be different.

Basically, column-based NoSQL databases are two dimensional arrays whereby each key (i.e. row / record) has one or more key / value pairs attached to it and these management systems allow very large and un-structured data to be kept and used (e.g. a record with tons of information).
Document Database:
Document based NoSQL database management systems can be considered the latest craze that managed to take a lot of people by storm. These DBMS work in a similar fashion to column-based ones; however, they allow much deeper nesting and complex structures to be achieved (e.g. a document, a document within a document).
Documents overcome the constraints of one or two level of key / value nesting of columnar databases. Basically, any complex and arbitrary structure can form a document, which can be stored using these management systems.
Despite their powerful nature, and the ability to query records by individual keys, document based management systems have their own issues and downfalls compared to others. For example, retrieving a value of a record means getting the whole lot of it and same goes for updates, all of which affect the performance.
Graph:
Finally, the very interesting flavor of NoSQL database management systems is the graph based ones.


The graph based DBMS models represent the data in a completely different way than the previous three models. They use tree-like structures (i.e. graphs) with nodes and edges connecting each other through relations.


Similarly to mathematics, certain operations are much simpler to perform using these type of models thanks to their nature of linking and grouping related pieces of information (e.g. connected people).


These databases are commonly used by applications whereby clear boundaries for connections are necessary to establish. For example, when you register to a social network of any sort, your friends' connection to you and their friends' friends' relation to you are much easier to work with using graph-based database management systems.


NoSQL Advantages
NoSQL Disadvantages
High Scalability
Too many options (> 250), which one to pick.
Schema Flexibility
Limited query capabilities (so far)

Distributed Computing (Reliability, Scalability, Sharing of Resources, Speed)
Eventual consistency is not intuitive to program for strict scenarios like banking applications.
No complicated relationships
Lacks Joins, Group by, Order by facilities
Lower cost (Hardware Costs)
ACID transactions
Open Source – All of the NoSQL options with the exceptions of Amazon S3 (Amazon Dynamo) are open-source solutions. This provides a low-cost entry point.
Limited or no guarantee of support – Open source

Feature
NoSQL
RDBMS
Data Volume
Handles Huge Data Volumes
Handles Limited Data Volumes
Data Validity
Highly Guaranteed
Less Guaranteed
Scalability
Horizontally
Horizontally & Vertically
Query Language
No declarative query language
Structured Query Language (SQL)
Schema
No predefined schema or less rigid schemas
Predefined Schema (Data Definition Language & Data Manipulation Language)
Data Type
Supports unstructured and unpredictable data
Supports relational data and its relationships are stored in separate tables
ACID/BASE
Based on BASE principle (Basically, Available, Soft State, Eventually Consistent)
Based on ACID principle (Atomicity, Consistency, Isolation and Durability)
Transaction Management
Weaker transactional guarantee
Strong transactional guarantees
Data Storage Technique
Schema-free collections are utilized to store different types and document structures, such as {“color”, “blue”} and {“price”, “23.5”} can be stored within a single collection.
No collections are used for data storage; instead use DML for it.

Conclusion
Here is another quick comparison between NoSQL and RDBMS:
Opt NoSQL
1.     If data is huge, unstructured, sparse/growing
2.     Less rigid schema
3.     Performance & Availability preferred over Redundancy
4.      While scaling out is an out-of-the-box feature, it does not prevent scale up,
5.     Cost Effective- uses clusters of cheap commodity servers to manage the exploding data and transaction volumes
Opt RDBMS
1.    If Analytics, BI or Reporting is required.
2.     For Benefits of ACID
3.     Rigid Schema
4.     No redundancy allowed
5.     Allows Scale up & limited Scale-out (sharding)
6.     Expensive- rely on expensive proprietary servers and storage systems
There are more than 250+ NoSQL Products available in the market today. Below are few of them with their characteristics.

DB Name/Type
Year Started
Document Database
Columner Database
Graph Database
Key-Value
Multi-Model
Dead
ACID
MongoDB
2009 Feb
Y






CouchDB
2010 July
Y






ArangoDB
2011
Y

Y
Y
Y


Cloud Datastore (Google)
2013 May
Y





Y
DocumentDB (Microsoft Azure)
2015
Y





Y
RethinkDB
2009 June
Y






SequoiaDB
2013
Y






NosDB (for .NET)
2016 July
Y






RavenDB (for .NET)
2010 April
Y






MarkLogic
2001
Y



Y

Y
Clusterpoint Database
2006
Y



Y

Y
Terrastore
2009
Y




Y !!!

JasDB
2013 June
Y






RaptorDB (for .NET)
2012 April
Y






EJDB
2012 Nov
Y






Cassandra
2010 Apr

Y

Y



Scylla
2015 Sep

Y





Hbase
2008 !!!

Y





BigTable (Google)
2005 Feb

Y





Accumulo
2012 March

Y

Y
Y


MonetDB
2016 Dec

Y





Riak
2009 Aug



Y



Redis
2009 May



Y



Dynamo (Amazon)
2013



Y



MemcacheDB
2008 Dec



Y



Neo4j
2010 Feb


Y




OrientDB
2010


Y

Y


FlockDB
2010 April


Y




InfoGrid
2010 !!


Y




InfiniteGraph
2010


Y




gunDB
2015 Jan
Y

Y

Y


LucidDB






Y

Hypertable
2015 Dec





Y