NoSQL is not the name of any particular database instead it refers to a broad class of non-relational databases that differ from classical relational database management systems (RDBMS) in some significant aspects, most notably because they do not use SQL as their primary query language, instead providing access by means of Application Programming Interfaces (API).
NoSQL databases and data-processing frameworks are primarily utilized because of their speed, scalability and flexibility. Adoption of NoSQL in the enterprise level, however, is still emerging. Some consider it the absolute apogee of achievement, while others maintain it at the peak of the Inflated Expectations Phase of Gartner’s Hype Cycle, used to characterize the over-enthusiasm or “hype” and subsequent disappointment that typically happen with the introduction of new technologies. Still others relegate it to an inferior and inconspicuous position in favor of columnar relational databases such as Sybase IQ or Oracle 11g.
NoSQL...can be considered "Internet age" databases that are being used by Amazon, Facebook, Google and the like to address performance and scalability requirements that cannot be met by traditional relational databases.
Features of NoSQL databases
One major difference between traditional relational databases and NoSQL is that the latter do not generally provide guarantees for atomicity, consistency, isolation and durability (commonly known as ACID property), although some support is beginning to emerge. Instead of ACID, NoSql databases more or less follow something called "BASE". We will discuss this in more detail later in the article.
ACID is comprised of a set of properties that guarantees that database transactions are processed reliably. To know more about ACID, read What is a database? A question for both pro and newbie
The other major difference is, NoSQL databases are generally schema-less - that is records in these databases do not require to conform to a pre-defined storage schema.
In a relational database, schema is the structure of a database system described in a formal language supported by the DBMS and refers how the database will be constructed and divided into database objects such as tables, fields, relationships, views, indexes, packages, procedures, functions, queues, triggers and other elements.
In NoSQL databases, schema-free collections are utilized instead so that different types and document structures such as{“color”, “blue”} and {“price”, “23.5”} can be stored within a single collection.
Below table lists down the major characteristic features of NoSQL databases
Feature | Description |
Schema-less | "Tables" don't have a pre-defined schema. Records have a variable number of fields that can vary from record to record. Record contents and semantics are enforced by applications. |
Shared nothing architecture | Instead of using a common storage pool (e.g., SAN), each server uses only its own local storage. This allows storage to be accessed at local disk speeds instead of network speeds, and it allows capacity to be increased by adding more nodes. Cost is also reduced since commodity hardware can be used. |
Elasticity | Both storage and server capacity can be added on-the-fly by merely adding more servers. No downtime is required. When a new node is added, the database begins giving it something to do and requests to fulfill. |
Sharding | Instead of viewing the storage as a monolithic space, records are partitioned into shards. Usually, a shard is small enough to be managed by a single server, though shards are usually replicated. Sharding can be automatic (e.g., an existing shard splits when it gets too big), or applications can assist in data sharding by assigning each record a partition ID. |
Asynchronous replication | Compared to RAID storage (mirroring and/or striping) or synchronous replication, NoSQL databases employ asynchronous replication. This allows writes to complete more quickly since they don't depend on extra network traffic. One side effect of this strategy is that data is not immediately replicated and could be lost in certain windows. Also, locking is usually not available to protect all copies of a specific unit of data. |
BASE instead of ACID | NoSQL databases emphasize performance and availability. This requires prioritizing the components of the CAP theorem (described elsewhere) that tends to make true ACID transactions implausible |
Types of NoSQL databases
NoSQL database systems came into being by some of the major internet players such as Google, Facebook, LinkedIn and others which had significantly different challenges in dealing with data than those addressed by traditional RDBMS solutions. There was a need to provide information out of large volumes of data that to a greater or lesser degree adhered to similar horizontal structures. These companies realized that performance and real-time character was more important than consistency, to which much of the processing time in a traditional RDBMS had been devoted.
As such, NoSQL databases are often highly optimized for retrieve and append operations and often offer little functionality beyond record storage. The reduced run-time flexibility compared to full SQL systems is counterbalanced by significant gains in scalability and performance for certain data models. NoSQL databases demonstrate their strengths above all with regard to the flexible handling of variable data by document-oriented databases, in the representation of relationships by graph databases and in the reduction of a database to a container with key-value pairs provided by key-value databases.
Consequently, NoSQL databases are often categorized according to the way they store data and fall under the following major categories:
- Key-value stores
- Columnar (or column-oriented) databases
- Graph databases
- Document databases
Key-value stores
Key-value stores allow the application to store its data in a schema-less (key, value) pairs. These data can be stored in a hash table like datatypes of a programming language - so that each value can be accessed by its key. Although such storage might not be very efficient - since they provide only a single way to access the values - but eliminates the need for a fixed data model.
Columnar databases
A column-oriented DBMS stores its content by column rather than by row. It contains predefined families of columns and is more accomplished at scaling and updating at relatively high speeds, which offers advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items.
Graph databases
Graph databases optimize the storage of networks – or “Graphs“ – of related nodal data as a single logical unit. A graph database uses graph structures with nodes, edges and properties to represent and store data and provides index-free adjacency, meaning that every element contains a direct pointer to its adjacent element and no index lookups are necessary. This can be useful in cases of finding degrees of separation where SQL would require extremely complex queries. A popular movie service, for example, shows the logged-in user a “Best Guess for You” rating for each film based on how similar people rated it, while other services such as LinkedIn, Facebook or Netflix show people in a network at various degrees of separation. Although such queries become simple in Graph databases, the relevance of this technology in a financial enterprise is difficult to determine.
Document databases
Document stores are used for large, unstructured or semistructured records. Data is organized in documents that can contain any number of fields of any length. All document-oriented database implementations assume documents encapsulate and encode data in some sort of standard formats – known as encodings – and are ideal for MS Office or PDF documents. Document databases should not be confused with Document Management Systems, however. The documents referred to are not actual documents as such, although they can be. Documents inside a document-oriented database are similar in some ways to records or rows in relational databases, but they are less rigid because they are not required to adhere to a standard schema. Unlike a relational database where each record would have the same set of fields and unused fields might be kept empty, there are no empty fields in document records. This system allows new information to be added to or removed from any record without wasting space by creating empty fields on all other records. In contrast to key-value and columnar databases, which view each record as a list of attributes which are updated one at a time, document stores allow insertion, updates and queries of entire records using a JavaScript Object Notation (JSON) format. The concept of a join is less relevant in document databases than in traditional RDBMS systems. As a result, records that might be joined in a traditional RDBMS, are generally denormalized into wide records. Denormalization refers to a process by which the read-performance of a database is optimized by the addition of redundant or grouped data. Some of the NoSQL vendors, most notably MongoDB, do in fact feature add-on join capabilities as well. Many of these database categories are beginning to blur, however. As all of them support the association of values with keys, they are therefore all fundamentally key-value stores; document databases, moreover, can perform all of the capabilities of columnar databases from a sematic point of view. As a result, the distinguishing factors must be evaluated in terms of performance and ease of use for a particular solution.
Popular incarnations of NoSql databases
Most implemented solutions cannot be strictly assigned to a specific type and contain features from two or more categories. We should also recognize that each NoSQL implementation has its own special nuances. Popular offerings include the following:
Apache Cassandra
Apache Cassandra is an open-source, distributed database-management system designed to handle very large amounts of data spread out across many commodity servers while providing a high degree of service availability with no single point of failure. It is particularly fast at write operations as opposed to reads and might therefore lend itself best to applications that require analysis of large sets of data with write-backs.
HBase
HBase is also an open-source, distributed database modeled after Google’s BigTable. HBase technologies are not strictly a data-store, but generally work closely with a NoSQL database to accomplish highly scalable analyses. HBase scales linearly with the number of nodes and can quickly return queries on tables consisting of billions of rows and millions of columns.
BigTable
BigTable can be defined as a sparse, distributed, multi-dimensional sorted map. BigTable is designed to scale into the petabyte range – a petabyte is equivalent to 1 million gigabytes - across hundreds or thousands of machines and to make it easy to add more machines to the system and start taking advantage of those resources automatically without any reconfiguration.
Coherence and Ehcache
Coherence and Ehcache are equipped with In-Memory caches. Coherence is in heavy use in financial industries where network latency – defined as the time it takes to cross a network connection from sender to receiver - is a factor.
NoSQL versus relational columnar databases – Is NoSql right for you?
Relational columnar databases such as SybaseIQ continue to use a relational model and are accessed via traditional SQL. The physical storage structure is very different when compared to non-relational NoSQL columnar stores, which store data as rows whose structure may vary and are organized by the developer into families of columns according to the application use case.
Relational columnar databases, on the other hand, require a fixed schema with each column physically distinct from the others, which makes it impossible to declaratively optimize retrievals by organizing logical units or families. Because a NoSQL database retrieval can specify one or more column families while ignoring others, NoSQL databases can offer a significant advantage when performing individual row queries. NoSQL databases cannot meet the performance characteristics of relational columnar databases when it comes to retrieving aggregated results from groups of underlying records, however.
This distinction is a litmus test when deciding between NoSQL and traditional SQL databases. NoSQL databases are not as flexible and are exceptional at speedily returning individual rows from a query. Traditional SQL databases, on the other hand, forfeit some storage capacity and scalability but provide extra flexibility with a standard, more familiar SQL interface.
Since relational databases must adhere to a schema, they typically need to reserve space even for unused columns. NoSQL databases have a dense per-row schema and so tend to be better at optimizing the storage of sparse data, although the relational databases often use sophisticated storage-optimization techniques to mitigate this perceived shortcoming.
Most importantly, relational columnar databases are generally intended for the read-only access found in conjunction with data warehouses, which provide data that was loaded collectively from conventional data stores. This can be contrasted with NoSQL columnar tables, which can handle a much higher rate of updates.
The CAP Theorem
Despite the high demand in recent years for massively distributed databases with high partition fault-tolerance, the CAP theorem stipulates that it is actually impossible for a distributed system to provide consistency, availability and partition fault-tolerance guarantees simultaneously; a distributed system can satisfy at most any two of these guarantees at the same time, but not all three. These guarantees can be understood as follows:
Consistency – Concurrently executing queries see the same valid and consistent data at the same time.
Availability – This is a guarantee that every request receives a response about whether it succeeded or failed.
Partition-tolerance – Also known as fault-tolerance, this is a guarantee that the system continues to operate despite arbitrary message loss.
Because no distributed system is capable of satisfying all three guarantees at the same time, a tradeoff must be made. While traditional databases make that decision for us, NoSQL databases provide these guarantees as tuning options. Database vendors must always decide which two to prioritize. The options are as follows:
Availability is compromised in favor of consistency and partition-tolerance.
Partition-tolerance is forfeited in favor of consistency and availability.
Consistency is compromised but systems are always available and can work when parts are partitioned.
Traditional SQL databases place a high priority on consistency and fault-tolerance and have generally as a result chosen to go with the first option above and forfeit high availability. NoSQL databases frequently leave that decision to the application operations team and provide configuration options so that the preferred options can be chosen based on the application use case.
Concepts of BASE - Basically Available Soft-state Eventually
Sometimes, however, perfect consistency is not a requirement and “eventual consistency” will suffice. Consequently, many NoSQL databases are using eventual consistency to provide both availability and partition tolerance guarantees with a maximum level of data consistency. In contrast to immediate consistency, which guarantees that updates are immediately visible to all when a update operation returns to the user with a successful result, eventual consistency means that given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.
In database terminology, this is known as “Basically Available Soft-state Eventually” (BASE) consistent as opposed to the database concept of ACID. No doubt the juxtaposition of the terms ACID and BASE was more than a mere coincidence.
Apache CouchDB, for example, uses a versioning system similar to software version control systems such as Subversion (SVN). An update to a record does not overwrite the old value, but rather creates a new version of that record. If two clients are operating on the same record and client A updates the record before client B, then client B will be notified that the version being modified is out of date and will have the option to requery the revised record and make the change there in a manner similar to an “update and merge” operation in SVN.
In order to use NoSQL databases at the present time, an understanding of the API language is required and queries must be written in that language. This is, however, greatly facilitated by the fact that Java is supported in every case. Work has also been done recently to create a unified NoSQL language called Unstructured Query Language (UNQL), which is semantically a superset of SQL Data Manipulation Language (DML). There is also an Apache incubator project called Thrift which involves an interface-definition language particularly well-suited to NoSQL use cases. Thrift is reminiscent of CORBA IDL and provides a means by which language-specific interfaces can be generated for most popular languages. Originally developed at Facebook, it has been shared as an open-source project since 2007.
Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking. Big Data Hadoop Online Course Hyderabad
ReplyDelete