CAP Theorem defines three characteristics of distributed database systems namely- Consistency, Availability and Partition tolerance. These three characteristics defined the capabilities and in-capabilities of your distributed database system. The CAP theorem is also known as Brewer’s Theorem on the name of professor Eric A. Brewer.
A distributed system is a system of more than one node where we can store our data in multiple nodes. Nodes can be physical or virtual machines. In today’s world, every second or third enterprise application is on cloud or going to migrate to cloud systems and cloud systems encourage the utilization of the power of distributed systems.
Nodes communicate with each other to sync up the latest updates/writes/deletes.
Distributed database systems are preferred to scale horizontally (microservice architecture) instead of vertical scaling (monolithic architecture). CAP theorem supports horizontally scaled architecture. Let’s have a discussion on each characteristic of CAP theorem:
Consistency:
Consistency means getting regular correct responses from the server. In a distributed system, we have multiple nodes so have multiple writes and reads on different nodes. Here, consistency means nodes must return the same response for all reads and response should be the most recent successful write. Inconsistent results might businesses suffer.
For example, there are five nodes, one node has an update request and the other four have read requests. After an immediate update on a node, every other node provides the updated response to every subsequent request.
Availability:
Availability means the system is up (means responding on every request) even if one of the several nodes located in different parts of the world goes down. Suppose, there are 20 nodes situated around the world and five of them go down due to any reason (electricity, natural disaster, etc) but 15 nodes are still up and working absolutely fine. The system is broken but still responds to your queries.
Partition Tolerance:
Communication is the most important part of a distributed system which is most essential in CAP. The distributed system is made up of several nodes and nodes communicate with each other to sync up the latest updates means most recent writes. To understand more about partition tolerance, we have to go through the below combinations.
A distributed database system can provide only two capabilities on the cost of the third one like a system supports consistency and partition tolerance on the cost of availability. In CAP, P (Partition tolerance) is not an option, it’s a complete necessity or says C (Consistency) and A (Availability) is dependent on P (Partition tolerance) as the requirement of a distributed system.
Let’s start combinations (CP, AP, CA):
CP – (Consistency and Partition tolerance):
Data must be the same (consistent) from all the nodes on every request, to maintain consistency and also we must allow partition tolerance because data is sync-up when nodes are able to communicate with each other. In a case, where a node will de-sync from other nodes which means a node (desync) is not able to maintain most recent writes with other nodes. So, make these nodes (desync nodes) unavailable to accept any request. To maintain consistency and partition tolerance have to give up the availability of desync nodes.
Instead of providing inconsistent results, it’s better to make de-sync nodes unavailable. For example, email systems cannot be inconsistent.
Database examples MongoDB, Hbase, Memcache, Redis
AP – (Availability and Partition tolerance):
Availability means our system is up and responding either our data/results are consistent or inconsistent. Here, partition tolerance is also compulsory to make nodes communicate with each other. In some cases, availability is a high priority choice in comparison to consistent results but still, consistency is also maintained. To maintain consistency, we must have communication(partition tolerance) between nodes.
Unfortunately, communication is break and availability is a priority over consistent results. By opting availability over consistency, systems are up and responding. So, when the partition is resolved, nodes will re-sync data but doesn’t guarantee consistent results.
Providing system available on the cost of consistent results. We also have to ensure that nodes are communicating with each other.
Database examples Cassandra, Couch Db, DynamoDB.
CA – (Consistency and Availability):
Consistency and availability are not possible together without partition tolerance. If nodes are communicating with each other then there is no possibility of consistent data. Consistent and Availability is only possible in non-distributed architecture which is monolithic. Relational databases are a favorable approach for consistent and availability both.
Database examples MySQL, Oracle.