Cassandra has to option to enable "ReadRepair". A Read is send to all Replicas and if one is stale, it will be fixed/updated. But due to the fact, that all replicas receive the Read, there will be the point, when the nodes reach IO-Saturation. As always ALL replica nodes receive the read, adding further nodes will not help, as they also receive all reads (and will be saturated at once)?
Or does cassandra offer some "tunabililty" to configure that ReadRepair does only go to not all of the nodes (or offer any other "replication" that will allow true read scaling)?
thanks!!
jens
Update:
A Concrete exmaple, as I still do not understand how it will work in practice.
9 Cassandra "Boxes/Severs"
3 Replicas (N=3) => Every "Row" is
written to 2 additinal Nodes = 3
Boxes hold the data in total)
ReadRepair Enabled
The Row in Question is (Lets say customer1) is highly trafficed
1.) The first Time I write the Row "Customer1" to Cassandra it will evantually be available on all 3 nodes.
2.) Now I query the system with 1000's of Requests of requests per second for Customer1 (and to make it more clear with any caching disabled).
3.) The Read will always be dispateched to all 3 nodes. (The first request (to the nearest node) will be a full request for data and the two additional requests will only be a "checksum request".)
4.) As we are queryingw with 1000's of requests, we reach the IO-limit of all Replicas! (The IO is the same on all 3 nodes!! (only the bandwith is much smaller on the checksum nodes).
5.) I add 3 further Boxes (so we have 12 Boxes in Total):
A) These Boxes does NOT have the Data yet (to help scale linearly). I first have to get the Customer1 Record to at least one of this new Boxes.
=>This means I have to Change the replication Factor to 4
(OR is there any other option to get the data to another box?)
And now we have the same problem. The Replication Factor is now 4. And all 4 Boxes will receive the Read(Repair)Requst for this highly trafficed customer1 row. This does not scale this way. Scaling would only work if we have Copy that will NOT receive the ReadRepair Request.
What is wrong in my understanding?? My Conculsion: With Standard ReadRepair the System will NOT scale linearly (for a single highly trafficed row), as adding further boxes will also lead to the fact that these boxes also receive the ReadRepair requests (for this trafficed row)...
Thanks very much!!!Jens
Adding further nodes will help (in most situations). There will only be N read repair "requests" during a read, where N is the ReplicationFactor (number of replicas, nb. not the # of nodes in the entire cluster). So the new node(s) will only be included in a read / read repair if the data you request is included in the nodes key range (or is holding a replica of the data).
There is also the read_repair_chance tunable per ColumnFamily, but that is a more advanced topic and doesn't change the fundamental equation that you should scale reads by adding more nodes, rather than de-tuning read repair.
You could read more about replication and consistency from bens slides
Related
I'm studying up for system design interviews and have run into this pattern in several different problems. Imagine I have a large volume of work that needs to be repeatedly processed at some cadence. For example, I have a large number of alert configurations that need to be checked every 5 min to see if the alert threshold has been breached.
The general approach is to split the work across a cluster of servers for scalability and fault tolerance. Each server would work as follows:
start up
read assigned shard
while true:
process the assigned shard
sleep 5 min
Based on this answer (Zookeeper for assigning shard indexes), I came up with the following approach using ZooKeeper:
When a server starts up, it adds itself as a child under the node /service/{server-id} and watches the children of the node. ZooKeeper assigns a unique sequence number to the server.
Server reads its unique sequence number i from ZooKeeper. It also reads the total number of children n under the /service node.
Server identifies its shard by dividing the total volume of work into n pieces and locating the ith piece.
While true:
If the watch triggers (because servers have been added to or removed from the cluster), server recalculates its shard.
Server processes its shard.
Sleep 5 min.
Does this sound reasonable? Is this generally the way that it is done in real world systems? A few questions:
In step #2, when the server reads the number of children, does it need to wait a period of time to let things settle down? What if every server is joining at the same time?
I'm not sure how timely the watch would be. Seems like there would be a time period where the server is still processing its shard and reassignment of shards might cause another server to pick up a shard that overlaps with what this server is processing, causing duplicate processing (which may or may not be ok). Is there any way to solve this?
Thanks!
Sharding with replication]1
I have a multi tenant database with 3 tables(store,products,purchases) in 5 server nodes .Suppose I've 3 stores in my store table and I am going to shard it with storeId .
I need all data for all shards(1,2,3) available in nodes 1 and 2. But node 3 would contain only shard for store #1 , node 4 would contain only shard for store #2 and node 5 for shard #3. It is like a sharding with 3 replicas.
Is this possible at all? What database engines can be used for this purpose(preferably sql dbs)? Did you have any experience?
Regards
I have a feeling you have not adequately explained why you are trying this strange topology.
Anyway, I will point out several things relating to MySQL/MariaDB.
A Galera cluster already embodies multiple nodes (minimum of 3), but does not directly support "sharding". You can have multiple Galera clusters, one per "shard".
As with my comment about Galera, other forms of MySQL/MariaDB can have replication between nodes of each shard.
If you are thinking of having a server with all data, but replicate only parts to readonly Replicas, there are settings for replicate_do/ignore_database. I emphasize "readonly" because changes to these pseudo-shards cannot easily be sent back to the Primary server. (However see "multi-source replication")
Sharding is used primarily when there is simply too much traffic to handle on a single server. Are you saying that the 3 tenants cannot coexist because of excessive writes? (Excessive reads can be handled by replication.)
A tentative solution:
Have all data on all servers. Use the same Galera cluster for all nodes.
Advantage: When "most" or all of the network is working all data is quickly replicated bidirectionally.
Potential disadvantage: If half or more of the nodes go down, you have to manually step in to get the cluster going again.
Likely solution for the 'disadvantage': "Weight" the nodes differently. Give a height weight to the 3 in HQ; give a much smaller (but non-zero) weight to each branch node. That way, most of the branches could go offline without losing the system as a whole.
But... I fear that an offline branch node will automatically become readonly.
Another plan:
Switch to NDB. The network is allowed to be fragile. Consistency is maintained by "eventual consistency" instead of the "[virtually] synchronous replication" of Galera+InnoDB.
NDB allows you to immediately write on any node. Then the write is sent to the other nodes. If there is a conflict one of the values is declared the "winner". You choose which algorithm for determining the winner. An easy-to-understand one is "whichever write was 'first'".
I would like to know about the existing approaches that are available when running Zookeeper across data centers?
One approach that I found after doing some research is to have observers. That approach is to have only one ensemble in the main data center with leader and follower. And having observers in the backup data center. When main datacenter crash, we select other datacenter as the new main data center and convert observers to leader/follower manually.
I would like to about better approaches to achieve the same.
Thanks
First I would like to point the cons of your solution which hopefully my solution would solve:
a) in case of main data center failure the recovery process is manual (I quote you: "convert observers to leader/follower manually")
b) only the main data center accepts writes -> in case of failure all data (when observer don't write logs) or only last updates (when observer do write logs) are lost
Because the question is about data centerS I'll consider that we have enough (DCs) to reach our objective: solving a. and b. while having an usable multi data center distributed ZK.
So, when having an even number of data centers (DC) one could use an additional DC only for getting an odd number of ZK nodes in the ensemble. When having e.g. 2 DCs than a 3rd one could be added; each DC could contain 1 rwZK (read-write ZK node) or, for better tolerance against failures, each DC could contain 3 rwZK organized as hierarchical quorums (both cases could benefit of ZK observers). Inside a DC all ZK clients should point only to the DC's ZK-group so the traffic remained between DCs would be only for e.g. leader election, writes. With this kind of setup one solves both a. and b. but loses write/recovery-performance because the writes/elections must be agreed between data centers: at least 2 DCs must agree on writes/elections with 2 ZK nodes agreement per DC (see hierarchical quorums). The intra-DC agreement should be fast enough hence won't matter much for the overall write agreement process; bottom line, approximately only the delay between DCs would matter. The disadvantages of this approach are:
- additional cost for the 3rd data center: this could be mitigated by using the company office (a guy did that) as the 3rd data center
- lost sessions because of inter-DC network latency and/or throughput: with high enough timeouts one could reach a maximum possible write-throughput (depending on inter-DC average network speed) so this solution would be valid only when that maximum is acceptable. Still, when using 1 rw-ZK per DC I guess there'll be not much difference to your solution because the writes from backup DC to main DC must travel between DCs too; but for your solution won't be inter-DCs write agreements or leader elections related communication so it's faster.
Other consideration:
Regardless of the chosen solution the inter-DCs communication should be secured and for this ZK offers no solution so tunneling or other approach must be implemented.
UPDATE
Another solution would be to still use an additional 3rd DC (or company office) but where to keep only the rw-ZKs (1, 3 or other odd number) while the other 2 DCs to only have observer-ZKs. The clients should still connect only to the DC's ZK servers but we no longer need hierarchical quorums. The gain here is that the write agreements and leader elections would be only inside the DC with rw-ZKs (let's call it arbiter DC). The disadvantages are:
- the arbiter DC is a single point of failure
- the write requests will still have to travel from observer DCs to arbiter DC
I have been reading about MongoDB and Cassandra. MongoDB is a master/slave where as Cassandra is masterless (all nodes are equal). My doubt is about how the data is stored in these both.
Let's say a user is writing a request to MongoDB(a cluster with master and different slaves each in a separate machine). This means the master will decide(or through some application implementation) to which slave this update should be written to . That is same data will not be available in all the nodes in MongoDB. Each node size may vary. Am i right ? Also when queried will the master know to which node this request should be sent ?
In the case of cassandra, the same data will be written to all the nodes ie) effectively if one node size is 10GB, then the other nodes size is also 10GB. Because if only this is the case, then when one node fails, the user will not lose any data by querying in another node. Am i right here ? If I am right, the same data is available in all the nodes, then what is the advantage of using map/reduce function in Cassandra ? If I am wrong, then how availability is maintained in Cassandra since the same data will not be available in the other node ?
I was searching in stackoverflow about MongoDB vs cassandra and have read about some 10 posts but my questions could not be cleared with the answers in those posts. Please clear my doubts and If I had assumed wrongly, also correct me.
Regarding MongoDB, yep you're right, there is only one primary.
Any secondary can become primary as long as everything is in sync as this will mean the secondary has all the data. Each node doesn't have to be the same on-disk size and this can vary depending on when the replication was done, however, they do have the same data (as long as they're in sync).
I don't know much about Cassandra, sorry!
I've written a thesis about NoSQL stores and therefor I hope that I remember the most parts correctly for Cassandra:
Cassandra is a mixture of Amazon Dynamo, from which it inherit the replication and sharding, and Googles BigTable from which it got the datamodel. So Cassandra basically shards your data, while keeping copies of it on other nodes. Let's have a five node cluster, with nodes called A to E. Your keys are hashed to the keyring through consistent hashing, where continuous areas of your keyring are stored on a given node. So if we have a value range from 1 to 100, per default each node will get 1/5 of the ring. A will range from [1,20), B from [20,40) and so on.
An important Concept for Dynamo is the triple (R,W,N) which tells how many nodes have to read, write, and keep a given value.
Per default you have 3 (N) copies of your data, which is stored on the primary node and two following nodes, which hold backups. When I remember it right from the Dynamo paper your writes go per Default to the first W nodes of your N copies, the other nodes are updated through an Gossip Protocol eventually.
As long as everything is going fine you'll get consistent results, if your primary node is down for some time another node takes your data, through a hinted hand-off. Once the primary comes back your data will be merged, or tried to be merged (this part I can't really remember but check those Vector Clocks which are used to tell the update history).
So if not too big parts of your cluster go down, you'll have a consistent view on your data. If bigger parts of your node are down or you request from only a small parts of your copies you may see inconsistencies, which (may) eventually be consistent.
Hope that helped, I can highly recommend to read those original papers about Amazon Dynamo and Google BigTable, but I think you're mostly interested in Amazon Dynamo. Additionally this post from Werner Vogels may come handy as well.
As for the sharding size I think that those can vary depending on your machine and how hot given areas of your keyring are.
Cassandra does not, typically, keep all data on all nodes. As you suggest, this would defeat some of the advantages offered by it's distributed data model (in particular fast writes would be hampered). The amount of replication desired (how many nodes should keep copies of your data) is customizable by the client at write time. As such, you can set it up to replicate across all nodes, or just keep your data at a single node with no replication. It's up to you. The specific node(s) to which the data gets written is determined by the hash value of the key. Each node is assigned a range of hash values it will store, so when you go to look up a value, again the key is hashed and that indicates on which node to find the data.
I am currently thinking of how to implement an authentication for a web application with a NoSQL solution. The problem I encounter hereby is that in most of the NoSQL solutions (e.g. Cassandra, MongoDB) have probably delayed writes. For example we write on node A but it is not guaranteed that the write is appearing on node B at the same time. This is logical with the approaches behind the NoSQL solutions.
Now one idea would be that you do no secondary reads (so everything goes over a master). This would probably work in MongoDB (where you actually have a master) but not in Cassandra (where all nodes are equal). But our application runs at several independent points all over the world, so we need multi master capability.
At the moment I am not aware of a solution with Cassandra where I could update data and be sure that subsequent reads (to all of the nodes) do have the change. So how could one build an authentication on top of those NoSQL solutions where the authentication request (read) could appear on several nodes in parallel?
Thanks for your help!
With respects to Apache Cassandra:
http://wiki.apache.org/cassandra/API#ConsistencyLevel
The ConsistencyLevel is an enum that controls both read and write behavior based on in your schema definition. The different consistency levels have different meanings, depending on if you're doing a write or read operation. Note that if W + R > ReplicationFactor, where W is the number of nodes to block for on write, and R the number to block for on reads, you will have strongly consistent behavior; that is, readers will always see the most recent write. Of these, the most interesting is to do QUORUM reads and writes, which gives you consistency while still allowing availability in the face of node failures up to half of ReplicationFactor. Of course if latency is more important than consistency then you can use lower values for either or both.
This is managed on the application side. To your question specifically, it comes down to how you design your Cassandra implementation, replication factor across the Cassandra nodes and how your application behaves on read/writes.
Write
ANY: Ensure that the write has been written to at least 1 node, including HintedHandoff recipients.
ONE: Ensure that the write has been written to at least 1 replica's commit log and memory table before responding to the client.
QUORUM: Ensure that the write has been written to N / 2 + 1 replicas before responding to the client.
LOCAL_QUORUM: Ensure that the write has been written to / 2 + 1 nodes, within the local datacenter (requires NetworkTopologyStrategy)
EACH_QUORUM: Ensure that the write has been written to / 2 + 1 nodes in each datacenter (requires NetworkTopologyStrategy)
ALL: Ensure that the write is written to all N replicas before responding to the client. Any unresponsive replicas will fail the operation.
Read
ANY: Not supported. You probably want ONE instead.
ONE: Will return the record returned by the first replica to respond. A consistency check is always done in a background thread to fix any consistency issues when ConsistencyLevel.ONE is used. This means subsequent calls will have correct data even if the initial read gets an older value. (This is called ReadRepair)
QUORUM: Will query all replicas and return the record with the most recent timestamp once it has at least a majority of replicas (N / 2 + 1) reported. Again, the remaining replicas will be checked in the background.
LOCAL_QUORUM: Returns the record with the most recent timestamp once a majority of replicas within the local datacenter have replied.
EACH_QUORUM: Returns the record with the most recent timestamp once a majority of replicas within each datacenter have replied.
ALL: Will query all replicas and return the record with the most recent timestamp once all replicas have replied. Any unresponsive replicas will fail the operation.