I wonder about this statement about config servers in mongodb (from documentation):
If any of the config servers is down, the cluster's meta-data goes
read only. However, even in such a failure state, the MongoDB cluster
can still be read from and written to.
We can use 1 or 3 config servers. Why if we use 3 config servers and one server is down, cluster goes to read-only mode?
As you can see from the link above, Each config server has a complete copy of all chunk information.
If each sesrver have a complete copy of all chunk information, why does it goes read only after one config server is down?
So, the reason for this is the way that config servers do their 2 phase commits. While if you have one config server and it fails, then your whole system fails. If you have 3 and one fails all of the metadata is still available, but you lose the resiliency factor of 2 phase commits. You cannot do 2 phase commits without 3 members.
So you may still run off of the other two for reads, but the balancer is essentially turned off so that no chunk migration or splits happen (hence metadata becomes read only). This is because you cannot commit splits or migrations using the commit process a 3 node config setup uses, so they don't happen.
Running with 1 config server is not recommended. Basically if it goes down, you don't know where any of your data is.
A 2 phase commit only works with 3 machines because it can make sure your data stays in a consistant state. It means that if a machine died in the middle of an update, that update will either fail or persist depending on whether it was committed to at least another node which will update the third, (hence 2 phase commit). So it is safe to read a sharded cluster using the 2 config servers left.
You can't do that with 2 nodes. It might have gone through, it might have not, you cant tell because you cannot compare the last remaining node to anything since the other one is down. So the safe thing to do is not take any updates until you get the third node back up, otherwise you may be reading out of date data.
If you want seamless failure resistance, then it doesn't make sense to use 2 due to why you use the 2 phase commit. It really has no more durability then 1 node if you would rather have nothing then potentially incorrect data. And in a sharded cluster, nothing and incorrect data go hand in hand since either way you don't know where to find your chunks.
Its basically done to protect you from potential config data corruption and inconsistencies.
Related
Sharding with replication]1
I have a multi tenant database with 3 tables(store,products,purchases) in 5 server nodes .Suppose I've 3 stores in my store table and I am going to shard it with storeId .
I need all data for all shards(1,2,3) available in nodes 1 and 2. But node 3 would contain only shard for store #1 , node 4 would contain only shard for store #2 and node 5 for shard #3. It is like a sharding with 3 replicas.
Is this possible at all? What database engines can be used for this purpose(preferably sql dbs)? Did you have any experience?
Regards
I have a feeling you have not adequately explained why you are trying this strange topology.
Anyway, I will point out several things relating to MySQL/MariaDB.
A Galera cluster already embodies multiple nodes (minimum of 3), but does not directly support "sharding". You can have multiple Galera clusters, one per "shard".
As with my comment about Galera, other forms of MySQL/MariaDB can have replication between nodes of each shard.
If you are thinking of having a server with all data, but replicate only parts to readonly Replicas, there are settings for replicate_do/ignore_database. I emphasize "readonly" because changes to these pseudo-shards cannot easily be sent back to the Primary server. (However see "multi-source replication")
Sharding is used primarily when there is simply too much traffic to handle on a single server. Are you saying that the 3 tenants cannot coexist because of excessive writes? (Excessive reads can be handled by replication.)
A tentative solution:
Have all data on all servers. Use the same Galera cluster for all nodes.
Advantage: When "most" or all of the network is working all data is quickly replicated bidirectionally.
Potential disadvantage: If half or more of the nodes go down, you have to manually step in to get the cluster going again.
Likely solution for the 'disadvantage': "Weight" the nodes differently. Give a height weight to the 3 in HQ; give a much smaller (but non-zero) weight to each branch node. That way, most of the branches could go offline without losing the system as a whole.
But... I fear that an offline branch node will automatically become readonly.
Another plan:
Switch to NDB. The network is allowed to be fragile. Consistency is maintained by "eventual consistency" instead of the "[virtually] synchronous replication" of Galera+InnoDB.
NDB allows you to immediately write on any node. Then the write is sent to the other nodes. If there is a conflict one of the values is declared the "winner". You choose which algorithm for determining the winner. An easy-to-understand one is "whichever write was 'first'".
In a distributed system, if only half of the nodes are successfully written, the subsequent nodes that read the unwritten data will be inconsistent. How to avoid this situation?
client write --> Node1 v
--> Node2 v
client read --> Node3 x(The latest data was not read)
My plan:
Compare the data version with other nodes when reading data
If the current node version is found to be lower, it will be routed to other nodes to read data.
I am going to ignore tags [mongo and elastic] :)
What you are planning to do is called Dynamo style replication. That system is eventually consistent by design. (I read a while ago that it could get strongly consistent with some effort, but I don't remember if that paper was correct.)
Back to dynamo and quorum: with three nodes you want to have at least 2 nodes to save writes to assume the write has succeeded. Important point is that you need two nodes to report back to customer the success, but the data is still should to be sent to three nodes.
Let's assume that data is written to two nodes, third failed, but camed back online later. To read the data, you have to read it from any two nodes as well. You will sent read requests to all three, but only two is needed to report back to the customer. This will give you quorum: 2+2>3. This guarantees that there is an intersection in between writes and reads.
This will work ok when the network is good and nodes are healthy. But you will run into major challengers, lost updates and conflict resolution to name a few. But in either way, the system will not be strongly consistent based on design itself.
Let me describe another interesting issue to illustrate weak consistency:
node 1 gets the write
the rest of process fails; node 1 has new data, but node 2 and 3 don't
now, when you read, under the quorum condition, you may or may not see the value from node 1 - since you are picking any two nodes for a read, node 1 may not be in that set.
Long story short, dynamo is not good for strong consistency, and we get to the Raft part of the solution.
Raft will get you what you need. A consistent system. There is a catch to watch for. Most examples are focused on writing - raft maintains a log of messages and consensus is used to agree on the order (and content) of these messages.
But when you do a read, you can't just go to a node, or any two nodes, or three and read the value. You will have to do read via Raft as well, by attaching a read operation to raft's log. This is called linearizable read.
I'll stop here, as this is pretty complicated topic (but not an impossible one to learn).
Hope this gave you enough ideas to explore.
I saw both mongodb and elasticsearch is being tagged, I don't know which case you are thinking, but the two database is very different.
For mongo, replicas are not by default used to increase reading speed, see https://docs.mongodb.com/manual/core/read-preference, the default reading preferences will only look at primary and excludes all replicas. The writing of Mongo is also to the primary first and the replication will happen asynchronously possibly after the write to primary finishes, see https://docs.mongodb.com/manual/core/replica-set-members/. Because of that, if you do a force read to the secondary, you are not guaranteed to have the newest data.
For elasticsearch, elasticsearch naturally does not guarantee you always read the most recent data, see https://www.elastic.co/guide/en/elasticsearch/reference/current/near-real-time.html, so in either way even if there is only one node you may get data that are out of date.
I use both Cassandra and ScyllaDB 3-node clusters and use PySpark to read data. I was wondering if any of them are not repaired forever, is there any challenge while reading data from either if there are inconsistencies in nodes. Will the correct data be read and if yes, then why do we need to repair them?
Yes you can get incorrect data if reapir is not done. It also depends on with what consistency you are reading or writing. Generally in production systems writes are done with (Local_one/Local_quorum) and read with Local_quorum.
If you are writing with weak consistency level, then repair becomes important as some of the nodes might not have got the mutations and while reading those nodes may get selected.
For example if you write with consistency level ONE on a table TABLE1 with a replication of 3. Now it may happen your write was written to NodeA only and NodeB and NodeC might have missed the mutation. Now if you are reading with Consistency level LOCAL_QUORUM, it may happen that NodeB and 'NodeC' get selected and they do not return the written data.
Repair is an important maintenance task for Cassandra which should be done periodically and continuously to keep data in healthy state.
As others have noted in other answers, different consistency levels make repair more or less important for different reasons. So I'll focus on the consistency level that you said in a comment you are using: LOCAL_ONE for reading and LOCAL_QUORUM for writing:
Successfully writing with LOCAL_QUORUM only guarantees that two replicas have been written. If the third replica is temporarily down, and will later come up - at that point one third of the read requests for this data, reads done from only one node (this is what LOCAL_ONE means) will miss the new data! Moreover, there isn't even a guarantee of so-called monotonic consistency - you can get new data in one read (from one node), and the old data in a later read (from another node).
However, it isn't completely accurate that only a repair can fix this problem. Another feature - enabled by default on both Cassandra and Scylla - is called Hinted Handoff - where when a node is down for relatively short time (up to three hours, but also depending on the amount of traffic in that period), other nodes which tried to send it updates remember those updates - and retry the send when the dead node comes back up. If you are faced only with such relatively short downtimes, repair isn't necessary and Hinted Handoff is actually enough.
That being said, Hinted Handoff isn't guaranteed perfect and might miss some inconsistencies. E.g., the node wishing to save a hint might itself be rebooted before it managed to save the hint, or replaced after saving it. So this mechanism isn't completely foolproof.
By the way, there another thing you need to be aware of: If you ever intend to do a repair (e.g., perhaps after some node was down for too long for Hinted Handoff to have worked, or perhaps because a QUORUM read causes a read repair), you must do it at least once every gc_grace_seconds (this defaults to 10 days).
The reason for this statement is the risk of data resurrection by repair which is too infrequent. The thing is, after gc_grace_seconds, the tombstones marking deleted items are removed forever ("garbage collected"). At that point, if you do a repair and one of the nodes happens to have an old version of this data (prior to the delete), the old data will be "resurrected" - copied to all replicas.
In addition to Manish's great answer, I'll just add that read operations run consistency levels higher than *_ONE have a (small...10% default) chance to invoke a read repair. I have seen that applications running at a higher consistency level for reads, will have less issues with inconsistent replicas.
Although, writing at *_QUORUM should ensure that the majority (quorum) of replicas are indeed consistent. Once it's written successfully, data should not "go bad" over time.
That all being said, running periodic (weekly) repairs is a good idea. I highly recommend using Cassandra Reaper to manage repairs, especially if you have multiple clusters.
I'm using Cassandra 2.0.4 (with vnodes) and 2 days ago I added 2 nodes (.210 and .195.) I expected Cassandra to redistribute the existing data automatically, but today I still find this nodetool status
Issuing a nodetool repair on any of the nodes doesn't do anything either (the repair finishes within seconds.) The logs state that the repair is being executed as expected, but after preparing the repair plan it pretty much instantly finishes executing said plan.
Was I wrong to assume the existing data would be redistributed at all, or is something wrong? And if that isn't the case; how do I manually 'rebalance' the data?
Worth noting: I seem to have lost some data after adding this new nodes. Issuing a select on certain keys only returns data from the last couple of days rather than weeks, this makes me think the data is saved on .92 while Cassandra queries for it on one of the new servers. But that's really just an uneducated guess, I may have simple broken something during all of my trial & error tests meaning the data is actually gone (even though I don't issue deletes, ever.)
Can anyone enlighten me?
There is currently no manual rebalance option for vnode-enabled clusters.
But your cluster doesn't look unbalanced based on the nodetool status output you show. I'm curious as to why node .88 has only 64 tokens compared to the others but that isn't a problem per se. When a cluster is smaller there will be a slight variance in the balance of data across the nodes.
As for the data issues, you can try running nodetool repair -pr around the nodes in the ring and then nodetool cleanup and see if that helps.
Consider the following setup:
There a 2 physical servers which are set up as a regular mongodb replication set (including an arbiter process, so automatic failover will work correctly).
now, as far as i understand, most actual work will be done on the primary server, while the slave will mostly just do work to keep its dataset in sync.
Would it be reasonable, to introduce sharding into this setup in a way that one would set up another replication set on the same 2 servers, so that each of them has one mongod process running as primary and one process running as secondary.
The expected result would be that both servers will share the workload of actual querys/inserts while both are up. In the case of one server failing the whole setup should elegantly fail over to continue running, until the other server is restored.
Are there any downsides to this setup, except the overall overhead in setup and number of processes (mongos/configservers/arbiters)?
That would definitely work. I'd asked a question in the #mongodb IRC channel a bit ago as to whether or not it was a bad idea to run multiple mongod processes on a single machine. The answer was "as long as you have the RAM/CPU/bandwidth, go nuts".
It's worth noting that if you're looking for high-performance reads, and don't mind writes being a bit slower, you could:
Do your writes in "safe mode", where the write doesn't return until it's been propagated to N servers (in this case, where N is the number of servers in the replica set, so all of them)
Set the driver-appropriate flag in your connection code to allow reading from slaves.
This would get you a clustered setup similar to MySQL - write once on the master, but any of the slaves is eligible for a read. In a circumstance where you have many more reads than writes (say, an order of magnitude), this may be higher performance, but I don't know how it'd behave when a node goes down (since writes may stall trying to write to 3 nodes, but only 2 are up, etc - that would need testing).
One thing to note is that while both machines are up, your queries are being split between them. When one goes down, all queries will go to the remaining machine thus doubling the demands placed on it. You'd have to make sure your machines could withstand a sudden doubling of queries.
In that situation, I'd reconsider sharding in the first place, and just make it an un-sharded replica set of 2 machines (+1 arbiter).
You are missing one crucial detail: if you have a sharded setup with two physical nodes only, if one dies, all your data is gone. This is because you don't have any redundancy below the sharding layer (the recommended way is that each shard is composed of a replica set).
What you said about the replica set however is true: you can run it on two shared-nothing nodes and have an additional arbiter. However, the recommended setup would be 3 nodes: one primary and two secondaries.
http://www.markus-gattol.name/ws/mongodb.html#do_i_need_an_arbiter