Rebalance data after adding nodes - nosql

I'm using Cassandra 2.0.4 (with vnodes) and 2 days ago I added 2 nodes (.210 and .195.) I expected Cassandra to redistribute the existing data automatically, but today I still find this nodetool status
Issuing a nodetool repair on any of the nodes doesn't do anything either (the repair finishes within seconds.) The logs state that the repair is being executed as expected, but after preparing the repair plan it pretty much instantly finishes executing said plan.
Was I wrong to assume the existing data would be redistributed at all, or is something wrong? And if that isn't the case; how do I manually 'rebalance' the data?
Worth noting: I seem to have lost some data after adding this new nodes. Issuing a select on certain keys only returns data from the last couple of days rather than weeks, this makes me think the data is saved on .92 while Cassandra queries for it on one of the new servers. But that's really just an uneducated guess, I may have simple broken something during all of my trial & error tests meaning the data is actually gone (even though I don't issue deletes, ever.)
Can anyone enlighten me?

There is currently no manual rebalance option for vnode-enabled clusters.
But your cluster doesn't look unbalanced based on the nodetool status output you show. I'm curious as to why node .88 has only 64 tokens compared to the others but that isn't a problem per se. When a cluster is smaller there will be a slight variance in the balance of data across the nodes.
As for the data issues, you can try running nodetool repair -pr around the nodes in the ring and then nodetool cleanup and see if that helps.

Related

How to ensure consistent reading in distributed system?

In a distributed system, if only half of the nodes are successfully written, the subsequent nodes that read the unwritten data will be inconsistent. How to avoid this situation?
client write --> Node1 v
--> Node2 v
client read --> Node3 x(The latest data was not read)
My plan:
Compare the data version with other nodes when reading data
If the current node version is found to be lower, it will be routed to other nodes to read data.
I am going to ignore tags [mongo and elastic] :)
What you are planning to do is called Dynamo style replication. That system is eventually consistent by design. (I read a while ago that it could get strongly consistent with some effort, but I don't remember if that paper was correct.)
Back to dynamo and quorum: with three nodes you want to have at least 2 nodes to save writes to assume the write has succeeded. Important point is that you need two nodes to report back to customer the success, but the data is still should to be sent to three nodes.
Let's assume that data is written to two nodes, third failed, but camed back online later. To read the data, you have to read it from any two nodes as well. You will sent read requests to all three, but only two is needed to report back to the customer. This will give you quorum: 2+2>3. This guarantees that there is an intersection in between writes and reads.
This will work ok when the network is good and nodes are healthy. But you will run into major challengers, lost updates and conflict resolution to name a few. But in either way, the system will not be strongly consistent based on design itself.
Let me describe another interesting issue to illustrate weak consistency:
node 1 gets the write
the rest of process fails; node 1 has new data, but node 2 and 3 don't
now, when you read, under the quorum condition, you may or may not see the value from node 1 - since you are picking any two nodes for a read, node 1 may not be in that set.
Long story short, dynamo is not good for strong consistency, and we get to the Raft part of the solution.
Raft will get you what you need. A consistent system. There is a catch to watch for. Most examples are focused on writing - raft maintains a log of messages and consensus is used to agree on the order (and content) of these messages.
But when you do a read, you can't just go to a node, or any two nodes, or three and read the value. You will have to do read via Raft as well, by attaching a read operation to raft's log. This is called linearizable read.
I'll stop here, as this is pretty complicated topic (but not an impossible one to learn).
Hope this gave you enough ideas to explore.
I saw both mongodb and elasticsearch is being tagged, I don't know which case you are thinking, but the two database is very different.
For mongo, replicas are not by default used to increase reading speed, see https://docs.mongodb.com/manual/core/read-preference, the default reading preferences will only look at primary and excludes all replicas. The writing of Mongo is also to the primary first and the replication will happen asynchronously possibly after the write to primary finishes, see https://docs.mongodb.com/manual/core/replica-set-members/. Because of that, if you do a force read to the secondary, you are not guaranteed to have the newest data.
For elasticsearch, elasticsearch naturally does not guarantee you always read the most recent data, see https://www.elastic.co/guide/en/elasticsearch/reference/current/near-real-time.html, so in either way even if there is only one node you may get data that are out of date.

Can Cassandra or ScyllaDB give incomplete data while reading with PySpark if either clusters are left un-repaired forever?

I use both Cassandra and ScyllaDB 3-node clusters and use PySpark to read data. I was wondering if any of them are not repaired forever, is there any challenge while reading data from either if there are inconsistencies in nodes. Will the correct data be read and if yes, then why do we need to repair them?
Yes you can get incorrect data if reapir is not done. It also depends on with what consistency you are reading or writing. Generally in production systems writes are done with (Local_one/Local_quorum) and read with Local_quorum.
If you are writing with weak consistency level, then repair becomes important as some of the nodes might not have got the mutations and while reading those nodes may get selected.
For example if you write with consistency level ONE on a table TABLE1 with a replication of 3. Now it may happen your write was written to NodeA only and NodeB and NodeC might have missed the mutation. Now if you are reading with Consistency level LOCAL_QUORUM, it may happen that NodeB and 'NodeC' get selected and they do not return the written data.
Repair is an important maintenance task for Cassandra which should be done periodically and continuously to keep data in healthy state.
As others have noted in other answers, different consistency levels make repair more or less important for different reasons. So I'll focus on the consistency level that you said in a comment you are using: LOCAL_ONE for reading and LOCAL_QUORUM for writing:
Successfully writing with LOCAL_QUORUM only guarantees that two replicas have been written. If the third replica is temporarily down, and will later come up - at that point one third of the read requests for this data, reads done from only one node (this is what LOCAL_ONE means) will miss the new data! Moreover, there isn't even a guarantee of so-called monotonic consistency - you can get new data in one read (from one node), and the old data in a later read (from another node).
However, it isn't completely accurate that only a repair can fix this problem. Another feature - enabled by default on both Cassandra and Scylla - is called Hinted Handoff - where when a node is down for relatively short time (up to three hours, but also depending on the amount of traffic in that period), other nodes which tried to send it updates remember those updates - and retry the send when the dead node comes back up. If you are faced only with such relatively short downtimes, repair isn't necessary and Hinted Handoff is actually enough.
That being said, Hinted Handoff isn't guaranteed perfect and might miss some inconsistencies. E.g., the node wishing to save a hint might itself be rebooted before it managed to save the hint, or replaced after saving it. So this mechanism isn't completely foolproof.
By the way, there another thing you need to be aware of: If you ever intend to do a repair (e.g., perhaps after some node was down for too long for Hinted Handoff to have worked, or perhaps because a QUORUM read causes a read repair), you must do it at least once every gc_grace_seconds (this defaults to 10 days).
The reason for this statement is the risk of data resurrection by repair which is too infrequent. The thing is, after gc_grace_seconds, the tombstones marking deleted items are removed forever ("garbage collected"). At that point, if you do a repair and one of the nodes happens to have an old version of this data (prior to the delete), the old data will be "resurrected" - copied to all replicas.
In addition to Manish's great answer, I'll just add that read operations run consistency levels higher than *_ONE have a (small...10% default) chance to invoke a read repair. I have seen that applications running at a higher consistency level for reads, will have less issues with inconsistent replicas.
Although, writing at *_QUORUM should ensure that the majority (quorum) of replicas are indeed consistent. Once it's written successfully, data should not "go bad" over time.
That all being said, running periodic (weekly) repairs is a good idea. I highly recommend using Cassandra Reaper to manage repairs, especially if you have multiple clusters.

Postgres VACUUM and replication

I have a master postgres with 2 async replication salves
I run VACUUM FULL VERBOSE ANALYSE my_table on all tables ,after vacuuming the slaves get out of sync
My application read from slaves , currently everything is wrong!
How can I force to sync or run re-sync ?
Whats problem here? Why running vacuum issued a problem?!
Whats problem here?
Your server log files can probably answer that much more accurately than random strangers without access to your computer can. What do the log files say? The replica logs are probably more interesting then the master logs, but check both.
Do you get messages about requested WAL segment %s has already been removed? If so, you will have to recreate your replicas. (Unless you have a WAL archive someplace which the replicas aren't currently configred to use--but even then, recreating may be faster and easier).
If you are using replication slots, the master should be retaining all the necessary WAL. In that case the replicas would still be trying to catch up, it might just take them a long time to do so. Either wait, or re-create them if you think that that will be faster.
Why running vacuum issued a problem?!
The key here is the FULL. Doing that basically rewrote your entire database, generating massive amounts of WAL which needs to fetched over the network and then replayed. The bottleneck could be anything from the network to the CPU to the disk drive.
Don't do VACUUM FULL without a darn good reason.

Is keep logging messages in group communication service or paxos practical?

In the case of network partition or node crash, most of the distributed atomic broadcast protocols (like Extended Virtual Synchrony or Paxos), require running nodes, to keep logging messages, until the crashed or partitioned node rejoins the cluster. When a node rejoins the cluster, replay of logged messages are enough to regain the current state.
My question is, if the partitioned/crash node takes really long time to join the cluster again, then eventually logs will overflow. This seem to be a very practical issue, but still no one in their paper talks about it. Is there a very obvious solution to this which I am missing? Or my understanding in incorrect.
You don't really need to remember the whole log. Imagine for example that the state you were synchronizing between the nodes was something like an SQL table with a row of the form (id: int, name: string) and the commands that would be written into the logs were in a form "insert row with id=x and name=y", "delete row where id=z", "set name=a where id=1000",...
Once such commands were committed, all you really care about is the final table. Then once a node which was offline for a long time goes online, it would only need to download the table + few entries from the log that were committed while the table was being downloaded.
This is called "log compaction", check out the chapter 7 in the Raft paper for more info.
There are a few potential solutions to the infinite log problem but one of the more popular ones for replicated state machines is to periodically snap-shot the full replicated state machine and delete all history prior to that point. A node that has been offline too long would then just discard all of their information, download the snapshot, and start replaying the replicated logs from that point.

Slony-I replication CPU usage

I have recently had to install slony (version 2.0.2) at work. Everything works fine, however, my boss would like to lower the cpu usage on slave nodes during replication. Searching on the net does not reveal any blatantly obvious answers to this. Any suggestions that would help reduce CPU usage (or spread the update out over a longer period) would be very much appreciated!
Have you looked into general PostgreSQL tuning here? The server can waste a lot of CPU cycles doing redundant work if it's not given enough resources to work with, and the default config is extremely small. Tuning Your PostgreSQL Server is a useful guide here, shared_buffers and checkpoint_segments are the two parameters you might get some significant improvement from on a slave (many of the rest only really help for improving query time).
Magnus might be right, this could very well just be a symptom of the fact that your database has very high traffic. Slony effectively multiplies the resource usage of any given DML operation: not only is data CRUD'ed to the replication master, but every time that happens, a Slony trigger (think of it as a change listener) generates an identical transaction and forwards it to the Slon process, which runs it on other members of the cluster.
However, there are two other possible explanations/solutions to this issue:
A possible solution might be to run the slon processes on a separate machine from your database hosts. Even if you have a single-master/single-slave replication scheme, it is advantageous in terms of stability, role-segregation, and performance (that’s you) to run the slon replication daemons on a physically different set of hardware (on the same LAN segment, ideally). There is nothing about Slony that says it has to run on the same machine as a given database host, so putting it in a different location (think “traffic controller”) might relieve some of the resource load on your database hosts. This is also a good idea in terms of both machine stability and scalability.
There's also a chance that this is only a temporary problem caused by the fact that you recently started using Slony. When you first subscribe a new node to a replication set, that node (and, to some extent, its parent) experiences VERY heavy CPU load (and possibly disk load as well) during the subscription process. I'm not sure how it works under the covers, but, depending on how much data was already on the node subscribed, Slony will either check the master’s data against every single piece of data present on the slave in tables that are replicated, and copy data down to the slave if it is missing or different. These are potentially CPU-intensive operations. Especially in large databases, the process of subscription can take a very long time (it took over a day for me, but our database is over 20GB), during which CPU load will be very high. A simple way to see what Slony is up to is to use pgAdmin’s Server Status viewer, which, while limited, will give you some useful info here. If there are a lot of “prepare table for replication” or “cleanup table after replication” operations in progress on the node that has a high CPU load, it’s probably because a subscription isn’t complete. pgAdmin’s status viewer isn’t too informative, however; there are more reliable ways of checking subscription progress using Slony directly. Section 4.7.6.4 in the Slony log-analysis documentation might help with that, as would reading the doc for SUBSCRIBE SET (pay special attention to the boxed warning message, and the "Dangerous/Unintuitive Behavior" section. A simple yet definitive hack to tell whether a set is still in the process of subscriptions is to run a MERGE SET and try to merge it with an empty (or not) other set. MERGE SET will fail with a "subscriptions in progress" error if subscription is still running. However, that hack won't work on Slony 2.1; MERGE SET will just wait until subscriptions are finished.
The best way to reduce the CPU usage would be to put less data into the database :-)
Other than that, you can experiment with sync_interval. It may be what you're looking for.