Ceph: What happens when enough disks fail to cause data loss? - ceph

Imagine that we have enough disk failures in Ceph to cause actual loss of the data. (E.g. all 3 replicas fail in 3-replica; or >m fail in k+m erasure coding). What happens now?
Is the cluster still stable? That is, of course we've lost that data, but will other data, and new data, still work well.
Is there any way to get a list of the lost object ids?
In our use case, we could recover the lost data from offline backups. But, to do that, we'd need to know which data was actually lost - that is, get a list of the object ids that were lost.

Answer 1: what happens if ?
Ceph distributes your data in placement groups (PGs). Think of them as shards of your data pool. By default a PG is stored in 3 copies over your storage devices. Again by default a minimum of 2 copies have to be known to exist by ceph to be still accessible. Should only 1 copy be available (because 2 OSDs (aka disks) are offline), writes to that PG are blocked until the minimum number of copies (2) are online again. If all copies of a PG are offline your reads will be blocked until one copy comes online. All other PGs are free to be accessed if online with enough copies.
Answer 2: what is affected ?
You are probably referring to the S3 like object storage. This is modelled on top of the rados object store, that is the key storage of ceph. Problematic PGs can be traced and associated with the given rados object. There is documentation about identifying blocked RadosGW requests and another section about getting from defective PGs to the rados objects.

Related

How to ensure consistent reading in distributed system?

In a distributed system, if only half of the nodes are successfully written, the subsequent nodes that read the unwritten data will be inconsistent. How to avoid this situation?
client write --> Node1 v
--> Node2 v
client read --> Node3 x(The latest data was not read)
My plan:
Compare the data version with other nodes when reading data
If the current node version is found to be lower, it will be routed to other nodes to read data.
I am going to ignore tags [mongo and elastic] :)
What you are planning to do is called Dynamo style replication. That system is eventually consistent by design. (I read a while ago that it could get strongly consistent with some effort, but I don't remember if that paper was correct.)
Back to dynamo and quorum: with three nodes you want to have at least 2 nodes to save writes to assume the write has succeeded. Important point is that you need two nodes to report back to customer the success, but the data is still should to be sent to three nodes.
Let's assume that data is written to two nodes, third failed, but camed back online later. To read the data, you have to read it from any two nodes as well. You will sent read requests to all three, but only two is needed to report back to the customer. This will give you quorum: 2+2>3. This guarantees that there is an intersection in between writes and reads.
This will work ok when the network is good and nodes are healthy. But you will run into major challengers, lost updates and conflict resolution to name a few. But in either way, the system will not be strongly consistent based on design itself.
Let me describe another interesting issue to illustrate weak consistency:
node 1 gets the write
the rest of process fails; node 1 has new data, but node 2 and 3 don't
now, when you read, under the quorum condition, you may or may not see the value from node 1 - since you are picking any two nodes for a read, node 1 may not be in that set.
Long story short, dynamo is not good for strong consistency, and we get to the Raft part of the solution.
Raft will get you what you need. A consistent system. There is a catch to watch for. Most examples are focused on writing - raft maintains a log of messages and consensus is used to agree on the order (and content) of these messages.
But when you do a read, you can't just go to a node, or any two nodes, or three and read the value. You will have to do read via Raft as well, by attaching a read operation to raft's log. This is called linearizable read.
I'll stop here, as this is pretty complicated topic (but not an impossible one to learn).
Hope this gave you enough ideas to explore.
I saw both mongodb and elasticsearch is being tagged, I don't know which case you are thinking, but the two database is very different.
For mongo, replicas are not by default used to increase reading speed, see https://docs.mongodb.com/manual/core/read-preference, the default reading preferences will only look at primary and excludes all replicas. The writing of Mongo is also to the primary first and the replication will happen asynchronously possibly after the write to primary finishes, see https://docs.mongodb.com/manual/core/replica-set-members/. Because of that, if you do a force read to the secondary, you are not guaranteed to have the newest data.
For elasticsearch, elasticsearch naturally does not guarantee you always read the most recent data, see https://www.elastic.co/guide/en/elasticsearch/reference/current/near-real-time.html, so in either way even if there is only one node you may get data that are out of date.

Can Cassandra or ScyllaDB give incomplete data while reading with PySpark if either clusters are left un-repaired forever?

I use both Cassandra and ScyllaDB 3-node clusters and use PySpark to read data. I was wondering if any of them are not repaired forever, is there any challenge while reading data from either if there are inconsistencies in nodes. Will the correct data be read and if yes, then why do we need to repair them?
Yes you can get incorrect data if reapir is not done. It also depends on with what consistency you are reading or writing. Generally in production systems writes are done with (Local_one/Local_quorum) and read with Local_quorum.
If you are writing with weak consistency level, then repair becomes important as some of the nodes might not have got the mutations and while reading those nodes may get selected.
For example if you write with consistency level ONE on a table TABLE1 with a replication of 3. Now it may happen your write was written to NodeA only and NodeB and NodeC might have missed the mutation. Now if you are reading with Consistency level LOCAL_QUORUM, it may happen that NodeB and 'NodeC' get selected and they do not return the written data.
Repair is an important maintenance task for Cassandra which should be done periodically and continuously to keep data in healthy state.
As others have noted in other answers, different consistency levels make repair more or less important for different reasons. So I'll focus on the consistency level that you said in a comment you are using: LOCAL_ONE for reading and LOCAL_QUORUM for writing:
Successfully writing with LOCAL_QUORUM only guarantees that two replicas have been written. If the third replica is temporarily down, and will later come up - at that point one third of the read requests for this data, reads done from only one node (this is what LOCAL_ONE means) will miss the new data! Moreover, there isn't even a guarantee of so-called monotonic consistency - you can get new data in one read (from one node), and the old data in a later read (from another node).
However, it isn't completely accurate that only a repair can fix this problem. Another feature - enabled by default on both Cassandra and Scylla - is called Hinted Handoff - where when a node is down for relatively short time (up to three hours, but also depending on the amount of traffic in that period), other nodes which tried to send it updates remember those updates - and retry the send when the dead node comes back up. If you are faced only with such relatively short downtimes, repair isn't necessary and Hinted Handoff is actually enough.
That being said, Hinted Handoff isn't guaranteed perfect and might miss some inconsistencies. E.g., the node wishing to save a hint might itself be rebooted before it managed to save the hint, or replaced after saving it. So this mechanism isn't completely foolproof.
By the way, there another thing you need to be aware of: If you ever intend to do a repair (e.g., perhaps after some node was down for too long for Hinted Handoff to have worked, or perhaps because a QUORUM read causes a read repair), you must do it at least once every gc_grace_seconds (this defaults to 10 days).
The reason for this statement is the risk of data resurrection by repair which is too infrequent. The thing is, after gc_grace_seconds, the tombstones marking deleted items are removed forever ("garbage collected"). At that point, if you do a repair and one of the nodes happens to have an old version of this data (prior to the delete), the old data will be "resurrected" - copied to all replicas.
In addition to Manish's great answer, I'll just add that read operations run consistency levels higher than *_ONE have a (small...10% default) chance to invoke a read repair. I have seen that applications running at a higher consistency level for reads, will have less issues with inconsistent replicas.
Although, writing at *_QUORUM should ensure that the majority (quorum) of replicas are indeed consistent. Once it's written successfully, data should not "go bad" over time.
That all being said, running periodic (weekly) repairs is a good idea. I highly recommend using Cassandra Reaper to manage repairs, especially if you have multiple clusters.

Changing osd while there are inconsistent objects inside

I have a disk in my cluster that started giving out pg read errors after scrubs the last 2 days. Manual repairs are taking too much time and the disk's smartctl output does not look good. My question is, can i change disks before the errors are fixed? Will it cause some corruption across cluster? every pg is active+clean, only some pgs (erasure coded) are active+clean+inconsistent. Should i start recovery process and then replace the osd?
PS: none of the data chunks with read error are the primary chunk.
Your question is not entirely clear to me, I answer how I understand it.
The safest way to replace a failed disk is to let ceph rebalance after the disk has been taken out and then let it remap after the new disk has been deployed. This is lots of network traffic and can take quite some time, but you reduce the risk of data loss.
There are ways to prevent ceph from backfilling and deploy a new disk in a degraded state, but this increases the risk of a disk failure during this degraded state which can lead to data loss. I would advise against it, though.
You write
"can I change disks before the errors are fixed?"
How many disks have failed? The PG availability is dependent on k + 1 active chunks for erasure coded pools (pool's min_size).
"Manual repairs are taking too much time"
You mean ceph pg repair <PG_ID>? Usually I would wait for it, it can help, not all scrub errors are corrupted data, but in your case with the smartctl output it's probably just a temporary fix, if it even succeeds.

Is Journal + Majority 100% safe?

I have been reading a few articles on MongoDB fault tolerance, and some people complain it can never really be achieved with MongoDB (like in this article: http://hackingdistributed.com/2013/01/29/mongo-ft/), and this got be confused.
Can someone confirm (and if possible show me the appropriate docs) that using the Write Concern "Journal + Majority" is enough to make sure that 100% of the writes that were reported as success by my driver are durably written and won't be lost even if any replica fails just after the write?
I'm talking about a 3 replica setup. I'm ok with the system no longer accepting writes in case of failure, but when a write is reported as successful by the driver, I need it to be durably committed (regardless of the number of replica failing after that).
Right so if you choose a journal write you are basically ensuring the write has made it to disk of a single node. If you choose to do a majority write, you are ensuring that the write has made it to memory of at least x number of nodes in your replica set.
By default, mongodb will flush from memory to journal every 100ms. By having your replica nodes on different machines (physical or virtual), ideally in different data centres, you are very unlikely to ever see ALL nodes in a geographically ditributed replica set go down within the same 100ms before one gets to disk.
Alternatively to guarantee that write made it to disk of a single node - use journal write.

Do NoSQL datacenter aware features enable fast reads and writes when nodes are distributed across high-latency connections?

We have a data system in which writes and reads can be made in a couple of geographic locations which have high network latency between them (crossing a few continents, but not this slow). We can live with 'last write wins' conflict resolution, especially since edits can't be meaningfully merged.
I'd ideally like to use a distributed system that allows fast, local reads and writes, and copes with the replication and write propagation over the slow connection in the background. Do the datacenter-aware features in e.g. Voldemort or Cassandra deliver this?
It's either this, or we roll our own, probably based on collecting writes using something like
rsync and sorting out the conflict resolution ourselves.
You should be able to get the behavior you're looking for using Voldemort. (I can't speak to Cassandra, but imagine that it's similarly possible using it.)
The key settings in the configuration will be:
replication-factor — This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
required-reads — The least number of reads that can succeed without throwing an exception.
required-writes — The least number of writes that can succeed without the client getting back an exception.
So for your situation, the replication would be set to whatever number made sense for your redundancy requirements, while both required-reads and required-writes would be set to 1. Reads and writes would return quickly, with a concomitant risk of stale or lost data, and the data would only be replicated to the other nodes afterwards.
I have no experience with Voldemort, so I can only comment on Cassandra.
You can deploy Cassandra to multiple datacenters with an inter-DC latency higher than a few milliseconds (see http://spyced.blogspot.com/2010/04/cassandra-fact-vs-fiction.html).
To ensure fast local reads, you can configure the cluster to replicate your data to a certain number of nodes in each datacenter (see "Network Topology Strategy"). For example, you specify that there should always be two replica in each data center. So even when you lose a node in a data center, you will still be able to read your data locally.
Write requests can be sent to any node in a Cassandra cluster. So for fast writes, your clients would always speak to a local node. The node receiving the request (the "coordinator") will replicate the data to other nodes (in other datacenters) in the background. If nodes are down, the write request will still succeed and the coordinator will replicate the data to the failed nodes at a later time ("hinted handoff").
Conflict resolution is based on a client-supplied timestamp.
If you need more than eventual consistency, Cassandra offers several consistency options (including datacenter-aware options).