MarkLogic: How to speed up rebalancing process when adding a new forest to an existing database? - merge

Our production MarkLogic DB is having 1.2 TB data divided among 6 forests. We plan to add 2 new forests to reduce stands per forest count.
Now, adding new forests starts rebalancing the data. That's okay, it takes time. But this rebalancing time keeps shooting up whenever merges start alongside rebalancing. Sometimes it takes, estimated 8 hours to suddenly to 16 hours. So, on average the whole process is taking approximately 24 hours.
My question is - If we disable the merge before adding the new forests and enable the manual merge soon after rebalancing completes( after adding forests), would the combined process be faster? And, will it be safe to do this?

Anything that affects disk IO will affect the speed of rebalancing, including merging and standard database activity, however care should be taken if you are disabling merging.
The risk of disabling merging, is that you prevent the system from pruning stands, so if too many stands accumulate you may hit the hard limit, which will impact server operation.
If merging is having such a heavy impact, then you can look at tuning the merge configurations. More information can be found in the documentation.
Dangers of Disabling Merges
Understanding and Controlling Database Merges
Setting Merge Policy
Configuring Merge Policy Rules

In addition to the other info provided, assignment policy may affect how much work is done. See for example: https://docs.marklogic.com/guide/admin/database-rebalancing#id_81616 . You can also set the rebalancer throttle to make it work slower if the system is getting overwhelmed. But if you turn off merging while rebalancing I'm going to bet you'll hit a TOOMANYSTANDS error pretty quickly since the small stands will need to be written because of the rebalancer, but won't be able to merge to larger+fewer stands.

Related

Can Cassandra or ScyllaDB give incomplete data while reading with PySpark if either clusters are left un-repaired forever?

I use both Cassandra and ScyllaDB 3-node clusters and use PySpark to read data. I was wondering if any of them are not repaired forever, is there any challenge while reading data from either if there are inconsistencies in nodes. Will the correct data be read and if yes, then why do we need to repair them?
Yes you can get incorrect data if reapir is not done. It also depends on with what consistency you are reading or writing. Generally in production systems writes are done with (Local_one/Local_quorum) and read with Local_quorum.
If you are writing with weak consistency level, then repair becomes important as some of the nodes might not have got the mutations and while reading those nodes may get selected.
For example if you write with consistency level ONE on a table TABLE1 with a replication of 3. Now it may happen your write was written to NodeA only and NodeB and NodeC might have missed the mutation. Now if you are reading with Consistency level LOCAL_QUORUM, it may happen that NodeB and 'NodeC' get selected and they do not return the written data.
Repair is an important maintenance task for Cassandra which should be done periodically and continuously to keep data in healthy state.
As others have noted in other answers, different consistency levels make repair more or less important for different reasons. So I'll focus on the consistency level that you said in a comment you are using: LOCAL_ONE for reading and LOCAL_QUORUM for writing:
Successfully writing with LOCAL_QUORUM only guarantees that two replicas have been written. If the third replica is temporarily down, and will later come up - at that point one third of the read requests for this data, reads done from only one node (this is what LOCAL_ONE means) will miss the new data! Moreover, there isn't even a guarantee of so-called monotonic consistency - you can get new data in one read (from one node), and the old data in a later read (from another node).
However, it isn't completely accurate that only a repair can fix this problem. Another feature - enabled by default on both Cassandra and Scylla - is called Hinted Handoff - where when a node is down for relatively short time (up to three hours, but also depending on the amount of traffic in that period), other nodes which tried to send it updates remember those updates - and retry the send when the dead node comes back up. If you are faced only with such relatively short downtimes, repair isn't necessary and Hinted Handoff is actually enough.
That being said, Hinted Handoff isn't guaranteed perfect and might miss some inconsistencies. E.g., the node wishing to save a hint might itself be rebooted before it managed to save the hint, or replaced after saving it. So this mechanism isn't completely foolproof.
By the way, there another thing you need to be aware of: If you ever intend to do a repair (e.g., perhaps after some node was down for too long for Hinted Handoff to have worked, or perhaps because a QUORUM read causes a read repair), you must do it at least once every gc_grace_seconds (this defaults to 10 days).
The reason for this statement is the risk of data resurrection by repair which is too infrequent. The thing is, after gc_grace_seconds, the tombstones marking deleted items are removed forever ("garbage collected"). At that point, if you do a repair and one of the nodes happens to have an old version of this data (prior to the delete), the old data will be "resurrected" - copied to all replicas.
In addition to Manish's great answer, I'll just add that read operations run consistency levels higher than *_ONE have a (small...10% default) chance to invoke a read repair. I have seen that applications running at a higher consistency level for reads, will have less issues with inconsistent replicas.
Although, writing at *_QUORUM should ensure that the majority (quorum) of replicas are indeed consistent. Once it's written successfully, data should not "go bad" over time.
That all being said, running periodic (weekly) repairs is a good idea. I highly recommend using Cassandra Reaper to manage repairs, especially if you have multiple clusters.

How to run a Kafka Canary Consumer

We have a Kafka queue with two consumers, both read from the same partition (fan-out scenario). One of those consumers should be the canary and process 1% of the messages, while the other processes the 99% remaining ones.
The idea is to make the decision based on a property of the message, eg the message ID or timestamp (e.g. mod 100), and accept or drop based on that, just with a reversed logic for canary and non-canary.
Now we are facing the issue of how to do so robustly, e.g. reconfigure percentages while running and avoid loosing messages or processing them twice. It appears this escalates to a distributed consensus problem to keep the decision logic in sync, which we would very much like to avoid, even though we could just use ZooKeeper for that.
Is this a viable strategy, or are there better ways to do this? Possibly one that avoids consensus?
Update: Unfortunately the Kafka Cluster is not under our control, and we cannot make any changes.
Update 2 Latency of messages is not a huge issues, a few hundred 100ms added are okay and won't be noticed.
I dont see any way to change the "sampling strategy" across 2 machines without "ignoring" or double-processing records. Since different Kafka consumers could be in different positions in the partition, and could also get the new config at different times, you'd inevitably run into one of 2 scenarios:
Double processing of the same record by both machines
"Skipping" a record because neither machine thinks it should "own" it when it sees it.
I'd suggest a small change to your architecture instead:
Have the 99% machine (the non-canary) pick up all records, then decide for every record if it wants to handle it, or if it belongs to the canary
If it belongs to the canary, send the record to a 2nd topic (from the 99% machine)
Canary machine only listens on the 2nd topic, and processes every arriving record
And now you have a pipeline setup where decisions are only ever made in one point and no records are missed or double processed.
The obvious downside is somewhat higher latency on the canary machine. If you absolutely cannot tolerate the latency push the decision of which topic to produce to upstream to producers? (I don't know how feasible that is to you)
Variant in case a 2nd topic isnt allowed
If (as youve stated above) you cant have a 2nd topic, you could still make the decision only on the 99% machine, then for records that need to go to the canary, re-produce them into the origin partition with some sort of "marker" (either in the payload or as a kafka header, up to you).
The 99% machine will ignore any incoming records with a marker, and the canary machine will only process records with a marker.
Again, the major downside is added latency.

Why can't CP systems also be CAP?

My understanding of the CAP acronym is as follows:
Consistent: every read gets the most recent write
Available: every node is available
Partion Tolerant: the system can continue upholding A and C promises when the network connection between nodes goes down
Assuming my understanding is more or less on track, then something is bother me.
AFAIK, availability is achieved via any of the following techniques:
Load balancing
Replication to a disaster recovery system
So if I have a system that I already know is CP, why can't I "make it full CAP" by applying one of these techniques to make it available as well? I'm sure I'm missing something important here, just not sure what.
It's the partition tolerance, that you got wrong.
As long as there isn't any partitioning happening, systems can be consistent and available. There are CA systems which say, we don't care about partitions. You can have them running inside racks with server hardware and make partitioning extremely unlikely. The problem is, what if partitions occur?
The system can either choose to
continue providing the service, hoping the other server is down rather than providing the same service and serving different data - choosing availability (AP)
stop providing the service, because it couldn't guarantee consistency anymore, since it doesn't know if the other server is down or in fact up and running and just the communication between these two broke off - choosing consistency (CP)
The idea of the CAP theorem is that you cannot provide both Availability AND Consistency, once partitioning occurs, you can either go for availability and hope for the best, or play it safe and be unavailable, but consistent.
Here are 2 great posts, which should make it clear:
You Can’t Sacrifice Partition Tolerance shows the idea, that every truly distributed system needs to deal with partitioning now and than and hence CA systems will break instantly at the first occurrence of a partition
CAP Twelve Years Later: How the "Rules" Have Changed is slightly more up to date and shows the CAP theorem more flexible, where developers can choose how applications behave during partitioning and can sacrifice a bit of consistency to gain some availability, ...
So to finally answer your question, if you take a CP system and replicate it more often, you might either run into overhead of messages sent between the nodes of the system to keep it consistent, or - in case a substantial part of the nodes fails or network partitioning occurs without any part having a clear majority, it won't be able to continue operation as it wouldn't be able to guarantee consistency anymore. But yes, these lines are getting more blurred now and I think the references I've provided will give you a much better understanding.

A program resistent to power/hardware/OS failures

I need to write a program that performs a parallel search in a large space of possible states, with new areas being discovered (and their exploration started) in the process, and exploration of some areas being terminated early as intermediate results obtained elsewhere eliminate a possibility of discovering new useful results in them. The search is performed using multiple threads running in a heavy cooperation with each other to avoid recalculation of intermediate data.
A complex internal state (including call stacks of several threads and state synchronization primitives they use) has to be maintained and updated during the whole process, and there is no apparent way to split the computation into isolated chunks that can be executed sequentially, each saving and passing a small intermediate result to the next. Also, there is no way to split the computation into independent parallel threads not communicating with each other, without imposing a prohibitive overhead due to recalculation of large amount of intermediate data.
Because of the large search domain, the program would possibly run for months before producing a final result. Hence, there is a significant risk of power, hardware or OS failure during the program execution that can lead to a complete loss of all work that has been done to the moment. In such a case the program will need to restart all its computations from scratch.
I need a solution that can prevent a complete data loss in such cases. I thought of an execution engine/platform that continuously saves the current state of the process to a failure-resistant storage like a redundant disk array or database. But I understand that this approach can significantly slow down the process, even to a degree when there would be no benefit compared to an expected computation time including restarts due to possible failures.
In fact, I do not need an ideal solution that continuously saves the program state, and I can easily bear a loss of hours or maybe even days of work. A possible heavyweight solution that comes to my mind is to run the program inside a virtual machine, saving its snapshots from time to time, and restoring the machine after a possible host failure from a recent snapshot. This approach can also help to recover the program state after a random or preventable guest OS failure.
Is there a similar, but more lightweight solution limited to preserving a state of a single process? Or could you suggest any other approaches that can solve my problem?
You may want to look at using Erlang which allows large numbers of threads to run at relatively low cost. Because the thread cost is low, redundancy can be used to achieve increased reliability.
For the problem you present, a triple-redundancy scheme may be the way to go, where periodic checks for synchronization across the three (or more) systems would determine by vote who has failed.

Is it possible to use a cassandra table as a basic queue

Is it possible to use a table in cassandra as a queue, I don't think the strategy I use in mysql works, ie given this table:
create table message_queue(id integer, message varchar(4000), retries int, sending boolean);
We have a transaction that marks the row as "sending", tries to send, and then either deletes the row, or increments the retries count. The transaction ensures that only one server will be attempting to process an item from the message_queue at any one time.
There is an article on datastax that describes the pitfalls and how to get around it, however Im not sure what the impact of having lots of tombstones lying around is, how long do they stay around for?
Don't do this. Cassandra is a terrible choice as a queue backend unless you are very, very careful. You can read more of the reasons in Jonathan Ellis blog post "Cassandra anti-patterns: Queues and queue-like datasets" (which might be the post you're alluding to). MySQL is also not a great choice for backing a queue, us a real queue product like RabbitMQ, it's great and very easy to use.
The problem with using Cassandra as the storage for a queue is this: every time you delete a message you write a tombstone for that message. Every time you query for the next message Cassandra will have to trawl through those tombstones and deleted messages and try to determine the few that have not been deleted. With any kind of throughput the number of read values versus the number of actual live messages will be hundreds of thousands to one.
Tuning GC grace and other parameters will not help, because that only applies to how long tombstones will hang around after a compaction, and even if you dedicated the CPUs to only run compactions you would still have dead to live rations of tens of thousands or more. And even with a GC grace of zero tombstones will hang around after compactions in some cases.
There are ways to mitigate these effects, and they are outlined in Jonathan's post, but here's a summary (and I don't write this to encourage you to use Cassandra as a queue backend, but because it explains a bit more about Cassandra works, and should help you understand why it's a bad fit for the problem):
To avoid the tombstone problem you cannot keep using the same queue, because it will fill upp with tombstones quicker than compactions can get rid of them and your performance will run straight into a brick wall. If you add a column to the primary key that is deterministic and depends on time you can avoid some of the performance problems, since fewer tombstones have time to build up and Cassandra will be able to completely remove old rows and all their tombstones.
Using a single row per queue also creates a hotspot. A single node will have to handle that queue, and the rest of the nodes will be idle. You might have lots of queues, but chances are that one of them will see much more traffic than the others and that means you get a hotspot. Shard the queues over multiple nodes by adding a second column to the primary key. It can be a hash of the message (for example crc32(message) % 60 would create 60 shards, don't use a too small number). When you want to find the next message you read from all of the shards and pick one of the results, ignoring the others. Ideally you find a way to combine this with something that depends on time, so that you fix that problem too while you're at it.
If you sort your messages after time of arrival (for example with TIMEUUID clustering key) and can somehow keep track of the newest messages that has been delivered, you can do a query to find all messages after that message. That would mean less thrawling through tombstones for Cassandra, but it is no panacea.
Then there's the issue of acknowledgements. I'm not sure if they matter to you, but it looks like you have some kind of locking mechanism in your schema (I'm thinking of the retries and sending columns). This will not work. Until Cassandra 2.0 and it's compare-and-swap features there is no way to make that work correctly. To implement a lock you need to read the value of the column, check if it's not locked, then write that it should now be locked. Even with consistency level ALL another application node can do the same operations at the same time, and both end up thinking that they locked the message. With CAS in Cassandra 2.0 it will be possible to do atomically, but at the cost of performance.
There are a couple of more answers here on StackOverflow about Cassandra and queues, read them (start with this: Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds.
The grace period can be defined. Per default it is 10 days:
gc_grace_seconds¶
(Default: 864000 [10 days]) Specifies the time to wait before garbage
collecting tombstones (deletion markers). The default value allows a
great deal of time for consistency to be achieved prior to deletion.
In many deployments this interval can be reduced, and in a single-node
cluster it can be safely set to zero. When using CLI, use gc_grace
instead of gc_grace_seconds.
Taken from the
documentation
On a different note, I do not think that implementing a queue pattern in Cassandra is very useful. To prevent your worker to process one entry twice, you need to enforce "ALL" read consistency, which defeats the purpose of distributed database systems.
I highly recommend looking at specialized systems like messaging systems which support the queue pattern natively. Take a look at RabbitMQ for instance. You will be up and running in no time.
Theo's answer about not using Cassandra for queues is spot on.
Just wanted to add that we have been using Redis sorted sets for our queues and it has been working pretty well. Some of our queues have tens of millions of elements and are accessed hundreds of times per second.