How to avoid publishing duplicate data to Kafka via Kafka Connect and Couchbase Eventing, when replicate Couchbase data on multi data center with XDCR - apache-kafka

My buckets are:
MyDataBucket: application saves its data on this bucket.
MyEventingBucket: A couchbase eventing function extracts the 'currentState' field from MyDataBucket and saves it in this bucket.
Also, I have a kafka couchbase connector that pushs data from MyEventingBucket to kafka topic.
When we had a single data center, there wasn't any problem. Now, we have three data centers. We replicate our data with XDCR between data centers and we work as active-active. So, write requests can be from any data center.
When data is replicated on other data centers, the eventing service works on all data centers, and the same data is pushed three-time (because we have three data centers) on Kafka with Kafka connector.
How can we avoid pushing duplicate data o Kafka?
Ps: Of course, we can run an eventing service or Kafka connector in only one data center. So, we can publish data on Kafka just once. But this is not a good solution. Because we will be affected when a problem occurs in this data center. This was the main reason of using multi data center.

Obviously in a perfect world XDCR would just work with Eventing on the replicated bucket.
I put together an Eventing based work around to overcome issues in an active / active XDCR configuration - it is a bit complex so I thought working code would be best. This is one way to perform the solution that Matthew Groves alluded to.
Documents are tagged and you have a shared via XDCR "cluster_state" document (see comments in the code) to coordinated which cluster is "primary" as you only want one cluster to fire the Eventing function.
I will give the code for an Eventing function "xcdr_supression_700" for version 7.0.0 with a minor change it will also work for 6.6.5.
Note, newer Couchbase releases have more functionality WRT Eventing and allow the Eventing function to be simplified for example:
Advanced Bucket Accessors in 6.6+ specifically couchbase.replace()
can use CAS and prevent potential races (note Eventing does not allow
locking).
Timers have been improved and can be overwritten in 6.6+ thus simplifying the logic needed to determine if a timer is an orphan.
Constant Alias bindings in 7.X allow the JavaScript Eventing code identical between clusters changing just a setting for each cluster.
Setting up XDCR and Eventing
The following code will successfully suppress all extra Eventing mutations on a bucket called "common" or in 7.0.X a keyspace of "common._default._default" with an active/active XDCR replication.
The example is for two (2) clusters but may be extended. This code is 7.0 specific (I can supply a 6.5.1 variant if needed - please DM me)
PS : The only thing it does is log a message (in the cluster that is processing the function). You can just set up two one node clusters, I named my clusters "couch01" and "couch03". Pretty easy to setup and test to ensure that mutations in your bucket are only processed once across two clusters with active/active XDCR
The Eventing Function is generic WRT the JavaScript BUT it does require a different constant alias on each cluster, see the comment just under the OnUpdate(doc,meta) entry point.
/*
PURPOSE suppress duplicate mutations by Eventing when we use an Active/Active XDCR setup
Make two clusters "couch01" and "couch03" each with bucket "common" (if 7.0.0 keyspace "common._default._default")
On cluster "couch01", setup XDCR replication of common from "couch01" => "couch03"
On cluster "couch03", setup XDCR replication of common from "couch03" => "couch01"
This is an active / active XDCR configuration.
We process all documents in "common" except those with "type": "cluster_state" the documents can contain anything
{
"data": "...something..."
}
We add "owner": "cluster" to every document, in this sample I have two clusters "couch01" and "couch03"
We add "crc": "crc" to every document, in this sample I have two clusters "couch01" and "couch03"
If either the "owner" or "crc" property does not exist we will add the properties ourselves to the document
{
"data": "...something...",
"owner": "couch01",
"crc": "a63a0af9428f6d2d"
}
A document must exist with KEY "cluster_state" when things are perfect it looks lke the following:
{
"type": "cluster_state",
"ci_offline": {"couch01": false, "couch03": false },
"ci_backups": {"couch03": "couch01", "couch01": "couch03" }
}
Note ci_offline is an indicator that the cluster is down, for example is a document has an "owner": "couch01"
and "ci_offline": {"couch01": true, "couch03": false } then the cluster "couch02" will take ownership and the
documents will be updated accordingly. An external process (ping/verify CB is running, etc.) runs every minute
or so and then updates the "cluster_state" if a change in cluster state occurs, however prior to updating
ci_offline to "true" the eventing Function on that cluster should either be undeployed or paused. In addition
re-enabeling the cluster setting the flag ci_offline to "false" must be done before the Function is resumed or
re-deployed.
The ci_backups tells which cluster is a backup for which cluster, pretty simple for two clusters.
If you have timers when the timer fires you MUST check if the doc.owner is correct if not ignore the timer, i.e.
do nothing. In addition, when you "take ownership" you will need to create a new timer. Finally, all timers should
have an id such that if we ping pong ci_offline that the timer will be overwritten, this implies 6.6.0+ else you
need do even to more work to suppress orphaned timers.
The 'near' identical Function will be deployed on both clusters "couch01" and "couch02" make sure you have
a constant binding for 7.0.0 THIS_CLUSTER "couch01" or THIS_CLUSTER "couch02", or for 6.6.0 uncomment the
appropriate var statement at the top of OnUpdate(). Next you should have a bucket binding of src_bkt to
keyspace "common._default._default" for 7.0.0 or to bucket "common" in 6.6.0 in mode read+write.
*/
function OnUpdate(doc, meta) {
// ********************************
// MUST MATCH THE CLUSTER AND ALSO THE DOC "cluster_state"
// *********
// var THIS_CLUSTER = "couch01"; // this could be a constant binding in 7.0.0, in 6.X we uncomment one of these to match he cluster name
// var THIS_CLUSTER = "couch03"; // this could be a constant binding in 7.0.0, in 6.X we uncomment one of these to match he cluster name
// ********************************
if (doc.type === "cluster_state") return;
var cs = src_bkt["cluster_state"]; // extra bucket op read the state of the clusters
if (cs.ci_offline[THIS_CLUSTER] === true) return; // this cluster is marked offline do nothing.
// ^^^^^^^^
// IMPORTANT: when an external process marks the cs.ci_offline[THIS_CLUSTER] back to false (as
// in this cluster becomes online) it is assumed that the Eventing function was undeployed
// (or was paused) when it was set "true" and will be redeployed or resumed AFTER it is set "false".
// This order of this procedure is very important else mutations will be lost.
var orig_owner = doc.owner;
var fallback_cluster = cs.ci_backups[THIS_CLUSTER]; // this cluster is the fallback for the fallback_cluster
/*
if (!doc.crc && !doc.owner) {
doc.owner = fallback_cluster;
src_bkt[meta.id] = doc;
return; // the fallback cluster NOT THIS CLUSTER is now the owner, the fallback
// cluster will then add the crc property, as we just made a mutation in that
// cluster via XDCR
}
*/
if (!doc.crc && !doc.owner) {
doc.owner = THIS_CLUSTER;
orig_owner = doc.owner;
// use CAS to avoid a potential 'race' between clusters
var result = couchbase.replace(src_bkt,meta,doc);
if (result.success) {
// log('success adv. replace: result',result);
} else {
// log('lost to other cluster failure adv. replace: id',meta.id,'result',result);
// re-read
doc = src_bkt[meta.id];
orig_owner = doc.owner;
}
}
// logic to take over a failed clusters data, requires updating "cluster_state"
if (orig_owner !== THIS_CLUSTER) {
if ( orig_owner === fallback_cluster && cs.ci_offline[fallback_cluster] === true) {
doc.owner = THIS_CLUSTER; // Here update the doc's owner
src_bkt[meta.id] = doc; // This cluster now will now process this doc's mutations.
} else {
return; // this isn't the fallback cluster.
}
}
var crc_changed = false;
if (!doc.crc) {
var cur_owner = doc.owner;
delete doc.owner;
doc.crc = crc64(doc); // crc DOES NOT include doc.owner && doc.crc
doc.owner = cur_owner;
crc_changed = true;
} else {
var cur_owner = doc.owner;
var cur_crc = doc.crc;
delete doc.owner;
delete doc.crc;
doc.crc = crc64(doc); // crc DOES NOT include doc.owner && doc.crc
doc.owner = cur_owner;
if (cur_crc != doc.crc) {
crc_changed = true;
} else {
return;
}
}
if (crc_changed) {
// update the data with the new crc, to suppress duplicate XDCR processing, and re-deploy form Everything
// we could use CAS here but at this point only one cluster will update the doc, so we can not have races.
src_bkt[meta.id] = doc;
}
// This is the action on a fresh unprocessed mutation, here it is just a log message.
log("A. Doc created/updated", meta.id, 'THIS_CLUSTER', THIS_CLUSTER, 'offline', cs.ci_offline[THIS_CLUSTER],
'orig_owner', orig_owner, 'owner', doc.owner, 'crc_changed', crc_changed,doc.crc);
}
Make sure you have two buckets prior to importing "xcdr_supression_700.json" or "xcdr_supression_660.json"
The 1st cluster's (cluster01) setup play attention to the constant alias as you will need to ensure you have THIS_CLUSTER set to "couch01"
The 2nd cluster's (cluster03) setup play attention to the constant alias as you will need to ensure you have THIS_CLUSTER set to "couch03"
Now if you are running version 6.6.5 you do not have Constant Alias bindings (which act as globals in your Eventing function's JavaScript) thus the requirement to uncomment the appropriate variable example for cluster couch01.
function OnUpdate(doc, meta) {
// ********************************
// MUST MATCH THE CLUSTER AND ALSO THE DOC "cluster_state"
// *********
var THIS_CLUSTER = "couch01"; // this could be a constant binding in 7.0.0, in 6.X we uncomment one of these to match he cluster name
// var THIS_CLUSTER = "couch03"; // this could be a constant binding in 7.0.0, in 6.X we uncomment one of these to match he cluster name
// ********************************
// .... code removed (see prior code example) ....
}
Some comments/details:
You may wonder why we need to use CRC function and store it in the document undergoing XDCR.
The CRC function, crc64(), built into Eventing is used to detect a non-change or a mutation possible due to a XDCR document update. The use of CRC and the properties "owner" and "crc" allow a) the determination of the owning cluster and b) the suppression of the Eventing function when the mutation is due to an XDCR cluster to cluster copy based on the "active" cluster.
Note when updating CRC in the document as part of timer function, the OnUpdate(doc,meta) entry point of the Eventing function will be triggered again. If you have timers when the timer fires you MUST check if the doc.owner is correct if it is not you ignore the timer, i.e. do nothing. In addition, when you "take ownership" you will need to create a new timer. Finally, all timers should have an id such that if we ping pong cluster_state.ci_offline that the timer will be overwritten, this implies you must use version 6.6.0+ else you need do even to more work to determine when a timer fires that the timer is orphaned and then suppress any action. Be very careful in older Couchbase versions because in 6.5 you cannot overwrite a timer by its id and all timer ids should be unique.
Any mutation made to the source bucket by an Eventing function is suppressed or not seen by that Eventing function whether a document is mutated by the main JavaScript code to by a timer callback. Yet these mutations will be seen via XCDR active/active replication in the other cluster.
As to using Eventing timers pay attention to the comment, I put in the prior paragraph about overwriting and suppressing especially if you insist on using Couchbase-server 6.5 which is getting a bit long of tooth so to speak.
Concerning the responsibility to update the cluster_state document, it is envisioned that this would be a periodic script outside of Couchbase run in a Linux cron that does "aliveness" tests with a manual override. Be careful here as you can easily go "split brain" due to a network partitioning issue.
A comment about the cluster_state, this document is subject to XCDR it is a persistent document that the active/active replication makes appear to be a single inter-cluster document. If a cluster is "down" changing it on the live cluster will result in it replicating when the "down" cluster is recovered.
Deploy/Undeploy will either process all current documents via the DCP mutation stream all over again (feed boundary == Everything) -or- only process items or mutations occurring after the time of deployment (feed boundary == From now). So you need careful coding in the first case to prevent acting on the same document twice and you will miss mutations in the second case.
It is best to design our Eventing Functions to be idempotent, where there is no additional effect if it is called more than once with the same input parameters. This can be achieved by storing state in the documents that are processed so you never reprocess them on a re-deploy.
Pause/Resume Invoking Pause will create a check point and shutdown the Eventing processing. The on a Resume the DCP stream will start form the checkpoint (for each vBucket) you will not miss a single mutation subject to DCP dedup. Furthermore all "active" timers that would have fired during the "pause" will fire as soon as possible (typically within the next 7 second timer scan interval).
Best
Jon Strabala
Principal Product Manager - Couchbase

Related

How to wait while replicas are caught up master

There is a mongodb cluster (1 master 2 replicas)
Updating records in a larger number and for this used BulkWrite, need to call next BulkWrite after replicas caught up master, need to make sure that the replicas have already caught up with the master for this request. Used go.mongodb.org/mongo-driver/mongo
Write propagation can be "controlled" with a writeconcern.WriteConcern.
WriteConcern can be created using different writeconcern.Options. The writeconcern.W() function can be used to create a writeconcern.Option that controls how many instances writes must be progatated to (or more specifically the write operation will wait for acknowledgements from the given number of instances):
func W(w int) Option
W requests acknowledgement that write operations propagate to the specified number of mongod instances.
So first you need to create a writeconcern.WriteConcern that is properly configured to wait for acknowledgements from 2 nodes in your case (you have 1 master and 2 replicas):
wc := writeconcern.New(writeconcern.W(2))
You now have to choose the scope of this write concern: it may be applied on the entire mongo.Client, it may be applied on a mongo.Database or just applied on a mongo.Collection.
Creating a client, obtaining a database or just a collection all have options which allow specifying the write concern, so this is really up to you where you want it to be applied.
If you just want this write concern to have effect on certain updates, it's enough to apply it on mongo.Collection which you use to perform the updates:
var client *mongo.Client // Initialize / connect client
wc := writeconcern.New(writeconcern.W(2))
c := client.Database("<your db name>").
Collection("<your collection name>",
options.Collection().SetWriteConcern(wc))
Now if you use this c collection to perform the writes (updates), the wc write concern will be used / applied, that is, each write operation will wait for acknowledgements of the writes propagated to 2 instances.
If you would apply the wc write concert on the mongo.Client, then that would be the default write concern for everything you do with that client, if you'd apply it on a mongo.Database, then it would be the default for everything you do with that database. Of course the default can be overridden, just how we applied wc on the c collection in the above example.

What are the limits on actorevents in service fabric?

I am currently testing the scaling of my application and I ran into something I did not expect.
The application is running on a 5 node cluster, it has multiple services/actortypes and is using a shared process model.
For some component it uses actor events as a best effort pubsub system (There are fallbacks in place so if a notification is dropped there is no issue).
The problem arises when the number of actors grows (aka subscription topics). The actorservice is partitioned to 100 partitions at the moment.
The number of topics at that point is around 160.000 where each topic is subscribed 1-5 times (nodes where it is needed) with an average of 2.5 subscriptions (Roughly 400k subscriptions).
At that point communications in the cluster start breaking down, new subscriptions are not created, unsubscribes are timing out.
But it is also affecting other services, internal calls to a diagnostics service are timing out (asking each of the 5 replicas), this is probably due to the resolving of partitions/replica endpoints as the outside calls to the webpage are fine (these endpoints use the same technology/codestack).
The eventviewer is full with warnings and errors like:
EventName: ReplicatorFaulted Category: Health EventInstanceId {c4b35124-4997-4de2-9e58-2359665f2fe7} PartitionId {a8b49c25-8a5f-442e-8284-9ebccc7be746} ReplicaId 132580461505725813 FaultType: Transient, Reason: Cancelling update epoch on secondary while waiting for dispatch queues to drain will result in an invalid state, ErrorCode: -2147017731
10.3.0.9:20034-10.3.0.13:62297 send failed at state Connected: 0x80072745
Error While Receiving Connect Reply : CannotConnect , Message : 4ba737e2-4733-4af9-82ab-73f2afd2793b:382722511 from Service 15a5fb45-3ed0-4aba-a54f-212587823cde-132580461224314284-8c2b070b-dbb7-4b78-9698-96e4f7fdcbfc
I've tried scaling the application but without this subscribe model active and I easily reach a workload twice as large without any issues.
So there are a couple of questions
Are there limits known/advised for actor events?
Would increasing the partition count or/and node count help here?
Is the communication interference logical? Why are other service endpoints having issues as well?
After time spent with the support ticket we found some info. So I will post my findings here in case it helps someone.
The actor events use a resubscription model to make sure they are still connected to the actor. Default this is done every 20 seconds. This meant a lot of resources were being used and eventually the whole system overloaded with loads of idle threads waiting to resubscribe.
You can decrease the load by setting resubscriptionInterval to a higher value when subscribing. The drawback is that it will also mean the client will potentially miss events in the mean time (if a partition is moved).
To counteract the delay in resubscribing it is possible to hook into the lower level service fabric events. The following psuedo code was offered to me in the support call.
Register for endpoint change notifications for the actor service
fabricClient.ServiceManager.ServiceNotificationFilterMatched += (o, e) =>
{
var notification = ((FabricClient.ServiceManagementClient.ServiceNotificationEventArgs)e).Notification;
/*
* Add additional logic for optimizations
* - check if the endpoint is not empty
* - If multiple listeners are registered, check if the endpoint change notification is for the desired endpoint
* Please note, all the endpoints are sent in the notification. User code should have the logic to cache the endpoint seen during susbcription call and compare with the newer one
*/
List<long> keys;
if (resubscriptions.TryGetValue(notification.PartitionId, out keys))
{
foreach (var key in keys)
{
// 1. Unsubscribe the previous subscription by calling ActorProxy.UnsubscribeAsync()
// 2. Resubscribe by calling ActorProxy.SubscribeAsync()
}
}
};
await fabricClient.ServiceManager.RegisterServiceNotificationFilterAsync(new ServiceNotificationFilterDescription(new Uri("<service name>"), true, true));
Change the resubscription interval to a value which fits your need.
Cache the partition id to actor id mapping. This cache will be used to resubscribe when the replica’s primary endpoint changes(ref #1)
await actor.SubscribeAsync(handler, TimeSpan.FromHours(2) /*Tune the value according to the need*/);
ResolvedServicePartition rsp;
((ActorProxy)actor).ActorServicePartitionClientV2.TryGetLastResolvedServicePartition(out rsp);
var keys = resubscriptions.GetOrAdd(rsp.Info.Id, key => new List<long>());
keys.Add(communicationId);
The above approach ensures the below
The subscriptions are resubscribed at regular intervals
If the primary endpoint changes in between, actorproxy resubscribes from the service notification callback
This ends the psuedo code form the support call.
Answering my original questions:
Are there limits known/advised for actor events?
No hard limits, only resource usage.
Would increasing the partition count or/and node count help here? Partition count not. node count maybe, only if that means there are less subscribing entities on a node because of it.
Is the communication interference logical? Why are other service endpoints having issues as well?
Yes, resource contention is the reason.

K8s - Node alerts

How can I configure GCP to send me alerts when nodes events (create / shutdown) happen?
I would like to receive email alerting me about the cluster scaling.
tks
First, note that you can retrieve such events in Stackdriver Logging by using the following filter :
logName="projects/[PROJECT_NAME]/logs/cloudaudit.googleapis.com%2Factivity" AND
(
protoPayload.methodName="io.k8s.core.v1.nodes.create" OR
protoPayload.methodName="io.k8s.core.v1.nodes.delete"
)
This filter will retrieve only audit activity log entries (cloudaudit.googleapis.com%2Factivity) in your project [PROJECT_NAME], corresponding to a node creation event (io.k8s.core.v1.nodes.create) or deletion (io.k8s.core.v1.nodes.delete).
To be alerted when such a log is generated, there are multiple possibilities.
You could configure a sink to a Pub/Sub topic based on this filter, and then trigger a Cloud Function when a filtered log entry is created. This Cloud Function will define the logic to send you a mail. This is probably the solution I'd choose, since this use case is described in the documentation.
Otherwise, you could define a logs-based metric based on this filter (or one logs-based metric for creation and another for deletion), and configure an alert in Stackdriver Monitoring when this log-based metric is increased. This alert could be configured to send an email. However, I won't suggest you to implement this, because this is not a real "alert" (in the sense of "something went wrong"), but rather an information. You probably don't want to have incidents opened in Stackdriver Monitoring every time a node is created or deleted. But you can keep the idea of one/multiple logs-based metric and process it/them with a custom application.
for a faster way than using GCP sinks, you may also consider using internal Kubernetes nodes watchers.
You can see an example in https://github.com/notify17/k8s-node-watcher-example/blob/5fc3f802de69f65866cc8f37c4b0e721835ea5b9/main.go#L83.
This example uses Notify17 to generate notifications directly to you browser or mobile phone.
The relevant code is:
// Sets up the nodes watcher
watcher, err := api.Nodes().Watch(listOptions)
// ...
ch := watcher.ResultChan()
for event := range ch {
node, ok := event.Object.(*v1.Node)
// ...
switch event.Type {
case watch.Added:
// ...
// Triggers a Notify17 notification for the ADDED event
notify17(httpClient,
"Node added", fmt.Sprintf("Node %s has been added", node.Name))
case watch.Deleted:
// ...
// Triggers a Notify17 notification for the DELETED event
notify17(httpClient,
"Node deleted", fmt.Sprintf("Node %s has been deleted", node.Name))
}
// ...
You can test out this approach by following the instructions provided in the README.
Note: the drawback with this method is that, if the node where the pod lies on gets deleted/killed unsafely, there may be a chance the event will not be triggered for that node. If the node is deleted gracefully instead, like in the case of a cluster autoscaler, then the pod will be probably recreated on a new node before the old node gets deleted, therefore triggering the notification.

How check that Cluster sharding is started properly?

I want to check whether ClusterSharding started on not for one region. Here is the code:
def someMethod: {
val system = ActorSystem("ClusterSystem", ConfigFactory.load())
val region: ActorRef = ClusterSharding(system).shardRegion("someActorName")
}
Method akka.contrib.pattern.ClusterSharding#shardRegion throws IllegalArgumentException if it do not find shardRegion. I do not like approach to catch IllegalArgumentException just to check that ClusterSharding did not started.
Is there another approach like ClusterSharding(system).isStarted(shardRegionName = "someActorName")?
Or it is assumed that I should start all shardingRegion at ActorSystem start up?
You should indeed start all regions as soon as possible. According to the docs:
"When using the sharding extension you are first, typically at system startup on each node in the cluster, supposed to register the supported entry types with the ClusterSharding.start method."
Startup of a region is not immediate. In particular, even in local cases, it would take at the very least the time specified in the akka.contrib.cluster.sharding.retry-interval (the name is misleading: this value is both the initial delay of registration and the retry interval) parameter of your configuration before your sharded actors can effectively receive messages (the messages sent in that period are not lost, but not delivered until after a while).
If you want to be 100% sure that your region started, you should have one of your sharded actor respond to an identify message after you call cluster.start . Once it replies, you are guaranteed that your region is up and running. You can use a ask pattern if you want to be blocking and await on the ask future.

Service with background jobs, how to ensure jobs only run periodically ONCE per cluster

I have a play framework based service that is stateless and intended to be deployed across many machines for horizontal scaling.
This service is handling HTTP JSON requests and responses, and is using CouchDB as its data store again for maximum scalability.
We have a small number of background jobs that need to be run every X seconds across the whole cluster. It is vital that the jobs do not execute concurrently on each machine.
To execute the jobs we're using Actors and the Akka Scheduler (since we're using Scala):
Akka.system().scheduler.schedule(
Duration.create(0, TimeUnit.MILLISECONDS),
Duration.create(10, TimeUnit.SECONDS),
Akka.system().actorOf(LoggingJob.props),
"tick")
(etc)
object LoggingJob {
def props = Props[LoggingJob]
}
class LoggingJob extends UntypedActor {
override def onReceive(message: Any) {
Logger.info("Job executed! " + message.toString())
}
}
Is there:
any built in trickery in Akka/Actors/Play that I've missed that will do this for me?
OR a recognised algorithm that I can put on top of Couchbase (distributed mutex? not quite?) to do this?
I do not want to make any of the instances 'special' as it needs to be very simple to deploy and manage.
Check out Akka's Cluster Singleton Pattern.
For some use cases it is convenient and sometimes also mandatory to
ensure that you have exactly one actor of a certain type running
somewhere in the cluster.
Some examples:
single point of responsibility for certain cluster-wide consistent decisions, or coordination of actions across the cluster system
single entry point to an external system
single master, many workers
centralized naming service, or routing logic
Using a singleton should not be the first design choice. It has
several drawbacks, such as single-point of bottleneck. Single-point of
failure is also a relevant concern, but for some cases this feature
takes care of that by making sure that another singleton instance will
eventually be started.