keepalived transitions not happening as expected - centos

I am trying to implement keepalived based failover for my service. Please find below my configurations for the master and backup nodes.
Master node:
vrrp_script chk_splunkd {
script "pidof splunkd"
interval 2
fall 2
rise 2
}
vrrp_instance VI_1 {
interface eth0
state MASTER
advert_int 1
virtual_router_id 51
priority 200
nopreempt
smtp_alert
authentication {
auth_type PASS
auth_pass passme
}
virtual_ipaddress {
10.126.246.245
}
track_script {
chk_splunkd
}
notify_master /etc/keepalived/scripts/master.sh
notify_backup /etc/keepalived/scripts/stop_service.sh
notify_fault /etc/keepalived/scripts/stop_service.sh
}
Back up node:
vrrp_script chk_splunkd {
script "pidof splunkd"
interval 2
fall 2
rise 2
}
vrrp_instance VI_1 {
interface eth0
state BACKUP
advert_int 1
virtual_router_id 51
priority 100
nopreempt
smtp_alert
authentication {
auth_type PASS
auth_pass passme
}
virtual_ipaddress {
10.126.246.245
}
track_script {
chk_splunkd
}
notify_master /etc/keepalived/scripts/master.sh
notify_backup /etc/keepalived/scripts/stop_service.sh
notify_fault /etc/keepalived/scripts/stop_service.sh
}
However, I find that even when one node goes into fault state and stops sending VRRP advertisements, the other node doesn't automatically transition to master state. When I tried to monitor the VRRP advertisement packets using tcpdump -vv -i eth0 vrrp I find that even after the advertisement from one node stops, the other node doesn't automatically start sending the advertisements indicating that it has now become the master.
Please help me find out what I'm missing.
Thanks,
Keerthana

The issue was that during startup when one node became the master, the other one went into fault mode due to the pidof splunkd command which will return 1 as my splunk service should be up on only the master node. Once I edited the notify script to write current state to an external file and read the state to take action in my notify scripts, things started working fine.

Related

Akka cluster sharding configuration

Ok, I have 2 instances of my backend, hosted on 2 difference centos servers. What I want to do using Akka Cluster Sharding is to divide the work done by each of these instances:
I have data for 4 countries, which is retrieved from db at every 10 seconds by both backend instances, which update a Redis instance. So, multiple times, I have duplicated requests, because both backends get data for same country;
Using Akka Cluster Sharding, I try to divide the work dinamically, instance1 to get data for ES and EN, instance2 to get data for DE and IT. In case of instance1 is down, instance2 will take the jobs and will get data even for ES/EN.
I tought this is simple...but not.
All jobs are done by Akka Actors, so using Cluster Sharding, I thought all declared actors (from both instances) will be centralized somewhere, to can manipulate which do whatever job.
On localhost, all works fine, because I have an instance for my app with port 9001 and 2 cluster nodes with ports 2551 and 2552. But for production, I can't understand how to configure the hostnames
application.conf
"clusterRegistration" {
akka {
actor {
allow-java-serialization = on
provider = cluster
}
remote.artery {
enabled = on
transport = aeron-udp
}
cluster {
jmx.multi-mbeans-in-same-jvm = on
seed-nodes = [
"akka://ClusterService#instance1:8083",
"akka://ClusterService#instance1:2551"
]
}
}
}
class
object ClusterSharding {
def createNode(hostname: String, port: Int, role: String, props: Props, actorName: String) = {
val config = ConfigFactory.parseString(
s"""
|akka.cluster.roles = ["$role"]
|akka.remote.artery.canonical.hostname = $hostname
|akka.remote.artery.canonical.port = $port
|""".stripMargin
).withFallback(ConfigFactory.load
.getConfig("clusterRegistration"))
val system = ActorSystem("ClusterService", config)
system.actorOf(props, actorName)
}
val master = createNode("instance1", 8083, "master", Props[Master], "master")
createNode("instance1", 2551, "worker", Props[Worker], "worker")
createNode("instance2", 8083, "worker", Props[Worker], "worker")
Future {
while (true) {
master ! Proceed // this will fire an Actor Resolver case
Thread.sleep(5000)
}
}
}
master actor
class Master extends Actor {
var workers: Map[Address, ActorRef] = Map()
val cluster = Cluster(context.system)
override def preStart(): Unit = {
cluster.subscribe(
self,
initialStateMode = InitialStateAsEvents,
classOf[MemberEvent],
classOf[UnreachableMember]
)
}
override def postStop(): Unit = {
cluster.unsubscribe(self)
}
def receive = handleClusterEvents // cluster events
.orElse(handleWorkerRegistration) // worker registered to cluster
.orElse(handleJob) // give jobs to workers
def handleJob: Receive = {
case Proceed => {
// Here I must be able to use all workers from both instances
// (centos1 and centos2) and give work for each dinamically
if (workers.length == 2) {
worker1 ! List("EN", "ES")
worker2 ! List("DE", "IT")
} else if (workers.length == 1) {
worker ! List("EN", "ES", "DE", "IT")
} else {
execQueries() // if no worker is available, each backend instance will exec queries on his own way
}
}
}
}
Both instances are hosted with port 8083 (centos1: instance1:8083, centos2: instance2:8083). If I use settings just for one of the instances in application.conf and in createNode (instance1 for example), I can see in logs that the workers are created, but there is no communication with the second instance.
Where I'm wrong? thx
Your approach to configuring the hostnames is viable. There are better ways to do it (depending on how you're deploying the service: manual deploy vs. ansible/chef/puppet vs. docker vs. kubernetes/nomad/mesos will be different), but setting the hostname isn't likely your actual problem.
Your current approach will give you a master and 2 workers on every node and you're not actually using Cluster Sharding (you're using Cluster, but Cluster Sharding is something you opt into on top of Cluster). From the code you've posted, I strongly suspect that using Cluster Sharding will entail a dramatic redesign (though without posting the Worker and more complete Master code, it's hard to say).
The broad approach I'd take with this would be to have the process of updating Redis for a given country be owned by a sharded entity (keyed by that country). A cluster singleton actor would trigger the update process for each country every 10 seconds. Because we're using sharding and singleton, I'd probably actually have at least 3 instances of the service, or alternatively make use of a strongly consistent external lease system (the other split-brain resolution strategies (note that cluster sharding and cluster singleton basically force you to resolve split-brains) will all boil down, at least half the time, to losing one node is the same as losing both in a 2-node cluster). Because sharding implies that the actor for a process could be stopped arbitrarily (and possibly resumed on a different node), you'll also want to think about how the process can be resumed in a way that makes sense for the application.
Starting multiple ActorSystems in the same JVM process is generally only a good idea in fairly specific circumstances.

Kafka connector's task status are different when queried against different kafka connect nodes in a clustered enviroment

We have a 3 node Kafka connect cluster running version 5.5.4 in distributed mode. We are observing a strange issue regarding connector's task status.
The REST calls to node 1 and 2 are returning different results.
The first node returned this result:
{
"connector":{
"state":"RUNNING",
"worker_id":"x.com:8083"
},
"name":"connector",
"type":"source",
"tasks":[
]
}
Yes the task is empty where as the other node returned this result:
{
"connector":{
"state":"RUNNING",
"worker_id":"x.com:8083"
},
"name":"connector...",
"type":"source",
"tasks":[
{
"id":0,
"state":"RUNNING",
"worker_id":"x.com:8083"
}
]
}
As mentioned in this doc https://docs.confluent.io/home/connect/userguide.html#kconnect-internal-topics, I have configured group.id, config.storage.topic, offset.storage.topic and status.storage.topic with identical values in all 3 nodes.
I did go through connect-statuses-0 data directory and the file sizes for log, index and timestamp are all identical in node 1 and node 2. I don't know what is the .snapshot file but I see only one with root user/group in first node where as I see 2 of them in the 2nd node. One owned by root user/group and the other owned by our custom created user. Not sure this has anything to do with this problem.
Please guide me in identifying the root cause for this problem. If I do need to check any configuration, please let me know.

Akka Singleton Cluster: resolveOnce of a worker by master fails after restart

I am using Akka Cluster 2.4.3 and trying to setup a simple cluster in my machine to understand its working better. I have a singleton cluster with remoting enabled with primary and standby master and one worker node. Each of these 3 run in separate JVMs
Things work fine when all the nodes are started the first time. If I kill and restart the worker, I see following issues happening
Restart Worker
When the worker comes back after restart, the master on receiving MemberUp event tries to resolve for the actorRef from the member address the following way
context.actorSelection(member.address.toString).resolveOne(15 seconds)
This fails with an exception saying ActorNotFound. This works with no problem when all the nodes are coming up for the first time in the cluster.
Restart worker again
This time, the worker comes up with the following message
[WARN] [04/15/2016 18:24:24.991] [clustersystem-akka.remote.default-remote-dispatcher-5] [akka.remote.Remoting] Tried to associate with unreachable remote address [akka.tcp://clustersystem#host1:2551]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]
Restart worker again
This time the resolveOne on a MemberUp event works.
I am having a bit of difficulty in understanding what is happening here, I have looked into the docs but I did not find anything that will help me in there.
application.conf
akka {
actor {
provider = "akka.cluster.ClusterActorRefProvider"
}
remote {
enabled-transports = ["akka.remote.netty.tcp"]
}
log-dead-letters = off
jvm-exit-on-fatal-error = on
loglevel = "DEBUG"
remote {
log-remote-lifecycle-events = off
netty.tcp {
hostname = "host1"
port = 0
}
}
cluster {
seed-nodes = [
"akka.tcp://clustersystem#host1:2551",
"akka.tcp://clustersystem#host1:2552"]
auto-down-unreachable-after = 10s
}
extensions = ["akka.cluster.metrics.ClusterMetricsExtension"]
}
I start master nodes at ports 2551 and 2552 (provide the ports as command line args) and I start the worker on port 3551

akka remote actor running on local actor

I am learning akka-remote and trying to re-do http://www.typesafe.com/activator/template/akka-sample-remote-scala myself.
When I try to run the project in two separate JVMs, I see
$ clear;java -jar akkaio-remote/target/akka-remote-jar-with-dependencies.jar com.harit.akkaio.remote.RemoteApp ProcessingActor
ProcessingActorSystem Started
and
$ clear;java -jar akkaio-remote/target/akka-remote-jar-with-dependencies.jar com.harit.akkaio.remote.RemoteApp WatchingActor
WatchingActorSystem Started
asking processor to process
processing big things
I asked my Processing System to run on port 2552
include "common"
akka {
# LISTEN on tcp port 2552
remote.netty.tcp.port = 2552
}
and I told my other system (WatchingSystem) to run on port 2554 but start processingActor on port 2552
include "common"
akka {
actor {
deployment {
"/processingActor/*" {
remote = "akka.tcp://ProcessingActorSystem#127.0.0.1:2552"
}
}
}
remote.netty.tcp.port = 2554
}
and common is about using the right provider
akka {
actor {
provider = "akka.remote.RemoteActorRefProvider"
}
remote {
netty.tcp {
hostname = "127.0.0.1"
}
}
}
Questions/Concerns
From logs, I see that the processingActor is running on WatchingActorSystem and not on ProcessingActorSystem, what is wrong going on?
How can I see that the two ActorSystems are connecting to each other. I do not see logging happening. However, in the example, I shared the logging happens. What am I missing?
The entire code is posted on Github and runs as well
1) Your deployment configuration is set up to have all the children of processingActor being remote, as described in the akka configuration docs
You should set it to this instead:
deployment {
"/processingActor" {
remote = "akka.tcp://ProcessingActorSystem#127.0.0.1:2552"
}
2) You need to set your log level to something useful as described in the akka logging documentation

How debug akka association porcess?

Here is a scenario:
I have packaged scala project with spray into jar file.
Launch jar file on RedHat 6.5 on Virtual Box (ip - 192.168.1.38)
Launch jar file on RedHat 6.5 on Virtual Box (ip - 192.168.1.41)
Everything works locally - I can send REST request to each virtual machine and get response.
Problem
Akka systems can not became to cluster. I run 192.168.1.38 with default settings, but 192.168.1.41 have an additional property - akka.cluster.seed-nodes which is set to akka.tcp://mySystem#192.168.1.38:2551. So I get:
[WARN] [12/09/2014 17:10:24.043] [mySystem-akka.remote.default-remote-dispatcher-8] [akka.tcp://mySystem#192.168.1.41:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FmySystem%40192.168.1.38%3A2551-0] Association with remote system [akka.tcp://mySystem#192.168.1.38:2551] has failed, address is now gated for [5000] ms. Reason is: [Association failed with [akka.tcp://mySystem#192.168.1.38:2551]].
No other errors or warning. Also how can I test akka association or print debug akka association settings?
Also can linux settings influence to akka association?
Most probably iptables is blocking particular port, if it's your test configuration just disable iptables.
service iptables save
service iptables stop
chkconfig iptables off
service ip6tables save
service ip6tables stop
chkconfig ip6tables off
If it will not help try to check you SELinux configuration using command getenforce and the same for test purposes you can completely disable it. SELinux manual
In case of your application.conf, try using following configuration for each node:
akka {
log-dead-letters = on
loglevel = "debug"
actor
{
provider = "akka.cluster.ClusterActorRefProvider"
}
extensions = ["akka.contrib.pattern.ClusterReceptionistExtension"]
remote {
log-remote-lifecycle-events = off
netty.tcp {
port = 6001
}
}
cluster {
seed-nodes = [
"akka.tcp://ActorSystem#192.168.1.38:6001",
"akka.tcp://ActorSystem#192.168.1.41:6001"
]
auto-down-unreachable-after = 10s
}
}
All the logs related to the cluster nodes are logged as info but having debug log level in test environment is in general good idea.
When the second, node will join the cluster, you should notice following log:
INFO [ActorSystem-akka.actor.default-dispatcher-4] [Cluster(akka://ActorSystem)] - Cluster Node [akka.tcp://ActorSystem#10.0.1.41:6001] - Marking node(s) as REACHABLE [Member(address = akka.tcp://ActorSystem#10.0.1.41:6001, status = Up)]
Cluster state could be also monitored using jmx akka.Cluster MXBean
{ "self-address": "akka.tcp://ActorSystem#10.0.1.82:6001", "members": [ { "address": "akka.tcp://ActorSystem#10.0.1.82:6001", "status": "Up" } ], "unreachable": [ ] }