Akka cluster sharding configuration - scala

Ok, I have 2 instances of my backend, hosted on 2 difference centos servers. What I want to do using Akka Cluster Sharding is to divide the work done by each of these instances:
I have data for 4 countries, which is retrieved from db at every 10 seconds by both backend instances, which update a Redis instance. So, multiple times, I have duplicated requests, because both backends get data for same country;
Using Akka Cluster Sharding, I try to divide the work dinamically, instance1 to get data for ES and EN, instance2 to get data for DE and IT. In case of instance1 is down, instance2 will take the jobs and will get data even for ES/EN.
I tought this is simple...but not.
All jobs are done by Akka Actors, so using Cluster Sharding, I thought all declared actors (from both instances) will be centralized somewhere, to can manipulate which do whatever job.
On localhost, all works fine, because I have an instance for my app with port 9001 and 2 cluster nodes with ports 2551 and 2552. But for production, I can't understand how to configure the hostnames
application.conf
"clusterRegistration" {
akka {
actor {
allow-java-serialization = on
provider = cluster
}
remote.artery {
enabled = on
transport = aeron-udp
}
cluster {
jmx.multi-mbeans-in-same-jvm = on
seed-nodes = [
"akka://ClusterService#instance1:8083",
"akka://ClusterService#instance1:2551"
]
}
}
}
class
object ClusterSharding {
def createNode(hostname: String, port: Int, role: String, props: Props, actorName: String) = {
val config = ConfigFactory.parseString(
s"""
|akka.cluster.roles = ["$role"]
|akka.remote.artery.canonical.hostname = $hostname
|akka.remote.artery.canonical.port = $port
|""".stripMargin
).withFallback(ConfigFactory.load
.getConfig("clusterRegistration"))
val system = ActorSystem("ClusterService", config)
system.actorOf(props, actorName)
}
val master = createNode("instance1", 8083, "master", Props[Master], "master")
createNode("instance1", 2551, "worker", Props[Worker], "worker")
createNode("instance2", 8083, "worker", Props[Worker], "worker")
Future {
while (true) {
master ! Proceed // this will fire an Actor Resolver case
Thread.sleep(5000)
}
}
}
master actor
class Master extends Actor {
var workers: Map[Address, ActorRef] = Map()
val cluster = Cluster(context.system)
override def preStart(): Unit = {
cluster.subscribe(
self,
initialStateMode = InitialStateAsEvents,
classOf[MemberEvent],
classOf[UnreachableMember]
)
}
override def postStop(): Unit = {
cluster.unsubscribe(self)
}
def receive = handleClusterEvents // cluster events
.orElse(handleWorkerRegistration) // worker registered to cluster
.orElse(handleJob) // give jobs to workers
def handleJob: Receive = {
case Proceed => {
// Here I must be able to use all workers from both instances
// (centos1 and centos2) and give work for each dinamically
if (workers.length == 2) {
worker1 ! List("EN", "ES")
worker2 ! List("DE", "IT")
} else if (workers.length == 1) {
worker ! List("EN", "ES", "DE", "IT")
} else {
execQueries() // if no worker is available, each backend instance will exec queries on his own way
}
}
}
}
Both instances are hosted with port 8083 (centos1: instance1:8083, centos2: instance2:8083). If I use settings just for one of the instances in application.conf and in createNode (instance1 for example), I can see in logs that the workers are created, but there is no communication with the second instance.
Where I'm wrong? thx

Your approach to configuring the hostnames is viable. There are better ways to do it (depending on how you're deploying the service: manual deploy vs. ansible/chef/puppet vs. docker vs. kubernetes/nomad/mesos will be different), but setting the hostname isn't likely your actual problem.
Your current approach will give you a master and 2 workers on every node and you're not actually using Cluster Sharding (you're using Cluster, but Cluster Sharding is something you opt into on top of Cluster). From the code you've posted, I strongly suspect that using Cluster Sharding will entail a dramatic redesign (though without posting the Worker and more complete Master code, it's hard to say).
The broad approach I'd take with this would be to have the process of updating Redis for a given country be owned by a sharded entity (keyed by that country). A cluster singleton actor would trigger the update process for each country every 10 seconds. Because we're using sharding and singleton, I'd probably actually have at least 3 instances of the service, or alternatively make use of a strongly consistent external lease system (the other split-brain resolution strategies (note that cluster sharding and cluster singleton basically force you to resolve split-brains) will all boil down, at least half the time, to losing one node is the same as losing both in a 2-node cluster). Because sharding implies that the actor for a process could be stopped arbitrarily (and possibly resumed on a different node), you'll also want to think about how the process can be resumed in a way that makes sense for the application.
Starting multiple ActorSystems in the same JVM process is generally only a good idea in fairly specific circumstances.

Related

How to connect searchkick (in a Rails app &/ Sidekiq job) to multiple elasticsearch clusters without stomping on global searckick config?

Upon startup my app sets my (?global?) searchkick client to point at my default elasticsearch cluster.
Searchkick.client = Elasticsearch::Client.new(
hosts: default_cluster, # this is the list of hosts in my default cluster
retry_on_failure: true,
)
However, I am upgrading my cluster (again), and while I'd like to be able to have my app read/search from that default cluster,
/search?q="some term"
# =>
Model.search("some term")
continue to work against the default_cluster
Where it starts to get a bit tricky is that:
I'd also like (via some specific ?sidekiq background jobs?) to fill an alternate (alt) cluster's index, something like:
Model.connect_to(alternate_cluster) {|client|
Searchkick.client = client
Model.reindex
}
Without causing all other background jobs to interact with the alternate cluster.
And, of course:
I'd like some way to verify that the alternate_cluster is working well (i.e. for search) before making it my default_cluster. And presumably via some admin route:
/admin/search?q="some search term"&cluster=alternate
# =>
Model.connect_to(alternate_cluster) {|client|
Searchkick.client = client
Model.search("some term")
}
And finally:
I'd like to avoid having to reconnect before every search/reindex action, i.e. I'd prefer not to have the overhead of changing (also because that probably implies that long-running tasks that continue to reconnect to searchkick will be swapping back and-forth from one cluster to the other):
Model.search("some term")
# =>
Model.connect_to(alternate_cluster) {|client|
Searchkick.client = client
Model.search("some term")
}
^ I don't want that
FWIW, the best I've been able to come-up with so far is something like:
def self.connect_to(current_cluster, &block)
previous_es_client = Searchkick.client
current_es_client = Elasticsearch::Client.new(
hosts: current_cluster,
retry_on_failure: true,
)
block.call(current_es_client)
rescue Exception => e
logger.warn(e)
ensure
Searchkick.client = previous_es_client
end
But, I suspect that will cause every other interaction within my system (via the same web-worker or other background jobs running in the same background-worker-instance) to (temporarily) point at the alternate cluster.
Thanks in advance for your assistance...

Vertx http server instance number does not improve throughput

I am using Vertx 3.8.0 to build a http server. The CPU can not be utilized (Only about 25% of CPU can be used) even when I set the instance of the verticle to number larger than 1. The instresting thing is the best performance I can get is when I set the instance number to 1.
public class Runner {
public static void main(String[] args) {
VertxOptions vertxOptions = new VertxOptions().setPreferNativeTransport(true);
vertxOptions.setEventLoopPoolSize(6);
final HttpServerOptions options = new HttpServerOptions()
.setTcpFastOpen(true)
.setTcpNoDelay(true)
.setTcpQuickAck(true);
final Vertx vertx = Vertx.vertx(vertxOptions);
DeploymentOptions deploymentOptions;
deploymentOptions = new DeploymentOptions().setInstances(3);
vertx.deployVerticle(() -> new AbstractVerticle() {
#Override
public void start(Future<Void> startFuture) {
vertx.createHttpServer(options)
.requestHandler(req -> {
req.response().end("1");
})
.listen(8080, "0.0.0.0");
}
}, deploymentOptions
);
System.out.println("Deployment done with pooling");
}
}
I used apache benchmark to test throughput of the server.
ab -c 150 -n 100000 http://10.32.31.35:8080/api/values/
The throughput result in about 8k per second. The server only utilize about 25% of the CPU.
If I use keepalive of http, the throughput is about 48k with about 50% CPU.
I used JMX to monitor the server program. It seems like that the instances number setting actually worked. There are more than 1 eventloops processing the requests, but it's likely the acceptor event loop is the bottleneck.
Is there anyway to improve this?
I think multiple instance of vertx would help(Like docker) but isn't there any more elegant way to utilize the computing resource?
There are some invalid assumptions with this test:
You think you are deploying 3 servers, but they're deployed on the same port, so only one actually listens. And deploying more servers doesn't increase you concurrency anyway
Your test doesn't utilize event loop that much. Most of your time is wasted on establishing new connections. That's why you see "improvement" while using keepalive. It's pure networking, not Vert.x
Make sure you run ab on a separate machine, or you're competing on the same resources
Don't expect to see some kind of 100% CPU utilization anyway, as you're not doing anything CPU intensive, actually

akka stream: how to reconnect TCP after disconnected

I have a simple balancer that dispatch jobs to several external worker processes via TCP:
val sinkBalance =
Sink.fromGraph(GraphDSL.create() { implicit b =>
val balancer = b.add(Balance[ByteString](workerCount))
for (i <- 0 until workerCount) {
val codec = Framing.simpleFramingProtocol(500).reversed
val connection = Tcp(actorSystem).outgoingConnection("127.0.0.1", 3333+i)
val extConnector = connection.join(codec)
balancer.out(i)
~> extConnector.async
~> Sink.onComplete(_=>println(s"complete: $i"))
}
SinkShape(balancer.in)
})
Normally it works great. However, when one of those external worker process get killed, the sub-flow for it will be completed (the println statement triggered), but the balancer somehow didn't know that: for the 1st next request, it still distribute jobs to the dead worker, and lead to timeout for that request. From the 2nd next request, thing back to normal, the balancer never use the dead worker again.
So I would like to ask 2 questions:
How can I reconnect the broken sub-flow? Maybe automatically, or manually by some event, after restarted the dead worker.
How can I modify the code above so that the balancer will no longer use any dead worker?

Akka Singleton Cluster: resolveOnce of a worker by master fails after restart

I am using Akka Cluster 2.4.3 and trying to setup a simple cluster in my machine to understand its working better. I have a singleton cluster with remoting enabled with primary and standby master and one worker node. Each of these 3 run in separate JVMs
Things work fine when all the nodes are started the first time. If I kill and restart the worker, I see following issues happening
Restart Worker
When the worker comes back after restart, the master on receiving MemberUp event tries to resolve for the actorRef from the member address the following way
context.actorSelection(member.address.toString).resolveOne(15 seconds)
This fails with an exception saying ActorNotFound. This works with no problem when all the nodes are coming up for the first time in the cluster.
Restart worker again
This time, the worker comes up with the following message
[WARN] [04/15/2016 18:24:24.991] [clustersystem-akka.remote.default-remote-dispatcher-5] [akka.remote.Remoting] Tried to associate with unreachable remote address [akka.tcp://clustersystem#host1:2551]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]
Restart worker again
This time the resolveOne on a MemberUp event works.
I am having a bit of difficulty in understanding what is happening here, I have looked into the docs but I did not find anything that will help me in there.
application.conf
akka {
actor {
provider = "akka.cluster.ClusterActorRefProvider"
}
remote {
enabled-transports = ["akka.remote.netty.tcp"]
}
log-dead-letters = off
jvm-exit-on-fatal-error = on
loglevel = "DEBUG"
remote {
log-remote-lifecycle-events = off
netty.tcp {
hostname = "host1"
port = 0
}
}
cluster {
seed-nodes = [
"akka.tcp://clustersystem#host1:2551",
"akka.tcp://clustersystem#host1:2552"]
auto-down-unreachable-after = 10s
}
extensions = ["akka.cluster.metrics.ClusterMetricsExtension"]
}
I start master nodes at ports 2551 and 2552 (provide the ports as command line args) and I start the worker on port 3551

Does Akka clustering (2.4) use any ports by default other than 2551?

Does Akka use ports (by default) other than port 2551 for clustering?
I have a 3-node Akka cluster--each node running in a Docker using 2.4's bind-hostname/port. I have a seed running outside a Docker in some test code. I can successfully send messages to the nodes point-to-point, directly, so basic Akka messaging works fine for the Docker-ized nodes.
My Seed code looks like this:
class Seed() extends Actor {
def receive = {
case "report" =>
mediator ! DistributedPubSubMediator.SendToAll("/user/sender", ReportCommand(), false)
case r:ReportCommand => println("Report, please!")
}
}
val seed = system.actorOf(Props(new Seed()),"sender")
val mediator = DistributedPubSub(system).mediator
mediator ! DistributedPubSubMediator.Put(seed)
My worker nodes look like this:
class SenderActor(senderLike:SenderLike) extends Actor {
val mediator = DistributedPubSub(context.system).mediator
mediator ! Put(self)
def receive = {
case report:ReportCommand => println("REPORT CMD!")
}
}
When I run this and send a "report" message to the Seed, I see the Seed's "Report, please!" message, so it received its own broadcast, but the 3 workers in the Dockers don't register having received anything (no output on receive). Not sure what's wrong so I'm wondering if there is another port besides 2551 I need to EXPOSE in my Dockers for clustering?
You'll need to configure Akka using port and bind-port, since in docker the "local port" is different from the port "the outside world can reach me at".
In order to do this see this documentation page: Peer to Peer vs Client Server
And this FAQ section Why are replies not received from a remote actor?