akka stream: how to reconnect TCP after disconnected - scala

I have a simple balancer that dispatch jobs to several external worker processes via TCP:
val sinkBalance =
Sink.fromGraph(GraphDSL.create() { implicit b =>
val balancer = b.add(Balance[ByteString](workerCount))
for (i <- 0 until workerCount) {
val codec = Framing.simpleFramingProtocol(500).reversed
val connection = Tcp(actorSystem).outgoingConnection("127.0.0.1", 3333+i)
val extConnector = connection.join(codec)
balancer.out(i)
~> extConnector.async
~> Sink.onComplete(_=>println(s"complete: $i"))
}
SinkShape(balancer.in)
})
Normally it works great. However, when one of those external worker process get killed, the sub-flow for it will be completed (the println statement triggered), but the balancer somehow didn't know that: for the 1st next request, it still distribute jobs to the dead worker, and lead to timeout for that request. From the 2nd next request, thing back to normal, the balancer never use the dead worker again.
So I would like to ask 2 questions:
How can I reconnect the broken sub-flow? Maybe automatically, or manually by some event, after restarted the dead worker.
How can I modify the code above so that the balancer will no longer use any dead worker?

Related

Akka cluster sharding configuration

Ok, I have 2 instances of my backend, hosted on 2 difference centos servers. What I want to do using Akka Cluster Sharding is to divide the work done by each of these instances:
I have data for 4 countries, which is retrieved from db at every 10 seconds by both backend instances, which update a Redis instance. So, multiple times, I have duplicated requests, because both backends get data for same country;
Using Akka Cluster Sharding, I try to divide the work dinamically, instance1 to get data for ES and EN, instance2 to get data for DE and IT. In case of instance1 is down, instance2 will take the jobs and will get data even for ES/EN.
I tought this is simple...but not.
All jobs are done by Akka Actors, so using Cluster Sharding, I thought all declared actors (from both instances) will be centralized somewhere, to can manipulate which do whatever job.
On localhost, all works fine, because I have an instance for my app with port 9001 and 2 cluster nodes with ports 2551 and 2552. But for production, I can't understand how to configure the hostnames
application.conf
"clusterRegistration" {
akka {
actor {
allow-java-serialization = on
provider = cluster
}
remote.artery {
enabled = on
transport = aeron-udp
}
cluster {
jmx.multi-mbeans-in-same-jvm = on
seed-nodes = [
"akka://ClusterService#instance1:8083",
"akka://ClusterService#instance1:2551"
]
}
}
}
class
object ClusterSharding {
def createNode(hostname: String, port: Int, role: String, props: Props, actorName: String) = {
val config = ConfigFactory.parseString(
s"""
|akka.cluster.roles = ["$role"]
|akka.remote.artery.canonical.hostname = $hostname
|akka.remote.artery.canonical.port = $port
|""".stripMargin
).withFallback(ConfigFactory.load
.getConfig("clusterRegistration"))
val system = ActorSystem("ClusterService", config)
system.actorOf(props, actorName)
}
val master = createNode("instance1", 8083, "master", Props[Master], "master")
createNode("instance1", 2551, "worker", Props[Worker], "worker")
createNode("instance2", 8083, "worker", Props[Worker], "worker")
Future {
while (true) {
master ! Proceed // this will fire an Actor Resolver case
Thread.sleep(5000)
}
}
}
master actor
class Master extends Actor {
var workers: Map[Address, ActorRef] = Map()
val cluster = Cluster(context.system)
override def preStart(): Unit = {
cluster.subscribe(
self,
initialStateMode = InitialStateAsEvents,
classOf[MemberEvent],
classOf[UnreachableMember]
)
}
override def postStop(): Unit = {
cluster.unsubscribe(self)
}
def receive = handleClusterEvents // cluster events
.orElse(handleWorkerRegistration) // worker registered to cluster
.orElse(handleJob) // give jobs to workers
def handleJob: Receive = {
case Proceed => {
// Here I must be able to use all workers from both instances
// (centos1 and centos2) and give work for each dinamically
if (workers.length == 2) {
worker1 ! List("EN", "ES")
worker2 ! List("DE", "IT")
} else if (workers.length == 1) {
worker ! List("EN", "ES", "DE", "IT")
} else {
execQueries() // if no worker is available, each backend instance will exec queries on his own way
}
}
}
}
Both instances are hosted with port 8083 (centos1: instance1:8083, centos2: instance2:8083). If I use settings just for one of the instances in application.conf and in createNode (instance1 for example), I can see in logs that the workers are created, but there is no communication with the second instance.
Where I'm wrong? thx
Your approach to configuring the hostnames is viable. There are better ways to do it (depending on how you're deploying the service: manual deploy vs. ansible/chef/puppet vs. docker vs. kubernetes/nomad/mesos will be different), but setting the hostname isn't likely your actual problem.
Your current approach will give you a master and 2 workers on every node and you're not actually using Cluster Sharding (you're using Cluster, but Cluster Sharding is something you opt into on top of Cluster). From the code you've posted, I strongly suspect that using Cluster Sharding will entail a dramatic redesign (though without posting the Worker and more complete Master code, it's hard to say).
The broad approach I'd take with this would be to have the process of updating Redis for a given country be owned by a sharded entity (keyed by that country). A cluster singleton actor would trigger the update process for each country every 10 seconds. Because we're using sharding and singleton, I'd probably actually have at least 3 instances of the service, or alternatively make use of a strongly consistent external lease system (the other split-brain resolution strategies (note that cluster sharding and cluster singleton basically force you to resolve split-brains) will all boil down, at least half the time, to losing one node is the same as losing both in a 2-node cluster). Because sharding implies that the actor for a process could be stopped arbitrarily (and possibly resumed on a different node), you'll also want to think about how the process can be resumed in a way that makes sense for the application.
Starting multiple ActorSystems in the same JVM process is generally only a good idea in fairly specific circumstances.

StreamTcpException while under server stress

I have a service using Akka HTTP that I have been doing some load testing on. Under stress, I've found that my service will ocassionally run into StreamTcpException when calling other service endpoints.
I create one flow for each endpoint which is shared by all of my actors. I am using something like this:
//this is done only once
val connectionFlow = Http(sys).outgoingConnection("host_name")
...
//each actor does this
val response = Source.single(HttpRequest(...)).via(connectionFlow).runWith(Sink.head)
I use Apache JMeter to load test my service, and with 40 threads, it typically takes 2000-4000 requests before I see my first error message. With 10 threads, it took me 9000 requests before I saw it.
The message looks like:
akka.stream.StreamTcpException: Tcp command [Connect(<host_here>/<ip_here>,None,List(),Some(10 seconds),true)] failed
I actually have 4 separate flows for 4 different endpoints my service relies on. I usually see StreamTcpException from all four if my service fails.
Anyone have any ideas why this is happening? Thanks in advance.
I face same issue previously, for me it is working as expected.
val httpClient: Flow[HttpRequest, HttpResponse, Future[Http.OutgoingConnection]] = Http().outgoingConnection(host,port,None,connSetts)
Problem place 👍 Source.single(request).via(httpClient)
It is unable to connect my host so I use proxy server for this.
val httpsProxyTransport : ClientTransport = ClientTransport.httpsProxy(InetSocketAddress.createUnresolved("proxyservicesHost", 90))
val connSetts: ClientConnectionSettings = ClientConnectionSettings(system).withTransport(httpsProxyTransport).withIdleTimeout(Duration(180, "second"))
val httpClient: Flow[HttpRequest, HttpResponse, Future[Http.OutgoingConnection]] = Http().outgoingConnection(host,port,None,connSetts)
Then I pass the httpclient in .via(httpclient), it works.

Akka Singleton Cluster: resolveOnce of a worker by master fails after restart

I am using Akka Cluster 2.4.3 and trying to setup a simple cluster in my machine to understand its working better. I have a singleton cluster with remoting enabled with primary and standby master and one worker node. Each of these 3 run in separate JVMs
Things work fine when all the nodes are started the first time. If I kill and restart the worker, I see following issues happening
Restart Worker
When the worker comes back after restart, the master on receiving MemberUp event tries to resolve for the actorRef from the member address the following way
context.actorSelection(member.address.toString).resolveOne(15 seconds)
This fails with an exception saying ActorNotFound. This works with no problem when all the nodes are coming up for the first time in the cluster.
Restart worker again
This time, the worker comes up with the following message
[WARN] [04/15/2016 18:24:24.991] [clustersystem-akka.remote.default-remote-dispatcher-5] [akka.remote.Remoting] Tried to associate with unreachable remote address [akka.tcp://clustersystem#host1:2551]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]
Restart worker again
This time the resolveOne on a MemberUp event works.
I am having a bit of difficulty in understanding what is happening here, I have looked into the docs but I did not find anything that will help me in there.
application.conf
akka {
actor {
provider = "akka.cluster.ClusterActorRefProvider"
}
remote {
enabled-transports = ["akka.remote.netty.tcp"]
}
log-dead-letters = off
jvm-exit-on-fatal-error = on
loglevel = "DEBUG"
remote {
log-remote-lifecycle-events = off
netty.tcp {
hostname = "host1"
port = 0
}
}
cluster {
seed-nodes = [
"akka.tcp://clustersystem#host1:2551",
"akka.tcp://clustersystem#host1:2552"]
auto-down-unreachable-after = 10s
}
extensions = ["akka.cluster.metrics.ClusterMetricsExtension"]
}
I start master nodes at ports 2551 and 2552 (provide the ports as command line args) and I start the worker on port 3551

Does Akka clustering (2.4) use any ports by default other than 2551?

Does Akka use ports (by default) other than port 2551 for clustering?
I have a 3-node Akka cluster--each node running in a Docker using 2.4's bind-hostname/port. I have a seed running outside a Docker in some test code. I can successfully send messages to the nodes point-to-point, directly, so basic Akka messaging works fine for the Docker-ized nodes.
My Seed code looks like this:
class Seed() extends Actor {
def receive = {
case "report" =>
mediator ! DistributedPubSubMediator.SendToAll("/user/sender", ReportCommand(), false)
case r:ReportCommand => println("Report, please!")
}
}
val seed = system.actorOf(Props(new Seed()),"sender")
val mediator = DistributedPubSub(system).mediator
mediator ! DistributedPubSubMediator.Put(seed)
My worker nodes look like this:
class SenderActor(senderLike:SenderLike) extends Actor {
val mediator = DistributedPubSub(context.system).mediator
mediator ! Put(self)
def receive = {
case report:ReportCommand => println("REPORT CMD!")
}
}
When I run this and send a "report" message to the Seed, I see the Seed's "Report, please!" message, so it received its own broadcast, but the 3 workers in the Dockers don't register having received anything (no output on receive). Not sure what's wrong so I'm wondering if there is another port besides 2551 I need to EXPOSE in my Dockers for clustering?
You'll need to configure Akka using port and bind-port, since in docker the "local port" is different from the port "the outside world can reach me at".
In order to do this see this documentation page: Peer to Peer vs Client Server
And this FAQ section Why are replies not received from a remote actor?

Akka IO(Tcp) get reason of CommandFailed

I have the following example of Actor using IO(Tcp)
https://gist.github.com/flirtomania/a36c50bd5989efb69a5f
For the sake of experiment I've run it twice, so it was trying to bind to 803 port. Obviously I've got an error.
Question: How can I get the reason why "CommandFailed"? In application.conf I have enabled slf4j and debug level of logs, then I've got an error in my logs.
DEBUG akka.io.TcpListener - Bind failed for TCP channel on endpoint [localhost/127.0.0.1:803]: java.net.BindException: Address already in use: bind
But why is that only debug level? I do not want to enable all ActorSystem to log their events, I want to get the reason of CommandFailed event (like java.lang.Exception instance which I could make e.printStackTrace())
Something like:
case c # CommandFailed => val e:Exception = c.getReason()
Maybe it's not the Akka-way? How to get diagnostic info then?
Here's what you can do - find the PID that still keeps living and then kill it.
On a Mac -
lsof -i : portNumber
then
kill -9 PidNumber
I understood that you have 2 questions.
if you ran the same code simultaneously, bot actors are trying to bind to the same port (in your case, 803) which is not possible unless the bound one unbinds and closes the connection so that the other one can bind.
you can import akka.event.Logging and put val log = Logging(context.system, this) at the beginning of your actors, which will log all the activities of your actors and ...
it also shows the name of the actor,corresponding actor system and host+port (if you are using akka-cluster).
wish that helps