Scenario
Right now we only have a single node running the whole system. What we want is to make a distinction between "frontend" nodes, and a single "backend" node.
"Frontend" nodes (N nodes): Maintains a persistent connection with the clients through a WebSocket connection
"Backend" node (1 node): Processes all the requests coming form all the frontend nodes querying to database, and handling the needed domain logic.
This distinction is needed due to some reasons:
Do not reach the limit of 70-100k persistent connections per frontend node
Avoid disconnecting the clients while deploying changes only affecting the backend
Work done
We have connected the actors living on the frontend node with the ones living on the backend. We've done so instantiating the backend node ActorRefs from the frontend using the akka.cluster.singleton.ClusterSingletonProxy, and the ClusterSingletonManager while really instantiating them in the backend.
Question
How we do the deploy taking into account the Akka cluster node downing notification?
As far as I understood by the Akka Cluster documentation about downing, and some comments on the akka mailing list, the recommended approach while dealing with that process would be something like:
Download the akka distribution from http://akka.io/downloads/
Copy and paste the akka-cluster bash script together with the jmxsh-R5.jar on a resources/bin/ folder (for instance)
Include that folder on the distributed package (I've added the following lines on the build.sbt):
mappings in Universal ++=
(baseDirectory.value / "resources" / "bin" * "*" get) map
(bin => bin -> ("bin/" + bin.getName))
While deploying, set the node to be deployed as down manually calling the bash script like:
Execute bin/akka-cluster %node_to_be_deployed:port% down
Deploy the new code version
Execute bin/akka-cluster %deployed_node:port% join
Doubts:
Is this step by step procedure correct?
If the node to be deployed will have the very same IP and port after the deploy, is it needed to make the down and join?
We're planning to set one frontend and the backend nodes as the seed ones. This way, all the cluster could be reconstructed while making a deploy only of the frontend nodes, or only to the backend one. Is it correct?
Thanks!
To avoid downing manually, cleanup when a node is terminated, see:
http://doc.akka.io/docs/akka/current/scala/cluster-usage.html#How_To_Cleanup_when_Member_is_Removed
Regarding your points:
You do not need this procedure when the JVM is restarted and the cleanup code is executed. Only when the cleanup code somehow failed, you need to down manually as described in the procedure.
When the node is marked as removed by other nodes (after the cleanup code is executed), the same ip and port combination can be used to re-join the cluster.
Yes, you can just re-deploy a frontend node.
PS.:
- Coordinated shutdown will be improved in akka 2.5, see:
https://github.com/akka/akka-meta/issues/38
- If you want to manage your cluster using a http API, see: http://developer.lightbend.com/docs/akka-cluster-management/current/
Related
I have some general problems/questions regarding self managed Microservices (in Kubernetes).
The Situation:
I have a provider (Discord API) for my desired state, which tells me the count (or multiples of the count) of sharded connections (websocket -> stateful in some way) I should establish with the provider.
Currently a have a "monolithic" microservice (it can't be deployed in an autoscaling service and has to be stateful), which determines the count of connections i should have and a factor based on the currently active pods, that can establish a connection to this API.
It further (by heartbeating and updating the connection target of all those pods) manages the state of every pod and achieves this target configuration.
It also handles the case of a pod being removed from the service and a change of target configuration, by rolling out the updated target and only after updating the target discontinuing the old connections.
The Cons:
This does not in any way resemble a good microservice architecture
A failure of the manager (even when persisting the current state in a cache or db of some sort) results in the target of the target provider not being achieved and maybe one of the pods having a failure without graceful handling of the manager
The Pros:
Its "easy" to understand and maintain a centrally managed system
There is no case (assuming a running manager system) where a pod can fail and it wont be handled -> connection resumed on another pod
My Plan:
I would like this websocket connection pods to manage themselves in some way.
Theoretically there has to be a way in which a "swarm" (swarm here is just a descriptive word for pods within a service) can determine a swarm wide accepted target.
The tasks to achieve this target (or change of target) should then be allocated across the swarm by the swarm itself.
Every failure of a member of the swarm has to be recognized, and the now unhandled tasks (in my case websocket connections) have to be resumed on different members of the swarm.
Also updates of the target have to be rolled out across the swarm in a distinct manner, retaining the tasks for the old target till all tasks for the new target are handled.
My ideas so far:
As a general syncing point a cache like redis or a db like mongodb could be used.
Here the current target (and the old target, for creating earlier mentioned smooth target changes) could be stored, along with all tasks that have to be handled to achieve this desired target.
This should be relatively easy to set up and also a "voting process" for the current target could be possible - if even necessary (every swarm member checks the current target of the target provider and the target that is determined by most of the swarm members is set as the vote outcome).
But now we face the problem already mentioned in the pros for the managed system, I currently cant think of a way the failure of a swarm member can be recognized and managed by the swarm consistently.
How should a failure be determined without a constant exchange between swarm members, which i think should be avoided because of the:
swarms should operate entirely target driven and interact with each other as litte as possible
kubernetes itself isn't really designed to have easy intra service communication
Every contribution, idea or further question here helps.
My tech stack would be but isn't limited to:
Java with Micronaut for the application
Grpc as the only exchange protocol
Kubernetes as the orchestrator
Since you're on the JVM, you could use Akka Cluster to take care of failure detection between the pods (even in Kubernetes, though there's some care needed with service meshes to exempt the pod-to-pod communications from being routed through the mesh) and use (as one of many possibilities for this) Distributed Data's implementations of CRDTs to distribute state (in this case the target) among the pods.
This wouldn't require you to use Akka HTTP or Akka's gRPC implementations, so you could still use Micronaut for external interactions. It would effectively create a stateful self-organizing service which presents to Kubernetes as a regular stateless service.
If for some reason Akka isn't appealing, looking through the code and docs for its failure detection (phi-accrual) might provide some ideas for implementing a failure detector using (e.g.) periodic updates to a DB.
Disclaimer: I am employed by Lightbend, which provides commercial support for Akka and employs or has employed at some point most of the contributors to and maintainers of Akka.
Deployed Rundeck (rundeck/rundeck:4.2.0) importing and discovering my inventory using Ansible Resource Model Source. Having 300 nodes, out of which statistically ~150 are accessible/online, the rest is offline (IOT devices). All working fine.
My challenge is when creating jobs i can assign only those nodes which are online, while i wanted to assign ALL nodes (including those offline) and keep retrying the job for the failed ones only. Only this way i could track the completeness of my deployment. Ideally i would love rundeck to be intelligent enough to automatically deploy the job as soon as my node goes back online.
Any ideas/hints how to achieve that ?
Thanks,
The easiest way is to use the health checks feature (only available on PagerDuty Process Automation On-Prem, formerly "Rundeck Enterprise"), in that way you can use a node filter only for "healthy" (up) nodes.
Using this approach (e.g: configuring a command health check against all nodes) you can dispatch your jobs only for "up" nodes (from a global set of nodes), this is possible using the .* as node filter and !healthcheck:status: HEALTHY as exclude node filter. If any "offline" node "turns on", the filter/exclude filter should work automatically.
On Ansible/Rundeck integration it works using the following environment variable: ANSIBLE_HOST_KEY_CHECKING=False or using host_key_checking=false on the ansible.cfg file (at [defaults] section).
In that way, you can see all ansible hosts in your Rundeck nodes, and your commands/jobs should be dispatched only for online nodes, if any "offline" node changes their status, the filter should work.
I have an application deployed in kubernetes, it consists of cassandra, a go client, and a java client (and other things, but they are not relevant for this discussion).
We have used helm to do our deployment.
We are using a stateful set and a headless service for cassandra.
We have configured the clients to use the headless service dns as a contact point for cluster creation.
Everything works great.
Until all of the nodes go down, or some other nefarious combination of nodes going down, I am simulating it by deleting all pods using kubectl delete in succession on all of the cassandra nodes.
When I do this the clients throw NoHostAvailableException
in java its
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.200.23.151:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_QUORUM (1 required but only 0 alive)), /10.200.152.130:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency ONE (1 required but only 0 alive)))"
which eventually becomes
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)"
in go its
"gocql: no hosts available in the pool"
I can query cassandra using cqlsh, the node seems fine using nodetool status, all of the new ips are there
the image I am using doesnt have netstat so I have not yet confirmed its listening on the expected port.
Via executing bash on the two client pods I can see the dns makes sense using nslookup, but...
netstat does not show any established connections to cassandra (they are present before I take the nodes down)
If I restart my clients everything works fine.
I have googled a lot (I mean a lot), most of what I have found is related to never having a working connection, the most relevant things seem very old (like 2014, 2016).
So a node going down is very basic and I would expect everything to work, the cassandra cluster manages itself, it discovers new nodes as they come online, it balances the load, etc. etc.
If I take my all of my cassandra nodes down slowly, one at a time, everything works fine (I have not confirmed that the load is distributed appropriately and to the correct node, but at least it works)
So, is there a point where this behaviour is expected? ie I have taken everything down, nothing was up and running before the last from the first cluster was taken down.. is this behaviour expected?
To me it seems like it should be an easy issue to resolve, not sure whats missing / incorrect, I am surprised that both clients show the same symptoms, makes me think something is not happening with our statefulset and service
I think the problem might lie in the headless DNS service. If all of the nodes go down completely and there are no nodes at all available via the service until pods are replaced, it could cause the driver to hang.
I've noted that you've used Helm for your deployments but you may be interested in this document for connecting to Cassandra clusters in Kubernetes from the authors of the cass-operator.
I'm going to contact some of the authors and get them to respond here. Cheers!
Recently I've discovered such a thing as a Apache Mesos.
It all looks amazingly in all that demos and examples. I could easily imagine how one would run for stateless jobs - that fits to the whole idea naturally.
Bot how to deal with long running jobs that are stateful?
Say, I have a cluster that consists of N machines (and that is scheduled via Marathon). And I want to run a postgresql server there.
That's it - at first I don't even want it to be highly available, but just simply a single job (actually Dockerized) that hosts a postgresql server.
1- How would one organize it? Constraint a server to a particular cluster node? Use some distributed FS?
2- DRBD, MooseFS, GlusterFS, NFS, CephFS, which one of those play well with Mesos and services like postgres? (I'm thinking here on the possibility that Mesos/marathon could relocate the service if goes down)
3- Please tell if my approach is wrong in terms of philosophy (DFS for data servers and some kind of switchover for servers like postgres on the top of Mesos)
Question largely copied from Persistent storage for Apache Mesos, asked by zerkms on Programmers Stack Exchange.
Excellent question. Here are a few upcoming features in Mesos to improve support for stateful services, and corresponding current workarounds.
Persistent volumes (0.23): When launching a task, you can create a volume that exists outside of the task's sandbox and will persist on the node even after the task dies/completes. When the task exits, its resources -- including the persistent volume -- can be offered back to the framework, so that the framework can launch the same task again, launch a recovery task, or launch a new task that consumes the previous task's output as its input.
Current workaround: Persist your state in some known location outside the sandbox, and have your tasks try to recover it manually. Maybe persist it in a distributed filesystem/database, so that it can be accessed from any node.
Disk Isolation (0.22): Enforce disk quota limits on sandboxes as well as persistent volumes. This ensures that your storage-heavy framework won't be able to clog up the disk and prevent other tasks from running.
Current workaround: Monitor disk usage out of band, and run periodic cleanup jobs.
Dynamic Reservations (0.23): Upon launching a task, you can reserve the resources your task uses (including persistent volumes) to guarantee that they are offered back to you upon task exit, instead of going to whichever framework is furthest below its fair share.
Current workaround: Use the slave's --resources flag to statically reserve resources for your framework upon slave startup.
As for your specific use case and questions:
1a) How would one organize it? You could do this with Marathon, perhaps creating a separate Marathon instance for your stateful services, so that you can create static reservations for the 'stateful' role, such that only the stateful Marathon will be guaranteed those resources.
1b) Constraint a server to a particular cluster node? You can do this easily in Marathon, constraining an application to a specific hostname, or any node with a specific attribute value (e.g. NFS_Access=true). See Marathon Constraints. If you only wanted to run your tasks on a specific set of nodes, you would only need to create the static reservations on those nodes. And if you need discoverability of those nodes, you should check out Mesos-DNS and/or Marathon's HAProxy integration.
1c) Use some distributed FS? The data replication provided by many distributed filesystems would guarantee that your data can survive the failure of any single node. Persisting to a DFS would also provide more flexibility in where you can schedule your tasks, although at the cost of the difference in latency between network and local disk. Mesos has built-in support for fetching binaries from HDFS uris, and many customers use HDFS for passing executor binaries, config files, and input data to the slaves where their tasks will run.
2) DRBD, MooseFS, GlusterFS, NFS, CephFS? I've heard of customers using CephFS, HDFS, and MapRFS with Mesos. NFS would seem an easy fit too. It really doesn't matter to Mesos what you use as long as your task knows how to access it from whatever node where it's placed.
Hope that helps!
I have an environment (Graphite) that looks like the following:
N worker servers
1 relay server that forwards work to these worker servers
1 web server that can query the relay server.
I would like to use Chef to setup and deploy this environment in EC2 without having to create each worker server individually, get their IPs and set them as attributes in the relay cookbook, create that relay, get the IP, set it as an attribute in the web server cookbook, etc.
Is there a way using chef in which I can make sure that the environment is properly deployed, configured and running without having to set the IPs manually? Particularly, I would like to be able to add a worker server and have the relay update its worker list, or swap the relay server for another one and have the web server update its reference accordingly.
Perhaps this is not what Chef is intended for and is more for per-server configuration and deployment, if that is the case, what would be a technology that facilitates this?
Things you will need are:
knife-ec2 - This is used to start/stop Amazon EC2 instances.
chef-server - To be able to use search in your recipes. It should be also accessible from your EC2 instances.
search - with this you will be able to find among the nodes provisioned by chef, exactly the one you need using different queries.
I have lately written an article How to Run Dynamic Cloud Tests with 800 Tomcats, Amazon EC2, Jenkins and LiveRebel. It involves loadbalancer installation and loadbalancer must know all IP adresses of the servers it balances. You can check out the recipe of balanced node, how it looks for loadbalancer:
search(:node, "roles:lr-loadbalancer").first
And check out the loadbalancer recipe, how it looks for all the balanced nodes and updates the apache config file:
lr_nodes = search(:node, "role:lr-node")
template ::File.join( node[:apache2][:home], 'conf.d', 'httpd-proxy-balancer.conf' ) do
mode 0644
variables(:lr_nodes => lr_nodes)
notifies :restart, 'service[apache2]'
end
Perhaps you are looking for this?
http://www.infochimps.com/platform/ironfan