Service discovery with Scala akka inside kubernetes - scala

I am facing issues with working with service discovery using akka. I recently migrated my application from Kubernetes version 1.18 to 1.21 . Service discovery is working fine on the older version but it doesn't seem to work on version 1.21. Below are the error log:-
play.api.UnexpectedException: Unexpected exception[IllegalStateException: Service ${serviceName} was not found by service locator]
at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:358)
at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:264)
at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:430)
at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:422)
at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:454)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63)
at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:100)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:94)
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:100)
Config file looks like:-
akka.management {
cluster.bootstrap {
contact-point-discovery {
discovery-method = kubernetes-api
service-name = ${serviceName}
required-contact-point-nr = ${REQUIRED_CONTACT_POINT_NR}
}
}

First, make sure that your Role and Rolebinding are correct, as documented. This is, by far, the most common mistake: not giving the discovery mechanism the proper permissions.
Secondly, try manually specifying the pod-label-selector rather than specifying a service-name. This is what I've done recently and what the defaults are in the docs. So maybe you are in a situation where you don't have permission to query the headless service, but you can query the pods directly via the pod-label-selector.
Fundamentally what it's doing is trivial. It's running an extremely simple query (and that you can specify manually) against the K8S API and getting a list of Pod IPs back. (Which it then passes into the Akka Cluster mechanism where it does much more complex things to establish and modify the cluster, but this query is super simple.) So fundamentally, the only things that can be wrong are:
The query is wrong. (Either the thing your querying doesn't exist or there is a typo in your query or implied query.)
The permissions on your query are wrong such that your serviceaccount can execute the query.

Related

How to find the auto-created service connection when deploying to AKS

During a pipeline run, under deployment job, providing a deployment environment eliminates the need of providing service connection manually. I'd guess, it's either creating a new SC at this time or it would have created SC at the time of environment creation and using the same.
Either ways, is there a way to find out which Service connection is being used from the logs of pipeline run or from anywhere else?
In our environment, I see a lot of service connection for one environment and a cleanup is necessary to get things in place.
I tried giving SC manually along with environment and it works as expected. So, going forward, I can use this method. But for cleanup, I'd still like to know which one gets used when not specified! (none of the auto-created SCs show any execution history, but I know the deployment has happened multiple times)
As a Kubernetes resource in an environment is referencing Kubernetes service connection, you can use this API to list the serviceEndpointId of a Kubernetes resource, which is also the resourceId of the referenced service connection.
GET https://dev.azure.com/{organization}/{project}/_apis/distributedtask/environments/{environmentId}/providers/kubernetes/{resourceId}?api-version=7.0
Applied with the value of the serviceEndpointId from the response of the above API, we can proceed to use this API to get the referenced service connection details.
GET https://dev.azure.com/{organization}/{project}/_apis/serviceendpoint/endpoints/{endpointId}?api-version=7.0

When to not use StatefulSets?

CONTEXT: I have been learning Kubernetes and trying to get some hands-on experience. I have been using AKS to abstract the complexity of having to deal with the control plane (and because I have a free student azure account). I am deploying a NodeJS app that connects to the MongoDB database. So far the deployment has been successful but I am using MongoDB Atlas and connecting to it.
Based on the little I have learned about Stateful sets, the MongoDB Atlas service seems a lot easier and more convenient but my question is, when would it be a better idea to consider deploying a stateful set with MongoDB database? (running on the pod) What's more cost-effective? More easily scalable?
I realize the questions might be a little bit vague but I am just getting started with Kubernetes..
disclaimer: This is not a production application, just something simple I am using to learn K8S
Official docs docs uses statefullset and that would make sense. Generally all DB kind of applications deployed as statefullset. Because there can be states that nodes are not sync with each other and that would create data inconsistencies between nodes(mongodb nodes not kubernetes).
You can deploy MongoDB as deployment. I have seen it deployed. But most clients use a connection string to connect(a string of multiple node addresses). And since kubernetes exposes statefullsets with headless services you should be okay.
For learning purpose, I advice you to deploy your MongoDB in a StatefulSet. Then you can learn how it works and what problem you could encounter with this Kubernetes object.
For production application, I advice to never deploy a database in a StatefulSet if you don't need it. In fact, StatefulSet will come with a lot of problematics that you might not need to manage.
Sometimes, companies rules restrict to host their data on external company storage.
To know if you need to put your database in a StatefulSet, the question I try to answer is:
Should my DB be hosted on premise (for privacy)?
Should my DB be scalable?
Should my DB be updated frequently?
You can find a list of pros/cons on the documentation.

Cassandra Kubernetes Statefulset NoHostAvailableException

I have an application deployed in kubernetes, it consists of cassandra, a go client, and a java client (and other things, but they are not relevant for this discussion).
We have used helm to do our deployment.
We are using a stateful set and a headless service for cassandra.
We have configured the clients to use the headless service dns as a contact point for cluster creation.
Everything works great.
Until all of the nodes go down, or some other nefarious combination of nodes going down, I am simulating it by deleting all pods using kubectl delete in succession on all of the cassandra nodes.
When I do this the clients throw NoHostAvailableException
in java its
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.200.23.151:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_QUORUM (1 required but only 0 alive)), /10.200.152.130:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency ONE (1 required but only 0 alive)))"
which eventually becomes
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)"
in go its
"gocql: no hosts available in the pool"
I can query cassandra using cqlsh, the node seems fine using nodetool status, all of the new ips are there
the image I am using doesnt have netstat so I have not yet confirmed its listening on the expected port.
Via executing bash on the two client pods I can see the dns makes sense using nslookup, but...
netstat does not show any established connections to cassandra (they are present before I take the nodes down)
If I restart my clients everything works fine.
I have googled a lot (I mean a lot), most of what I have found is related to never having a working connection, the most relevant things seem very old (like 2014, 2016).
So a node going down is very basic and I would expect everything to work, the cassandra cluster manages itself, it discovers new nodes as they come online, it balances the load, etc. etc.
If I take my all of my cassandra nodes down slowly, one at a time, everything works fine (I have not confirmed that the load is distributed appropriately and to the correct node, but at least it works)
So, is there a point where this behaviour is expected? ie I have taken everything down, nothing was up and running before the last from the first cluster was taken down.. is this behaviour expected?
To me it seems like it should be an easy issue to resolve, not sure whats missing / incorrect, I am surprised that both clients show the same symptoms, makes me think something is not happening with our statefulset and service
I think the problem might lie in the headless DNS service. If all of the nodes go down completely and there are no nodes at all available via the service until pods are replaced, it could cause the driver to hang.
I've noted that you've used Helm for your deployments but you may be interested in this document for connecting to Cassandra clusters in Kubernetes from the authors of the cass-operator.
I'm going to contact some of the authors and get them to respond here. Cheers!

AWS elastic search cluster becoming unresponsive

we have several AWS elastic search domains which sometimes become unresponsive for no apparent reason. The ES endpoint as well as Kibana return bad gateway errors after a few minutes of trying to load the resources.
The node status message is the following (not that it's any help):
/_cluster/health: {"code":"ProxyRequestServiceException","message":"Unable to execute HTTP request: Read timed out (Service: null; Status Code: 0; Error Code: null; Request ID: null)"}
Error logs are activated for the cluster but do not show anything relevant for the time at which the cluster became inactive.
I would like to at least be able to restart the cluster but the status remains "processing" seemingly forever.
Unfortunately, if you are using the AWS ElasticSearch Service (as in not building it on your own EC2 instances), many... well... MOST... of the admin API's and capabilities are restricted so you cannot dig as much into it as you could if you built it from the ground up.
I have found that AWS Support does a pretty good job in getting to the bottom of things when needed, so I would suggest you open a support ticket.
I wish this wasn't the case, but using their service is nice and easy (as in you don't have to build and maintain the infra yourself), but you lose a LOT of capabilities from an Admin or Troubleshooting perspective. :(

Why shouldn't you run Kubernetes pods for longer than an hour from Composer?

The Cloud Composer documentation explicitly states that:
Due to an issue with the Kubernetes Python client library, your Kubernetes pods should be designed to take no more than an hour to run.
However, it doesn't provide any more context than that, and I can't find a definitively relevant issue on the Kubernetes Python client project.
To test it, I ran a pod for two hours and saw no problems. What issue creates this restriction, and how does it manifest?
I'm not deeply familiar with either the Cloud Composer or Kubernetes Python client library ecosystems, but sorting the GitHub issue tracker by most comments shows this open item near the top of the list: https://github.com/kubernetes-client/python/issues/492
It sounds like there is a token expiration issue:
#yliaog this is an issue for us, as we are running kubernetes pods as
batch processes and tracking the state of the pods with a static
client. Once the client object is initialized, it does no refresh, and
therefore any job that takes longer than 60 minutes will fail. Looking
through python-base, it seems like we could make a wrapper class that
generates a new client (or refreshes the config) every n minutes, or
checks status prior to every call (as #mvle suggested). The best fix
would be in swagger-codegen, but a temporary solution would probably
be very useful for a lot of people.
- #flylo, https://github.com/kubernetes-client/python/issues/492#issuecomment-376581140
https://issues.apache.org/jira/browse/AIRFLOW-3253 is the reason (and hopefully, my fix will be merged soon). As the others suggested, this affects anyone using the Kubernetes Python client with GCP auth. If you are authenticating with a Kubernetes service account, you should see no problem.
If you are authenticating via a GCP service account with gcloud (e.g. using the GKEPodOperator), you will generally see this problem with jobs that take longer than an hour because the auth token expires after an hour.
There are more insights here too.
Currently, long-running jobs on GKE always eventually fail with a 404 error (https://bitbucket.org/snakemake/snakemake/issues/932/long-running-jobs-on-kubernetes-fail). We believe that the problem is in the Kubernetes client, as we determined that although _refresh_gcp_token is being called when the token is expired, the next API call still fails with a 404 error.
You can see here that Snakemake uses the kubernetes python client.