How to terminate janusgraph container in case any exception is thrown - kubernetes

I'm using janusgraph docker image - https://hub.docker.com/r/janusgraph/janusgraph
In my kubernetes deployment to initialise the remote graph using groovy script mounted to docker-entrypoint-initdb.d
This works as expected but in case if the remote host is not ready the janusgraph container throws exception and is still in the running mode.
Because of this kubernetes will not attempt to restart the container again. Is there any way so that I can configure this janusgraph container to terminate in case of any exception

As #Gavin has mentioned you can use probes to check if containers are working. Liveness Probes is used to know when containers are failed. If a container is unresponsive - it can restart the container.
Readiness probes inform when the container is available for accepting traffic. The readiness probe is used to control which pods are used as the backends for a service. A pod is considered ready when all of its containers are ready. If a pod is not ready, it is removed from service Endpoints.
Kubernetes supports three mechanisms for implementing liveness and readiness probes:
1) making an HTTP request against a container
This probes have additional fields that can be set on httpGet:
host: Host name to connect to, defaults to the pod IP. You probably want to set "Host" in httpHeaders instead.
scheme: Scheme to use for connecting to the host (HTTP or HTTPS). Defaults to HTTP.
path: Path to access on the HTTP server. Defaults to /.
httpHeaders: Custom headers to set in the request. HTTP allows repeated headers.
port: Name or number of the port to access on the container. Number must be in the range 1 to 65535.
Read more: http-probes.
livenessProbe:
httpGet:
path: /healthz
port: liveness-port
2) opening a TCP socket against a container
initialDelaySeconds: 15
livenessProbe: ~
periodSeconds: 20
port: 8080
tcpSocket: ~
3) running a command inside a container
livenessProbe:
exec:
command:
- sh
- /tmp/status_check.sh
initialDelaySeconds: 10
If you will get status code different than 0 this will mean that probe failed.
You can also add to probes additional params such as initialDelaySeconds: indicate number of seconds after the container has started before liveness or readiness probes are initiated. See: configuring-probes.
In every case add also restartPolicy: Never
to your pods definition. By default is always.

A readinessProbe could be employed here with a command like janusgraph show-config or something similar which will exit with code -1
spec:
containers:
- name: liveness
image: janusgraph/janusgraph:latest
readinessProbe:
exec:
command:
- janusgraph
- show-config
Kubernetes will terminate the pod if the readinessProbe fails. A livenessProbe could also be used here too, in case this pod needs to be terminated if the remote host ever becomes unavailable.
Consider enabling JanusGraph server metrics, which could then be used with Prometheus for additional monitoring or even with the livenessProbe itself.

Related

Why am I losing my connection to my MongoDB after my GKE node gets preempted?

I am running a Mongo, Express, Node, React app in a GKE cluster that is setup with a preemptible VM (to save money). I am using mongoose to connect to my MongoDB which is hosted on Mongo Cloud Atlas. Everything works find when the pod first starts. However, when my node gets preempted, I lose connection to my mongoDB instance. I then have to go in and manually scale the deployment down to 0 replicas and then scaled it back up and the connection to the mongoDB is restored. Below is the error I am getting and the code for my mongo connection. Is this just a intended effect of using a preemptible instance? Is there any way to deal with it like, automatically scale the deployment after a preemption? I was running a GKE autopilot cluster and had no problems but that was a little expensive for my purposes. Thanks
mongoose
.connect(process.env.MONGODB_URL, {
useNewUrlParser: true,
useUnifiedTopology: true,
useFindAndModify: false,
})
.then(() => console.log('mongoDB connected...'));
(node: 24) UnhandledPromiseRejectionWarning: Error: querySrv ECONNREFUSED _mongodb._tcp.clusterx.xxxxx.azure.mongodb.net at QueryReqWrap.onresolve (dns.js:203)
The VM preemption can be reproduced in Compute Engine -> Instance groups -> Restart/Replace VMS and then choose option: Replace. After the VM has been restarted, the containers will be recreated too but unfortunately with network issues as mentioned.
My solution was to add liveness and readiness probes to Kubernetes Pods/Deployment via /health URL which checks if MongoDB is available and returns status code 500 if not. Details on how to define liveness and readiness probes in Kubernetes are here. The Kubernetes will restart pods that are not alive. The pods created later won't have network issues.
yaml spec block in my project looks like this:
spec:
containers:
- env:
- name: MONGO_URL
value: "$MONGO_URL"
- name: NODE_ENV
value: "$NODE_ENV"
image: gcr.io/$GCP_PROJECT/$APP_NAME:$APP_VERSION
imagePullPolicy: IfNotPresent
name: my-container
# the readiness probe details
readinessProbe:
httpGet: # make an HTTP request
port: 3200 # port to use
path: /health # endpoint to hit
scheme: HTTP # or HTTPS
initialDelaySeconds: 5 # how long to wait before checking
periodSeconds: 5 # how long to wait between checks
successThreshold: 1 # how many successes to hit before accepting
failureThreshold: 1 # how many failures to accept before failing
timeoutSeconds: 3 # how long to wait for a response
# the livenessProbe probe details
livenessProbe:
httpGet: # make an HTTP request
port: 3200 # port to use
path: /health # endpoint to hit
scheme: HTTP # or HTTPS
initialDelaySeconds: 15 # how long to wait before checking
periodSeconds: 5 # how long to wait between checks
successThreshold: 1 # how many successes to hit before accepting
failureThreshold: 2 # how many failures to accept before failing
timeoutSeconds: 3 # how long to wait for a response

Kubernetes external probes

Is it possible to define external path for example a other webserver as path for the web probes?
Or a TCP probe with a different IP?
livenessProbe:
httpGet:
path: external.de/test
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
I know thats not how you should use probes but I need it for testing.
Does someone know how to define probes that are not applied on the pods directly?
You can use following command along with your liveness probe
livenessProbe:
exec:
command:
- curl
- external.de/test:8080
initialDelaySeconds: 10
periodSeconds: 105
In this case, if curl external.de/test:8080command returns with an exit code of 0 then it will be assumed healthy, otherwise any other exit code will be deemed unhealthy.
Also keep in mind, once probe will fail, the pod running this probe will be restarted, not the one that running external.de/test:8080 web server
More details on how to use command within liveness probe described here
If you want to achieve that, you cannot use the http probe.
You have to use the exec one, pointing to a simple bash script that executes cURL on your behalf, so you can mount it via a ConfigMap or directly hostMount to perform your testing.

exec probe in GKE

I'm trying to use exec probes for readiness and liveness in GKE. This is because it is part of Kubernetes' recommended way to do health checks on gRPC back ends. However when I put the exec probe config into my deployment yaml and apply it, it doesn't take effect in GCP. This is my container yaml:
- name: rev79-uac-sandbox
image: gcr.io/rev79-232812/uac:latest
imagePullPolicy: Always
ports:
- containerPort: 3011
readinessProbe:
exec:
command: ["bin/grpc_health_probe", "-addr=:3011"]
initialDelaySeconds: 5
livenessProbe:
exec:
command: ["bin/grpc_health_probe", "-addr=:3011"]
initialDelaySeconds: 10
But still the health checks fail and when I look at the health check configuration in the GCP console I see a plain HTTP health check directed at '/'
When I edit a health check in GCP console there doesn't seem to be any way to choose an exec type. Also I can't see any mention of liveness checks as contrasted to readiness checks even though these are separate Kubernetes things.
Does Google cloud support using exec for health checks?
If so, how do I do it?
If not, how can I health check a gRPC server?
TCP probes are useful when we are using gRPC Services rather than using HTTP probes.
- containerPort: 3011
readinessProbe:
tcpSocket:
port: 3011
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
tcpSocket:
port: 3011
initialDelaySeconds: 15
periodSeconds: 20
the kubelet will attempt to open a socket to your container on the specified port. If it can establish a connection, the container is considered healthy, if it can’t it is considered a failure
define-a-tcp-liveness-probe
Exec probes work in GKE just the same way they work everywhere. You can view liveness probe result in "kubectl describe pod". Or you can simply log in into pod, execute command and see its return code.
The server has to implement the grpc probe protocol as indicated here as indicated in this article
Both answers from Vasily Angapov and Suresh Vishnoi should in theory work, however in practice they don't (at least in my practice).
So my solution was to start another server on my backend container - an HTTP server that simply has the job of executing the health check whenever it gets a request and returning a 200 status if it passes and a 503 if it fails.
I also had to open a second port on my container for that server to listen on.

Fake liveness/readiness probe in kubernetes

Is it possible to fake a container to always be ready/live in kubernetes so that kubernetes thinks that the container is live and doesn't try to kill/recreate the container? I am looking for a quick and hacky solution, preferably.
Liveness and Readiness probes are not required by k8s controllers, you can simply remove them and your containers will be always live/ready.
If you want the hacky approach anyways, use the exec probe (instead of httpGet) with something dummy that always returns 0 as exit code. For example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
livenessProbe:
exec:
command:
- touch
- /tmp/healthy
readinessProbe:
exec:
command:
- touch
- /tmp/healthy
I'd like to add background contextual information about why
/ how this can be useful to real world applications.
Also by pointing out some additional info about why this question is useful I can come up with an even better answer.
First off why might you want to implement a fake startup / readiness / liveness probe?
Let's say you have a custom containerized application, you're in a rush so you go live without any liveness or readiness probes.
Scenario 1:
You have a deployment with 1 replica, but you notice that whenever you go to update your app (push a new version via a rolling update), your monitoring platform reports occasionally 400, 500, and timeout errors during the rolling update. Post update you're at 1 replica and the errors go away.
Scenario 2:
You have enough traffic to warrant autoscaling and multiple replicas. You consistently get 1-3% errors, and 97% success.
Why are you getting errors in both scenarios?
Let's say it takes 1 minute to finish booting up / be ready to receive traffic. If you don't have readiness probes then newly spawned instances of your container will receive traffic before they've finished booting up / become ready to receive traffic. So the newly spawned instances are probably causing temporary 400, 500, and timeout errors.
How to fix:
You can fix the occasional errors in Scenario 1 and 2 by adding a readiness probe with an initialDelaySeconds (or startup probe), basically something that waits long enough for your container app to finish booting up.
Now the correct and proper best practice thing to do is to write a /health endpoint that properly reflects the health of your app. But writing an accurate healthcheck endpoint can take time. In many cases you can get the same end result (make the errors go away), without the effort of creating a /health endpoint by faking it, and just adding a wait period that waits for your app to finish booting up before sending traffic to it. (again /health is best practice, but for the ain't nobody got time for that crowd, faking it can be a good enough stop gap solution)
Below is a better version of a fake readiness probe:
Also here's why it's better
exec based liveness probes don't work in 100% of cases, they assume shell exists on the container, and that commands exist on the container. There's scenarios where hardened containers don't have things like a shell or touch command.
httpGet, tcpSocket, and grcp liveness probes are done from the perspective of the node running kubelet (the kubernetes agent) so they don't depend on the software installed in the container and should work in on hardened containers that are missing things like touch command or even scratch container. (In other words this soln should work in 100% of cases vs 99% of the time)
An alternative to startup probe is to use initialDelaySeconds with a readiness Probe, but that creates unnecessary traffic compared to a startup probe that runs once. (Again this isn't the best solution in terms of accuracy/fastest possible startup time, but often a good enough solution that's very practical.)
Run my example in a cluster and you'll see it's not ready for 60 seconds, then becomes ready after 60 seconds.
Since this is a fake probe it's pointless to use readiness/liveness probe, just go with startup probe as that will cut down on unnecessary traffic.
In the absence of a readiness probe the startup probe will have the effect of a readiness probe (block it from being ready until the probe passes, but only during initial start up)
apiVersion: apps/v1
kind: Deployment
metadata:
name: useful-hack
labels:
app: always-true-tcp-probe
spec:
replicas: 1
strategy:
type: Recreate #dev env fast feedback loop optimized value, don't use in prod
selector:
matchLabels:
app: always-true-tcp-probe
template:
metadata:
labels:
app: always-true-tcp-probe
spec:
containers:
- name: nginx
image: nginx:1.7.9
startupProbe:
tcpSocket:
host: 127.0.0.1 #Since kubelet does the probes, this is node's localhost, not pod's localhost
port: 10250 #worker node kubelet listening port
successThreshold: 1
failureThreshold: 2
initialDelaySeconds: 60 #wait 60 sec before starting the probe
Additional Notes:
The above example keeps traffic within the LAN this has several benefits.
It'll work in internet disconnected environments.
It won't incur egress network charges
The below example will only work for internet connected environments and isn't too bad for a startup probe, but would be a bad idea for a readiness / liveness probe as it could clog the NAT GW bandwidth, I'm only including it to point out something of interest.
startupProbe:
httpGet:
host: google.com #default's to pod IP
path: /
port: 80
scheme: HTTP
successThreshold: 1
failureThreshold: 2
initialDelaySeconds: 60
---
startupProbe:
tcpSocket:
host: 1.1.1.1 #CloudFlare
port: 53 #DNS
successThreshold: 1
failureThreshold: 2
initialDelaySeconds: 60
The interesting bit:
Remember I said "httpGet, tcpSocket, and grcp liveness probes are done from the perspective of the node running kubelet (the kubernetes agent)." Kubelet runs on the worker node's host OS, which is configured for upstream DNS, in other words it doesn't have access to inner cluster DNS entries that kubedns is aware of. So you can't specify Kubernetes service names in these probes.
Additionally Kubernetes Service IPs won't work for the probes either since they're VIPs (Virtual IPs) that only* exist in iptables (*most cases).

What is the http url to get readiness and liveness probes of service instances from kubernetes

I searched for 'readi', 'ready', 'live' etc in the kub swagger. I only see
io.k8s.api.core.v1.PodReadinessGate
thank you
That's one thing you would define. For example the following yaml file:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
livenessProbe: #this block performs liveness probes
httpGet:
path: /healthz
port: 80
readinessProbe: #this block performs readiness probes
httpGet:
path: /
port: 80
So, a pod with nginx. I can simply add the blocks highlighted in the yaml file and there it is. kubelet will check on them. Of course you have to have something serving there (/healthz, in this example), otherwise you will get 404.
You can add some configuration to the probes, like the other answer suggests. There are some more options than those.
According to Configure Liveness and Readiness Probes services can be configured to use
liveness command
TCP liveness probe
liveness HTTP request
So if your service use HTTP requests for liveness and readiness you can see in pod definition section livenessProbe (same for readinessProbe)
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
See full example here
There is no way to check the state of Liveness and Readiness probes directly.
You can check the resulting state of the pod which reflects changes in the state of Liveness and Readiness probes, but with some delay caused by threshold values.
Using kubectl describe pod you can also see some events at the bottom, but you can only see them after they occur. You can’t have it as a reply to the request.
You can also look into REST requests that are running under the hood of kubectl commands. All you need to do is adding a verbose flag to kubectl command:
-v, --v=0: Set the level of log output to debug-level (0~4) or trace-level (5~10)