Waking up Nodes - kaleido

We've created an environment on Kaleido and an API to communicate with the Private Network.
If there isn't any activity for a certain duration. The nodes on Kaleido go to sleep.
Our API returns an error obviously attempting to connect with the Node but returns error unresponsive.
We have to manually login into Kaleido to wake up the Nodes.
My question is:
Is there a way to remotely wake up the nodes?
Is this simply a feature in the free tier?
Thanks,
Chris

If your environment has been paused, you can the the PATCH API for the environment to resume all of the nodes within.
curl -X PATCH -d '{"state":"live"}' -H "$HDR_AUTH" -H "$HDR_CT" "$APIURL/consortia/{consortia_id}/environments/{environment_id}"
This will put your environment into a resume_pending state while it wakes the nodes, at which point the state will automatically be updated to live.
Currently, all environments will automatically pause after a set period of inactivity. If your use case needs a longer time before pausing, you can submit feedback via the Kaleido UI requesting an increase.

Related

Tag nodes in rundeck based on their connection status

I want to run a job(or any other method) on a set of nodes in Rundeck to test the successful connection with the nodes and then tag them as node_name_failled and node_name_succeed. Is this possible to do so using plugins or without them so that it can be achieved in one click
Currently, I'm able to do so by externally parsing the node's execution status and modifying the resource model. But this needs to navigate away from the UI
In Rundeck Enterprise you can use health check feature and later dispatch your jobs with some filter based on their status. A good way to do that in Community Edition is to call the jobs via API or RD CLI wrapped in some bash script that detects the status node before.
Here you have an example to call a job using API and here using RD CLI.
EDIT: Also you can build your own health check system based on this, take a look at this and this.

Acumatica scheduler

This might be a strange question. Is there any way to make the scheduler stay active perpetually? I have a couple of instances, one on a server for testing and a development instance on my laptop. I setup some business events in both instances that accurately fire as designed. My question comes from the fact that the scheduler seems to stall if no one logs into the instance. Once I login to the instance with any id, the scheduler restarts and runs for about 12 hours and then stalls again. I thought it was only the test instance on the server, but I took a couple days off and my laptop instance also stalled. Is there a setting to overcome this? I know the assumption is that there will be users in the system in production, but what about over the weekend or holidays?
The schedule is run from the IIS worker process (w3wp) of the assigned Application Pool. Normally the worker process is started when the first web request is received.
If you restart the test instance of the server or your laptop instance you may experience this delay until someone logs in.
However, you can set the worker process to start automatically whenever an Application Pool starts.
Check your IIS configuration, look for the Application Pool assigned to your Acumatica instance and edit its Advanced Settings.
There you can change StartMode to AlwaysRunning.
Your app pool might be recycled. Check the following posts that can help you
How to know who kills my threads
https://serverfault.com/questions/333907/what-should-i-do-to-make-sure-that-iis-does-not-recycle-my-application

How to use the Python Kubernetes client in a way resilient to GKE Kubernetes Master disruptions?

We sometimes use Python scripts to spin up and monitor Kubernetes Pods running on Google Kubernetes Engine using the Official Python client library for kubernetes. We also enable auto-scaling on several of our node pools.
According to this, "Master VM is automatically scaled, upgraded, backed up and secured". The post also seems to indicate that some automatic scaling of the control plane / Master VM occurs when the node count increases from 0-5 to 6+ and potentially at other times when more nodes are added.
It seems like the control plane can go down at times like this, when many nodes have been brought up. In and around when this happens, our Python scripts that monitor pods via the control plane often crash, seemingly unable to find the KubeApi/Control Plane endpoint triggering some of the following exceptions:
ApiException, urllib3.exceptions.NewConnectionError, urllib3.exceptions.MaxRetryError.
What's the best way to handle this situation? Are there any properties of the autoscaling events that might be helpful?
To clarify what we're doing with the Python client is that we are in a loop reading the status of the pod of interest via read_namespaced_pod every few minutes, and catching exceptions similar to the provided example (in addition we've tried also catching exceptions for the underlying urllib calls). We have also added retrying with exponential back-off, but things are unable to recover and fail after a specified max number of retries, even if that number is high (e.g. keep retrying for >5 minutes).
One thing we haven't tried is recreating the kubernetes.client.CoreV1Api object on each retry. Would that make much of a difference?
When a nodepool size changes, depending on the size, this can initiate a change in the size of the master. Here are the nodepool sizes mapped with the master sizes. In the case where the nodepool size requires a larger master, automatic scaling of the master is initiated on GCP. During this process, the master will be unavailable for approximately 1-5 minutes. Please note that these events are not available in Stackdriver Logging.
At this point all API calls to the master will fail, including the ones from the Python API client and kubectl. However after 1-5 minutes the master should be available and calls from both the client and kubectl should work. I was able to test this by scaling my cluster from 3 node to 20 nodes and for 1-5 minutes the master wasn't available .
I obtained the following errors from the Python API client:
Max retries exceeded with url: /api/v1/pods?watch=False (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at>: Failed to establish a new connection: [Errno 111] Connection refused',))
With kubectl I had :
“Unable to connect to the server: dial tcp”
After 1-5 minutes the master was available and the calls were successful. There was no need to recreate kubernetes.client.CoreV1Api object as this is just an API endpoint.
According to your description, your master wasn't accessible after 5 minutes which signals a potential issue with your master or setup of the Python script. To troubleshoot this further on side while your Python script runs, you can check for availability of master by running any kubectl command.

Gracefully draining sessions from Dropwizard application using Kubernetes

I have a Dropwizard application that holds short-lived sessions (phone calls) in memory. There are good reasons for this, and we are not changing this model in the near term.
We will be setting up Kubernetes soon and I'm wondering what the best approach would be for handling shutdowns / rolling updates. The process will need to look like this:
Remove the DW node from the load balancer so no new sessions can be started on this node.
Wait for all remaining sessions to be completed. This should take a few minutes.
Terminate the process / container.
It looks like kubernetes can handle this if I make step 2 a "preStop hook"
http://kubernetes.io/docs/user-guide/pods/#termination-of-pods
My question is, what will the preStop hook actually look like? Should I set up a DW "task" (http://www.dropwizard.io/0.9.2/docs/manual/core.html#tasks) that will just wait until all sessions are completed and CURL it from kubernetes? Should I put a bash script that polls some sessionCount service until none are left in the docker container with the DW app and execute that?
Assume you don't use the preStop hook, and a pod deletion request has been issued.
API server processes the deletion request and modify the pod object.
Endpoint controller observes the change and removes the pod from the list of endpoints.
On the node, a SIGTERM signal is sent to your container/process.
Your process should trap the signal and drains all existing requests. Note that this step should not take more than the defined TerminationGracePeriod defined in your pod spec.
Alternatively, you can use the preStop hook which blocks until all the requests are drained. Most likely, you'll need a script to accomplish this.

When marathon restarts process, possible to pass different command-line flag?

I notice that when I am running a process under marathon and I restart it, the process automatically starts back up. The way the logic of the process works, if it is restarted, it enters a recovery mode where it tries to replay its state. The recovery mode is entered when a command-line flag is seen, such as "-r". I want to append this flag to cmd command that is initially used during startup in marathon. Is there an option somewhere in marathon for this capability?
I solved my issue by using event subscriber in marathon. By using PUT with curl rather than POST, you are able to modify a deployment rather than recreating a brand new one with POST.