POD is being terminated and created again due to scale up and it's running twice - kubernetes

I have an application that runs a code and at the end it sends an email with a report of the data. When I deploy pods on GKE , certain pods get terminated and a new pod is created due to Auto Scale, but the problem is that the termination is done after my code is finished and the email is sent twice for the same data.
Here is the JSON file of the deploy API:
{
"apiVersion": "batch/v1",
"kind": "Job",
"metadata": {
"name": "$name",
"namespace": "$namespace"
},
"spec": {
"template": {
"metadata": {
"name": "********"
},
"spec": {
"priorityClassName": "high-priority",
"containers": [
{
"name": "******",
"image": "$dockerScancatalogueImageRepo",
"imagePullPolicy": "IfNotPresent",
"env": $env,
"resources": {
"requests": {
"memory": "2000Mi",
"cpu": "2000m"
},
"limits":{
"memory":"2650Mi",
"cpu":"2650m"
}
}
}
],
"imagePullSecrets": [
{
"name": "docker-secret"
}
],
"restartPolicy": "Never"
}
}
}
}
and here is a screen-shot of the pod events:
Any idea how to fix that?
Thank you in advance.

"Perhaps you are affected by this "Note that even if you specify .spec.parallelism = 1 and .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", the same program may sometimes be started twice." from doc. What happens if you increase terminationgraceperiodseconds in your yaml file? – "
#danyL

my problem was that I had another jobs that deploy pods on my nodes with more priority , so it was trying to terminate my running pods but the job was already done and the email was already sent , so i fixed the problem by fixing the request and the limit resources on all my json files , i don't know if it's the perfect solution but for now it solved my problem.
Thank you all for you help

Related

How to programmatically modify a running k8s pod status conditions?

I'm trying to modify the running state of my pod, managed by a deployment controller both from command line via kubectl patch and from the k8s python client API. Neither of them seem to work
From the command line, I tried both strategic merge match and JSON merge patch, but neither of them works. For e.g. I'm trying to patch the pod conditions to make the status field to False
kubectl -n foo-ns patch pod foo-pod-18112 -p '{
"status": {
"conditions": [
{
"type": "PodScheduled",
"status": "False"
},
{
"type": "Ready",
"status": "False"
},
{
"type": "ContainersReady",
"status": "False"
},
{
"type": "Initialized",
"status": "False"
}
],
"phase": "Running"
}
}' --type merge
From the python API
# definition of various pod states
ready_true = { "type": "Ready", "status": "True" }
ready_false = { "type": "Ready", "status": "False" }
scheduled_true = { "type": "PodScheduled", "status": "True" }
cont_ready_true = { "type": "ContainersReady", "status": "True" }
cont_ready_false = { "type": "ContainersReady", "status": "False" }
initialized_true = { "type": "Initialized", "status": "True" }
initialized_false = { "type": "Initialized", "status": "False" }
patch = {"status": { "conditions": [ready_false, initialized_false, cont_ready_false, scheduled_true ], "phase" : "Running" }}
p_status = v1.patch_namespaced_pod_status(podname, "default", body=patch)
While running the above snippet, I don't see any errors and the response p_status has all the pod conditions modified as applied in the patch, but I don't see any events from API server related to this pod status change.
May be the deployment controller is rolling back the changes to a working config? I'm looking for ways to patch the pod conditions and test if my custom controller (not related to the question) is able to see those new pod conditions.
You should not.
Clients write the desired state in the spec: and controllers write the status:-part.

What thresholds should be set in Service Fabric Placement / Load balancing config for Cluster with large number of guest executable applications?

What thresholds should be set in Service Fabric Placement / Load balancing config for Cluster with large number of guest executable applications?
I am having trouble with Service Fabric trying to place too many services onto a single node too fast.
To give an example of cluster size, there are 2-4 worker node types, there are 3-6 worker nodes per node type, each node type may run 200 guest executable applications, and each application will have at least 2 replicas. The nodes are more than capable of running the services while running, it is just startup time where CPU is too high.
The problem seems to be the thresholds or defaults for placement and load balancing rules set in the cluster config. As examples of what I have tried: I have turned on InBuildThrottlingEnabled and set InBuildThrottlingGlobalMaxValue to 100, I have set the Global Movement Throttle settings to be various percentages of the total application count.
At this point there are two distinct scenarios I am trying to solve for. In both cases, the nodes go to 100% for an amount of time such that service fabric declares the node as down.
1st: Starting an entire cluster from all nodes being off without overwhelming nodes.
2nd: A single node being overwhelmed by too many services starting after a host comes back online
Here are my current parameters on the cluster:
"Name": "PlacementAndLoadBalancing",
"Parameters": [
{
"Name": "UseMoveCostReports",
"Value": "true"
},
{
"Name": "PLBRefreshGap",
"Value": "1"
},
{
"Name": "MinPlacementInterval",
"Value": "30.0"
},
{
"Name": "MinLoadBalancingInterval",
"Value": "30.0"
},
{
"Name": "MinConstraintCheckInterval",
"Value": "30.0"
},
{
"Name": "GlobalMovementThrottleThresholdForPlacement",
"Value": "25"
},
{
"Name": "GlobalMovementThrottleThresholdForBalancing",
"Value": "25"
},
{
"Name": "GlobalMovementThrottleThreshold",
"Value": "25"
},
{
"Name": "GlobalMovementThrottleCountingInterval",
"Value": "450"
},
{
"Name": "InBuildThrottlingEnabled",
"Value": "false"
},
{
"Name": "InBuildThrottlingGlobalMaxValue",
"Value": "100"
}
]
},
Based on discussion in answer below, wanted to leave a graph-image: if a node goes down, the act of shuffling services on to the remaining nodes will cause a second node to go down, as noted here. Green node goes down, then purple goes down due to too many resources being shuffled onto it.
From SF's perspective, 1 & 2 are the same problem. Also as a note, SF doesn't evict a node just because CPU consumption is high. So: "The nodes go to 100% for an amount of time such that service fabric declares the node as down." needs some more explanation. The machines might be failing for other reasons, or I guess could be so loaded that the kernel level failure detectors can't ping other machines, but that isn't very common.
For config changes: I would remove all of these to go with the defaults
{
"Name": "PLBRefreshGap",
"Value": "1"
},
{
"Name": "MinPlacementInterval",
"Value": "30.0"
},
{
"Name": "MinLoadBalancingInterval",
"Value": "30.0"
},
{
"Name": "MinConstraintCheckInterval",
"Value": "30.0"
},
For the inbuild throttle to work, this needs to flip to true:
{
"Name": "InBuildThrottlingEnabled",
"Value": "false"
},
Also, since these are likely constraint violations and placement (not proactive rebalancing) we need to explicitly instruct SF to throttle those operations as well. There is config for this in SF, although it is not documented or publicly supported at this time, you can see it in the settings. By default only balancing is throttled, but you should be able to turn on throttling for all phases and set appropriate limits via something like the below.
These first two settings are also within PlacementAndLoadBalancing, like the ones above.
{
"Name": "ThrottlePlacementPhase",
"Value": "true"
},
{
"Name": "ThrottleConstraintCheckPhase",
"Value": "true"
},
These next settings to set the limits are in their own sections, and are a map of the different node type names to the limit you want to throttle for that node type.
{
"name": "MaximumInBuildReplicasPerNodeConstraintCheckThrottle",
"parameters": [
{
"name": "YourNodeTypeNameHere",
"value": "100"
},
{
"name": "YourOtherNodeTypeNameHere",
"value": "100"
}
]
},
{
"name": "MaximumInBuildReplicasPerNodePlacementThrottle",
"parameters": [
{
"name": "YourNodeTypeNameHere",
"value": "100"
},
{
"name": "YourOtherNodeTypeNameHere",
"value": "100"
}
]
},
{
"name": "MaximumInBuildReplicasPerNodeBalancingThrottle",
"parameters": [
{
"name": "YourNodeTypeNameHere",
"value": "100"
},
{
"name": "YourOtherNodeTypeNameHere",
"value": "100"
}
]
},
{
"name": "MaximumInBuildReplicasPerNode",
"parameters": [
{
"name": "YourNodeTypeNameHere",
"value": "100"
},
{
"name": "YourOtherNodeTypeNameHere",
"value": "100"
}
]
}
I would make these changes and then try again. Additional information like what is actually causing the nodes to be down (confirmed via events and SF health info) would help identify the source of the problem. It would probably also be good to verify that starting 100 instances of the apps on the node actually works and whether that's an appropriate threshold.

Openshift Job Trigger

We have batch job that process flat files which gets triggered using Rest Call
For e.g. https://clustername.com/loader?filname=file1.dat
https://clustername.com/loader?filname=file2.dat
https://clustername.com/loader?filname=file3.dat
We want to configure Openshift Job to trigger this batch job.
https://docs.openshift.com/container-platform/3.11/dev_guide/jobs.html
As per the Kubernetes documentation, the job can be triggered using Queue:
https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
Can the job also be triggered by Rest Call?
As others have mentioned, you can instantiate a job by creating a new one via the API.
IIRC you'll make a POST call to /apis/batch/v1/namespaces/<your-namespace>/jobs
(The endpoint may be slightly different depending on your API versions.)
The payload for your REST call is the JSON formatted manifest for the job you want to run. i.e.
{
"apiVersion": "batch/v1",
"kind": "Job",
"metadata": {
"name": "example"
},
"spec": {
"selector": {},
"template": {
"metadata": {
"name": "example"
},
"spec": {
"containers": [
{
"name": "example",
"image": "hello-world"
}
],
"restartPolicy": "Never"
}
}
}
}

What is the DNS Record logged by kubedns?

I'm using Google Container Engine and I'm noticing entries like the following in my logs
{
"insertId": "1qfzyonf2z1q0m",
"internalId": {
"projectNumber": "1009253435077"
},
"labels": {
"compute.googleapis.com/resource_id": "710923338689591312",
"compute.googleapis.com/resource_name": "fluentd-cloud-logging-gke-gas2016-4fe456307445d52d-worker-pool-",
"compute.googleapis.com/resource_type": "instance",
"container.googleapis.com/cluster_name": "gas2016-4fe456307445d52d",
"container.googleapis.com/container_name": "kubedns",
"container.googleapis.com/instance_id": "710923338689591312",
"container.googleapis.com/namespace_name": "kube-system",
"container.googleapis.com/pod_name": "kube-dns-v17-e4rr2",
"container.googleapis.com/stream": "stderr"
},
"logName": "projects/cml-236417448818/logs/kubedns",
"resource": {
"labels": {
"cluster_name": "gas2016-4fe456307445d52d",
"container_name": "kubedns",
"instance_id": "710923338689591312",
"namespace_id": "kube-system",
"pod_id": "kube-dns-v17-e4rr2",
"zone": "us-central1-f"
},
"type": "container"
},
"severity": "ERROR",
"textPayload": "I0718 17:05:20.552572 1 dns.go:660] DNS Record:&{worker-7.default.svc.cluster.local. 6000 10 10 false 30 0 }, hash:f97f8525\n",
"timestamp": "2016-07-18T17:05:20.000Z"
}
Is this an actual error or is the severity incorrect? Where can I find the definition for the struct that is being printed?
The severity is incorrect. This is some tracing/debugging that shouldn't have been left in the binary, and has been removed since 1.3 was cut. It will be removed in a future release.
See also: Google container engine cluster showing large number of dns errors in logs

Why does a GCE volume mount in a kubernetes pod cause a delay?

I have a kubernetes pod to which I attach a GCE persistent volume using a persistence volume claim. (For the even worse issue without a volume claim see: Mounting a gcePersistentDisk kubernetes volume is very slow)
When there is no volume attached, the pod starts in no time (max 2 seconds). But when the pod has a GCE persistent volume mount, the Running state is reached somewhere between 20 and 60 seconds. I was testing with different disk sizes (10, 200, 500 GiB) and multiple pod creations and the size does not seem to be correlated with the delay.
And this delay is not only happening in the beginning but also when rolling updates are performed with the replication controllers or when the code crashes during runtime.
Below I have the kubernetes specifications:
The replication controller
{
"apiVersion": "v1",
"kind": "ReplicationController",
"metadata": {
"name": "a1"
},
"spec": {
"replicas": 1,
"template": {
"metadata": {
"labels": {
"app": "a1"
}
},
"spec": {
"containers": [
{
"name": "a1-setup",
"image": "nginx",
"ports": [
{
"containerPort": 80
},
{
"containerPort": 443
}
]
}
]
}
}
}
}
The volume claim
{
"apiVersion": "v1",
"kind": "PersistentVolumeClaim",
"metadata": {
"name": "myclaim"
},
"spec": {
"accessModes": [
"ReadWriteOnce"
],
"resources": {
"requests": {
"storage": "10Gi"
}
}
}
}
And the volume
{
"apiVersion": "v1",
"kind": "PersistentVolume",
"metadata": {
"name": "mydisk",
"labels": {
"name": "mydisk"
}
},
"spec": {
"capacity": {
"storage": "10Gi"
},
"accessModes": [
"ReadWriteOnce"
],
"gcePersistentDisk": {
"pdName": "a1-drive",
"fsType": "ext4"
}
}
}
Also
GCE (along with AWS and OpenStack) must first attach a disk/volume to the node before it can be mounted and exposed to your pod. The time required for attachment is dependent on the cloud provider.
In the case of pods created by a ReplicationController, there is an additional detach operation that has to happen. The same disk cannot be attached to more than one node (at least not in read/write mode). Detaching and pod cleanup happen in a different thread than attaching. To be specific, Kubelet running on a node has to reconcile the pods it currently has (and the sum of their volumes) with the volumes currently present on the node. Orphaned volumes are unmounted and detached. If your pod was scheduled on a different node, it must wait until the original node detaches the volume.
The cluster eventually reaches the correct state, but it might take time for each component to get there. This is your wait time.