What is the DNS Record logged by kubedns? - kubernetes

I'm using Google Container Engine and I'm noticing entries like the following in my logs
{
"insertId": "1qfzyonf2z1q0m",
"internalId": {
"projectNumber": "1009253435077"
},
"labels": {
"compute.googleapis.com/resource_id": "710923338689591312",
"compute.googleapis.com/resource_name": "fluentd-cloud-logging-gke-gas2016-4fe456307445d52d-worker-pool-",
"compute.googleapis.com/resource_type": "instance",
"container.googleapis.com/cluster_name": "gas2016-4fe456307445d52d",
"container.googleapis.com/container_name": "kubedns",
"container.googleapis.com/instance_id": "710923338689591312",
"container.googleapis.com/namespace_name": "kube-system",
"container.googleapis.com/pod_name": "kube-dns-v17-e4rr2",
"container.googleapis.com/stream": "stderr"
},
"logName": "projects/cml-236417448818/logs/kubedns",
"resource": {
"labels": {
"cluster_name": "gas2016-4fe456307445d52d",
"container_name": "kubedns",
"instance_id": "710923338689591312",
"namespace_id": "kube-system",
"pod_id": "kube-dns-v17-e4rr2",
"zone": "us-central1-f"
},
"type": "container"
},
"severity": "ERROR",
"textPayload": "I0718 17:05:20.552572 1 dns.go:660] DNS Record:&{worker-7.default.svc.cluster.local. 6000 10 10 false 30 0 }, hash:f97f8525\n",
"timestamp": "2016-07-18T17:05:20.000Z"
}
Is this an actual error or is the severity incorrect? Where can I find the definition for the struct that is being printed?

The severity is incorrect. This is some tracing/debugging that shouldn't have been left in the binary, and has been removed since 1.3 was cut. It will be removed in a future release.
See also: Google container engine cluster showing large number of dns errors in logs

Related

Extract status of Kubernetes CR created via ansible-operator

I am new to json query. Facing trouble extracting the status.conditions[ansibleResult].type
I have a CRD defined and created CR against the same, which is picked up by operator-sdk running ansible in the background. I am updating the CRD to provide relevant status once CR is accepted and processed by operator-sdk.
CR output in json appears like below.
{
"apiVersion": "vault.cpe.oraclecloud.com/v1alpha1",
"kind": "OciVaultKeys",
"metadata": {
"annotations": {
"kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"vault.cpe.oraclecloud.com/v1alpha1\",\"kind\":\"OciVaultKeys\",\"metadata\":{\"annotations\":{},\"name\":\"operator-key-broken\",\"namespace\":\"tms\"},\"spec\":{\"freeformTags\":[{\"key\":\"Type\",\"value\":\"Optional-Values-Added\"}],\"ociVaultKeyName\":\"operator-key-broken\",\"ociVaultKeyShapeAlgorithm\":\"RSA\",\"ociVaultKeyShapeLength\":32,\"ociVaultName\":\"ocivault-sample-12\"}}\n"
},
"creationTimestamp": "2022-03-18T07:43:03Z",
"finalizers": [
"vault.cpe.oraclecloud.com/finalizer"
],
"generation": 1,
"name": "operator-key-broken",
"namespace": "tms",
"resourceVersion": "717880023",
"selfLink": "/apis/vault.cpe.oraclecloud.com/v1alpha1/namespaces/tms/ocivaultkeys/operator-key-broken",
"uid": "0d634e72-f592-48e0-be9b-ebfa017b2dfe"
},
"spec": {
"freeformTags": [
{
"key": "Type",
"value": "Optional-Values-Added"
}
],
"ociVaultKeyName": "operator-key-broken",
"ociVaultKeyShapeAlgorithm": "RSA",
"ociVaultKeyShapeLength": 32,
"ociVaultName": "ocivault-sample-12"
},
"status": {
"conditions": [
{
"lastTransitionTime": "2022-03-18T07:43:27Z",
"message": "",
"reason": "",
"status": "False",
"type": "Successful"
},
{
"lastTransitionTime": "2022-03-18T08:26:08Z",
"message": "Running reconciliation",
"reason": "Running",
"status": "False",
"type": "Running"
},
{
"ansibleResult": {
"changed": 0,
"completion": "2022-03-18T08:26:24.217728",
"failures": 1,
"ok": 14,
"skipped": 1
},
"lastTransitionTime": "2022-03-18T08:26:25Z",
"message": "The task includes an option with an undefined variable. The error was: No first item, sequence was empty.\n\nThe error appears to be in '/home/opc/cpe-workstation/mr_folder/workspace-2/osvc-kubernetes-operators/oci-services/roles/ocivaultkeys/tasks/fetch_vault_details_oci.yml': line 12, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: DEBUG | Fetch Vault Details | Extract Vault OCID n service_endpoint in source region\n ^ here\n",
"reason": "Failed",
"status": "True",
"type": "Failure"
}
]
}
}
I wish to reliably extract the status.conditions[].type (for the element ansibleResult) in CRD.
CRD definition extract is as below
- name: v1alpha1
served: true
storage: true
additionalPrinterColumns:
- description: 'Status of the OCI Vault Key'
jsonPath: .status.conditions[-1].type
name: STATUS
type: string
priority: 0
CRD is looking for a jsonPath expression to extract.
Thanks
Please try following :
kubectl get ocivaultkeys operator-key-broken -o jsonpath='{.status.conditions[?(#.ansibleResult)].type}'
Expected output : Failure
jsonpath help

POD is being terminated and created again due to scale up and it's running twice

I have an application that runs a code and at the end it sends an email with a report of the data. When I deploy pods on GKE , certain pods get terminated and a new pod is created due to Auto Scale, but the problem is that the termination is done after my code is finished and the email is sent twice for the same data.
Here is the JSON file of the deploy API:
{
"apiVersion": "batch/v1",
"kind": "Job",
"metadata": {
"name": "$name",
"namespace": "$namespace"
},
"spec": {
"template": {
"metadata": {
"name": "********"
},
"spec": {
"priorityClassName": "high-priority",
"containers": [
{
"name": "******",
"image": "$dockerScancatalogueImageRepo",
"imagePullPolicy": "IfNotPresent",
"env": $env,
"resources": {
"requests": {
"memory": "2000Mi",
"cpu": "2000m"
},
"limits":{
"memory":"2650Mi",
"cpu":"2650m"
}
}
}
],
"imagePullSecrets": [
{
"name": "docker-secret"
}
],
"restartPolicy": "Never"
}
}
}
}
and here is a screen-shot of the pod events:
Any idea how to fix that?
Thank you in advance.
"Perhaps you are affected by this "Note that even if you specify .spec.parallelism = 1 and .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", the same program may sometimes be started twice." from doc. What happens if you increase terminationgraceperiodseconds in your yaml file? – "
#danyL
my problem was that I had another jobs that deploy pods on my nodes with more priority , so it was trying to terminate my running pods but the job was already done and the email was already sent , so i fixed the problem by fixing the request and the limit resources on all my json files , i don't know if it's the perfect solution but for now it solved my problem.
Thank you all for you help

Google Storage AuditLogs - finding who is trying to access

I have a google storage bucket with Audit Logs enabled. Every one\two days I getting logs about PERMISSION DENIED. The log is specifying what kind of access the requestor is asking for. But, not give me enough information to answer the question - who is requesting?
This is the log message:
{
"insertId": "rr6wsd...",
"logName": "projects/PROJECT_ID/logs/cloudaudit.googleapis.com%2Fdata_access",
"protoPayload": {
"#type": "type.googleapis.com/google.cloud.audit.AuditLog",
"authenticationInfo": {},
"authorizationInfo": [
{
"permission": "storage.buckets.get",
"resource": "projects//buckets/BUCKET_NAME",
"resourceAttributes": {}
}
],
"methodName": "storage.buckets.get",
"requestMetadata": {
"callerSuppliedUserAgent": "Blob/1 (cr/340918833)",
"destinationAttributes": {},
"requestAttributes": {
"auth": {},
"reason": "8uSywAZKWkhOZWVkZWQg...",
"time": "2021-01-20T03:43:38.405230045Z"
}
},
"resourceLocation": {
"currentLocations": [
"us-central1"
]
},
"resourceName": "projects//buckets/BUCKET_NAME",
"serviceName": "storage.googleapis.com",
"status": {
"code": 7,
"message": "PERMISSION_DENIED"
}
},
"receiveTimestamp": "2021-01-20T03:43:38.488787956Z",
"resource": {
"labels": {
"bucket_name": "BUCKET_NAME",
"location": "us-central1",
"project_id": "PROJECT_ID"
},
"type": "gcs_bucket"
},
"severity": "ERROR",
"timestamp": "2021-01-20T03:43:38.399417759Z"
}
As you can see, the only information who talking about "who is trying to access" is
"callerSuppliedUserAgent": "Blob/1 (cr/340918833)",
But what that means? mean nothing to me.
How I can understand who is trying to access this permission?
The callerSuppliedUserAgent can be anything the client application puts in their request headers - Ignore it as this header can be faked. Only legitimate applications put anything meaningful in the header.
This is an unauthenticated request. There is no identity to record. Most likely a troll scanning the Internet looking for open buckets.
Notice that the auth key is empty. No authorization was provided in the request.
"requestAttributes": {
"auth": {},
"reason": "8uSywAZKWkhOZWVkZWQg...",
"time": "2021-01-20T03:43:38.405230045Z"
}

What thresholds should be set in Service Fabric Placement / Load balancing config for Cluster with large number of guest executable applications?

What thresholds should be set in Service Fabric Placement / Load balancing config for Cluster with large number of guest executable applications?
I am having trouble with Service Fabric trying to place too many services onto a single node too fast.
To give an example of cluster size, there are 2-4 worker node types, there are 3-6 worker nodes per node type, each node type may run 200 guest executable applications, and each application will have at least 2 replicas. The nodes are more than capable of running the services while running, it is just startup time where CPU is too high.
The problem seems to be the thresholds or defaults for placement and load balancing rules set in the cluster config. As examples of what I have tried: I have turned on InBuildThrottlingEnabled and set InBuildThrottlingGlobalMaxValue to 100, I have set the Global Movement Throttle settings to be various percentages of the total application count.
At this point there are two distinct scenarios I am trying to solve for. In both cases, the nodes go to 100% for an amount of time such that service fabric declares the node as down.
1st: Starting an entire cluster from all nodes being off without overwhelming nodes.
2nd: A single node being overwhelmed by too many services starting after a host comes back online
Here are my current parameters on the cluster:
"Name": "PlacementAndLoadBalancing",
"Parameters": [
{
"Name": "UseMoveCostReports",
"Value": "true"
},
{
"Name": "PLBRefreshGap",
"Value": "1"
},
{
"Name": "MinPlacementInterval",
"Value": "30.0"
},
{
"Name": "MinLoadBalancingInterval",
"Value": "30.0"
},
{
"Name": "MinConstraintCheckInterval",
"Value": "30.0"
},
{
"Name": "GlobalMovementThrottleThresholdForPlacement",
"Value": "25"
},
{
"Name": "GlobalMovementThrottleThresholdForBalancing",
"Value": "25"
},
{
"Name": "GlobalMovementThrottleThreshold",
"Value": "25"
},
{
"Name": "GlobalMovementThrottleCountingInterval",
"Value": "450"
},
{
"Name": "InBuildThrottlingEnabled",
"Value": "false"
},
{
"Name": "InBuildThrottlingGlobalMaxValue",
"Value": "100"
}
]
},
Based on discussion in answer below, wanted to leave a graph-image: if a node goes down, the act of shuffling services on to the remaining nodes will cause a second node to go down, as noted here. Green node goes down, then purple goes down due to too many resources being shuffled onto it.
From SF's perspective, 1 & 2 are the same problem. Also as a note, SF doesn't evict a node just because CPU consumption is high. So: "The nodes go to 100% for an amount of time such that service fabric declares the node as down." needs some more explanation. The machines might be failing for other reasons, or I guess could be so loaded that the kernel level failure detectors can't ping other machines, but that isn't very common.
For config changes: I would remove all of these to go with the defaults
{
"Name": "PLBRefreshGap",
"Value": "1"
},
{
"Name": "MinPlacementInterval",
"Value": "30.0"
},
{
"Name": "MinLoadBalancingInterval",
"Value": "30.0"
},
{
"Name": "MinConstraintCheckInterval",
"Value": "30.0"
},
For the inbuild throttle to work, this needs to flip to true:
{
"Name": "InBuildThrottlingEnabled",
"Value": "false"
},
Also, since these are likely constraint violations and placement (not proactive rebalancing) we need to explicitly instruct SF to throttle those operations as well. There is config for this in SF, although it is not documented or publicly supported at this time, you can see it in the settings. By default only balancing is throttled, but you should be able to turn on throttling for all phases and set appropriate limits via something like the below.
These first two settings are also within PlacementAndLoadBalancing, like the ones above.
{
"Name": "ThrottlePlacementPhase",
"Value": "true"
},
{
"Name": "ThrottleConstraintCheckPhase",
"Value": "true"
},
These next settings to set the limits are in their own sections, and are a map of the different node type names to the limit you want to throttle for that node type.
{
"name": "MaximumInBuildReplicasPerNodeConstraintCheckThrottle",
"parameters": [
{
"name": "YourNodeTypeNameHere",
"value": "100"
},
{
"name": "YourOtherNodeTypeNameHere",
"value": "100"
}
]
},
{
"name": "MaximumInBuildReplicasPerNodePlacementThrottle",
"parameters": [
{
"name": "YourNodeTypeNameHere",
"value": "100"
},
{
"name": "YourOtherNodeTypeNameHere",
"value": "100"
}
]
},
{
"name": "MaximumInBuildReplicasPerNodeBalancingThrottle",
"parameters": [
{
"name": "YourNodeTypeNameHere",
"value": "100"
},
{
"name": "YourOtherNodeTypeNameHere",
"value": "100"
}
]
},
{
"name": "MaximumInBuildReplicasPerNode",
"parameters": [
{
"name": "YourNodeTypeNameHere",
"value": "100"
},
{
"name": "YourOtherNodeTypeNameHere",
"value": "100"
}
]
}
I would make these changes and then try again. Additional information like what is actually causing the nodes to be down (confirmed via events and SF health info) would help identify the source of the problem. It would probably also be good to verify that starting 100 instances of the apps on the node actually works and whether that's an appropriate threshold.

On what metric weave net should be alerted on?

WeaveNet exposes the following Prometheus metrics
Does the below alerts look correct to be alerted on? On what values of these metrics we should raise alert to monitor weave-net health?
WeaveNoFastDP weave_flows[5m] > 0
WeaveIPAMUnreachable weave_ipam_unreachable_percentage > 0
WeaveIPAMPendingAllocates weave_ipam_pending_allocates > 0
WeavePendingClaims weave_ipam_pending_claims > 0
WeaveConnecTerm weave_connection_terminations_total > 300
Made grafana dashboard on top of weave metrics.
Here are the dashboards
WeaveNet https://grafana.com/grafana/dashboards/11789
WeaveNet(Cluster) https://grafana.com/grafana/dashboards/11804
Here are the useful metrics on which weave net should be monitored. Below alert are in json format.
{
"groups": [
{
"name": "nodeagent",
"rules": [
{
"alert": "UnhealthyNodes",
"expr": "changes(central_nodeagent:node_route_unhealthy_count[3m]) > 0",
"for": "1m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "Unhealthy nodes in the cluster. Go to prometheus the below prometheus link for details.",
"description": "Actionable: Find why the node(s) are unhealthy and fix it."
}
}
]
},
{
"name": "weave-net",
"rules": [
{
"alert": "WeaveNetIPAMSPlitBrain",
"expr": "max(weave_ipam_unreachable_percentage) - min(weave_ipam_unreachable_percentage) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNetIPAM has a split brain. Go to the below prometheus link for details.",
"description": "Actionable: Every node should see same unreachability percentage. Please check and fix why it is not so."
}
},
{
"alert": "WeaveNetIPAMUnreachable",
"expr": "weave_ipam_unreachable_percentage[10m] > 25",
"for": "10m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNetIPAM unreachability percentage is above threshold. Go to the below prometheus link for details.",
"description": "Actionable: Find why the unreachability threshold have increased from threshold and fix it. WeaveNet is responsible to keep it under control. Weave rm peer deployment can help clean things."
}
},
{
"alert": "WeaveNetIPAMPendingAllocates",
"expr": "sum(weave_ipam_pending_allocates) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNet IPAM has pending allocates. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for IPAM allocates to be in pending state and fix it."
}
},
{
"alert": "WeaveNetIPAMPendingClaims",
"expr": "sum(weave_ipam_pending_claims) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNet IPAM has pending claims. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for IPAM claims to be in pending state and fix it."
}
},
{
"alert": "WeaveNetFastDPFlowsLow",
"expr": "sum(weave_flows) < 15000",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNet total FastDP flows is below threshold. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for fast dp flows dropping below the threshold."
}
},
{
"alert": "WeaveNetFastDPFlowsOff",
"expr": "sum(weave_flows == bool 0) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNet FastDP flows is not happening in some or all nodes. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for fast dp being off."
}
},
{
"alert": "WeaveNetHighConnectionTerminationRate",
"expr": "rate(weave_connection_terminations_total[5m]) > 0.1",
"for": "5m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are getting terminated. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for high connection termination rate and fix it."
}
},
{
"alert": "WeaveNetConnectionsConnecting",
"expr": "sum(weave_connections{state='connecting'}) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are in connecting state. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason and fix it."
}
},
{
"alert": "WeaveNetConnectionsRetying",
"expr": "sum(weave_connections{state='retrying'}) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are in retrying state. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason and fix it."
}
},
{
"alert": "WeaveNetConnectionsPending",
"expr": "sum(weave_connections{state='pending'}) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are in pending state. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason and fix it."
}
},
{
"alert": "WeaveNetConnectionsFailed",
"expr": "sum(weave_connections{state='failed'}) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are in failed state. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason and fix it."
}
}
]
}
]
}