Is it possible to trigger some alerts on the Prometheus dashboard by manually stopping respective services on the Kubernetes cluster in order to verify that I'm receiving alert for issues on Prometheus dashboard ?
I would recommend using tools such as chaos toolkit to do this declaratively and automatically instead of doing it manually. This is called chaos engineering more generally.
{
"title": "Do we remain available in face of pod going down?",
"description": "We expect Kubernetes to handle the situation gracefully when a pod goes down",
"tags": ["kubernetes"],
"steady-state-hypothesis": {
"title": "Verifying service remains healthy",
"probes": [
{
"name": "all-our-microservices-should-be-healthy",
"type": "probe",
"tolerance": true,
"provider": {
"type": "python",
"module": "chaosk8s.probes",
"func": "microservice_available_and_healthy",
"arguments": {
"name": "myapp"
}
}
}
]
},
"method": [
{
"type": "action",
"name": "terminate-db-pod",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "app=my-app",
"name_pattern": "my-app-[0-9]$",
"rand": true
}
},
"pauses": {
"after": 5
}
}
]
}
You can use Gremlin to achieve this goal too. First, install the Gremlin agent on your Kubernetes cluster using the helm chart: https://github.com/gremlin/helm/
Next, shutdown the specific services using the Kubernetes features within Gremlin. You can control the blast radius by selecting 1 pod/1 service or many pods/services. This is a tutorial that I wrote on this topic: https://www.gremlin.com/community/tutorials/how-to-install-and-use-gremlin-with-kubernetes/.
Validating monitoring and alerting is a great use case for Chaos Engineering. As you said, triggering alerts on the Prometheus dashboard by manually stopping respective services on the Kubernetes cluster. This will enable you to verify alerts for issues on your Prometheus dashboard. This tutorial explains how to use Gremlin webhooks with Grafana and Prometheus: https://www.gremlin.com/community/tutorials/visualize-chaos-experiments-in-grafana-with-gremlin-webhooks/
Related
I have grafana Dashboards, Pods drop down coming None within namespace, however we have pods running in namespace and pulling data prometheus.
Screenshot:
Query:
"datasource": "Prometheus",
"definition": "",
"description": null,
"error": null,
"hide": 0,
"includeAll": false,
"label": "Pod",
"multi": false,
"name": "pod",
"options": [],
"query": {
"query": "query_result(sum(container_memory_working_set_bytes{namespace=\"$namespace\"}) by (pod_name))",
"refId": "Prometheus-pod-Variable-Query"
},
"refresh": 1,
"regex": "/pod_name=\\\"(.*?)(\\\")/",
"skipUrlSync": false,
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
I am imported json code:
https://grafana.com/grafana/dashboards/6879
Edit your dashboard's JSON:
Rename "pod_name" to "pod" in the 2 places (and save)
Looks like this grafana dashboard was created with older kubernetes version,
and metrics internals since changed.
Probably will also need similar edits for "container_name" changing to "container" in these older dashboards
This might not be a full answer, but I cannot yet comment.
The linked dashboard imports and works for me fine. So I suspect one of these:
Prometheus scraping is not running (correctly). You could enter directly into the Prometheus app and check whether if the container_memory_working_set_bytes metric has any value at all, anywhere.
The kube_system system namespace might be restricted with respect to scraping and such. If another namespace works and only this one doesn't, then this is the case.
I am getting resource not found in resource group error while deploying arm template.Could someone help
please .Below is the sample template used:
{
"name": "[variables('AppName')]",
"type": "Microsoft.Web/sites",
"apiVersion": "2016-08-01",
"kind": "app",
"location": "xx",
"identity": {
"type": "SystemAssigned"
},
"properties": {
"httpsOnly": true,
"clientAffinityEnabled": false,
"serverFarmId": "xx"
},
"resources": [
{
"name": "appsettings",
"type": "config",
"apiVersion": "2016-08-01",
"properties": {
xx:xx
},
"dependsOn": [
"[resourceId('Microsoft.Web/sites', variables('AppName'))]",
"[resourceId('Microsoft.KeyVault/vaults/secrets', variables('keyVaultName'),'xx')]",
"[resourceId('Microsoft.KeyVault/vaults/secrets', variables('keyVaultName'),'xx')]",
"[resourceId('Microsoft.KeyVault/vaults/secrets', variables('keyVaultName'),'xx')]"
]
}
]
},
{
"type": "Microsoft.Web/sites/config",
"apiVersion": "2016-08-01",
"name": "[concat(variables('AppName'), '/web')]",
"location": "xx",
"dependsOn": [
"[resourceId('Microsoft.Web/sites', variables('AppName'))]"
],
}
Let me know is this the right way to do
its hard to tell without having the exact template and all the variables\parameters, but generally it means one of the following:
wrong name used for the resources that depend on the webapp somewhere
wrong location used for the resources that depend on the webapp somewhere
dependsOn isn't setup properly and it doesnt wait for the webapp and attempts to create a resource in parallel with the web app
Have you ever use the same ARM template to deploy succeed?
Also kindly check if you could directly use script to deploy locally successful without using Azure DevOps. This will help to narrow down the issue.
##[error]ResourceNotFound: The Resource 'Microsoft.Web/sites/xx' under resource group 'yy' was not found in deploying ARM template
This error indicate Resource Manager needs to retrieve the properties for a resource, but can't find the resource in your subscriptions.
You could give a try with below solution:
Solution 1 - check resource properties
Solution 2 - set dependencies
Solution 3 - get external resource
Solution 4 - get managed identity from resource
Solution 5 - check functions
More details please take a look at our official doc here-- Resolve resource not found errors
I am trying to deploy data factory using ARM template. It is easy to use the exported template to create a deployment pipeline.
However, as the data factory needs to access an on-premise database server, I need to have an integrated runtime. The problem is how can I include the run time in the arm template?
The template looks like this and we can see that it is trying to include the runtime:
{
"name": "[concat(parameters('factoryName'), '/OnPremisesSqlServer')]",
"type": "Microsoft.DataFactory/factories/linkedServices",
"apiVersion": "2018-06-01",
"properties":
{
"annotations": [],
"type": "SqlServer",
"typeProperties": {
"connectionString": "[parameters('OnPremisesSqlServer_connectionString')]"
},
"connectVia": {
"referenceName": "OnPremisesSqlServer",
"type": "IntegrationRuntimeReference"
}
},
"dependsOn": [
"[concat(variables('factoryId'), '/integrationRuntimes/OnPremisesSqlServer')]"
]
},
{
"name": "[concat(parameters('factoryName'), '/OnPremisesSqlServer')]",
"type": "Microsoft.DataFactory/factories/integrationRuntimes",
"apiVersion": "2018-06-01",
"properties": {
"type": "SelfHosted",
"typeProperties": {}
},
"dependsOn": []
}
Running this template gives me this error:
\"connectVia\": {\r\n \"referenceName\": \"OnPremisesSqlServer\",\r\n \"type\": \"IntegrationRuntimeReference\"\r\n }\r\n }\r\n} and error is: Failed to encrypted linked service credentials on self-hosted IR 'OnPremisesSqlServer', reason is: NotFound, error message is: No online instance..
The problem is that I will need to type in some key in the integrated runtime's UI, so it can be registered in azure. But I can only get that key from my data factory instance's UI. So above arm template deployment will always fail at least once. I am wondering if there is a way to create the run time independently?
The problem is that I will need to type in some key in the integrated
runtime's UI, so it can be registered in azure. But I can only get
that key from my data factory instance's UI. So above arm template
deployment will always fail at least once. I am wondering if there is
a way to create the run time independently?
It seems that you already know how to create Self-Hosted IR in the ADF ARM.
{
"name": "[concat(parameters('dataFactoryName'), '/integrationRuntime1')]",
"type": "Microsoft.DataFactory/factories/integrationRuntimes",
"apiVersion": "2018-06-01",
"properties": {
"additionalProperties": {},
"description": "jaygongIR1",
"type": "SelfHosted"
}
}
Result:
Only you concern is that Windows IR Tool need to be configured with AUTHENTICATION KEY to access ADF Self-Hosted IR node.So,it should be Unavailable status once it is created.This flow is make sense i think,authenticate key should be created first,then you can use it to configure On-Premise Tool.You can't implement all of things in one step because these behaviors are operated on both of azure and on-premise sides.
Based on the Self-Hosted IR Tool document ,the Register steps can't be implemented with Powershell code. So,all steps can't be processed in the flow are creating IR and getting Auth key,not for Registering in the tool.
As a QA in our company I am daily user of kubernetes, and we use kubernetes job to create performance tests pods. One advantage of job, according to the docs, is
to create one Job object in order to reliably run one Pod to completion
But in our tests this feature will create infinite pods if previous ones fail, which will occupy resources of our team's shared cluster, and deleting such pods will take a lot of time. see this image:
Currently the job manifest is like this:
{
"apiVersion": "batch/v1",
"kind": "Job",
"metadata": {
"name": "upgradeperf",
"namespace": "ntg6-grpc26-tts"
},
"spec": {
"template": {
"spec": {
"containers": [
{
"name": "upgradeperfjob",
"image":
"mycompany.com:5000/ncs-cd-qa/upgradeperf:0.1.1",
"command": [
"python",
"/jmeterwork/jmeter.py",
"-gu",
"git#gitlab-pri-eastus2.dev.mycompany.net:mobility-ncs-tools/tts-cdqa-tool.git",
"-gb",
"upgradeperf",
"-t",
"JMeter/testcases/ttssvc/JMeterTestPlan_ttssvc_cmpsize.jmx",
"-JtestDataFile",
"JMeter/testcases/ttssvc/testData/avaml_opus.csv",
"-JthreadNum",
"3",
"-JthreadLoopCount",
"1500",
"-JresultsFile",
"results_upgradeperf_cavaml_opus_t3_l1500.csv",
"-Jhost",
"mtl-blade32-03.mycompany.com",
"-Jport",
"28416"
]
}
],
"restartPolicy": "Never",
"imagePullSecrets": [
{
"name": "docker-registry-secret"
}
]
}
}
}
}
In some cases, such as misconfiguring of ip/ports, 'reliably run one Pod to completion' is impossible and recreating pods is waste of time and resource.
So is it possible, and how to limit kubernetes job to create a maxium number(say 3) of pods if always fail?
Depending on your kubernetes version, you can resolve this problem with these methods:
set the option: restartPolicy: OnFailure, then the failed container will be restarted in the same Pod, so you will not get lots of failed Pods, instead you will get a Pod with lots of restart.
From Kubernetes 1.8 on, There is a parameter backoffLimit to control the restart policy of failed job. This parameter defines the retry times of the job before treating the job to be failed, default 6 times. For this parameter to work you must set the parameter restartPolicy: Never .
You probably didn't set restartPolicy: Never in your pod spec, add that and I would expect it matches your expected behaviors better.
Update: I found executing script on the octopus server is now available in version 3.3, I haven't update my octopus yet but I will take that would work as designed. I'm still wondering if there is a better way to do this without octo.exe?
The task I'm trying to accomplish is after each successful production deployment, automatically schedule a DR deployment to happen next 24 hours.
My desired approach is have octopus do it.
I added a new Octopus step at the end of the deployment only runs upon success of previous step. I attempted to use octo deploy-release --deployAt can be found here in the newly created step.
My challenge is, a script step requires me to pick a target role, which means it will be executed on a tentacle. Also, presence of Octo.exe is required.
I tried to create my own octopus step template, a deployment target role is still required in my customized step.
{
"Id": "ActionTemplates-2",
"Name": "Octopus - Schedule Deployment",
"Description": "Schedule a future octopus deployment",
"ActionType": "Octopus.Script",
"Version": 3,
"Properties": {
"Octopus.Action.Script.Syntax": "PowerShell",
"Octopus.Action.Script.ScriptBody": "--hide--"
},
"SensitiveProperties": {},
"Parameters": [
{
"Name": "OctoPath",
"Label": "Path for Octo.exe",
"HelpText": "Location for octo.exe",
"DefaultValue": null,
"DisplaySettings": {
"Octopus.ControlType": "SingleLineText"
}
},
{
"Name": "projName",
"Label": "Project Name",
"HelpText": "The name of the project should be deployed",
"DefaultValue": null,
"DisplaySettings": {
"Octopus.ControlType": "SingleLineText"
}
},
{
"Name": "days",
"Label": "Days",
"HelpText": "The days in future this deployment would happen",
"DefaultValue": null,
"DisplaySettings": {
"Octopus.ControlType": "SingleLineText"
}
},
{
"Name": "hours",
"Label": "Hours",
"HelpText": "The hours in future this deployment would happen",
"DefaultValue": null,
"DisplaySettings": {
"Octopus.ControlType": "SingleLineText"
}
},
{
"Name": "env",
"Label": "Environment to deploy",
"HelpText": "The environment next deployment should happen",
"DefaultValue": null,
"DisplaySettings": {
"Octopus.ControlType": "SingleLineText"
}
}
],
"$Meta": {
"ExportedAt": "2016-04-20T13:58:54.263Z",
"OctopusVersion": "3.2.0",
"Type": "ActionTemplate"
}
}
Is there a way to alter the template to get rid of the role selection and have octopus server directly execute it as it does for Azure script step?
Is there any another way we can have octopus server automatically schedule the deployment without external help? I guess this go back to first problem, I may still need octopus to run something on the server side.
Note: We kick off production deployment manually, thus I don't have another tool waiting for the response of the deployment. I think it is possible to have a process regularly call out the last deployment and do some analysis then schedule new deployment accordingly but this is not as clean as have octopus do it directly. Injecting octo.exe to a random production machine is not desired at all
You could create new WebAPI project in C#, pull in the Octopus.Deploy nuget package,
write code that accepts HTTP requests, and deals with the scheduling logic.
Host that project on the same server as Octopus server itself. Should be 20-30 minute job to set the website up in IIS.
In your deployment process, add step that creates http request, and done. You could go even one step further, and have the site/service listen for every successful deployment, and do decisions based on that, such that other projects don't have to add extra steps to octopus deployment process.
As you said, polling is also viable option.
Alternatively, if you're on Octopus deploy 3.0, they already expose REST API, I am not sure if it's powerful enough to allow you create scheduled deployment, but you could explore that: https://github.com/OctopusDeploy/OctopusDeploy-Api/wiki/Releases
I agree floating octo.exe in production servers is bad idea. It might get out of sync, and your production server shouldn't deal with this.