On what metric weave net should be alerted on? - kubernetes

WeaveNet exposes the following Prometheus metrics
Does the below alerts look correct to be alerted on? On what values of these metrics we should raise alert to monitor weave-net health?
WeaveNoFastDP weave_flows[5m] > 0
WeaveIPAMUnreachable weave_ipam_unreachable_percentage > 0
WeaveIPAMPendingAllocates weave_ipam_pending_allocates > 0
WeavePendingClaims weave_ipam_pending_claims > 0
WeaveConnecTerm weave_connection_terminations_total > 300

Made grafana dashboard on top of weave metrics.
Here are the dashboards
WeaveNet https://grafana.com/grafana/dashboards/11789
WeaveNet(Cluster) https://grafana.com/grafana/dashboards/11804
Here are the useful metrics on which weave net should be monitored. Below alert are in json format.
{
"groups": [
{
"name": "nodeagent",
"rules": [
{
"alert": "UnhealthyNodes",
"expr": "changes(central_nodeagent:node_route_unhealthy_count[3m]) > 0",
"for": "1m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "Unhealthy nodes in the cluster. Go to prometheus the below prometheus link for details.",
"description": "Actionable: Find why the node(s) are unhealthy and fix it."
}
}
]
},
{
"name": "weave-net",
"rules": [
{
"alert": "WeaveNetIPAMSPlitBrain",
"expr": "max(weave_ipam_unreachable_percentage) - min(weave_ipam_unreachable_percentage) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNetIPAM has a split brain. Go to the below prometheus link for details.",
"description": "Actionable: Every node should see same unreachability percentage. Please check and fix why it is not so."
}
},
{
"alert": "WeaveNetIPAMUnreachable",
"expr": "weave_ipam_unreachable_percentage[10m] > 25",
"for": "10m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNetIPAM unreachability percentage is above threshold. Go to the below prometheus link for details.",
"description": "Actionable: Find why the unreachability threshold have increased from threshold and fix it. WeaveNet is responsible to keep it under control. Weave rm peer deployment can help clean things."
}
},
{
"alert": "WeaveNetIPAMPendingAllocates",
"expr": "sum(weave_ipam_pending_allocates) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNet IPAM has pending allocates. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for IPAM allocates to be in pending state and fix it."
}
},
{
"alert": "WeaveNetIPAMPendingClaims",
"expr": "sum(weave_ipam_pending_claims) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNet IPAM has pending claims. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for IPAM claims to be in pending state and fix it."
}
},
{
"alert": "WeaveNetFastDPFlowsLow",
"expr": "sum(weave_flows) < 15000",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNet total FastDP flows is below threshold. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for fast dp flows dropping below the threshold."
}
},
{
"alert": "WeaveNetFastDPFlowsOff",
"expr": "sum(weave_flows == bool 0) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "WeaveNet FastDP flows is not happening in some or all nodes. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for fast dp being off."
}
},
{
"alert": "WeaveNetHighConnectionTerminationRate",
"expr": "rate(weave_connection_terminations_total[5m]) > 0.1",
"for": "5m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are getting terminated. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason for high connection termination rate and fix it."
}
},
{
"alert": "WeaveNetConnectionsConnecting",
"expr": "sum(weave_connections{state='connecting'}) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are in connecting state. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason and fix it."
}
},
{
"alert": "WeaveNetConnectionsRetying",
"expr": "sum(weave_connections{state='retrying'}) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are in retrying state. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason and fix it."
}
},
{
"alert": "WeaveNetConnectionsPending",
"expr": "sum(weave_connections{state='pending'}) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are in pending state. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason and fix it."
}
},
{
"alert": "WeaveNetConnectionsFailed",
"expr": "sum(weave_connections{state='failed'}) > 0",
"for": "3m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "A lot of connections are in failed state. Go to the below prometheus link for details.",
"description": "Actionable: Find the reason and fix it."
}
}
]
}
]
}

Related

Error in running Azure Data Factory Pipeline. Linked Service Reference not found

I am facing the below issue in creating an Azure Machine Learning Batch Execution activity to execute a scoring ML experiment. Please help:
Please let me know if any other relevant information is needed. I am new to this so, please help
Created an AzureML Linked Service as below:
{
"name": "PredictionAzureML",
"properties": {
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/xxxxx/jobs",
"apiKey": "xxxxxxxx=="
},
"type": "AzureML"
}
}
Created Pipeline as below:
{
"name": "pipeline1",
"properties": {
"description": "use AzureML model",
"activities": [
{
"name": "MLActivity",
"description": "description",
"type": "AzureMLBatchExecution",
"policy": {
"timeout": "02:00:00",
"retry": 1,
"retryIntervalInSeconds": 30
},
"typeProperties": {
"webServiceInput": "PredictionInputDataset",
"webServiceOutputs": {
"output1": "PredictionOutputDataset"
}
},
"inputs": [
{
"name": "PredictionInputDataset"
}
],
"outputs": [
{
"name": "PredictionOutputDataset"
}
],
"linkedServiceName": "PredictionAzureML"
}
]
}
}
Getting the below error:
{
"errorCode": "2109",
"message": "'linkedservicereference' with reference name 'PredictionAzureML' can't be found.",
"failureType": "UserError",
"target": "MLActivity"
}
I got this working in Data Factory v2, so apologies if you are using v1.
Try putting the linkedServiceName as an object in the JSON outside of the typeProperties and use the following structure:
"linkedServiceName": {
"referenceName": "PredictionAzureML",
"type": "LinkedServiceReference"
}
Hope that helps!
Please use "Trigger" instead of "Debug" in the UX. You need publish your pipeline first before click "Trigger" Button.
Please follow this doc to update your payload. It should look like the following.
{
"name": "AzureMLExecutionActivityTemplate",
"description": "description",
"type": "AzureMLBatchExecution",
"linkedServiceName": {
"referenceName": "AzureMLLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"webServiceInputs": {
"<web service input name 1>": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"path1"
},
"<web service input name 2>": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"path2"
}
},
"webServiceOutputs": {
"<web service output name 1>": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"path3"
},
"<web service output name 2>": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"path4"
}
},
"globalParameters": {
"<Parameter 1 Name>": "<parameter value>",
"<parameter 2 name>": "<parameter 2 value>"
}
}
}

Resource Not Found for Creating CronJob

I am running 1.6.2, and am hitting the /apis/batch/v2alpha1/namespaces/<namespace>/cronjobs endpoint, with a valid namespace and a request body of
{
"body": {
"apiVersion": "batch/v2alpha1",
"kind": "CronJob",
"metadata": {
"name": "hello"
},
"spec": {
"schedule": "*/1 * * * *",
"jobTemplate": {
"spec": {
"template": {
"spec": {
"containers": [
{
"name": "hello",
"image": "busybox",
"args": [
"/bin/sh",
"-c",
"date; echo Hello from the Kubernetes cluster"
]
}
],
"restartPolicy": "OnFailure"
}
}
}
}
}
}
}
I receive a response of
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "the server could not find the requested resource",
"reason": "NotFound",
"details": {},
"code": 404
}
According to the documentation, this endpoint should exist. I figure I probably have some setting set incorrectly, but I'm not sure which one and how to correct it. Any help is appreciated.
The v2alpha1 features are not enabled by default. Make sure you are starting your kube-apiserver with this switch to enable the CronJob resource: --runtime-config=batch/v2alpha1=true.

"Error getting chaincode package bytes" when deploying chaincode on hyperledger via REST

I'm trying to deploy chaincode on hyperledger (Bluemix service) via POST/REST to
/chaincode
QuerySpec
{ "jsonrpc": "2.0", "method": "deploy", "params": { "type": 1,
"chaincodeID": { "path":
"https://github.com/romeokienzler/learn-chaincode/tree/master/finished"
}, "ctorMsg": { "function": "init", "args": [ "hi there" ] },
"secureContext": "user_type1_0" }, "id": 1 }
I've also tried those links
https://github.com/romeokienzler/learn-chaincode/blob/master/finished/chaincode_finished?raw=true
https://raw.githubusercontent.com/romeokienzler/learn-chaincode/master/finished/chaincode_finished.go
I always get
{ "jsonrpc": "2.0", "error": {
"code": -32001,
"message": "Deployment failure",
"data": "Error when deploying chaincode: Error getting chaincode package bytes: Error getting code 'go get' failed with error: 'exit
status 1'\npackage
github.com/romeokienzler/learn-chaincode/tree/master/finished: cannot
find package
'github.com/romeokienzler/learn-chaincode/tree/master/finished' in any
of:\n\t/usr/local/go/src/github.com/romeokienzler/learn-chaincode/tree/master/finished
(from
$GOROOT)\n\t/go/usercode/552962906/src/github.com/romeokienzler/learn-chaincode/tree/master/finished
(from
$GOPATH)\n\t/go/src/github.com/romeokienzler/learn-chaincode/tree/master/finished\n"
}, "id": 1 }
Any idea?
Considering that you are playing with Bluemix service, I assume you are following "Implementing your first chain code tutorial"
If your forked repository you will see instructions to use branch v1.0 for Bluemix Blockchain Services (link) IBM BMX Service is (still) using Fabric v0.5.
Once you have Registered with one of the available Enroll ID you should be able to deploy your chaincode using DeploySpec (note the path: "https://github.com/romeokienzler/learn-chaincode/tree/v1.0/finished")
{
"jsonrpc": "2.0",
"method": "deploy",
"params": {
"type": 1,
"chaincodeID": {
"path": "https://github.com/romeokienzler/learn-chaincode/tree/v1.0/finished"
},
"ctorMsg": {
"function": "init",
"args": [
"hi there"
]
},
"secureContext": "user_type1_0"
},
"id": 1
}
First of all deploy command should be changed to ( the value for path variable was changed):
{
"jsonrpc": "2.0",
"method": "deploy",
"params": {
"type": 1,
"chaincodeID": {
"path": "https://github.com/romeokienzler/learn-chaincode/finished"
},
"ctorMsg": {
"function": "init",
"args": ["hi there"]
},
"secureContext": "user_type1_0"
},
"id": 1
}
P.S. As #Mil4n correctly mentioned, IBM Bluemix still works with Fabric v0.5. Chaincode romeokienzler/learn-chaincode/finished should be adopted to this version.
For example shim.ChaincodeStubInterface is not available yet and should be replaced with *shim.ChaincodeStub.

MariaDB backups: meaning of backup and restore states returned by Cloud Foundry

I did not find any documentation about the states of a MariaDB backup and the restore (https://docs.developer.swisscom.com/devguide-sc/services/backups.html).
For example, when I make the API call GET /custom/service_instances/{service-instance-id}/backups, the following JSON is returned as a response. In the response there are the attributes "status" in the entity of a backup, and the entity of a restore.
{
"total_results": 2,
"total_pages": 1,
"prev_url": null,
"next_url": null,
"resources": [
{
"metadata": {
"guid": "95b9108a-1903-4cea-b52e-bbb3b0414986",
"url": "/custom/service_instances/3955ad28-3f47-4f08-8eee-748f6e162d46/backups/95b9108a-1903-4cea-b52e-bbb3b0414986",
"created_at": "2016-10-03T20:52:04Z",
"updated_at": "2016-10-03T20:52:34Z"
},
"entity": {
"service_instance_id": "3955ad28-3f47-4f08-8eee-748f6e162d46",
"status": "CREATE_SUCCEEDED",
"restores": []
}
},
{
"metadata": {
"guid": "4ffff7d4-55a8-4e57-9035-98ed11380991",
"url": "/custom/service_instances/3955ad28-3f47-4f08-8eee-748f6e162d46/backups/4ffff7d4-55a8-4e57-9035-98ed11380991",
"created_at": "2016-10-03T08:50:07Z",
"updated_at": "2016-10-03T08:50:37Z"
},
"entity": {
"service_instance_id": "3955ad28-3f47-4f08-8eee-748f6e162d46",
"status": "CREATE_SUCCEEDED",
"restores": [
{
"metadata": {
"guid": "1a33a385-3703-423d-8052-be7a7a061878",
"url": "/custom/service_instances/3955ad28-3f47-4f08-8eee-748f6e162d46/backups/4ffff7d4-55a8-4e57-9035-98ed11380991/restores/1a33a385-3703-423d-8052-be7a7a061878",
"created_at": "2016-10-03T20:52:38Z",
"updated_at": "2016-10-03T20:53:08Z"
},
"entity": {
"backup_id": "4ffff7d4-55a8-4e57-9035-98ed11380991",
"status": "SUCCEEDED"
}
}
]
}
}
]
}
So, the question is: WHAT are all possible values in the "status" attributes on both cases, and WHEN do they happen?
Thanks in advance,
Just for documentation purposes, I've got to know which are the available statuses:
Backup statuses:
CREATE_IN_PROGRESS
CREATE_SUCCEEDED
CREATE_FAILED
DELETE_IN_PROGRESS
DELETE_SUCCEEDED
DELETE_FAILED
Restore statuses:
IN_PROGRESS
SUCCEEDED
FAILED
Moreover, it is not possible to trigger a restore when there is another backup create or delete, or another restore operation in progress.

What is the DNS Record logged by kubedns?

I'm using Google Container Engine and I'm noticing entries like the following in my logs
{
"insertId": "1qfzyonf2z1q0m",
"internalId": {
"projectNumber": "1009253435077"
},
"labels": {
"compute.googleapis.com/resource_id": "710923338689591312",
"compute.googleapis.com/resource_name": "fluentd-cloud-logging-gke-gas2016-4fe456307445d52d-worker-pool-",
"compute.googleapis.com/resource_type": "instance",
"container.googleapis.com/cluster_name": "gas2016-4fe456307445d52d",
"container.googleapis.com/container_name": "kubedns",
"container.googleapis.com/instance_id": "710923338689591312",
"container.googleapis.com/namespace_name": "kube-system",
"container.googleapis.com/pod_name": "kube-dns-v17-e4rr2",
"container.googleapis.com/stream": "stderr"
},
"logName": "projects/cml-236417448818/logs/kubedns",
"resource": {
"labels": {
"cluster_name": "gas2016-4fe456307445d52d",
"container_name": "kubedns",
"instance_id": "710923338689591312",
"namespace_id": "kube-system",
"pod_id": "kube-dns-v17-e4rr2",
"zone": "us-central1-f"
},
"type": "container"
},
"severity": "ERROR",
"textPayload": "I0718 17:05:20.552572 1 dns.go:660] DNS Record:&{worker-7.default.svc.cluster.local. 6000 10 10 false 30 0 }, hash:f97f8525\n",
"timestamp": "2016-07-18T17:05:20.000Z"
}
Is this an actual error or is the severity incorrect? Where can I find the definition for the struct that is being printed?
The severity is incorrect. This is some tracing/debugging that shouldn't have been left in the binary, and has been removed since 1.3 was cut. It will be removed in a future release.
See also: Google container engine cluster showing large number of dns errors in logs