Is it possible to track down very rare failed requests using linkerd? - kubernetes

Linkerd's docs explain how to track down failing requests using the tap command, but in some cases the success rate might be very high, with only a single failed request every hour or so. How is it possible to track down those requests that are considered "unsuccessful"? Perhaps a way to log them somewhere?

It sounds like you're looking for a way to configure Linkerd to trap requests that fail and dump the request data somewhere, which is not supported by Linkerd at the moment.
You do have a couple of options with the current functionality to derive some of the info that you're looking for. The Linkerd proxies record error rates as Prometheus metrics which are consumed by Grafana to render the dashboards. When you observe one of these infrequent errors, you can use the time window functionality in Grafana to find the precise time that the error occurred, then refer to the service log to see if there are any corresponding error messages there. If the error is coming from the service itself, then you can add as much logging info about the request that you need to in order to help solve the problem.
Another option, which I haven't tried myself is to integrate linkerd tap into your monitoring system to collect the request info and save the data for the requests that fail. There's a caveat here in that you will want to be careful about leaving a tap command running, because it will continuously collect data from the tap control plane component, which will add load to that service.
Perhaps a more straightforward approach would be to ensure that all the proxy logs and service logs are written to a long-term store like Splunk, an ELK (Elasticsearch, Logstash, and Kibana), or Loki. Then you can set up alerting (Prometheus alert-manager, for example) to send a notification when a request fails, then you can match the time of the failure with the logs that have been collected.
You could also look into adding distributed tracing to your environment. Depending on the implementation that you use (jaeger, zipkin, etc.) I think the interface will allow you to inspect the details of the request for each trace.
One final thought: since Linkerd is an open source project, I'd suggest opening a feature request with specifics on the behavior that you'd like to see and work with the community to get it implemented. I know the roadmap includes plans to be able to see the request bodies using linkerd tap and this sounds like a good use case for having those bodies.


Does a kubernetes reconcile have to be quick?

I'm writing a controller for a k8s CRD.
The job the controller has to do will usually be quick, but could on occasion take a really long time - let's say as much as an hour.
Is that ok for a Reconcile? Or should I move that work out of the controller into a separate pod, and have the controller monitor the process of that pod?
I see no reason why the reconcile loop couldn't take as long as you need.
Technically speaking a reconcile is just getting a copy of a resource i.e. an HTTP Get or an event if you're using the Watch API, followed by a change to the resource e.g updating the resource Status fields i.e an HTTP PUT/POST.
The only caveat is making sure the resource version you have is still the latest one when trying to change it. Including resource versions in your request should solve this problem.
More info here:

how to automate bots to monitor for successful queues on orchestrator?

I have a project that I have to do that deals with queues being loaded successfully and unsuccessfully whereby I do manually at the moment that can be tedious and also positive negative meaning the orchestrator can state that new queues have been added but when I access the actual job (process) nothing has been added.
I would like to know, is there a way to monitor queue success and unsuccessful rates on orchestrator instead of the using monitoring it manually?
You can access pretty much any information via the Orchestrator API.
You can find the "Orchestrator HTTP Request" activity, which will allow you to access any relevant endpoint.
Note that the provisioned Robot in Orchestrator needs to have the right access permission, so please have a look at what roles are associated to the Robot user.
The API reference can be found here:
You will see it mentions swagger, which in turn will give you all the information you need to access the relevant APIs.

Getting many welcome messages from the same user

I am getting many welcome messages from the same user, is it some kind of a monitoring system by Google?
How can I learn to ignore those requests?
Yes, Google periodically issues a health check against your Action, usually about every 5-10 minutes. Your Action should respond to it normally so Google knows if there is something wrong. If there is, you will receive email that your Action is unavailable because it is unhealthy. They will continue to monitor it and, when healthy again, will restore it.
You don't need to ignore those requests, however you may wish to, either to save on resources or to avoid logging it all the time.
With a library such as multivocal, it detects it and responds automatically - there is nothing you need to to. For other libraries, you will need to examine the raw input sent in the body of your webhook request.
If you are using the Action SDK, you should examine the inputs array to see if there is one with an argument named "is_health_check". If you are using Dialogflow, then you would need to look under

Unable to setup Azure alert on resource specific events

In the past, it was possible to setup an Azure alert on a single event for a resource e.g. on data factory single RunFinished where the status is Failed*.
This appears to have been superseded by "Activity Log Alerts"
However these alerts only seem to either work on a metric threshold (e.g. number of failures in 5 minutes) or on events which are related to the general admin of the resource (e.g. has it been deployed) not on the operations of the resource.
A threshold doesn't make sense for data factory, as a data factory may only run once a day, if a failure happens and then it doesn't happen X minutes later it doesn't mean it's been resolved.
The activity event alerts, don't seem to have things like failures.
Am I missing something?
It it because this is expected to be done in OMS Log Analytics now? Or perhaps even in Event Grid later?
*n.b. it is still possible to create these alert types via ARM templates, but you can't see them in the portal anymore.
The events you're describing are part of a resource type's diagnostic logs, which are not alertable in the same way that the Activity Log is. I suggest routing the data to Log Analytics and setting the alert there:

Timeline of kubernetes events

I would like to be able to see all of the various things that happened to a kube cluster on a timeline, including when nodes were found to be dead, when new nodes were added, when pods crashed and when they were restarted.
So far the best that we have found is kubectl get event but that seems to have a few limitations:
it doesn't go back in time that far (I'm not sure how far it goes back. A day?)
it combines similar events and orders the resulting list by the time of the latest event in each group. This makes it impossible to know what happened during some time range since events in that range may have been combined with later events outside the range.
One idea that I have is to write a pod that will use the API to watch the stream of events and log them to a file. This would let us control retention and it seems that events that occur while we are watching will not be combined, solving the second problem as well.
What are other people doing about this?
My understanding is that Kubernetes itself dedups events, documented here:
Once that happens, there is no way to get the individual events back.
See for complaints how that loses info. at least improved the message. See also KEP for recent discussion and proposal to improve usability in kubectl.
How long events are retained? Their "time-to-live" is apparently controlled by kube-apiserver --event-ttl option, defaults to 1 hour:
You can raise this. Might require more resources for etcd — from what I saw in some 2015 github discussions, event TTL used to be 2 days, and events were the main thing stressing etcd...
In a pinch, it might be possible to figure out what happened earlier from various log, especially the kubelet logs?
Saving events
Running kubectl get event -o yaml --watch into a persistent file sounds like a simple thing to do. I think when you watch events as they arrive, you see them pre-dedup.
Heapster can send events to some of the supported sinks:
Eventrouter can send events to various sinks:
Have you checked out the pod specific events tab in the Dashboard?
Some events from a cluster I have running in GKE:
kubernetes/heapster can persist event to gcl and influxdb, but for now there is no api to access stored data