Linking from Grafana to Jaeger using Data Links hits the wrong traces

Linking from Grafana to Jaeger using Data Links hits the wrong traces - grafana

We are working on incorporating Grafana, Jaeger, Prometheus and other tools into our production environments.
In this process, we have set up graphs for various services/operations, and we would like to be able to go directly from a "Spike" in the graph to the trace, either in the Grafana Explorer or Jaeger itself, but here It seems like things do not quite match up.
We have managed to get data links to work by first transforming the data fields (renaming them "Organize fields") otherwise, we could not get the link templates to work. The Trace ID has then been renamed to TraceId.
Then we add a Data Link to the graph, currently locally, with the following template:
http://localhost:16686/trace/${__data.fields['TraceId']}
So to try to summarize - Given a really simple graph:
Type: Time Series
Data Source: Jaeger
Query: Service=<Our Service>, Operation=All, Tags="category=REQUEST"
Limit: 2500
Transform:
Organize Fields: Trace ID -> TraceId
Data Link:
Jaeger: http://localhost:16686/trace/${__data.fields['TraceId']}
Now, this Appears to work at first. However, our problem is that the ID's does not match up correctly. So when we have e.g. a 10second spike and click that in the graph, get the context menu and then click the link above, we go to a completely different trace.
I have tried to zoom in and see that when the data links appear in the tooltip, the duration matches the trace I want to see.
Then I finally filtered something down to only having 4 points (traces) in the graph, and I discovered that it appears like the links are in reverse compared to the graph.
So with 4 data points in the graph like so: ( Id=What Id seems to be the right match in Jaeger taken the Time and Duration )
2022-08-04T08:36:57.402, 3.91 ms, ( Id=ea801f08b37f64cd374fbc28f52e38f6 )
2022-08-04T08:36:57.403, 2.90 ms, ( Id=eecf656be29f783d5c0ecae1113827d8 )
2022-08-04T08:36:57.409, 4.43 ms, ( Id=06da4ae77a124e6eb4eca3a63a65e115 )
2022-08-04T08:36:57.416, 3.06 ms, ( Id=3ddb2e5b37309fec68a163d35fc929e8 )
Yet the links fall like this instead:
2022-08-04T08:36:57.402, 3.91 ms, Link=http://localhost:16686/trace/3ddb2e5b37309fec68a163d35fc929e8
2022-08-04T08:36:57.403, 2.90 ms, Link=http://localhost:16686/trace/06da4ae77a124e6eb4eca3a63a65e115
2022-08-04T08:36:57.409, 4.43 ms, Link=http://localhost:16686/trace/eecf656be29f783d5c0ecae1113827d8
2022-08-04T08:36:57.416, 3.06 ms, Link=http://localhost:16686/trace/ea801f08b37f64cd374fbc28f52e38f6
This seems to be consistent, so does anyone have any clue for us to what we are doing wrong?
Feel free to ask for specific details. I am not sure what to share.

Related

How to make sense of the micrometer metrics using SpringBoot 2, InfluxDB and Grafana?

I'm trying to configure a SpringBoot application to export metrics to InfluxDB to visualise them using a Grafana dashboard. I'm using this dashboard as an example which uses Prometheus as a backend.
For some metrics I have no problem figuring out how to create graphs for them but for some others I don't know how to create the graphs or even if it's possible at all. So I enumerate the things I'm not really sure about in the following points:
Is there any documentation where a value unit is described? The application I'm using as an example doesn't have any load on it so sometimes I don't know whether the value is a bit, a byte, a second, a millisecond, a count, etc.
Some measurements contain the tag 'metric_type = histogram' with fields 'count', 'sum', 'mean' and 'upper'. Again, here I don't know what the value units are, what upper means or how I'm suppose to plot them. Examples of this are 'http_server_requests' or 'jvm_gc_pause'.
From what I see in the Grafana dashboard example, it seems I should use these measurements of type histogram to create both a graph with counts and graphs with duration. For example I see I should be able to create a graph with the number of requests and another one with their duration. Or for the garbage collector, I should be able to provide a graph for the number of minor and major GCs and another for their duration.
As an example of measures I get inserted into InfluxDB:
time count exception mean method metric_type outcome status sum upper uri
1625579637946000000 1 None 0.892144 GET histogram SUCCESS 200 0.892144 0.892144 /actuator/health
or
time action cause count mean metric_type sum upper
1625581132316000000 end of minor GC Allocation Failure 1 2 histogram 2 2

I agree the documentation for micrometer is not great. I've had to dig through the code to find answers...
Regarding your questions about jvm_gc_pause, it is a Timer and the implementation is AbstractTimer which is a class that wraps a Histogram among other components. This particular metric is registered by the JvmGcMetrics class. The various measurements that are published to InfluxDB are determined by the InfluxMeterRegistry.writeTimer(Timer timer) method:
sum: timer.totalTime(getBaseTimeUnit()) // The total time of recorded events
count: timer.count() // The number of times stop has been called on the timer
mean: timer.mean(getBaseTimeUnit()) // totalTime()/count()
upper: timer.max(getBaseTimeUnit()) // The max time of a single event
The base time unit is milliseconds.
Similarly, http_server_requests appears to be a Timer as well.
I believe you are correct that the sensible thing is to chart on two separate Grafana panels: one panel for GC pause seconds using sum (or mean or upper), and one panel for GC events using count.

How do we change the "precision:ms" setting in the Grafana Query Inspector?

I have an InfluxDB database with only x11 data points in it. These data are not displaying correctly (or at least as I would expect) in Grafana when the time between them is shorter than 1ms.
If I insert data points 1 ms apart, then everything works as expected and I see all x11 points at the correct times, as shown below.:
However, if I delete these points and upload new ones but this time one point per 100 μs, then although the data displays correctly in InfluxDB, in Grafana I see only two points in my graph:
It seems like the data is being rounded/binned to the nearest millisecond, an that this is related to the “precision=ms” setting in the query here:
but I cannot find any way to change this setting. What is the correct way to fix this?

You can't configure Grafana to support different time precision for the InfluxDB. It is hardcoded in the source code: https://github.com/grafana/grafana/blob/36fd746c5df1438f27aa33fc74b24be77debc7ff/public/app/plugins/datasource/influxdb/datasource.ts#L364 (It may need to be fixed in multiple places of the source, not only in this one.)
So the correct way to fix it is to code it, which is of course not in the scope of this question.

Can you calculate active users using time series

My atomist client exposes metrics on commands that are run. Each command is a metric with a username element as well a status element.
I've been scraping this data for months without resetting the counts.
My requirement is to show the number of active users over a time period. i.e 1h, 1d, 7d and 30d in Grafana.
The original query was:
count(count({Username=~".+"}) by (Username))
this is an issue because I dont clear the metrics so its always a count since inception.
I then tried this:
count(max_over_time(help_command{job=“Application
Name”,Username=~“.+“}[1w]) -
max_over_time(help_command{job=“Application name”,Username=~“.+“}[1w]
offset 1w) > 0)
which works but only for one command I have about 50 other commands that need to be added to that count.
I tried the:
"{__name__=~".+_command",job="app name"}[1w] offset 1w"
but this is obviously very expensive (timeout in browser) and has issues with integrating max_over_time which doesn't support it.
Any help, am I using the metric in the wrong way. Is there a better way to query... my only option at the moment is the count (format working above for each command)
Thanks in advance.

To start, I will point out a number of issues with your approach.
First, the Prometheus documentation recommends against using arbitrarily large sets of values for labels (as your usernames are). As you can see (based on your experience with the query timing out) they're not entirely wrong to advise against it.
Second, Prometheus may not be the right tool for analytics (such as active users). Partly due to the above, partly because it is inherently limited by the fact that it samples the metrics (which does not appear to be an issue in your case, but may turn out to be).
Third, you collect separate metrics per command (i.e. help_command, foo_command) instead of a single metric with the command name as label (i.e. command_usage{commmand="help"}, command_usage{commmand="foo"})
To get back to your question though, you don't need the max_over_time, you can simply write your query as:
count by(__name__)(
(
{__name__=~".+_command",job=“Application Name”}
-
{__name__=~".+_command",job=“Application name”} offset 1w
) > 0
)
This only works though because you say that whatever exports the counts never resets them. If this is simply because that exporter never restarted and when it will the counts will drop to zero, then you'd need to use increase instead of minus and you'd run into the exact same performance issues as with max_over_time.
count by(__name__)(
increase({__name__=~".+_command",job=“Application Name”}[1w]) > 0
)

Google Analytics API: tiny differences in results between GA API and Google Analytics UI

I'm querying GA Report API v4 to get some metrics for AdWords Keywords.
As dimension I use:
ga:keyword
As metrics I use:
ga:adClicks,
ga:adCost,
ga:CPC,
ga:sessions,
ga:bounceRate,
ga:pageviewsPerSession,
ga:goalConversionRateAll,
ga:transactions,
ga:transactionRevenue
When I compare results pulled from API with results that I'm getting by Google Analytics UI, I found out that certain metrics in some Keywords has tiny differences.
Also when I tried GA API v3 I had same result.
What is the reason?
Why some returned metrics for Keywords are fully identical to results in UI, but certain not?
I tried various date ranges: 1 day, week, month but in all cases I got some tiny differences in some metrics of certain Keywords.
Here is screenshot with example of differences in metrics how it looks like:
In red color means the difference, green color - means that values are identical

Problem: The reason for the discrepancy is that you are calling two different reports.
Report 1) UI Report.
As you have seen, this report is made up of two parts the first being Clicks, Cost, and CPC which are from the Google AdWords API, and the other metrics (sessions, bounce, etc.) which is from Google Analytics.
Because you are going into AdWords > Keywords, you are actually setting a filter to select only AdWords traffic.
Report 2) Custom Report.
This report is pulling the keywords dimension without any filters. This means that the report will also have data for organic keywords, and any UTM_term parameters set.
Because sessions from organic keywords have no AdWords data, the first three columns will be the same, however the Google Analytics specific columns will show variation in the metrics.
Solution:
To get your reports the same, you need to add a filter to your API request, such as ga:adwordsCustomerID or ga:source=google & ga:medium=cpc.

Google Analytics Core Reporting API query for exits and entrances metrics - entrance values incorrectly exactly the same as exits

I'm using GA's Core Reporting API to create a report that shows the top exit pages alongside some behavioural metrics for each page. The dimension is ga:exitPagePath, and the metrics I want are:
ga:exits
ga:pageviews
ga:entrances
ga:avgTimeOnPage
ga:bounceRate
ga:exitRate
I'm sorting by -ga:exits. I'm not using any filters or segments.
The query appears to work fine, it doesn't return an error - however the entrances values it returns are incorrect and exactly match the exit values for each page. Other queries for ga:entrances without ga:exits give the correct entrance values.
I may have overlooked it but I can't find anywhere in the documentation indicating that these metrics can't be used together. I also tested creating a custom report within the GA interface with these two metrics and found the same result - no error or indication that I can't create a report with both metrics, but entrances incorrectly reported and exactly matching the exit values. I also get the same result in GA's Query Explorer.
Would love to work this out - it seems perfectly logical to me to want to view entrances alongside exits for exit pages :)

A better late than never response.
It makes sense, because all users that have visited your site (entrances) have left (exits).
It gets meaningful when using it along with the pages (ga:pagePath for example).