Why window_avg in tableau has to work on aggregated variable? - tableau-api

I am trying to calculate a rolling average by 30 days. However, in Tableau, I have to use window_avg(avg(varaible), - 30, 0). It means that it is actually calculating the average of daily average. It first calculate the average value per day, then average the values for past 30 days. I am wondering whether there is a function in Tableau that can calculate directly rolling average, like pandas.rolling?

In this specific case, you can use the following
window_sum(sum(variable), -30, 0) / window_sum(sum(1), -30, 0)
A few concepts about table calcs to keep in mind
Table calcs operate on aggregate query results.
This gives you flexibility - you can partition the table of query results in many ways, access multiple values in the result set, order the query results to impact your calculations, nest table calcs in different ways.
This approach can also give you efficiency if you can calculate what you need simply from the aggregate results that you've already fetched.
It also gives you complexity. You have to be aware of how each calculation specifies the addressing and partitioning of the query results. You also have to think about how double aggregation will impact your results.
In most cases, applying back to back aggregation functions requires some careful thought about what the results will mean. As you've noted, averages of averages may not mean what people think they mean. Others, may be quite reasonable, say averages of daily sales totals.
In some cases, double aggregation can be used without extra thought as the results are the same regardless. Sums of Sums, Mins of Mins, Max of Max yield the same result as calling Sum, min or max on the underlying data rows. These functions are called additive aggregation functions, and obey the associative rule you learned in grade school. Hence, the formula at the start of this answer.
You can also read about the Total() function.

Related

Why are my values multiplying when I apply Month/Year to my values?

When I apply Month/Year to Cases or Deaths from my data, the values explode. For Cases it goes from approximately 48 million to over 1 billion, and for Deaths it goes from about 700 thousand to over 22 million. However, when I try the same thing with Initial Claims or the Stringency Index, my values remain correct. I'm trying to find the month over month percentage change by the way. And I'm using the Date column. I only select 2020 and 2021 in the filter for Year.
What I'm asking about is Sheet 21.
Link to workbook: https://public.tableau.com/app/profile/nilajah.rivers/viz/CoronaVirusProject_16323687296770/Sheet21
Your problem is that the data points are daily cumulative deaths. If you change the date aggregation to anything other than days, Tableau will default to summing the numbers for all the days in the month. This will give the wrong result, obviously.
If you want to show the correct total deaths or cases regardless of the time aggregation (months, days, weeks etc.) then you could use the New Case or New Death numbers plus a running sum table calculation. This will always give the correct total for the time period.
Table calculations will also allow automatic calculation of the period to period % change from the same data fields.
This is a common problem when working with datasets that offer pre-calculated aggregations. Tableau doesn't need that as it can dynamically calculate the aggregation of a field over any given time period but it is easy to forget which field has pre-aggregated data and which has raw data. Pre-aggregated fields assume a particular time period and can't be used for different time periods without disentangling that assumption (which is unnecessary if you also have the raw data (in this case daily new deaths/cases).

How to calculate difference and aggregate energy as a counter - TimescaleDB

I've got time series data in TimescaleDB from smart meters, the Energy value is stored as a counter.
I have 2 questions:
1) How do I calculate the difference between each row of energy values so I can just see the increase minute by minute for each row?
2) I've got this in 1 minute intervals and I'd like to aggregate as 30m, 60m etc. What's the best way of achieving this?
Thanks in advance for any help you can give me.
There are a couple of challenges here. First, you have to make sure that the intervals of your counter indexes are constant (think communication outages,...) . If not, you'll have to deal with the resulting energy peaks.
Second, your index will probably look like a discrete jigsaw signal, restarting at zero once in a while.
Here's how we did it.
For 2), we use as many continuous aggregates on the indexes as we require resolutions (15min, 60min,...). Use locf where required.
For 1) we do the delta computation on the fly. Meaning that we query the db for the indexes and then loop through the array to compute the delta. This way we can easily handle the jigsaw and peaks.
I've just got an answer to my question, which is very similar to your Part 1, here
In short, the answer I got was to use a before_insert trigger and calculate the difference values on insertion, storing them in a new column. This avoids needing to re-calculate deltas on every query.
I extended the function suggested in the Answer by also calculating the delta_time with
NEW.delta_time = EXTRACT (EPOCH FROM NEW.time - previous_time);
This returns the number of seconds which have passed, allowing you to calculate meter power reliably.
For Part 2, consider a Continuous Aggregate with time buckets as suggested above

How to do a distinct count of a metric using graphite datasource in grafana?

I have a metric that shows the state of a server. The values are integers and if the value is 0 (zero) then the server is stable, else it is unstable. And the graph we have is at a minute level. So, I want to show an aggregated value to know how many hours the server is unstable in the selected time range.
Lets say, if I select "Last 7 days" as the time duration...we have get X hours of instability of server.
And one more thing, I have a line graph (time series graph) that shows the state of server...but, the thing is when I select "Last 24 hours or 48 hours" I am getting the graph at a minute level...when I increase the duration to a quarter I am getting the graph for every 5 min or something like that....I understand it's aggregating the values....but does any body know how the grafana is doing the aggregation ??
I have tried "scaleToSeconds" function and "ConsolidateBy" functions and many more to first get the count of non zero value minutes, but no success.
Any help would be greatly appreciated.
Thanks in advance.
There are a few different ways to tackle this, there are 2 places that aggregation happens in this situation:
When you query for a time range longer than your raw retention interval and whisper returns aggregated data. The aggregation method used here is defined in your carbon aggregation configuration.
When Grafana sends a query to Graphite it passes maxDataPoints=<width of graph in pixels>, and Graphite will perform aggregation to return at most that many points (because you don't have enough pixels to render more points than that). The method used for this consolidation is controlled by the consolidateBy function.
It is possible for both of these to be used in the same query if you eg have a panel that queries 3 days worth of data and you store 2 days at 1-minute and 7 days at 5-minute intervals in whisper then you'd have 72 * 60 / 5 = 864 points from the 5-minute archive in whisper, but if your graph is only 500px wide then at runtime that would be consolidated down to 10-minute intervals and return 432 points.
So, if you want to always have access to the count then you can change your carbon configuration to use sum aggregation for those series (and remove the existing whisper files so new ones are created with the new aggregation config), and pass consolidateBy('sum') in your queries, and you'll always get the sum back for each interval.
That said, you can also address this at query time by multiplying the average back out to get a total (assuming that your whisper aggregation config is using average). The simplest way to do that will be to summarize the data with average into buckets that match the longest aggregation interval you'll be querying, then scale those values by that interval to calculate the total number of minutes. Finally, you'll want to use consolidateBy('sum') so that any runtime consolidation will work properly.
consolidateBy(scale(summarize(my.series, '10min', 'avg'), 60), 'sum')
With all of that said, you may want to consider reporting uptime in terms of percentages rather than raw minutes, in which case you can use the raw averages directly.
When you say the value is zero (0), the server is healthy - what other values are reported while the server is unhealthy/unstable? If you're only reporting zero (healthy) or one (unhealthy), for example, then you could use the sumSeries function to get a count across multiple servers.
Some more information is needed here about the types of values the server is reporting in order to give you a better answer.
Grafana does aggregate - or consolidate - data typically by using the average aggregation function. You can override this using the 'sum' aggregation in the consolidateBy function.
To get a running calculation over time, you would most likely have to use the summarize function (also with the sum aggregation) and define the time period, e.g. 1 hour, 1 day, 1 week, and so on. You could take this a step further by combining this with a time template variable so that as the period grows/shrinks, the summarize period will increase/decrease accordingly.

How to calculate the mean of a dataframe column and find the top 10%

I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select groups of players via their stats that meet certain criteria.
Once I have the subset of players I am interested in looking at further, I would like to find the mean of a column; eg Batting Average or RBIs. From there I would like to break all the players into percentile groups based on their average performance compared to all players; the top 10%, bottom 10%, 40-50%
I've been able to use the DataFrame.describe() function to return a summary of a desired column (mean, stddev, count, min, and max) all as strings though. Is there a better way to get just the mean and stddev as Doubles, and what is the best way of breaking the players into groups of 10-percentiles?
So far my thoughts are to find the values that bookend the percentile ranges and writing a function that groups players via comparators, but that feels like it is bordering on reinventing the wheel.
I was able to get the percentiles by using Windows Functions and apply ntile() and cumeDist() over the window. The ntile() can create grouping based off of an input number. If you want things grouped by 10%, just enter ntile(10), if by 5% then ntile(20). For a more fine-tuned restult, cumeDist() applied over the window will output a new column with the cumulative distribution, and those can be filtered from there through select(), where(), or a SQL query.

PIG - filtering groupby by the contents of the group

I am new to pig, and I am wondering if I can do any inter-group filtering easily with it.
I have some data grouped by userid and some timestamps. I want to take only the groups that have two consecutive timestamps that are less than 30 minutes apart. Is this easy to express in Pig?
Thanks a lot!
The cleanest way to do this would be to write a UDF. The function would take a bag of timestamps as input, order them, and compute the minimum difference between timestamps. You could then filter your data based on the output of this UDF.
It is possible to do this in pure Pig Latin, if you really want to, although it involves more temporary data and map-reduce cycles, which means it may not be worth it. This would involve FLATTENing the bag of timestamps twice to get its cross-product, creating an indicator variable for any pairs of timestamps separated by less than 30 minutes, and then summing this variable for each user. Any user with a sum greater than zero has the property you desire.
Give it a go, and if you run into any specific issues, post another question outlining exactly where you're stuck.