I want to try avg() aggregation within a time window
sql code
select
user_id,timestamp
avg(y) over(range between '5 second' preceding and '5 second' following),
from A
but the system report error
RANGE PRECEDING is only supported with UNBOUNDED
Is there any method to implement, say, a 10 second window for avg() window function?
frame of window function is as wide as range from n seconds preceding the timestamp of current row and m seconds following the timestamp of current row
RANGE PRECEDING is only supported with UNBOUNDED
Yep ... PostgreSQL's window functions don't yet implement ranges.
I've had many situations where they would've been useful, but it's a lot of work to implement them and time is limited.
Is there any method to implement, say, a 10 second window for avg() window function?
You will need to use a left join over generate_series (and, if appropriate, aggregation) to turn the range into a regular sequence of rows, inserting null rows where there's no data, and combining multiple data from within one second to a single value where there are multiple values.
Then you do a (ROWS n PRECEDING ...) window over the left-joined and aggregated data to get the running average.
Related
How can I plot time-grouped increment data in a bar graph in Grafana, but with a sparse data source that needs interpolation BEFORE calculating the increment?
My data source is an InfluxDB with a sparse time series of accumulated values (think: gas meter readings). The data points are usually a few days apart.
My goal is to create a bar graph with value increase per day. For the missing values, linear interpolation will do just fine.
I've come up with
SELECT spread("value") FROM "gas" WHERE $timeFilter GROUP BY time(1d) fill(linear)
but this won't work as the fill(linear) command is executed AFTER the spread(value) command. If I use time periods much greater than my granularity of input data (e.g. time(14d)), it shows proper bars, but once I use smaller periods, the bars collapse to 0.
How can I apply the interpolation BEFORE the difference operation?
Described situation by you is caused by fact that fill() fills data only if you do not have anything in your group by time() period in your query. If you get spread=0 then you probably have only one value in this period, so no fill() is used.
What I can suggest to you is to use subquery with lower group period time to prepare interpolation of your original signal. This is an example:
SELECT spread("interpolated_value") FROM (
SELECT first("value") as "interpolated_value" from "gas"
WHERE $timeFilter
GROUP BY time(10s) fill(linear)
)
GROUP BY time(1d) fill(none)
Subquery will prepare value for each 10s period (I recommend to set this value possibly as high as you can accept). If in 10s periods are values, it will pick the first one, if there is no value in this period, it will do an interpolation.
In main query there is an usage from prepared interpolated set of values to calculate spread.
All above only describes how you can get interpolated data within shorted periods. I strongly recommend to think about usability of this data. Calculating spread from lineary interpolated data may have questionable reliability.
I would like to randomly sample n rows from a table using Impala. I can think of two ways to do this, namely:
SELECT * FROM TABLE ORDER BY RANDOM() LIMIT <n>
or
SELECT * FROM TABLE TABLESAMPLE SYSTEM(1) limit <n>
In my case I set n to 10000 and sample from a table of over 20 million rows. If I understand correctly, the first option essentially creates a random number between 0 and 1 for each row and orders by this random number.
The second option creates many different 'buckets' and then randomly samples at least 1% of the data (in practice this always seems to be much greater than the percentage provided). In both cases I then select only the 10000 first rows.
Is the first option reliable to randomly sample the 10K rows in my case?
Edit: some aditional context. The structure of the data is why the random sampling or shuffling of the entire table seems quite important to me. Additional rows are added to the table daily. For example, one of the columns is country and usually the incoming rows are then first all from country A, then from country B, etc. For this reason I am worried that the second option would maybe sample too many rows from a single country, rather than randomly. Is that a justified concern?
Related thread that reveals the second option: What is the best query to sample from Impala for a huge database?
I beg to differ OP. I prefer second optoin.
First option, you are assigning values 0 to 1 to all of your data and then picking up first 10000 records. so basically, impala has to process all rows in the table and thus the operation will be slow if you have a 20million row table.
Second option, impala randomly picks up rows from files based on percentage you provide. Since this works on the files, so return count of rows may different than the percentage you mentioned. Also, this method is used to compute statistics in Impala. So, performance wise this is much better and correctness of random can be a problem.
Final thought -
If you are worried about randomness and correctness of your random data, go for option 1. But if you are not much worried about randomness and want sample data and quick performance, then pick second option. Since Impala uses this for COMPUTE STATS, i pick this one :)
EDIT : After looking at your requirement, i have a method to sample over a particular field or fields.
We will use window function to set rownumber randomly to each country group. Then pick up 1% or whatever % you want to pick up from that data set.
This will make sure you have data evenly distributed between countries and each country have same % of rows in result data set.
select * from
(
select
row_number() over (partition by country order by country , random()) rn,
count() over (partition by country order by country) cntpartition,
tab.*
from dat.mytable tab
)rs
where rs.rn between 1 and cntpartition* 1/100 -- This is for 1% data
screenshot from my data -
HTH
I am trying to show the change in moving average by county on a map.
Currently, I have the calculated field for this:
IF ISNULL(LOOKUP(SUM([Covid Count]),-14)) THEN NULL ELSE
WINDOW_AVG(SUM([Covid Count]), -7, 0)-WINDOW_AVG(SUM([Covid Count]), -14, -7)
END
This works in creating a line graph where I filter the dates to only include 15 consecutive dates. This results in one point with the correct change in average.
I would like this to number to be plotted on a map but it says there are just null values.
The formula is only one part of defining a table calculation (a class of calculations performed client side tableau taking the aggregate query results returned from the data source)
Equally critical are the dimensions in play on the view to determine the level of detail of the query, and the instructions you provide to tell Tableau how to slice up or layout the query results before applying the table calc formula. This critical step is known as setting the “partitioning and addressing” for the table calc, sometimes also as setting the “compute using”. Read about it in the online help for table calcs. You can experiment with using the Edit Table Calc dialog by clicking on the corresponding pill.
In short, you probably have to a dimension, such as your Date field to some shelf - likely the detail shelf, and the set the partitioning and addressing, probably to partition by county and address by state.
If you have more than a couple of weeks of data, then you’ll get multiple marks per county. You may need to decide how to handle that on your map.
I am trying to calculate a rolling average by 30 days. However, in Tableau, I have to use window_avg(avg(varaible), - 30, 0). It means that it is actually calculating the average of daily average. It first calculate the average value per day, then average the values for past 30 days. I am wondering whether there is a function in Tableau that can calculate directly rolling average, like pandas.rolling?
In this specific case, you can use the following
window_sum(sum(variable), -30, 0) / window_sum(sum(1), -30, 0)
A few concepts about table calcs to keep in mind
Table calcs operate on aggregate query results.
This gives you flexibility - you can partition the table of query results in many ways, access multiple values in the result set, order the query results to impact your calculations, nest table calcs in different ways.
This approach can also give you efficiency if you can calculate what you need simply from the aggregate results that you've already fetched.
It also gives you complexity. You have to be aware of how each calculation specifies the addressing and partitioning of the query results. You also have to think about how double aggregation will impact your results.
In most cases, applying back to back aggregation functions requires some careful thought about what the results will mean. As you've noted, averages of averages may not mean what people think they mean. Others, may be quite reasonable, say averages of daily sales totals.
In some cases, double aggregation can be used without extra thought as the results are the same regardless. Sums of Sums, Mins of Mins, Max of Max yield the same result as calling Sum, min or max on the underlying data rows. These functions are called additive aggregation functions, and obey the associative rule you learned in grade school. Hence, the formula at the start of this answer.
You can also read about the Total() function.
I have a query (with a subquery) that calculates an average of temperatures over the previous years, plus/minus one week per each day. It works, but it is not all that fast. The time series values below are just an example. Why I'm using doy is because I want a sliding window around the same date for every year.
SELECT days,
(SELECT avg(temperature)
FROM temperatures
WHERE site_id = ? AND
extract(doy FROM timestamp) BETWEEN
extract(doy FROM days) - 7 AND extract(doy FROM days) + 7
) AS temperature
FROM generate_series('2017-05-01'::date, '2017-08-31'::date, interval '1 day') days
So my question is, could this query somehow be improved? I was thinking about using some kind of window function or possibly lag and lead. However at least regular window functions only work on specific amount of rows, whereas there can be any number of measurements within the two-week window.
I can live with what I have for now, but as the amount of data grows so does the execution speed of the query. The two latter extracts could be removed, but that has no noticeable speed improvement and only makes the query less legible. Any help would be greatly appreciated.
The best index for your original query is
create index idx_temperatures_site_id_timestamp_doy
on temperatures(site_id, extract(doy from timestamp));
This can greatly improve your original query's performance.
While your original query is simple & readable, it has 1 flaw: it will calculate every day's average 14 times (on average). Instead, you could calculate these averages on a daily basis & calculate the 2 week window's weighted average (the weight for a day's average needs to be count of the individual rows in your original table). Something like this:
with p as (
select timestamp '2017-05-01' min,
timestamp '2017-08-31' max
)
select t.*
from p
cross join (select days, sum(sum(temperature)) over pn1week / sum(count(temperature)) over pn1week
from p
cross join generate_series(min - interval '1 week', max + interval '1 week', interval '1 day') days
left join temperatures on site_id = ? and extract(doy from timestamp) = extract(doy from days)
group by days
window pn1week as (order by days rows between 7 preceding and 7 following)) t
where days between min and max
order by days
However, there is not much gain here, as this is only twice as fast as your original query (with the optimal index).
http://rextester.com/JCAG41071
Notes: I used timestamp because I assumed your column's type is timestamp. But as it turned out, you use timestamptz (aka. timestamp with time zone). With that type, you cannot index the extract(doy from timestamp) expression, because that expression's output is dependent of the actual client's time zone setting.
For timestamptz use an index which (at least) starts with site_id. Using the window version should improve the performance anyway.
http://rextester.com/XTJSM42954