TSQL Rolling Average of Time Groupings - tsql

This is a follow up to: TSQL Group by N Seconds . (I got what I asked for, but didn't ask for the right thing)
How can I get a rolling average of 1 second groups of count(*)?
So I want to return per second counts, but I also want to be able to smooth that out over certain intervals, say 10 seconds.
So one method might be to take the average per second of every 10 seconds, can that be done in TSQL?
Ideally, the time field would be returned in Unix Time.

SQL Server is not particularly good in rolling/cumulative queries.
You can use this:
WITH q (unix_ts, cnt) AS
(
SELECT DATEDIFF(s, '1970-01-01', ts), COUNT(*)
FROM record
GROUP BY
DATEDIFF(s, '1970-01-01', ts)
)
SELECT *
FROM q q1
CROSS APPLY
(
SELECT AVG(cnt) AS smooth_cnt
FROM q q2
WHERE q2.unix_ts BETWEEN q1.unix_ts - 5 AND q1.unix_ts + 5
) q2
, however, this may not be very efficient, since it will count the same overlapping intervals over an over again.
For the larger invervals, it may be even better to use a CURSOR-based solution that would allow to keep intermediate results (though normally they are worse performance-wise than pure set-based solutions).
Oracle and PostgreSQL support this clause:
WITH q (unix_ts, cnt) AS
(
SELECT TRUNC(ts, 'ss'), COUNT(*)
FROM record
GROUP BY
TRUNC(ts, 'ss')
)
SELECT q.*,
AVG(cnt) OVER (ORDER BY unix_ts RANGE BETWEEN INTERVAL '-5' SECOND AND INTERVAL '5' SECOND)
FROM q
which keeps an internal window buffer and is very efficient.
SQL Server, unfortunately, does not support moving windows.

Related

Count distinct users over n-days

My table consists of two fields, CalDay a timestamp field with time set on 00:00:00 and UserID.
Together they form a compound key but it is important to have in mind that we have many rows for each given calendar day and there is no fixed number of rows for a given day.
Based on this dataset I would need to calculate how many distinct users there are over a set window of time, say 30d.
Using postgres 9.3 I cannot use COUNT(Distinct UserID) OVER ... nor I can work around the issue using DENSE_RANK() OVER (... RANGE BETWEEN) because RANGE only accepts UNBOUNDED.
So I went the old fashioned way and tried with a scalar subquery:
SELECT
xx.*
,(
SELECT COUNT(DISTINCT UserID)
FROM data_table AS yy
WHERE yy.CalDay BETWEEN xx.CalDay - interval '30 days' AND xx.u_ts
) as rolling_count
FROM data_table AS xx
ORDER BY yy.CalDay
In theory, this should work, right? I am not sure yet because I started the query about 20 mins ago and it is still running. Here lies the problem, the dataset is still relatively small (25000 rows) but will grow over time. I would need something that scales and performs better.
I was thinking that maybe - just maybe - using the unix epoch instead of the timestamp could help but it is only a wild guess. Any suggestion would be welcome.
This should work. Can't comment on speed, but should be a lot less than your current one. Hopefully you have indexes on both these fields.
SELECT t1.calday, COUNT(DISTINCT t1.userid) AS daily, COUNT(DISTINCT t2.userid) AS last_30_days
FROM data_table t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY t1.calday
UPDATE
Tested it with a lot of data. The above works but is slow. Much faster to do it like this:
SELECT t1.*, COUNT(DISTINCT t2.userid) AS last_30_days
FROM (
SELECT calday, COUNT(DISTINCT userid) AS daily
FROM data_table
GROUP BY calday
) t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY 1, 2
So instead of building up a massive table for all the JOIN combinations and then grouping/aggregating, it first gets the "daily" data, then joins the 30 day on that. Keeps the join much smaller and returns quickly (just under 1 second for 45000 rows in the source table on my system).

How to get count of timestamps which has interval bigger than xx seconds between next row in PostgresSQL

I have table with 3 columns (postgres 9.6) : serial , timestamp , clock_name
Usually there is 1 second different between each row but sometimes the interval is bigger.
I'm trying to get the number of occasions that the timestamp interval between 2 rows was bigger than 10 seconds (lets say I limit this to 1000 rows)
I would like to do this in one query (probably select from select) but I have no idea how to write such a query , my sql knowladge is very basic.
Any help will be appreciated
You can use window functions to retrieve the next record record given the current record.
Using the ORDER BY on the function to ensure things are in time stamp order and using PARTITION to keep the clocks separate you can find for each row the row that follows it.
WITH links AS
(
SELECT
id, ts, clock, LEAD(ts) OVER (PARTITION BY clock ORDER BY ts) AS next_ts
FROM myTable
)
SELECT * FROM links
WHERE
EXTRACT(EPOCH FROM (next_ts - ts)) > 10
You can then just compare the time stamps.
Window functions https://www.postgresql.org/docs/current/static/functions-window.html
Or if you prefer to use derived tables instead of WITH clause.
SELECT * FROM (
SELECT
id, ts, clock, LEAD(ts) OVER (PARTITION BY clock ORDER BY ts) AS next_ts
FROM myTable
) links
WHERE
EXTRACT(EPOCH FROM (next_ts - ts)) > 10

How to get a simple hash join query to perform as well as a complex sort merge query?

I have a system that logs information about running processes. Each running process contains a series of steps that may or may not run in parallel. The system logs information about a process and its steps to two separate tables:
CREATE TABLE pid (
pid integer,
start_time timestamp,
end_time timestamp,
elapsed bigint,
aborted integer,
label char(30)
);
CREATE TABLE pid_step (
pid integer,
step integer,
start_time timestamp,
end_time timestamp,
elapsed bigint,
mem bigint,
...
);
The pid_step table contains a bunch of resource usage stats about each step which I have simplified here as just the mem column which logs the # of bytes of memory allocated for that step. I want to sample memory allocation by process label, perhaps at 5 second intervals, so I can plot it. I need a result similar to the following:
tick label mem
----------------------- ------ -----------
2014-11-04 05:37:40.0 foo 328728576
2014-11-04 05:37:40.0 bar 248436
2014-11-04 05:37:40.0 baz 1056144
2014-11-04 05:37:45.0 foo 1158807552
2014-11-04 05:37:45.0 bar 632822
2014-11-04 05:37:45.0 baz 854398
Since the logs only give me starting and ending timestamps for each process and step instead of a sample of resource usage at 5 second intervals, I need to find the most efficient way to determine which process steps were running at each 5 second interval (tick) and then aggregate their allocated memory. I've made 3 separate attempts that all produce the same results with varying levels of performance. For brevity's sake I'll put each query and its explain plan in a gist (https://gist.github.com/anonymous/3b57f70015b0d234a2de) but I'll explain my approach for each:
This was my first attempt and it is definitely the most intuitive and easiest to maintain. It cross joins the distinct process labels with generate_series to generate the 5 second ticks for each label, then left joins on the pid and pid_step tables. The left join creates a "zero-fill" effect and ensures we do not drop out any ticks that have no associated data. Unfortunately this approach performs the worst (see benchmark link below) and I believe it is due to the use of a hash join where the between t2.start_time and t2.end_time predicate is handled as a join filter instead of a join condition.
This was my second attempt and it performs way better but is a lot less intuitive and maintainable. The "zero-fill" approach is the same as in query 1. But, before performing the left join of pid and pid_step, I pre-compute the ticks that have associated data based on the max process elapsed time and the process steps start and end times. This allows for a sort merge join where both the tick and label predicates can be expressed as a join condition and no join filters are used.
This was my final attempt and it performs the best with about the same intuitiveness and maintainability as query 2. The optimization here is that I use the max process step elapsed time which is guaranteed to be smaller than the max process elapsed time and therefore creates a smaller nested loop at the beginning of CTE t3.
Ideally, I'd like the SQL to be as simple and maintainable as query 1 but perform as well as query 3. Is there anything I can do in the way of indexes or a slight rewrite of query 1 that would improve the performance?
Benchmark Results: http://i.imgur.com/yZxdQlM.png
Here is a solution using the power of PostgreSQL ranges (SQLFiddle)
CREATE TABLE pid (
pid integer PRIMARY KEY,
label char(30)
);
CREATE TABLE pid_step (
pid integer,
step serial,
start_time timestamp,
end_time timestamp,
mem bigint,
PRIMARY KEY (pid, step)
);
The sampling method is a good idea but is, in my opinion, an optimisation. Here is my solution:
Let's say we want to plot one day of data, we split this day in a number of time slices, each lasting 5 seconds. For one process and for one time slice, we want to retrieve the average memory of all steps that ran during this 5 seconds.
So instead of sampling every 5 seconds (which can hide data spikes), we are displaying an aggregation of the relevant data for these 5 seconds. The aggregation can be whichever PostgreSQL aggregate function is available.
The first step is to generate these time slices (as you already did without using the range datatype):
-- list of time ranges of 5 seconds interval
-- inclusive lower bound, exclusive upper bound
SELECT
tsrange(tick, tick + '5 seconds'::interval, '[)') as time_range
FROM generate_series(
'2001-02-16 21:28:30'::timestamp,
'2001-02-16 22:28:30'::timestamp,
'5 seconds'::interval
) AS tick
Notice that these slices does not overlap each other as the lower bound is inclusive and the upper bound is exclusive.
Here is the tricky part, we don't want to change our table schema by removing start_time and end_time and creating a range column for this data. Fortunately, PostgreSQL allow indexes on expressions:
-- create index on range (inclusive on upper and lower)
CREATE INDEX pid_step_tstzrange_index ON pid_step
USING gist (tsrange(start_time, end_time, '()'));
With this index, we are now able to use the wide variety of PostgreSQL range operators at a fraction of the processing cost, the only caveat is that in order to use this index, we must use the exact same function in our query.
As you might already have guessed, the index will be used to join time slices and steps as we need to join if the step "virtual" range overlap the time slice.
Here is the final query:
WITH
time_range AS (
-- list of time ranges of 5 seconds interval
-- inclusive lower bound, exclusive upper bound
SELECT
tsrange(tick, tick + '5 seconds'::interval, '[)') as time_range
FROM generate_series(
'2001-02-16 21:28:30'::timestamp,
'2001-02-16 22:28:30'::timestamp,
'5 seconds'::interval
) AS tick
),
-- associate each pid_step with the matching time_range
-- aggregate the average memory usage for each pid for each time slice
avg_memory_by_pid_by_time_range AS (
SELECT
time_range,
pid,
avg(mem) avg_memory
FROM
time_range
JOIN pid_step
ON tsrange(pid_step.start_time, pid_step.end_time, '()') && time_range.time_range
GROUP BY
time_range,
pid
)
-- embellish the result with some additional data from pid
SELECT
lower(time_range) AS tick,
pid.label AS label,
trunc(avg_memory) AS mem
FROM
avg_memory_by_pid_by_time_range
JOIN pid ON avg_memory_by_pid_by_time_range.pid = pid.pid
ORDER BY
lower(time_range),
pid.label
;
I hope that the performance is still great in your production data (there are a lot of details that play in the query planning equation).

MS SQL Server 2008/2012 Get Min Difference between any two values

Given a table with a single money column how do I calculate the smallest difference between any two values in that table using TSQL? I'm looking for the performance optimized solution, which will work with millions of rows.
For SQL Server 2012 you could use
;WITH CTE
AS (SELECT YourColumn - Lag(YourColumn) OVER (ORDER BY YourColumn) AS Diff
FROM YourTable)
SELECT
Min(Diff) AS MinDiff
FROM CTE
This does it with one scan of the table (ideally you would have an index on YourColumn to avoid a sort and a narrow index on that single column would reduce IO).
I can't think of a nice way of getting it to short circuit and so do less than one scan of the table if it finds the minimum possible difference of zero. Adding MIN(CASE WHEN Diff = 0 THEN 1/0 END) to the SELECT list and trapping the divide by zero error as a signal that zero was found would probably work but I can't really recommend that approach...

Using DATEDIFF in T-SQL

I am using DATEDIFF in an SQL statement. I am selecting it, and I need to use it in WHERE clause as well. This statement does not work...
SELECT DATEDIFF(ss, BegTime, EndTime) AS InitialSave
FROM MyTable
WHERE InitialSave <= 10
It gives the message: Invalid column name "InitialSave"
But this statement works fine...
SELECT DATEDIFF(ss, BegTime, EndTime) AS InitialSave
FROM MyTable
WHERE DATEDIFF(ss, BegTime, EndTime) <= 10
The programmer in me says that this is inefficient (seems like I am calling the function twice).
So two questions. Why doesn't the first statement work? Is it inefficient to do it using the second statement?
Note: When I originally wrote this answer I said that an index on one of the columns could create a query that performs better than other answers (and mentioned Dan Fuller's). However, I was not thinking 100% correctly. The fact is, without a computed column or indexed (materialized) view, a full table scan is going to be required, because the two date columns being compared are from the same table!
I believe there is still value in the information below, namely 1) the possibility of improved performance in the right situation, as when the comparison is between columns from different tables, and 2) promoting the habit in SQL developers of following best practice and reshaping their thinking in the right direction.
Making Conditions Sargable
The best practice I'm referring to is one of moving one column to be alone on one side of the comparison operator, like so:
SELECT InitialSave = DateDiff(second, T.BegTime, T.EndTime)
FROM dbo.MyTable T
WHERE T.EndTime <= T.BegTime + '00:00:10'
As I said, this will not avoid a scan on a single table, however, in a situation like this it could make a huge difference:
SELECT InitialSave = DateDiff(second, T.BegTime, T.EndTime)
FROM
dbo.BeginTime B
INNER JOIN dbo.EndTime E
ON B.BeginTime <= E.EndTime
AND B.BeginTime + '00:00:10' > E.EndTime
EndTime is in both conditions now alone on one side of the comparison. Assuming that the BeginTime table has many fewer rows, and the EndTime table has an index on column EndTime, this will perform far, far better than anything using DateDiff(second, B.BeginTime, E.EndTime). It is now sargable, which means there is a valid "search argument"--so as the engine scans the BeginTime table, it can seek into the EndTime table. Careful selection of which column is by itself on one side of the operator is required--it can be worth experimenting by putting BeginTime by itself by doing some algebra to switch to AND B.BeginTime > E.EndTime - '00:00:10'
Precision of DateDiff
I should also point out that DateDiff does not return elapsed time, but instead counts the number of boundaries crossed. If a call to DateDiff using seconds returns 1, this could mean 3 ms elapsed time, or it could mean 1997 ms! This is essentially a precision of +- 1 time units. For the better precision of +- 1/2 time unit, you would want the following query comparing 0 to EndTime - BegTime:
SELECT DateDiff(second, 0, EndTime - BegTime) AS InitialSave
FROM MyTable
WHERE EndTime <= BegTime + '00:00:10'
This now has a maximum rounding error of only one second total, not two (in effect, a floor() operation). Note that you can only subtract the datetime data type--to subtract a date or a time value you would have to convert to datetime or use other methods to get the better precision (a whole lot of DateAdd, DateDiff and possibly other junk, or perhaps using a higher precision time unit and dividing).
This principle is especially important when counting larger units such as hours, days, or months. A DateDiff of 1 month could be 62 days apart (think July 1, 2013 - Aug 31 2013)!
You can't access columns defined in the select statement in the where statement, because they're not generated until after the where has executed.
You can do this however
select InitialSave from
(SELECT DATEDIFF(ss, BegTime, EndTime) AS InitialSave
FROM MyTable) aTable
WHERE InitialSave <= 10
As a sidenote - this essentially moves the DATEDIFF into the where statement in terms of where it's first defined. Using functions on columns in where statements causes indexes to not be used as efficiently and should be avoided if possible, however if you've got to use datediff then you've got to do it!
beyond making it "work", you need to use an index
use a computed column with an index, or a view with an index, otherwise you will table scan. when you get enough rows, you will feel the PAIN of the slow scan!
computed column & index:
ALTER TABLE MyTable ADD
ComputedDate AS DATEDIFF(ss,BegTime, EndTime)
GO
CREATE NONCLUSTERED INDEX IX_MyTable_ComputedDate ON MyTable
(
ComputedDate
) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
create a view & index:
CREATE VIEW YourNewView
AS
SELECT
KeyValues
,DATEDIFF(ss, BegTime, EndTime) AS InitialSave
FROM MyTable
GO
CREATE CLUSTERED INDEX IX_YourNewView
ON YourNewView(InitialSave)
GO
You have to use the function instead of the column alias - it is the same with count(*), etc. PITA.
As an alternate, you can use computed columns.