Vertica has a very nice type of operations: Event-Based Window operations, which basically let you identify when an event occurs.
For example the conditional_true_event will increment a counter each time the given boolean expression resolves to true.
We use this kind of approach heavily.
We are thinking about moving to RedShift, but we would need a similar function.
RedShift has some nice window functions, but I can't find this one.
Is there any way I can emulate this function using RedShift?
The CONDITIONAL_TRUE_EVENT() is rather easy to write with window functions. It's just a COUNT with a conditional (CASE):
SELECT ts, symbol, bid,
CONDITIONAL_TRUE_EVENT(bid > 10.6)
OVER (ORDER BY ts) AS oce
FROM Tickstore3
ORDER BY ts ;
becomes:
SELECT ts, symbol, bid,
COUNT(CASE WHEN bid > 10.6 THEN 1 END)
OVER (ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS oce
FROM Tickstore3
ORDER BY ts ;
The CONDITIONAL_CHANGE_EVENT() is more complicated because it needs to use the previous value. It can be emulated using LAG() and SUM() or COUNT() (or ROW_NUMBER()). But it will require I think a CTE or a derived table (or a self-join):
SELECT ts, symbol, bid,
CONDITIONAL_CHANGE_EVENT(bid)
OVER (ORDER BY ts) AS cce
FROM Tickstore3
ORDER BY ts ;
will become:
WITH emu AS
( SELECT ts, symbol, bid,
CASE WHEN bid <> LAG(bid) OVER (ORDER BY ts)
THEN 1
END AS change_bid
FROM Tickstore3
)
SELECT ts, symbol, bid,
COUNT(change_bid)
OVER (ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS cce
FROM emu
ORDER BY ts ;
I don't know how this CONDITIONAL_CHANGE_EVENT() function behaves with nulls. If there are NULL values in the checked for changes column - and you want to see if there is a change from the last non-null value and not just the previous one - the rewrite will be even more complicated.
Edit: As far as I understand Redshift's documentation an explicit window frame (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) is required for window aggregates when there is an ORDER BY. So, you can/have to use that (or whatever the default frame is in Vertica for these cases. It's either the above or with RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).
Related
When using the mode() aggregation function, which tiebreaker criterion does the method use?
select mode() within group (order by my_field) FROM my_table
I couldn't find any documentation related to that
What happens if the column has an equal amount of occurrence of the values
select my_field, count(*) FROM my_table group by 1
status
count
4096
24
4098
24
In this example above, I am getting 4096, but I would like to confirm if it actually gets the lowest result, or if this is happening for another reason
UPDATE:
I still don't know how to fix this so that it's not an arbitrary choice, for now I'm using another order by
select mode() within group (order by my_field) FROM my_table order by my_field
Per the docs, it is arbitrary:
mode () WITHIN GROUP ( ORDER BY anyelement ) → anyelement
Computes the mode, the most frequent value of the aggregated argument
(arbitrarily choosing the first one if there are multiple
equally-frequent values). The aggregated argument must be of a
sortable type.
https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE
I'm trying to do a distinct operation on OHLC data where I have multiple dates per symbol
I can do the operation itself just fine, it only returns date, and symbol
select distinct timestamp, symbol from mv_qs_facts group by mv_qs_facts.symbol, mv_qs_facts.timestamp;
but I'd like it to return all columns (additional: close, open, high, low, volume) as well.
My goal is to return the last distinct (timestamp, symbol)
an idea I had.
select distinct on (timestamp, symbol), close, open, high, low from mv_qs_facts group by mv_qs_facts.symbol, mv_qs_facts.timestamp;
I see it's not as easy as this statement.
I've read I might be able to solve it with a subquery, a temporary table, or a join (all which don't use distinct).
Use DISTINCT ON ():
SELECT DISTINCT ON (timestamp, symbol)
timestamp, symbol, close, open, high, low
FROM mv_qs_facts;
This will return close, open, high and low for a random member of the group.
If you want to control which member is used, add an ORDER BY clause, then the first member in this ordering is taken.
If the problem is memory consumption on the client, you should use cursors:
BEGIN;
DECLARE c CURSOR FOR SELECT ...;
FETCH 100 FROM c;
FETCH 100 FROM c;
...
COMMIT;
Here's your query.
select distinct t1.* from (
select row_number() over (partition by symbol order by timestamp desc) as rn, * from
mv_qs_facts) as t1
where t1.rn = 1
I added id because I initially had timestamp and date as my composite key but it turned out to be bad with duplicate dates and I needed something to reference to sort by
CREATE materialized view temp AS
SELECT DISTINCT ON (symbol, timestamp)
id, timestamp, symbol, close, open, high, low
FROM qs_facts order by symbol, timestamp, id desc;
There is a window function without ORDER BY in OVER () clause. Is there a guarantee that the rows will be processed in the order specified by the ORDER BY expression in SELECT itself?
For example:
SELECT tt.*
, row_number() OVER (PARTITION BY tt."group") AS npp --without ORDER BY
FROM
(
SELECT SUBSTRING(random() :: text, 3, 1) AS "group"
, random() :: text AS "data"
FROM generate_series(1, 100) t(ser)
ORDER BY "group", "data"
) tt
ORDER BY tt."group", npp;
In this example the subquery returns the data sorted in ascending order in each group. The window function handles the rows in the same order, and so the line numbers go in ascending order of the data. Can I rely on this?
Good question!
No, you cannot rely on that.
Window functions are processed before the query's ORDER BY clause, and without an ORDER BY in the window definition, the rows will be processed in the order in which they happen to come from the subselect.
if you use an order by in your over ()
row_number() OVER (PARTITION BY tt."group" ORDER BY tt."group")
you should get the order you want.
I have a table with an ID column called mmsi and another column of timestamp, with multiple timestamps per mmsi.
For each mmsi I want to calculate the standard deviation of the difference between consecutive timestamps.
I'm not very experienced with SQL but have tried to construct a function as follows:
SELECT
mmsi, stddev(time_diff)
FROM
(SELECT mmsi,
EXTRACT(EPOCH FROM (timestamp - lag(timestamp) OVER (ORDER BY mmsi ASC, timestamp ASC)))
FROM ais_messages.ais_static
ORDER BY mmsi ASC, timestamp ASC) AS time_diff
WHERE time_diff IS NOT NULL
GROUP BY mmsi;
Your query looks on the right track, but it has several problems. You labelled your subquery, which looks almost right, with an alias which you then select. But this subquery returns multiple rows and columns so this doesn't make any sense. Here is a corrected version:
SELECT
t.mmsi,
STDDEV(t.time_diff) AS std
FROM
(
SELECT
mmsi,
EXTRACT(EPOCH FROM (timestamp - LAG(timestamp) OVER
(PARTITION BY mmsi ORDER BY timestamp))) AS time_diff
FROM ais_messages.ais_static
ORDER BY mmsi, timestamp
) t
WHERE t.time_diff IS NOT NULL
GROUP BY t.mmsi
This approach should be fine but there is one edge case where it might not behave as expected. If a given mmsi group have only one record, then it would not even appear in the result set of standard deviations. This is because the LAG calculation would return NULL for that single record and it would be filtered off.
Is there any option to get the average of the same values using the RANK() function in PostgreSQL? Here is the example of what I want to do:
This query will do the trick for you
SELECT
test_score,
row_number() OVER (ORDER BY test_score) AS rank,
rank() OVER (ORDER BY test_score)
+ (count(*) OVER (PARTITION BY test_score) - 1) / 2.0 AS "rank (with tied)"
FROM scores
SQLFiddle
Remarks:
What you believe is the "rank" is really the row_number() (i.e. a consecutive series of positive integer with no gaps and no duplicates).
That rank "with tied" that you're looking for can be calculated from the real rank() (rank with gaps) + the number of other elements of the same rank divided by two. This is a faster shortcut to calculate the average row_number() given your specific requirements.
I'm pretty sure you want row_number(), not rank(). Rank will not give repeated values in the way you presented. To get the answer you're looking for:
with rwn as (
select
test_score
,row_number() over (order by test_score) rwn
from
score
)
select
test_score
,avg(rwn) average_rank
from
rwn
group by
test_score;
Here's a SQLFiddle.
#Lukas and #jeremy already explained the difference between rank() and row_number() you seemed to be missing.
You can also compute the row number (rn), and the average over rn (avg_rn) per rank (= per group of same values) in the next step:
SELECT test_score, rn, avg(rn) OVER (PARTITION BY test_score) AS avg_rn
FROM (SELECT test_score, row_number() OVER (ORDER BY test_score) AS rn FROM tbl) sub;
You need a subquery because window functions cannot be nested on the same query level.
You need another window function (not an aggregate function like has been suggested) to preserve all original rows.
The result is ordered by rn by default (for this simple query), but this is just an implementation detail. To guarantee an ordered result, add an explicit ORDER BY (for practically no cost):
...
ORDER BY rn;
SQL Fiddle.