There is a window function without ORDER BY in OVER () clause. Is there a guarantee that the rows will be processed in the order specified by the ORDER BY expression in SELECT itself?
For example:
SELECT tt.*
, row_number() OVER (PARTITION BY tt."group") AS npp --without ORDER BY
FROM
(
SELECT SUBSTRING(random() :: text, 3, 1) AS "group"
, random() :: text AS "data"
FROM generate_series(1, 100) t(ser)
ORDER BY "group", "data"
) tt
ORDER BY tt."group", npp;
In this example the subquery returns the data sorted in ascending order in each group. The window function handles the rows in the same order, and so the line numbers go in ascending order of the data. Can I rely on this?
Good question!
No, you cannot rely on that.
Window functions are processed before the query's ORDER BY clause, and without an ORDER BY in the window definition, the rows will be processed in the order in which they happen to come from the subselect.
if you use an order by in your over ()
row_number() OVER (PARTITION BY tt."group" ORDER BY tt."group")
you should get the order you want.
Related
For Example:
SELECT * FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
I want to LIMIT 1 for each of the countries in my IN clause so I only see a total of 3 rows: One customer for per country (1 German, 1 France, 1 UK). Is there a simple way to do that?
Normally, a simple GROUP BY would suffice for this type of solution, however as you have specified that you want to include ALL of the columns in the result, then we can use the ROW_NUMBER() window function to provide a value to filter on.
As a general rule it is important to specify the column to sort on (ORDER BY) for all windowing or paged queries to make the result repeatable.
As no schema has been supplied, I have used Name as the field to sort on for the window, please update that (or the question) with any other field you would like, the PK is a good candidate if you have nothing else to go on.
SELECT * FROM
(
SELECT *
, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Name) AS _rn
FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
)
WHERE _rn = 1
The PARTITION BY forces the ROW_NUMBER to be counted across all records with the same Country value, starting at 1, so in this case we only select the rows that get a row number (aliased as _rn) of 1.
The WHERE clause could have been in the outer query if you really want to, but ROW_NUMBER() can only be specified in the SELECT or ORDER BY clauses of the query, so to use it as a filter criteria we are forced to wrap the results in some way.
In the picture below you can example data. I would like to get first occurence of batch_start for each batch. As you can see (green highlight) batch 1522049 occurs in 2 chunks, one has 2 rows and second has 1 row.
SELECT FIRST_VALUE(batch_start) OVER (PARTITION BY batch ORDER BY batch_start)
does not solve the problem, since it joins both chunks into one and result is '2013-01-29 10:27:23' for both of them.
Any idea how to distinguish these rows and get batch_start of each chunk of data?
This seems to me a simple gaps-and-islands problem: you just need to calculate a value, which is the same for every subsequent rows for the same batch value, which will be
row_number() over (order by batch_start) - row_number() over (partition by batch order by batch_start)
From this, the solution depends on what do you want to do with these "batch groups". F.ex. here is a variant, which will aggregate them, to find out which is the first batch_start:
select batch, min(batch_start)
from (select *, row_number() over (order by batch_start) -
row_number() over (partition by batch order by batch_start) batch_number
from batches) b
group by batch, batch_number
http://rextester.com/XLX80303
Maybe: select batch, min(batch_start) firstOccurance, max(batch_start) lastOccurance from yourTable group by batch
or try (keeping your part of query):
SELECT FIRST_VALUE(a.batch_start) OVER (PARTITION BY a.batch ORDER BY a.batch_start) from yourTable a
join (select batch, min(batch_start) firstOccurance, max(batch_end) lastOccurance from yourTable group by batch) b on a.batch = b.batch
Vertica has a very nice type of operations: Event-Based Window operations, which basically let you identify when an event occurs.
For example the conditional_true_event will increment a counter each time the given boolean expression resolves to true.
We use this kind of approach heavily.
We are thinking about moving to RedShift, but we would need a similar function.
RedShift has some nice window functions, but I can't find this one.
Is there any way I can emulate this function using RedShift?
The CONDITIONAL_TRUE_EVENT() is rather easy to write with window functions. It's just a COUNT with a conditional (CASE):
SELECT ts, symbol, bid,
CONDITIONAL_TRUE_EVENT(bid > 10.6)
OVER (ORDER BY ts) AS oce
FROM Tickstore3
ORDER BY ts ;
becomes:
SELECT ts, symbol, bid,
COUNT(CASE WHEN bid > 10.6 THEN 1 END)
OVER (ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS oce
FROM Tickstore3
ORDER BY ts ;
The CONDITIONAL_CHANGE_EVENT() is more complicated because it needs to use the previous value. It can be emulated using LAG() and SUM() or COUNT() (or ROW_NUMBER()). But it will require I think a CTE or a derived table (or a self-join):
SELECT ts, symbol, bid,
CONDITIONAL_CHANGE_EVENT(bid)
OVER (ORDER BY ts) AS cce
FROM Tickstore3
ORDER BY ts ;
will become:
WITH emu AS
( SELECT ts, symbol, bid,
CASE WHEN bid <> LAG(bid) OVER (ORDER BY ts)
THEN 1
END AS change_bid
FROM Tickstore3
)
SELECT ts, symbol, bid,
COUNT(change_bid)
OVER (ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS cce
FROM emu
ORDER BY ts ;
I don't know how this CONDITIONAL_CHANGE_EVENT() function behaves with nulls. If there are NULL values in the checked for changes column - and you want to see if there is a change from the last non-null value and not just the previous one - the rewrite will be even more complicated.
Edit: As far as I understand Redshift's documentation an explicit window frame (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) is required for window aggregates when there is an ORDER BY. So, you can/have to use that (or whatever the default frame is in Vertica for these cases. It's either the above or with RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).
I am having a small difficulty understanding the below simple DISTINCT ON query:
SELECT DISTINCT
ON (bcolor) bcolor,
fcolor
FROM
t1
ORDER BY
bcolor,
fcolor;
I have this table here:
What is the order of execution of the above table and why I am getting the following result:
As I understand since ORDER BY is used it will display the table columns (both of them), in alphabetical order and since ON is used it will return the 1st matched duplicate, but I am still confused about how the resulting table is displayed.
Can somebody take me through how exactly this query is executed ?
This is an odd one since you would think that the SELECT would happen first, then the ORDER BY like any normal RDBMS, but the DISTINCT ON is special. It needs to know the order of the records in order to properly determine which records should be dropped.
So, in this case, it orders first by the bcolor, then by the fcolor. Then it determines distinct bcolors, and drops any but the first record for each distinct group.
In short, it does ORDER BY then applies the DISTINCT ON to drop the appropriate records. I think it would be most helpful to think of 'DISTINCT ON' as being special functionality that differs greatly from DISTINCT.
Added after initial post:
This could be done using window functions and a subquery as well:
SELECT
bcolor,
fcolor
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY bcolor ORDER BY fcolor ASC) as rownumber,
bcolor,
fcolor
FROM t1
) t2
WHERE rownumber = 1
Is there any way to get rowid of a record in postgres??
In oracle i can use like
SELECT MAX(BILLS.ROWID) FROM BILLS
Yes, there is ctid column which is equivalent for rowid. But is useless for you. Rowid and ctid are physical row/tuple identifiers => can change after rebuild/vacuum.
See: Chapter 5. Data Definition > 5.4. System Columns
The PostgreSQL row_number() window function can be used for most purposes where you would use rowid. Whereas in Oracle the rowid is an intrinsic numbering of the result data rows, in Postgres row_number() computes a numbering within a logical ordering of the returned data. Normally if you want to number the rows, it means you expect them in a particular order, so you would specify which column(s) to order the rows when numbering them:
select client_name, row_number() over (order by date) from bills;
If you just want the rows numbered arbitrarily you can leave the over clause empty:
select client_name, row_number() over () from bills;
If you want to calculate an aggregate over the row number you'll have to use a subquery:
select max(rownum) from (
select row_number() over () as rownum from bills
) r;
If all you need is the last item from a table, and you have a column to sort sequentially, there's a simpler approach than using row_number(). Just reverse the sort order and select the first item:
select * from bills
order by date desc limit 1;
Use a Sequence. You can choose 4 or 8 byte values.
http://www.neilconway.org/docs/sequences/
Add any unique column to your table(name maybe rowid).
And prevent changing it by creating BEFORE UPDATE trigger, which will raise exception if someone will try to update.
You may populate this column with sequence as #JohnMudd mentioned.