Amazon Redshift : Fetch latest row query

Amazon Redshift : Fetch latest row query - amazon-redshift

saletable :
saleID | date | orderstatus | .....
I want to fetch only latest details for each saleID. I can do it using nested queries like
Select * from saletable t1 where date in ( select max(date) from saletable t2 where t1.saleID = t2.saleID )
Is it possible to do it with a simple query ? If so, any hint ?

You can use a common table expression to do this efficiently:
WITH ld AS (
SELECT saleID, max("date") AS latest FROM saletable GROUP BY saleID)
SELECT s.*
FROM saletable s
JOIN ld ON ld.saleID = s.saleID
WHERE s."date" = ld.latest;
As commented by DogBoneBlues: This has the advantage over the original method as there are only 2 scans of the data (one is aggregated and the other is filtered, both of which a columnar DB like Redshift will do very efficiently). With the original approach, a query would be reissued for each row of the data resulting in an O(n2) operation.

Related

Why is this nested INNER JOIN not working in POSTGRESQL?

I am using the Nothwind data base & working in pgAdmin, and my query is looking like this at the moment
SELECT
TO_CHAR (o.ShippedDate, 'yyyy.MM') AS Month
,o.OrderID
,Total
,SUM (Total) OVER PARTITION BY TO_CHAR (ShippedDate,
‘yyyy.MM’) ORDER BY O.OrderID) AS Running_Total
FROM public.orders O
INNER JOIN (
SELECT OrderID, SUM(Quantity*UnitPrice) AS Total
FROM public.order_details
GROUP BY OrderID
ORDER BY OrderID
) OD ON O.OrderID = OD.OrderID
WHERE
TO_CHAR (o.ShippedDate, 'yyyy.MM') IS NOT NULL
And is is not working, it says:
ERROR: column "o.shippeddate" must appear in the GROUP BY clause or be used in an aggregate function
LINE 2: TO_CHAR (o.ShippedDate, 'yyyy.MM') AS Month
Can you help me out what could be the issue? Thanks!
I fixed the query, so it is now the correct one.

SQL Group By that works in SQLite does not work in Postgres

This statement works in SQLite, but not in Postgres:
SELECT A.*, B.*
FROM Readings A
LEFT JOIN Offsets B ON A.MeterNum = B.MeterNo AND A.DateTime > B.TimeDate
WHERE A.MeterNum = 1
GROUP BY A.DateTime
ORDER BY A.DateTime DESC
The Readings table contains electric submeter readings each with a date stamp. The Offsets table holds an adjustment that the user enters after a failed meter is replaced with a new one that starts again at zero. Without the Group By statement the query returns a line for each meter reading with each prior adjustment made before the reading date while I only want the last adjustment.
All the docs I've seen on Group By for Postgres indicate I should be including an aggregate function which I don't need and can't use (The Reading column contains the Modbus string returned from the meter).

Just pick the latest reading in a derived table. In Postgres this can be done quite efficiently using distinct on ()
SELECT A.*, B.*
FROM readings A
left join (
select distinct on (meterno) o.*
from offsets o
order by o.meterno, o.timedate desc
) B ON A.MeterNum = B.MeterNo AND A.DateTime > B.TimeDate
WHERE A.meternum = 1
ORDER BY A.DateTime DESC
distinct on () will only return one row per meterno and this is the "latest" row due to the order by ... , timedate desc
The query might even be faster by pushing the condition on datetime > timedate into the derived table using a lateral join:
SELECT A.*, B.*
FROM readings A
left join lateral (
select distinct on (meterno) o.*
from offsets o
where a.datetime > o.timedeate
order by o.meterno, o.timedate desc
) B ON A.MeterNum = B.MeterNo
WHERE A.meternum = 1
ORDER BY A.DateTime DESC

Nested SQL Query Optimization

is there any better way to write this query to be optimized?
SELECT * FROM data d
WHERE d.id IN (SELECT max(d1.id)
FROM data d1
WHERE d1.name='A'
AND d1.version='2')
I am not so good with SQL.

With PostgreSQL v13, you can do it like this:
SELECT * FROM data
WHERE name = 'A'
AND version = '2'
ORDER BY is DESC
FETCH FIRST 1 ROWS WITH TIES;
That will give you all rows where id is the maximum.
If id is unique, you can use FETCH FIRST 1 ROWS ONLY or LIMIT 1, which will also work with older PostgreSQL versions.

Apart from other answers that are equally interesting / correct, IN is typically a non-performant keyword. You can remove it by using a slightly different way of writing your own query:
SELECT * FROM data d
WHERE d.name = 'A' and d.version = '2' and
d.id = (SELECT max(d1.id) FROM data d1 WHERE d1.name='A' AND d1.version='2')

SQL left join on maximum date

I have two tables: contracts and contract_descriptions.
On contract_descriptions there is a column named contract_id which is equal on contracts table records.
I am trying to join the latest record on contract_descriptions:
SELECT *
FROM contracts c
LEFT JOIN contract_descriptions d ON d.contract_id = c.contract_id
AND d.date_description =
(SELECT MAX(date_description)
FROM contract_descriptions t
WHERE t.contract_id = c.contract_id)
It works, but is it the performant way to do it? Is there a way to avoid the second SELECT?

You could also alternatively use DISTINCT ON:
SELECT * FROM contracts c LEFT JOIN (
SELECT DISTINCT ON (cd.contract_id) cd.* FROM contract_descriptions cd
ORDER BY cd.contract_id, cd.date_description DESC
) d ON d.contract_id = c.contract_id
DISTINCT ON selects only one row per contract_id while the sort clause cd.date_description DESC ensures that it is always the last description.
Performance depends on many values (for example, table size). In any case, you should compare both approaches with EXPLAIN.

Your query looks okay to me. One typical way to join only n rows by some order from the other table is a lateral join:
SELECT *
FROM contracts c
CROSS JOIN LATERAL
(
SELECT *
FROM contract_descriptions cd
WHERE cd.contract_id = c.contract_id
ORDER BY cd.date_description DESC
FETCH FIRST 1 ROW ONLY
) cdlast;

SQL Server 2008 De-duping

Long story short, I took over a project and a table in the database is in serious need of de-duping. The table looks like this:
supply_req_id | int | [primary key]
supply_req_dt | datetime |
request_id | int | [foreign key]
supply_id | int | [foreign key]
is_disabled | bit |
The duplication is exists with records having the same request_id and supply_id. I'd like to find a best practice way to de-dupe this table.
[EDIT]
#Kirk_Broadhurst, thanks for the question. Since supply_req_id is not referenced anywhere else, I would answer by saying keep the first, delete any subsequent occurances.
Happy Holidays

This creates a rank for each row in the (supply_req_dt, request_id) grouping, starting with 1 = lowest supply_req_id. Any dupe has a value > 1
;WITH cDupes AS
(
SELECT
supply_req_id,
ROW_NUMBER() OVER (PARTITION BY supply_req_dt, request_id ORDER BY supply_req_id) AS RowNum
FROM
MyTable
)
DELETE
cDupes
WHERE
RowNum > 1
Then add a unique constraint or INDEX
CREATE UNIQUE INDEX IXU_NoDupes ON MyTable (supply_req_dt, request_id)

Seems like there should be a command for this, but maybe that's because I'm used to a different database server. Here's the relevant support doc:
How to remove duplicate rows from a table in SQL Server
http://support.microsoft.com/kb/139444

You need to clarify your rule for determining which record to keep in the case of a 'match' - the most recent, the earliest, the one that has is_disabled true, or false?
Once you've identified that rule, the rest is fairly simple:
select the records you want to keep - the distinct records
join back to the original table to get the ids for those records.
delete everthing where not in the joined dataset.
So let's say you want to keep the most recent records of any 'duplicate' pair. Your query would look like this:
DELETE FROM [table] WHERE supply_req_id NOT IN
(SELECT supply_req_id from [table] t
INNER JOIN
(SELECT MAX(supply_req_dt) dt, request_id, supply_id
FROM [table]
GROUP BY request_id, supply_id) d
ON t.supply_req_dt = d.dt
AND t.request_id = d.request_id
AND t.supply_id = d.supply_id)
The catch is that if the supply_req_dt is also duplicated, then you'll be keeping both of the duplicates. The fix is to do another group by and select the top id
select MAX(supply_req_id), supply_req_dt, request_id, supply_id
group by supply_req_dt, request_id, supply_id
as an interim step. But if you don't need to do that, don't bother with it.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Amazon Redshift : Fetch latest row query - amazon-redshift

Related

Why is this nested INNER JOIN not working in POSTGRESQL?

SQL Group By that works in SQLite does not work in Postgres

Nested SQL Query Optimization

SQL left join on maximum date

SQL Server 2008 De-duping

Categories

Resources