PostgreSQL percentile_cont and aggregations

PostgreSQL percentile_cont and aggregations - postgresql

I have to compute three 99th%ile with postgresql function percentile_cont.
two of them are related to single column, the third one is about the difference of the two previous columns, BUT I have to compute the last one after an inner join, and it is not clear to me how to do.
SELECT
TableA1.a,
TableA1.b99,
TableA2.b99,
**percentile_cont(0.99) WITHIN GROUP (ORDER BY (TableA1.b - TableA2.b))::bigint**
FROM
(
SELECT
a,
percentile_cont(0.99) WITHIN GROUP (ORDER BY (b))::bigint as b99
FROM TableA
GROUP BY
a
) TableA1
INNER JOIN
(
SELECT
a,
percentile_cont(0.99) WITHIN GROUP (ORDER BY (b))::bigint as b99
FROM TableA
GROUP BY
a
) TableA2
ON
TableA1.a = TableA2.a
The fourth field selected (percentile_cont(0.99) WITHIN GROUP (ORDER BY (TableA1.b - TableA2.b))::bigint) is impossible because the field b from the two subqueries has been already aggregated.
Any suggestion on how to rebuild the query?
Thank you in advance,
Lorenzo

Related

SQL left join on maximum date

I have two tables: contracts and contract_descriptions.
On contract_descriptions there is a column named contract_id which is equal on contracts table records.
I am trying to join the latest record on contract_descriptions:
SELECT *
FROM contracts c
LEFT JOIN contract_descriptions d ON d.contract_id = c.contract_id
AND d.date_description =
(SELECT MAX(date_description)
FROM contract_descriptions t
WHERE t.contract_id = c.contract_id)
It works, but is it the performant way to do it? Is there a way to avoid the second SELECT?

You could also alternatively use DISTINCT ON:
SELECT * FROM contracts c LEFT JOIN (
SELECT DISTINCT ON (cd.contract_id) cd.* FROM contract_descriptions cd
ORDER BY cd.contract_id, cd.date_description DESC
) d ON d.contract_id = c.contract_id
DISTINCT ON selects only one row per contract_id while the sort clause cd.date_description DESC ensures that it is always the last description.
Performance depends on many values (for example, table size). In any case, you should compare both approaches with EXPLAIN.

Your query looks okay to me. One typical way to join only n rows by some order from the other table is a lateral join:
SELECT *
FROM contracts c
CROSS JOIN LATERAL
(
SELECT *
FROM contract_descriptions cd
WHERE cd.contract_id = c.contract_id
ORDER BY cd.date_description DESC
FETCH FIRST 1 ROW ONLY
) cdlast;

How to get the MAX(SUM of values) to find the category with the biggest total? PostgreSQL

I have two tables. One is Transactions and the other is Tickets. In Tickets I have the Ticket_Number,the name of the Category(Theater,Cinema,Concert), the Price of the Ticket. In Transactions I also have the Ticket_Number. What i want to do is to Get a SUM of money for each Category, and then with that data I want to Select the Category with the most money.
I already managed to get the SUM for each category but I am stuck here
SELECT category, SUM (Tickets.Price) AS Price
FROM Tickets,Transactions
WHERE Tickets.ticket_num=Transactions.ticket_num
GROUP BY Category
ORDER BY Price DESC;
I know i can add LIMIT 1 but I know it's not correct because 2 or more values can be the same

Using ROW_NUMBER to generate a sequence based on the sum of the price. Then, restrict to only the matching aggregated row with the highest total price.
WITH cte AS (
SELECT category, SUM(t1.Price) AS Price,
ROW_NUMBER() OVER (ORDER BY SUM(t1.Price) DESC) rn
FROM Tickets t1
INNER JOIN Transactions t2
ON t1.ticket_num = t2.ticket_num
GROUP BY Category
)
SELECT category, Price
FROM cte
WHERE rn = 1
ORDER BY Price DESC;
Note that if you want to capture all categories tied for the highest price, should a tie occur, then replace ROW_NUMBER in the above CTE with RANK, keeping everything else the same.

What you are looking for is a window function DENSE_RANK() which will handle ties properly.
RANK() will also work for your case, but if you would like to extend it to get TOP N places with ties (where N > 1), dense rank is the way to go.
SELECT Category, Price
FROM (
SELECT
Category,
SUM(ti.Price) AS Price,
DENSE_RANK() OVER (ORDER BY SUM(ti.Price) DESC) AS rnk
FROM Tickets ti
INNER JOIN Transactions tr ON
ti.ticket_num = tr.ticket_num
GROUP BY Category
) t
WHERE rnk = 1
I've also replaced the old style and not recommended joining of tables as comma separated list in FROM clause to a proper INNER JOIN clause and assigned aliases to tables.

You can use rank() to rank the sums of the prices, more expensive first.
SELECT category,
price
FROM (SELECT category,
sum(tickets.price) price,
rank() OVER (ORDER BY sum(tickets.price) DESC) r
FROM tickets
INNER JOIN transactions
ON transactions.ticket_num = tickets.ticket_num
GROUP BY category) x
WHERE r = 1;
I also took the liberty to rewrite your join from the ancient comma style to a modern, clearer version.

Unable to get Percentile_Cont() to work in Postgresql

I am trying to calculate a percentile using the percentile_cont() function in PostgreSQL using common table expressions. The goal is find the top 1% of accounts regards to their balances (called amount here). My logic is to find the 99th percentile which will return those whose account balances are greater than 99% of their peers (and thus finding the 1 percenters)
Here is my query
--ranking subquery works fine
with ranking as(
select a.lname,sum(c.amount) as networth from customer a
inner join
account b on a.customerid=b.customerid
inner join
transaction c on b.accountid=c.accountid
group by a.lname order by sum(c.amount)
)
select lname, networth, percentile_cont(0.99) within group
order by networth over (partition by lname) from ranking ;
I keeping getting the following error.
ERROR: syntax error at or near "order"
LINE 2: ...ame, networth, percentile_cont(0.99) within group order by n..
I am thinking that perhaps I forgot a closing brace etc. but I can't seem to figure out where. I know it could be something with the order keyword but I am not sure what to do. Can you please help me to fix this error?

This tripped me up, too.
It turns out percentile_cont is not supported in postgres 9.3, only in 9.4+.
https://www.postgresql.org/docs/9.4/static/release-9-4.html
So you have to use something like this:
with ordered_purchases as (
select
price,
row_number() over (order by price) as row_id,
(select count(1) from purchases) as ct
from purchases
)
select avg(price) as median
from ordered_purchases
where row_id between ct/2.0 and ct/2.0 + 1
That query care of https://www.periscopedata.com/blog/medians-in-sql (section: "Median on Postgres")

You are missing the brackets in the within group (order by x) part.
Try this:
with ranking
as (
select a.lname,
sum(c.amount) as networth
from customer a
inner join account b on a.customerid = b.customerid
inner join transaction c on b.accountid = c.accountid
group by a.lname
order by networth
)
select lname,
networth,
percentile_cont(0.99) within group (
order by networth
) over (partition by lname)
from ranking;

I want to point out that you don't need a subquery for this:
select c.lname, sum(t.amount) as networth,
percentile_cont(0.99) within group (order by sum(t.amount)) over (partition by lname)
from customer c inner join
account a
on c.customerid = a.customerid inner join
transaction t
on a.accountid = t.accountid
group by c.lname
order by networth;
Also, when using table aliases (which should be always), table abbreviations are much easier to follow than arbitrary letters.

multiple extract() with WHERE clause possible?

So far I have come up with the below:
WHERE (extract(month FROM orders)) =
(SELECT min(extract(month from orderdate))
FROM orders)
However, that will consequently return zero to many rows, and in my case, many, because many orders exist within that same earliest (minimum) month, i.e. 4th February, 9th February, 15th Feb, ...
I know that a WHERE clause can contain multiple columns, so why wouldn't the below work?
WHERE (extract(day FROM orderdate)), (extract(month FROM orderdate)) =
(SELECT min(extract(day from orderdate)), min(extract(month FROM orderdate))
FROM orders)
I simply get: SQL Error: ORA-00920: invalid relational operator
Any help would be great, thank you!
Sample data:
02-Feb-2012
14-Feb-2012
22-Dec-2012
09-Feb-2013
18-Jul-2013
01-Jan-2014
Output:
02-Feb-2012
14-Feb-2012
Desired output:
02-Feb-2012

I recreated your table and found out you just messed up the brackets a bit. The following works for me:
where
(extract(day from OrderDate),extract(month from OrderDate))
=
(select
min(extract(day from OrderDate)),
min(extract(month from OrderDate))
from orders
)

Use something like this:
with cte1 as (
select
extract(month from OrderDate) date_month,
extract(day from OrderDate) date_day,
OrderNo
from tablename
), cte2 as (
select min(date_month) min_date_month, min(date_day) min_date_day
from cte1
)
select cte1.*
from cte1
where (date_month, date_day) = (select min_date_month, min_date_day from cte2)
A common table expression enables you to restructure your data and then use this data to do your select. The first cte-block (cte1) selects the month and the day for each of your table rows. Cte2 then selects min(month) and min(date). The last select then combines both ctes to select all rows from cte1 that have the desired month and day.
There is probably a shorter solution to that, however I like common table expressions as they are almost all the time better to understand than the "optimal, shortest" query.

If that is really what you want, as bizarre as it seems, then as a different approach you could forget the extracts and the subquery against the table to get the minimums, and use an analytic approach instead:
select orderdate
from (
select o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
from orders o
)
where rn = 1;
ORDERDATE
---------
01-JAN-14
The row_number() effectively adds a pseudo-column to every row in your original table, based on the month and day in the order date. The rn values are unique, so there will be one row marked as 1, which will be from the earliest day in the earliest month. If you have multiple orders with the same day/month, say 01-Jan-2013 and 01-Jan-2014, then you'll still only get exactly one with rn = 1, but which is picked is indeterminate. You'd need to add further order by conditions to make it deterministic, but I have no idea what you might want.
That is done in the inner query; the outer query then filters so that only the records marked with rn = 1 is returned; so you get exactly one row back from the overall query.
This also avoids the situation where the earliest day number is not in the earliest month number - say if you only had 01-Jan-2014 and 02-Feb-2014; comparing the day and month separately would look for 01-Feb-2014, which doesn't exist.
SQL Fiddle (with Thomas Tschernich's anwer thrown in too, giving the same result for this data).
To join the result against your invoice table, you don't need to join to the orders table again - especially not with a cross join, which is skewing your results. You can do the join (at least) two ways:
SELECT
o.orderno,
to_char(o.orderdate, 'DD-MM-YYYY'),
i.invno
FROM
(
SELECT o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
FROM orders o
) o, invoices i
WHERE i.invno = o.invno
AND rn = 1;
Or:
SELECT
o.orderno,
to_char(o.orderdate, 'DD-MM-YYYY'),
i.invno
FROM
(
SELECT orderno, orderdate, invno
FROM
(
SELECT o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
FROM orders o
)
WHERE rn = 1
) o, invoices i
WHERE i.invno = o.invno;
The first looks like it does more work but the execution plans are the same.
SQL Fiddle with your pastebin-supplied query that gets two rows back, and these two that get one.

Firebird 2.5 Removing Rows with Duplicate Fields

I am trying to removing duplicate values which, for some reason, was imported in a specific Table.
There is no Primary Key in this table.
There is 27797 unique records.
Select distinct txdate, plunumber from itemaudit
Give me the correct records, but only displays the txdate, plunumber of course.
If it was possible to select all the fields but only select the distinct of txdate,plunumber I could export the values, delete the duplicated ones and re-import.
Or if its possible to delete the distinct values from the entire table.
If you select the distinct of all fields the value is incorrect.

To get all information on the duplicates, you simply need to query all information for the duplicate rows using a JOIN:
SELECT b.*
FROM (SELECT COUNT(*) as cnt, txdate, plunumber
FROM itemaudit
GROUP BY txdate, plunumber
HAVING COUNT(*) > 1) a
INNER JOIN itemaudit b ON a.txdate = b.txdate AND a.plunumber = b.plunumber

DELETE FROM itemaudit t1
WHERE EXISTS (
SELECT 1 FROM itemaudit t2
WHERE t1.txdate = t2.txdate and t1.plunumber = t2.plunumber
AND t1.RDB$DB_KEY < t2.RDB$DB_KEY
);