Query to select by number of associated objects - postgresql

I have two tables that look like the following:
Orders
------
id
tracking_number
ShippingLogs
------
tracking_number
created_at
stage
I would like to select the IDs of Orders that have ONLY ONE ShippingLog associated with it, and the stage of the ShippingLog must be error. If it has two ShippingLog entries, I don't want it. If it has one ShippingLog bug its stage is shipped, I don't want it.
This is what I have, and it doesn't work, and I know why (it finds the log with the error, but has no way of knowing if there are others). I just don't really know how to get it the way I need it.
SELECT DISTINCT
orders.id, shipping_logs.created_at, COUNT(shipping_logs.*)
FROM
orders
JOIN
shipping_logs ON orders.tracking_number = shipping_logs.tracking_number
WHERE
shipping_logs.created_at BETWEEN '2021-01-01 23:40:00'::timestamp AND '2021-01-26 23:40:00'::timestamp AND shipping_logs.stage = 'error'
GROUP BY
orders.id, shipping_logs.created_at
HAVING
COUNT(shipping_logs.*) = 1
ORDER BY
orders.id, shipping_logs.created_at DESC;

If you want to retain every column from the join of the two tables given your requirements, then I would suggest using COUNT here as an analytic function:
WITH cte AS (
SELECT o.id, sl.created_at,
COUNT(*) OVER (PARTITION BY o.id) num_logs,
COUNT(*) FILTER (WHERE sl.stage <> 'error')
OVER (PARTITION BY o.id) non_error_cnt
FROM orders o
INNER JOIN shipping_logs sl ON sl.tracking_number = o.tracking_number
WHERE sl.created_at BETWEEN '2021-01-01 23:40:00'::timestamp AND
'2021-01-26 23:40:00'::timestamp
)
SELECT id AS order_id, created_at
FROM cte
WHERE num_logs = 1 AND non_error_cnt = 0
ORDER BY id, created_at DESC;

Related

How to select corresponding record alongside aggregate function with having clause

Let's say I have an orders table with customer_id, order_total, and order_date columns. I'd like to build a report that shows all customers who haven't placed an order in the last 30 days, with a column for the total amount their last order was.
This gets all of the customers who should be on the report:
select customer, max(order_date), (select order_total from orders o2 where o2.customer = orders.customer order by order_date desc limit 1)
from orders
group by 1
having max(order_date) < NOW() - '30 days'::interval
Is there a better way to do this that doesn't require a subquery but instead uses a window function or other more efficient method in order to access the total amount from the most recent order? The techniques from How to select id with max date group by category in PostgreSQL? are related, but the extra having restriction seems to stop me from using something like DISTINCT ON.
demo:db<>fiddle
Solution with row_number window function (https://www.postgresql.org/docs/current/static/tutorial-window.html)
SELECT
customer, order_date, order_total
FROM (
SELECT
*,
first_value(order_date) OVER w as last_order,
first_value(order_total) OVER w as last_total,
row_number() OVER w as row_count
FROM orders
WINDOW w AS (PARTITION BY customer ORDER BY order_date DESC)
) s
WHERE row_count = 1 AND order_date < CURRENT_DATE - 30
Solution with DISTINCT ON (https://www.postgresql.org/docs/9.5/static/sql-select.html#SQL-DISTINCT):
SELECT
customer, order_date, order_total
FROM (
SELECT DISTINCT ON (customer)
*,
first_value(order_date) OVER w as last_order,
first_value(order_total) OVER w as last_total
FROM orders
WINDOW w AS (PARTITION BY customer ORDER BY order_date DESC)
ORDER BY customer, order_date DESC
) s
WHERE order_date < CURRENT_DATE - 30
Explanation:
In both solutions I am working with the first_value window function. The window function's frame is defined by customers. The rows within the customers' groups are ordered descending by date which gives the latest row first (last_value is not working as expected every time). So it is possible to get the last order_date and the last order_total of this order.
The difference between both solutions is the filtering. I showed both versions because sometimes one of them is significantly faster
The window function style is creating a row count within the frames. Every first row can be filtered later. This is done by adding a row_number window function. The benefit of this solution comes out when you are trying to filter the first two or three data sets. You simply have to change the filter from WHERE row_count = 1 to WHERE row_count = 2
But if you want only one single row per group you just need to ensure that the expected row per group is ordered to be the first row in the group. Then the DISTINCT ON function can delete all following rows. DISTINCT ON (customer) gives the first (ordered) row per customer group.
Try to join table on itself
select o1.customer, max(order_date),
from orders o1
join orders o2 on o1.id=o2.id
group by o1.customer
having max(o1.order_date) < NOW() - '30 days'::interval
Subqueries in select is a bad idea, because DB will execute a query for each row
If you use postgres you can also try to use CTE
https://www.postgresql.org/docs/9.6/static/queries-with.html
WITH t as (
select id, order_total from orders o2 where o2.customer = orders.customer
order by order_date desc limit 1
) select o1.customer, max(order_date),
from orders o1
join t t.id=o2.id
group by o1.customer
having max(order_date) < NOW() - '30 days'::interval

Unable to get Percentile_Cont() to work in Postgresql

I am trying to calculate a percentile using the percentile_cont() function in PostgreSQL using common table expressions. The goal is find the top 1% of accounts regards to their balances (called amount here). My logic is to find the 99th percentile which will return those whose account balances are greater than 99% of their peers (and thus finding the 1 percenters)
Here is my query
--ranking subquery works fine
with ranking as(
select a.lname,sum(c.amount) as networth from customer a
inner join
account b on a.customerid=b.customerid
inner join
transaction c on b.accountid=c.accountid
group by a.lname order by sum(c.amount)
)
select lname, networth, percentile_cont(0.99) within group
order by networth over (partition by lname) from ranking ;
I keeping getting the following error.
ERROR: syntax error at or near "order"
LINE 2: ...ame, networth, percentile_cont(0.99) within group order by n..
I am thinking that perhaps I forgot a closing brace etc. but I can't seem to figure out where. I know it could be something with the order keyword but I am not sure what to do. Can you please help me to fix this error?
This tripped me up, too.
It turns out percentile_cont is not supported in postgres 9.3, only in 9.4+.
https://www.postgresql.org/docs/9.4/static/release-9-4.html
So you have to use something like this:
with ordered_purchases as (
select
price,
row_number() over (order by price) as row_id,
(select count(1) from purchases) as ct
from purchases
)
select avg(price) as median
from ordered_purchases
where row_id between ct/2.0 and ct/2.0 + 1
That query care of https://www.periscopedata.com/blog/medians-in-sql (section: "Median on Postgres")
You are missing the brackets in the within group (order by x) part.
Try this:
with ranking
as (
select a.lname,
sum(c.amount) as networth
from customer a
inner join account b on a.customerid = b.customerid
inner join transaction c on b.accountid = c.accountid
group by a.lname
order by networth
)
select lname,
networth,
percentile_cont(0.99) within group (
order by networth
) over (partition by lname)
from ranking;
I want to point out that you don't need a subquery for this:
select c.lname, sum(t.amount) as networth,
percentile_cont(0.99) within group (order by sum(t.amount)) over (partition by lname)
from customer c inner join
account a
on c.customerid = a.customerid inner join
transaction t
on a.accountid = t.accountid
group by c.lname
order by networth;
Also, when using table aliases (which should be always), table abbreviations are much easier to follow than arbitrary letters.

How to get fields and added in group by in PostreSQL8.4?

I am selecting column used in group by and count, and query looks something like
SELECT s.country, count(*) AS posts_ct
FROM store s
JOIN store_post_map sp ON sp.store_id = s.id
GROUP BY 1;
However, I want to select some more fields, like store name or store address from store table where count is max, but I don't to include that in group by clause.
For instance, to get the stores with the highest post-count per country:
SELECT DISTINCT ON (s.country)
s.country, s.store_id, s.name, sp.post_ct
FROM store s
JOIN (
SELECT store_id, count(*) AS post_ct
FROM store_post_map
GROUP BY store_id
) sp ON sp.store_id = s.id
ORDER BY s.country, sp.post_ct DESC
Add any number of columns from store to the SELECT list.
Details about this query style in this related answer:
Select first row in each GROUP BY group?
Reply to comment
This produces the count per country and picks (one of) the store(s) with the highest post-count:
SELECT DISTINCT ON (s.country)
s.country, s.store_id, s.name
,sum(post_ct) OVER (PARTITION BY s.country) AS post_ct_for_country
FROM store s
JOIN (
SELECT store_id, count(*) AS post_ct
FROM store_post_map
GROUP BY store_id
) sp ON sp.store_id = s.id
ORDER BY s.country, sp.post_ct DESC;
This works because the window function sum() is applied before DISTINCT ON per definition.

TSQL Compare 2 select's result and return result with most recent date

Wonder if someone could give me a quick hand. I have 2 select queries (as shown below) and I want to compare the results of both and only return the result that has the most recent date.
So say I have the following 2 results from the queries:-
--------- ---------- ----------------------- --------------- ------ --
COMPANY A EMPLOYEE A 2007-10-16 17:10:21.000 E-mail 6D29D6D5 SYSTEM 1
COMPANY A EMPLOYEE A 2007-10-15 17:10:21.000 E-mail 6D29D6D5 SYSTEM 1
I only want to return the result with the latest date (so the first one). I thought about putting the results into a temporary table and then querying that but just wondering if there's a simpler, more efficient way?
SELECT * FROM (
SELECT fc.accountidname, fc.owneridname, fap.actualend, fap.activitytypecodename, fap.createdby, fap.createdbyname,
ROW_NUMBER() OVER (PARTITION BY fc.accountidname ORDER BY fap.actualend DESC) AS RN
FROM FilteredContact fc
INNER JOIN FilteredActivityPointer fap ON fc.parentcustomerid = fap.regardingobjectid
WHERE fc.statecodename = 'Active'
AND fap.ownerid = '0F995BDC'
AND fap.createdon < getdate()
) tmp WHERE RN = 1
SELECT * FROM (
SELECT fa.name, fa.owneridname, fa.new_technicalaccountmanageridname, fa.new_customerid, fa.new_riskstatusname,
fa.new_numberofopencases, fa.new_numberofurgentopencases, fap.actualend, fap.activitytypecodename, fap.createdby, fap.createdbyname,
ROW_NUMBER() OVER (PARTITION BY fa.name ORDER BY fap.actualend DESC) AS RN
FROM FilteredAccount fa
INNER JOIN FilteredActivityPointer fap ON fa.accountid = fap.regardingobjectid
WHERE fa.statecodename = 'Active'
AND fap.ownerid = '0F995BDC'
AND fap.createdon < getdate()
) tmp2 WHERE RN = 1
if the tables have the same structure (column count and column types to match), then you could just union the results of the two queries, then order by the date desc and then select the top 1.
select top 1 * from
(
-- your first query
union all
-- your second query.
) T
order by YourDateColumn1 desc
You should GROUP BY and use MAX(createdon)

Tsql, returning rows with identical column values

Given an example table 'Users', which has an int column named 'UserID' (and some arbitrary number of other columns), what is the best way to select all rows from which UserID appears more than once?
So far I've come up with
select * from Users where UserID in
(select UserID from Users group by UserID having COUNT(UserID) > 1)
This seems like quite an innefficient way to do this though, is there a better way?
In SQL Server 2005+ you could use this approach:
;WITH UsersNumbered AS (
SELECT
UserID,
rownum = ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY UserID)
FROM Users
)
SELECT u.*
FROM Users u
INNER JOIN UsersNumbered n ON u.UserID = n.UserID AND n.rownum = 2
Provided there exists a non-clustered index on UserID, this yields a slightly worse execution plan than your approach. To make it better (actually, same as yours), you'll need to use... a subquery, however counter-intuitive it may seem:
;WITH UsersNumbered AS (
SELECT
UserID,
rownum = ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY UserID)
FROM Users
)
SELECT u.*
FROM Users u
WHERE EXISTS (
SELECT *
FROM UsersNumbered n
WHERE u.UserID = n.UserID AND n.rownum = 2
);
In case of a clustered index on UserID all three solutions give the same plan.
This would do the same thing but evaluate the performance and it would likely be faster/more efficient. Of course there should be an index on this UserID column.
select u.*
from Users u
join (select UserID,count(UserID) as CUserID from Users group by UserID) u1 on u1.UserID = u.UserID
where CUserID > 1