select last of an item for each user in postgres - postgresql

I want to get the last entry for each user but the customer_id is a hash 'ASAG#...' order by customer_id destroys the query. Is there an alternative?
Select Distinct On (l.customer_id)
l.customer_id
,l.created_at
,l.text
From likes l
Order By l.customer_id, l.created_at Desc

Your current query already appears to be working, q.v. here:
Demo
I don't know why your current query is not generating the results you would expect. It should return one distinct record for every customer, corresponding to the more recent one, given your ORDER BY statement.
In any case, if it does not do what you want, an alternative would be to use ROW_NUMBER() here with a partition by user. The inner query assigns a row number to each user, with the value 1 going to the most recent record for each user. Then the outer query retains only the latest record.
SELECT
t.customer_id,
t.created_at,
t.text
FROM
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) rn
FROM likes
) t
WHERE t.rn = 1
To speed up the inner query which uses ROW_NUMBER() you can try adding a composite index on the customer_id and created_at columns:
CREATE INDEX yourIdx ON likes (customer_id, created_at);

Related

Converting counts inside query result tables to percentages of total

I have a table and want to calculate the percentage of total by store_id which each (category_id, store_id) subtotal represents. My code is below:
WITH
example_table (name, store_id)
AS
(
select name, store_id
from category
join film_category using (category_id)
join film using (film_id)
join inventory using (film_id)
join rental using (inventory_id)
)
SELECT name, store_id, cast(count(*) as numeric)/(SELECT count(*) FROM example_table)
FROM example_table
GROUP BY name, store_id
ORDER BY name, store_id
This code actually works, as in, it doesn't throw an error, only they're not the results I'm looking for. Here each of the subtotals is divided by the total across both stores and all 16 names. Instead, I want the subtotals divided by their respective store totals or divided by their respective name totals.
I'm wondering how to perform calculations on those subtotals in general.
Thanks in advance,
I believe you need to explore the possibilities of using aggregate functions combined with an OVER(PARTITION BY ...) e.g.
SELECT DISTINCT
name, store_id, store_id_count, name_count
FROM (
select name, store_id
, count(*) over(partition by store_id) as store_id_count
, count(*) over(partition by name) as name_count
from category
join film_category using (category_id)
join film using (film_id)
join inventory using (film_id)
join rental using (inventory_id)
) AS example_table
When using aggregate function with the over clause you get the wanted counts on each row of the result, and it seems that in this case you need this. Note that select distinct has been used simply to reduce the final number of rows returned, you might still need to use a group by but I am not sure if you do.
Once you have the needed values within the derived table (aliases as example_table) then it should be a simple matter of some arithmetic in the overall select clause.

Find time difference between two most recent orders

I am trying to estimate the time of a new order from repeat customers by finding the time difference between the most recent order and the second most recent order, and then adding that difference to the most recent order.
I have been trying limit and offset, but this returns a blanket date for every row. I am thinking I need to do a lateral join, but not sure how to implement it correctly. When I try to do it, I receive no output.
select public.orders.customer_id,
max(public.orders.created_at) as last_order_date,
(select created_at from public.orders group by created_at order by created_at desc limit 1 offset 1) as second_last
from public.orders
inner join
(select
customer_id, count(*)
from public.orders
where status = 'fulfilled'
group by public.orders.customer_id
having count(customer_id) >1) repeat_customers
on public.orders.customer_id = repeat_customers.customer_id
group by public.orders.customer_id;
I wanted the second_last field to be populated by the second most recent date for each customer_id, but the output is the second most recent date for the entire table, resulting in the same date for every entry.
For your second_last column you're not limiting it per customer, it will indeed find the max of everything just like the results you've seen. See the WHERE clause in the example below which should solve this:
(SELECT
created_at
FROM
public.orders po
WHERE
po.customer_id = customer_id
ORDER BY
created_at
LIMIT 1 OFFSET 1) AS second_last
I've also aliased the table because I wasn't sure if it would complain about ambiguity since the same table is mentioned in the main select.

ROW_NUMBER() OVER PARTITION optimization

I have the following query:
SELECT *
FROM
(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY Code ORDER BY Price ASC) as RowNum
from Offers) r
where RowNum = 1
Offers table contains about 10 million records. But there are only ~4000 distinct codes there. So I need to get the row with the lowest price for each code and there will be only 4000 rows in the result.
I have an Index on (Code, Price) columns with all other columns in INCLUDE statement.
The query runs for 2 minutes. And if I look at the execution plan, I see Index scan with 10M actual rows. So, I guess it scans the whole index to get needed values.
Why MSSQL do the whole index scan? Is it because subquery needs the whole data? How to avoid this scan? Is there a SQL hint to process only the first row in partition?
Is there another way to optimize such query?
After trying multiple different solutions, I've found the fastest query with CROSS APPLY statement:
SELECT C.*
FROM (SELECT DISTINCT Code from Offers) A
CROSS APPLY (SELECT TOP 1 *
FROM Offers B
WHERE A.Code = B.Code
ORDER by Price) C
It take ~1 second to run.
Try creating an index on ( Code, Price ) without including the other columns and then (assuming that there is a unique Id column):
select L.*
from Offers as L inner join
( select Id,
Row_Number() over ( partition by Code order by Price ) as RN
from Offers ) as R on R.Id = L.Id and R.RN = 1
An index scan on a smaller index ought to help.
Second guess would be to get the Id of the row with the lowest Price for each Code explicitly: Get distinct Code values, get Id of top 1 (to avoid problems with duplicate prices) Min( Price ) row for that Code, join with Offers to get complete rows. Again, the more compact index should help.
Not sure if you'll get any significant performance gains, but you may want to try the WITH TIES clause
Example
Select Top 1 with Ties *
From Offers
Order By Row_Number() over (Partition By Code Order By Price)

SQL SELECT in another table with most recent date

I have a list of Matter data in Table1 that I need to query, as well as get the most recent Invoice Number in Table2 that is tied to the original Matter. I'm having extreme difficulty in joining these tables together and only getting one result for each Matter as I only want the most recent Invoice #.
Any and all help would be greatly appreciated.
Table1
Table2
RESULT
The following assigns numbers to each invoice row in order of date, and selects only the most recent. Note that this assumes InvoiceDate is stored as a date,datetime, or something else that will sort chronologically, and that in the event of two invoices for the same date, returning either will be fine. If you need to return both invoices in the event of ties, replace row_number with rank.
Select * from Table1 a
inner join
(Select *
, row_number() over (partition by MatterID order by InvoiceDate desc) as RN
from Table2) b
on a.MatterID = b.MatterID and b.RN = 1

Postgresql vlookup

Let's say I have a table "uservalue" with the following columns:
integer user_id
integer group_id
integer value
I can get the maximum value for each group easily:
select max(value) from uservalue group by group_id;
What I would like is for it to return the user_id in each group that had the highest value. The max function in matlab will also return the index of the maximum, is there some way to make postgresql do the same thing?
The proper way todo this is with a subquery.
select
u.user_id,
u.value
from
uservalue u
join
(select groupid, max(value) as max_value from uservalue group by group_id) mv
on u.value = mv.max_value and mv.group_id = u.group_id
However I sometimes prefer a simpler hack.
select max(value*100000 + user_id) - 100000, max(value) from user_value group by group_id
Making sure that number (100000) is higher than any userids you are expecting to have. This makes sure only one user_id is selected on the same values whilst the other one selects them both.
Seems you should be able to do this with a windowing query, something like:
SELECT DISTINCT
group_id,
first_value(user_id) OVER w AS user,
first_value(value) OVER w AS val
FROM
uservalue
WINDOW w AS (PARTITION BY group_id ORDER BY value DESC)
This query will also work if you have multiple users with the same value (unless you add a second column to ORDER BY you will not know which one you will get back though - but you will only get one row back per group)
Here are several ways to do this.
It's pretty much a FAQ.