Fetch minimum value for each NTILE bucket in Hive - postgresql

I am trying to partition the data into percentiles (100 equal buckets) using NTILE window function for each merchant_id ordered by score column. The output of the query will contain merchant_id, score, and percentile for every record in the source table. (Sample code below)
CREATE TABLE merchant_score_ntiles
AS
SELECT merchant_id, score, NTILE(100) OVER (PARTITION BY merchant_id ORDER BY score DESC) as percentile
FROM merch_table
This will return sample output as follows:
merchant_id,score,percentile
1001,900,1
1001,800,1
1001,760,1
1002,900,2
1002,800,2
1002,750,2
Is there a way we can return only the minimum score for each merchant_id based on percentile column such as below?
merchant_id,score,percentile
1001,760,1
1002,750,2

You can try to use ROW_NUMBER window function in subquery before using NTILE window function
SELECT merchant_id,
score,
NTILE(100) OVER (PARTITION BY merchant_id ORDER BY score DESC) as percentile
FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY merchant_id ORDER BY score) rn
FROM merch_table
) t1
WHERE rn = 1

Related

Max fuction in Postgres does not give the max value

I am writing a simple SQL query to get the latest record from every customer and to get the max of device_count if there are multiple records for a customer with same timestamp. However, the max function doesn't seem to take the max value though. Any help would be appreciated.
My SQL query -
select sub.customerid, max(sub.device_count) from(
SELECT customerid, device_count,
RANK() OVER
(
PARTITION by customerid
ORDER BY date_time desc
) AS rownum
FROM tableA) sub
WHERE rownum = 1
group by 1
Sample data:
customerid
device_count
date_time
A
3573
2021-07-26 02:15:09-05:00
A
4
2021-07-26 02:15:13-05:00
A
16988
2021-07-26 02:15:13-05:00
A
20696
2021-07-26 02:15:13-05:00
A
24655
2021-07-26 02:15:13-05:00
Desired Output should be to get the row with max device_count which is 24655 but I get 16988 as the output.
try to :
sort your table using ORDER BY customerid,device_count
Then apply the LAST_VALUE(device_count) window function aver the customerid partition.
Apply LAST_VALUE() to find the latest device_count (since it's sorted ascending, the last device_count value is the max).
You need to put device_count into the window function's order by and take out the aggregation:
select sub.customerid, device_count from(
SELECT customerid, device_count,
RANK() OVER
(
PARTITION by customerid
ORDER BY date_time desc, device_count desc
) AS rownum
FROM tableA) sub where rownum=1;
But if the top row for a customerid has ties (in both date_time and device_count fields) it will return all such ties. So better to replace RANK() with ROW_NUMBER().

SQL multiple calculation in table creation

I'm trying to create a table with multiple calculation.
I have a base table from which I would like to collect data and insert into the new table. The next columns are calculated based on the base table. So the first few columns are based on the original table, one part of it exactly the same, other part is calculated.
These works fine, however the last 2 columns are not. The calculation of these would be based on the calculated field of the new table.
Can it be solved within one step? Should I use update? As far as I know ranking is not working with that.
INSERT INTO [RAW_NBA_TeamSimpleRating]
(
[Team]
,[Game_total]
,[ORtg_avg]
,[DRtg_avg]
,[ORtg_rank]
,[ORtg_cluster]
)
SELECT
[Team]
,[Game]
,AVG ([ORtg]) OVER (PARTITION BY Team ORDER BY RowNumber rows between 81 preceding and current row) as ORtg_avg
,AVG ([DRtg]) OVER (PARTITION BY Team ORDER BY RowNumber rows between 81 preceding and current row) as DRtg_avg
,RANK () OVER (PARTITION BY [RAW_NBA_TeamSimpleRating].[Game_total] ORDER BY [RAW_NBA_TeamSimpleRating].[ORtg_avg] Desc)
,CASE
WHEN RANK () OVER (PARTITION BY [RAW_NBA_TeamSimpleRating].[Game_total] ORDER BY [RAW_NBA_TeamSimpleRating].[ORtg_avg] DESC) > 10 THEN 'Bottom'
WHEN RANK () OVER (PARTITION BY [RAW_NBA_TeamSimpleRating].[Game_total] ORDER BY [RAW_NBA_TeamSimpleRating].[ORtg_avg] DESC) <= 10 THEN 'TOP'
END
FROM [WRK_NBA_TeamTable]
If you wrap your query you can use the values from the inner select, such as
select Team, Game, ORtg_avg, DRtg_avg, [Rank],
case
when [Rank] > 10 then 'Bottom'
when [Rank] <= 10 then 'TOP'
end as ORtg_cluster
from (
select Team, Game
,Avg (ORtg) over (partition by Team order by RowNumber rows between 81 preceding and current row) as ORtg_avg
,Avg (DRtg) over (partition by Team order by RowNumber rows between 81 preceding and current row) as DRtg_avg
,Rank () over (partition by RAW_NBA_TeamSimpleRating.Game_total order by RAW_NBA_TeamSimpleRating.ORtg_avg desc) as [Rank]
from WRK_NBA_TeamTable
)s

How to select corresponding record alongside aggregate function with having clause

Let's say I have an orders table with customer_id, order_total, and order_date columns. I'd like to build a report that shows all customers who haven't placed an order in the last 30 days, with a column for the total amount their last order was.
This gets all of the customers who should be on the report:
select customer, max(order_date), (select order_total from orders o2 where o2.customer = orders.customer order by order_date desc limit 1)
from orders
group by 1
having max(order_date) < NOW() - '30 days'::interval
Is there a better way to do this that doesn't require a subquery but instead uses a window function or other more efficient method in order to access the total amount from the most recent order? The techniques from How to select id with max date group by category in PostgreSQL? are related, but the extra having restriction seems to stop me from using something like DISTINCT ON.
demo:db<>fiddle
Solution with row_number window function (https://www.postgresql.org/docs/current/static/tutorial-window.html)
SELECT
customer, order_date, order_total
FROM (
SELECT
*,
first_value(order_date) OVER w as last_order,
first_value(order_total) OVER w as last_total,
row_number() OVER w as row_count
FROM orders
WINDOW w AS (PARTITION BY customer ORDER BY order_date DESC)
) s
WHERE row_count = 1 AND order_date < CURRENT_DATE - 30
Solution with DISTINCT ON (https://www.postgresql.org/docs/9.5/static/sql-select.html#SQL-DISTINCT):
SELECT
customer, order_date, order_total
FROM (
SELECT DISTINCT ON (customer)
*,
first_value(order_date) OVER w as last_order,
first_value(order_total) OVER w as last_total
FROM orders
WINDOW w AS (PARTITION BY customer ORDER BY order_date DESC)
ORDER BY customer, order_date DESC
) s
WHERE order_date < CURRENT_DATE - 30
Explanation:
In both solutions I am working with the first_value window function. The window function's frame is defined by customers. The rows within the customers' groups are ordered descending by date which gives the latest row first (last_value is not working as expected every time). So it is possible to get the last order_date and the last order_total of this order.
The difference between both solutions is the filtering. I showed both versions because sometimes one of them is significantly faster
The window function style is creating a row count within the frames. Every first row can be filtered later. This is done by adding a row_number window function. The benefit of this solution comes out when you are trying to filter the first two or three data sets. You simply have to change the filter from WHERE row_count = 1 to WHERE row_count = 2
But if you want only one single row per group you just need to ensure that the expected row per group is ordered to be the first row in the group. Then the DISTINCT ON function can delete all following rows. DISTINCT ON (customer) gives the first (ordered) row per customer group.
Try to join table on itself
select o1.customer, max(order_date),
from orders o1
join orders o2 on o1.id=o2.id
group by o1.customer
having max(o1.order_date) < NOW() - '30 days'::interval
Subqueries in select is a bad idea, because DB will execute a query for each row
If you use postgres you can also try to use CTE
https://www.postgresql.org/docs/9.6/static/queries-with.html
WITH t as (
select id, order_total from orders o2 where o2.customer = orders.customer
order by order_date desc limit 1
) select o1.customer, max(order_date),
from orders o1
join t t.id=o2.id
group by o1.customer
having max(order_date) < NOW() - '30 days'::interval

postgres - get top category purchased by customer

I have a denormalized table with the columns:
buyer_id
order_id
item_id
item_price
item_category
I would like to return something that returns 1 row per buyer_id
buyer_id, sum(item_price), item_category
-- but ONLY for the category with the highest rank of sales along that specific buyer_id.
I can't get row_number() or partition to work because I need to order by the sum of item_price relative to item_category relative to buyer. Am I overlooking anything obvious?
You need a few layers of fudging here:
SELECT buyer_id, item_sum, item_category
FROM (
SELECT buyer_id,
rank() OVER (PARTITION BY buyer_id ORDER BY item_sum DESC) AS rnk,
item_sum, item_category
FROM (
SELECT buyer_id, sum(item_price) AS item_sum, item_category
FROM my_table
GROUP BY 1, 3) AS sub2) AS sub
WHERE rnk = 1;
In sub2 you calculate the sum of 'item_price' for each 'item_category' for each 'buyer_id'. In sub you rank these with a window function by 'buyer_id', ordering by 'item_sum' in descending order (so the highest 'item_sum' comes first). In the main query you select those rows where rnk = 1.

TSQL - Easiest way to set increasing value on the Position column ..order by Price

Imagine a table with [ID, Name, Position, Price]
Currently the table has ALL the records with Position=0 and each record has a different price value.
What is the quickest way to update that Position and set its value unique and based on the Price, where the record with the lowest price is Position=1, the second lowest is Position=2 ...and so on?
thanks in advanced.
You could use row_number to number rows. A subquery is required to refer to the row number by alias:
update yt
set Position = rn
from (
select row_number() over (order by Price desc) as rn
, *
from YourTable
) yt
Working example at SE Data.