Filter duplicates on row_number results

Filter duplicates on row_number results - postgresql

I'm trying to make a query on PostgreSQL that gives me the top 10 jobs that take more time each month (excluding current month), I have made this query so far but it gives me duplicates on the job name. How can I filter these?
SELECT job, month, duration
FROM (
SELECT
month,
job,
duration,
ROW_NUMBER() OVER (PARTITION BY month ORDER BY duration DESC) AS RN
FROM
run_history
WHERE
owner = 'john'
) x
WHERE RN <= 10
AND month < TO_CHAR(CURRENT_DATE, 'yyyymm')

Sounds like there can be multiple rows per (owner, month, job) and you want to work with the maximum duration per month for each job.
If so, aggregate computing max(duration) first, then use row_number() on top of it:
SELECT job, month, max_duration
FROM (
SELECT month, job, max(duration) AS max_duration
, row_number() OVER (PARTITION BY month ORDER BY max(duration) DESC NULLS LAST) AS rn
FROM run_history
WHERE owner = 'john'
AND month < to_char(CURRENT_DATE, 'yyyymm')
GROUP BY month, job
) sub
WHERE rn <= 10
ORDER BY month DESC, rn;
Aside: consider integer or date instead of text for the column month: cleaner and more efficient.

Related

Postgresql want to run a query for each day in an interval

i have a query which i need to run for every day in an interval, like for each day, for the last 2 years, i don't have the day info in the table, so i need to do it in a loop i think:
' select distinct on (osu.order_id) osu.order_id, osu.order_state, osu.created_at
from stock_management.order_state_updates osu
where osu.created_at < '2021-01-26 22:00:00'
order by osu.order_id desc, osu.created_at desc) temp
where temp.order_state = 'Filter1';'
in which the date '2021-01-26 22:00:00' would go through each day of the interval. thank you
https://docs.google.com/spreadsheets/d/1B2xx-c3wWZYaEN76LxjYhHnlrPRUx4TG8vsAkZ1X_Vs/edit?usp=sharing
error

You can generate a calendar and join it to your query. I'm not sure this will retrieve the right datas because I don't have sample data and expected result.
with d as (select * from generate_series ('20210101','20220427',interval '1 day') as date)
select distinct on (osu.order_id) osu.order_id, osu.order_state, osu.created_at::date
from stock_management.order_state_updates osu right join d on osu.created_at::date = d.date
order by osu.order_id desc, osu.created_at desc) temp
where temp.order_state = 'Filter1';

PostgreSQL - SQL function to loop through all months of the year and pull 10 random records from each

I am attempting to pull 10 random records from each month of this year using this query here but I get an error "ERROR: relation "c1" does not exist
"
Not sure where I'm going wrong - I think it may be I'm using Mysql syntax instead, but how do I resolve this?
My desired output is like this
Month
Another header
2021-01
random email 1
2021-01
random email 2
total of ten random emails from January, then ten more for each month this year (til November of course as Dec yet to happen)..
With CTE AS
(
Select month,
email,
Row_Number() Over (Partition By month Order By FLOOR(RANDOM()*(1-1000000+1))) AS RN
From (
SELECT
DISTINCT(TO_CHAR(DATE_TRUNC('month', timestamp ), 'YYYY-MM')) AS month
,CASE
WHEN
JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'name') = 'email'
THEN
JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'value')
END AS email
FROM form_submits_y2 fs
WHERE fs.website_id IN (791)
AND month LIKE '2021%'
GROUP BY 1,2
ORDER BY 1 ASC
)
)
SELECT *
FROM CTE C1
LEFT JOIN
(SELECT RN
,month
,email
FROM CTE C2
WHERE C2.month = C1.month
ORDER BY RANDOM() LIMIT 10) C3
ON C1.RN = C3.RN
ORDER By month ASC```

You can't reference an outer table inside a derived table with a regular join. You need to use left join lateral to make that work

I did end up finding a more elegant solution to my query here via this source from github :
SELECT
month
,email
FROM
(
Select month,
email,
Row_Number() Over (Partition By month Order By FLOOR(RANDOM()*(1-1000000+1))) AS RN
From (
SELECT
TO_CHAR(DATE_TRUNC('month', timestamp ), 'YYYY-MM') AS month
,CASE
WHEN JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'name') = 'email'
THEN JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'value')
END AS email
FROM form_submits_y2 fs
WHERE fs.website_id IN (791)
AND month LIKE '2021%'
GROUP BY 1,2
ORDER BY 1 ASC
)
) q
WHERE
RN <=10
ORDER BY month ASC

Is there a SQL code for cumulative count of SaaS customer over months?

I have a table with:
ID (id client), date_start (subscription of SaaS), date_end (could be a date value or be NULL).
So I need a cumulative count of active clients month by month.
any idea on how to write that in Postgres and achieve this result?
Starting from this, but I don't know how to proceed
select
date_trunc('month', c.date_start)::date,
count(*)
from customer

Please check next solution:
select
subscrubed_date,
subscrubed_customers,
unsubscrubed_customers,
coalesce(subscrubed_customers, 0) - coalesce(unsubscrubed_customers, 0) cumulative
from (
select distinct
date_trunc('month', c.date_start)::date subscrubed_date,
sum(1) over (order by date_trunc('month', c.date_start)) subscrubed_customers
from customer c
order by subscrubed_date
) subscribed
left join (
select distinct
date_trunc('month', c.date_end)::date unsubscrubed_date,
sum(1) over (order by date_trunc('month', c.date_end)) unsubscrubed_customers
from customer c
where date_end is not null
order by unsubscrubed_date
) unsubscribed on subscribed.subscrubed_date = unsubscribed.unsubscrubed_date;
share SQL query

You have a table of customers. With a start date and sometimes an end date. As you want to group by date, but there are two dates in the table, you need to split these first.
Then, you may have months where only customers came and others where only customers left. So, you'll want a full outer join of the two sets.
For a cumulative sum (also called a running total), use SUM OVER.
with came as
(
select date_trunc('month', date_start) as month, count(*) as cnt
from customer
group by date_trunc('month', date_start)
)
, went as
(
select date_trunc('month', date_end) as month, count(*) as cnt
from customer
where date_end is not null
group by date_trunc('month', date_end)
)
select
month,
came.cnt as cust_new,
went.cnt as cust_gone,
sum(came.cnt - went.cnt) over (order by month) as cust_active
from came full outer join went using (month)
order by month;

Count distinct loop in sql

I am trying to pull unique active users before a date.
So specifically, I have a date range (let's say August - November) where I want to know the cumulative unique active users on or before a day within a month.
So, the pseudocode would look something like this:
SELECT COUNT(DISTINCT USERS) FROM USER_DB
WHERE
Month = [loop through months 8-11]
AND
DAY <= [day in loop of 1:31]
The output I desire is something Like this

step-by-step demo: db<>fiddle
SELECT
mydate,
SUM( -- 3
COUNT(DISTINCT username) -- 1, 2
) OVER (ORDER BY mydate) -- 3
FROM t
GROUP BY mydate -- 2
GROUP BY your date and count the users
Because you don't want to count ALL user accesses, but only one access per user and day, you need to add the DISTINCT
This is a window function. This one aggregates all counts which where previously done cumulatively.
If you want to get unique user over ALL days (count a user only on its first access) you can filter the users with a DISTINCT ON clause first:
demo: db<>fiddle
SELECT DISTINCT ON (username)
*
FROM t
ORDER BY username, mydate
This yields:
SELECT
mydate,
SUM(
COUNT(*)
) OVER (ORDER BY mydate)
FROM (
SELECT DISTINCT ON (username)
*
FROM t
ORDER BY username, mydate
) s
GROUP BY mydate

How to select corresponding record alongside aggregate function with having clause

Let's say I have an orders table with customer_id, order_total, and order_date columns. I'd like to build a report that shows all customers who haven't placed an order in the last 30 days, with a column for the total amount their last order was.
This gets all of the customers who should be on the report:
select customer, max(order_date), (select order_total from orders o2 where o2.customer = orders.customer order by order_date desc limit 1)
from orders
group by 1
having max(order_date) < NOW() - '30 days'::interval
Is there a better way to do this that doesn't require a subquery but instead uses a window function or other more efficient method in order to access the total amount from the most recent order? The techniques from How to select id with max date group by category in PostgreSQL? are related, but the extra having restriction seems to stop me from using something like DISTINCT ON.

demo:db<>fiddle
Solution with row_number window function (https://www.postgresql.org/docs/current/static/tutorial-window.html)
SELECT
customer, order_date, order_total
FROM (
SELECT
*,
first_value(order_date) OVER w as last_order,
first_value(order_total) OVER w as last_total,
row_number() OVER w as row_count
FROM orders
WINDOW w AS (PARTITION BY customer ORDER BY order_date DESC)
) s
WHERE row_count = 1 AND order_date < CURRENT_DATE - 30
Solution with DISTINCT ON (https://www.postgresql.org/docs/9.5/static/sql-select.html#SQL-DISTINCT):
SELECT
customer, order_date, order_total
FROM (
SELECT DISTINCT ON (customer)
*,
first_value(order_date) OVER w as last_order,
first_value(order_total) OVER w as last_total
FROM orders
WINDOW w AS (PARTITION BY customer ORDER BY order_date DESC)
ORDER BY customer, order_date DESC
) s
WHERE order_date < CURRENT_DATE - 30
Explanation:
In both solutions I am working with the first_value window function. The window function's frame is defined by customers. The rows within the customers' groups are ordered descending by date which gives the latest row first (last_value is not working as expected every time). So it is possible to get the last order_date and the last order_total of this order.
The difference between both solutions is the filtering. I showed both versions because sometimes one of them is significantly faster
The window function style is creating a row count within the frames. Every first row can be filtered later. This is done by adding a row_number window function. The benefit of this solution comes out when you are trying to filter the first two or three data sets. You simply have to change the filter from WHERE row_count = 1 to WHERE row_count = 2
But if you want only one single row per group you just need to ensure that the expected row per group is ordered to be the first row in the group. Then the DISTINCT ON function can delete all following rows. DISTINCT ON (customer) gives the first (ordered) row per customer group.

Try to join table on itself
select o1.customer, max(order_date),
from orders o1
join orders o2 on o1.id=o2.id
group by o1.customer
having max(o1.order_date) < NOW() - '30 days'::interval
Subqueries in select is a bad idea, because DB will execute a query for each row
If you use postgres you can also try to use CTE
https://www.postgresql.org/docs/9.6/static/queries-with.html
WITH t as (
select id, order_total from orders o2 where o2.customer = orders.customer
order by order_date desc limit 1
) select o1.customer, max(order_date),
from orders o1
join t t.id=o2.id
group by o1.customer
having max(order_date) < NOW() - '30 days'::interval

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Filter duplicates on row_number results - postgresql

Related

Postgresql want to run a query for each day in an interval

PostgreSQL - SQL function to loop through all months of the year and pull 10 random records from each

Is there a SQL code for cumulative count of SaaS customer over months?

Count distinct loop in sql

How to select corresponding record alongside aggregate function with having clause

Categories

Resources