Select rows from joined tables with more than n occurrence - postgresql

My problem is similar with MySQL: Select rows with more than one occurrence but I am using PostgreSQL. I have a query like:
select d.user_id, d.recorded_at, d.glucose_value, d.unit
from diary as d
join (
select u.id
from health_user as u
join (
select distinct user_id
from care_connect
where clinic_id = 217
and role = 'user'
and status = 'active'
) as c
on u.id = c.user_id
where u.is_tester is false
) as cu
on d.user_id = cu.id
where d.created_at >= d.recorded_at
and d.recorded_at < current_date and d.recorded_at >= current_date - interval '30 days'
and d.glucose_value > 0
and (d.state = 'wakeup' or (d.state = 'before_meal' and d.meal_type = 'breakfast'))
The result looks like:
+---------+---------------------+---------------+--------+
| user_id | recorded_at | glucose_value | unit |
+---------+---------------------+---------------+--------+
| 12041 | 2018-06-26 01:10:12 | 100 | mg/dL |
| 12041 | 2018-06-30 02:10:11 | 90 | mg/dL |
| 12214 | 2018-06-25 12:40:13 | 10 | mmol/L |
| 12214 | 2018-06-26 12:41:13 | 12 | mmol/L |
| 12214 | 2018-06-29 00:21:14 | 11 | mmol/L |
| 12214 | 2018-06-29 12:59:32 | 10 | mmol/L |
+---------+---------------------+---------------+--------+
As you can see that is already a long query with many conditions. Now I want to get only the records that are from users who have no less than four records (rows) in the result, so I tried:
select d.user_id, d.recorded_at, d.glucose_value, d.unit, count(d.*)
from diary as d
join (
select u.id
from health_user as u
join (
select distinct user_id
from care_connect
where clinic_id = 217
and role = 'user'
and status = 'active'
) as c
on u.id = c.user_id
where u.is_tester is false
) as cu
on d.user_id = cu.id
where d.created_at >= d.recorded_at
and d.recorded_at < current_date and d.recorded_at >= current_date - interval '30 days'
and d.glucose_value > 0
and (d.state = 'wakeup' or (d.state = 'before_meal' and d.meal_type = 'breakfast'))
group by d.user_id
having count(d.*) >= 4
My expected output is:
+---------+---------------------+---------------+--------+
| user_id | recorded_at | glucose_value | unit |
+---------+---------------------+---------------+--------+
| 12214 | 2018-06-25 12:40:13 | 10 | mmol/L |
| 12214 | 2018-06-26 12:41:13 | 12 | mmol/L |
| 12214 | 2018-06-29 00:21:14 | 11 | mmol/L |
| 12214 | 2018-06-29 12:59:32 | 10 | mmol/L |
+---------+---------------------+---------------+--------+
However, it throws an error, saying that d.recorded_at should also be added in group by, but that's not what I want. Besides grouping raw timestamps is not meaningful.
I know I can probably join another table, which are generated by the same query but only select d.user_id, count(d.*) at the first line, but the whole query would look crazy.
Would somebody please help me how to achieve this in a better way? Sorry I don't put table structures here, but I can edit and clarify things if needed.

Try this
Select user_id, recorded_at, glucose_value, unit
From (
select d.user_id, d.recorded_at, d.glucose_value, d.unit, count(1) over (partition by d.user_id) rcnt
from diary as d
join (
select u.id
from health_user as u
join (
select distinct user_id
from care_connect
where clinic_id = 217
and role = 'user'
and status = 'active'
) as c
on u.id = c.user_id
where u.is_tester is false
) as cu
on d.user_id = cu.id
where d.created_at >= d.recorded_at
and d.recorded_at < current_date and d.recorded_at >= current_date - interval '30 days'
and d.glucose_value > 0
and (d.state = 'wakeup' or (d.state = 'before_meal' and d.meal_type = 'breakfast'))
) x
Where rcnt >= 4

Try this:
Replace your_query with your actual query.
Using with clause and exists clause.
with original_query as ( your_query )
select * from original_query q1
where
exists( select q2.user_id from original_query q2 where q1.user_id = q2.user_id
group by q2.user_id
having count(q2.user_id) >= 4 )

Related

Distinct Count Dates by timeframe

I am trying to find the daily count of frequent visitors from a very large data-set. Frequent visitors in this case are visitor IDs used on 2 distinct days in a rolling 3 day period.
My data set looks like the below:
ID | Date | Location | State | Brand |
1 | 2020-01-02 | A | CA | XYZ |
1 | 2020-01-03 | A | CA | BCA |
1 | 2020-01-04 | A | CA | XYZ |
1 | 2020-01-06 | A | CA | YQR |
1 | 2020-01-06 | A | WA | XYZ |
2 | 2020-01-02 | A | CA | XYZ |
2 | 2020-01-05 | A | CA | XYZ |
This is the result I am going for. The count in the visits column is equal to the count of distinct days from the date column, -2 days for each ID. So for ID 1 on 2020-01-05, there was a visit on the 3rd and 4th, so the count is 2.
Date | ID | Visits | Frequent Prior 3 Days
2020-01-01 |Null| Null | Null
2020-01-02 | 1 | 1 | No
2020-01-02 | 2 | 1 | No
2020-01-03 | 1 | 2 | Yes
2020-01-03 | 2 | 1 | No
2020-01-04 | 1 | 3 | Yes
2020-01-04 | 2 | 1 | No
2020-01-05 | 1 | 2 | Yes
2020-01-05 | 2 | 1 | No
2020-01-06 | 1 | 2 | Yes
2020-01-06 | 2 | 1 | No
2020-01-07 | 1 | 1 | No
2020-01-07 | 2 | 1 | No
2020-01-08 | 1 | 1 | No
2020-01-09 | 1 | null | Null
I originally tried to use the following line to get the result for the visits column, but end up with 3 in every successive row at whichever date it first got to 3 for that ID.
,
count(ID) over (Partition by ID order by Date ASC rows between 3 preceding and current row) as visits
I've scoured the forum, but every somewhat similar question seems to involve counting the values rather than the dates and haven't been able to figure out how to tweak to get what I need. Any help is much appreciated.
You can aggregate the dataset by user and date, then use window functions with a range frame to look at the three preceding rows.
You did not tell which database you are running - and not all databases support the window ranges, nor have the same syntax for literal intervals. In standard SQL, you would go:
select
id,
date,
count(*) cnt_visits
case
when sum(count(*)) over(
partition by id
order by date
range between interval '3' day preceding and current row
) >= 2
then 'Yes'
else 'No'
end is_frequent_visitor
from mytable
group by id, date
On the other hand, if you want a record for every user and every day (event when there is no visit), then it is a bit different. You can generate the dataset first, then bring the table with a left join:
select
i.id,
d.date,
count(t.id) cnt_visits,
case
when sum(count(t.id)) over(
partition by i.id
order by d.date
rows between '3' day preceding and current row
) >= 2
then 'Yes'
else 'No'
end is_frequent_visitor
from (select distinct id from mytable) i
cross join (select distinct date from mytable) d
left join mytable t
on t.date = d.date
and t.id = i.id
group by i.id, d.date
I would be inclined to approach this by expanding out the days and visitors using a cross join and then just window functions. Assuming you have all dates in the data:
select i.id, d.date,
count(t.id) over (partition by i.id
order by d.date
rows between 2 preceding and current row
) as cnt_visits,
(case when count(t.id) over (partition by i.id
order by d.date
rows between 2 preceding and current row
) >= 2
then 'Yes' else 'No'
end) as is_frequent_visitor
from (select distinct id from t) i cross join
(select distinct date from t) d left join
(select distinct id, date from t) t
on t.date = d.date and
t.id = i.id;

A twist on a classic: Getting the most recent entries for an id then getting the highest number out of that group

I have a redshift table that I'm accessing via tableau:
+-----------+--------------+--------+----------+
| date | active_count | job_id | stage_id |
+-----------+--------------+--------+----------+
| 7/27/2020 | 2 | 10001 | 8 |
+-----------+--------------+--------+----------+
| 7/27/2020 | 140 | 10001 | 14 |
+-----------+--------------+--------+----------+
| 7/27/2020 | 20 | 10001 | 21 |
+-----------+--------------+--------+----------+
| 7/27/2020 | 1 | 10001 | 37 |
+-----------+--------------+--------+----------+
| 7/27/2020 | 0 | 10001 | 39 |
I want to run a query that selects the most recent rows per job_id and then returns the highest stage_id and corresponding job_id with an active_count > 0.
So I want my result to be:
+---------+--------+
| stage | job_id |
+---------+--------+
| 37 | 10001 |
+---------+--------+
I know that I can use this sort of thing to get the most recent entry for the job_id:
select
max(t1.stage_id),
t1.job_id
from
table1 as t1
left join table1 as t2
on t1.job_id = t2.job_id and
(t1.date < t2.date or
(t1.date = t2.date and
t1.job_id < t2.job_id))
where
t1.active_count > 0
group by
t1.job_id
but I'm not sure if this is efficient/actually working as intended. Is this the best way to go about this?
Unless your real criteria is more complex than your example you can just use a MAX() of each value you want.
SELECT t1.job_id
, MAX( t1.stage_id ) AS max_stage_id
, MAX( t1.active_count ) AS max_active_count
FROM table1 AS t1
WHERE t1.active_count > 0
GROUP BY t1.job_id
;
I ran this again and landed on the following code which works as intended:
select
max(t1.stage_id),
t1.job_id
from
stage_snapshots as t1
left join stage_snapshots as t2
on t1.job_id = t2.job_id and
(t1.date < t2.date or
(t1.date = t2.date and
t1.job_id < t2.job_id))
where
t1.active_count > 0 and
t2.date is null
group by
t1.job_id
This does what it's supposed to but it turns out stage_id is not ordered in the way I thought it was so I'll be going back to the drawing board.

Postgres join and distinct query

I have two tabels
user
id | name
-------------
1 | User1 |
2 | User2 |
3 | User3 |
4 | User4 |
User can change name in any moment.
And another tabel
order
id |user_name | user_id | price | order_date
---------------------------------------------
1 | OldUser3| 3 | 5 | 2017-07-12 08:01:00.000000
2 | NewUser3| 3 | 6 | 2017-07-12 09:01:00.000000
3 | User1 | 1 | 8 | 2017-07-12 10:01:00.000000
4 | NewUser | | 10 | 2017-07-12 11:01:00.000000
5 | NewUser | | 100 | 2017-07-12 12:01:00.000000
user_name copied from tabel user in moment making order and if user change name several times there are can be diferent value.
user_id can be null if it's not registered user
I need result tabel like this
order
no |user_name | user_id | total_pr| count | last_order
---------------------------------------------
1 | NewUser3| 3 | 11 | 2 |2017-07-12 09:01:00.000000
2 | User1 | 1 | 8 | 1 |2017-07-12 10:01:00.000000
3 | NewUser | | 10 | 1 |2017-07-12 11:01:00.000000
4 | NewUser | | 100 | 1 |2017-07-12 12:01:00.000000
user_name value must take from bigest order_date and need to sort by any column
and if user_id is null that all user with the same name it's different users
I try this
SELECT order.user_id, order.user_name, SUM(price), COUNT(order.user_id), MAX(order_date)
FROM order, user
WHERE
order.order_date >= '2017-07-01 08:01:00.000000'
AND order.order_date <= '2017-07-15 08:01:00.000000'
GROUP BY user_id, user_name ORDER BY count ASC
but its not all
Try this
with users_cte (user_name,user_id,total_pr,count,last_order) as (
--Fetching data for members who are in users table
Select user_name,user_id,total_pr,count,last_order from (
SELECT o.user_name, o.user_id, row_number() over (partition by o.user_id order by order_date desc) rno
, SUM(price) over (partition by o.user_id) as total_pr, COUNT(o.user_id) over(partition by o.user_id) as count , MAX(order_date) over (partition by o.user_id) as last_order
FROM orders o
left join users u
on o.user_id = u.id
WHERE
u.id is not null
and o.order_date >= '2017-07-01 08:01:00.000000'
AND o.order_date <= '2017-07-15 08:01:00.000000'
) A WHere A.rno=1
union all
--Fetching data for new members
SELECT o.user_name,null as user_id
, SUM(price) as total_pr, COUNT(o.user_name), MAX(order_date)
FROM orders o
left join users u
on o.user_id = u.id
WHERE
u.id is null
and o.order_date >= '2017-07-01 08:01:00.000000'
AND o.order_date <= '2017-07-15 08:01:00.000000'
GROUP BY o.user_name
)
Select row_number() over(order by last_order) as no,* from users_cte
try:
SELECT order.id, order.user_name, SUM(price), COUNT(order.user_id), MAX(order_date)
FROM order
LEFT OUTER JOIN user on order.user_id = user.id
WHERE
order.order_date >= '2017-07-01 08:01:00.000000'
AND order.order_date <= '2017-07-15 08:01:00.000000'
GROUP BY user_id, user_name ORDER BY count ASC

Aggregate data per week

I'd like to aggregate data weekly according to a date and a value.
I have a table like this :
create table test (t_val integer, t_date date);
insert into test values(1,'2017-02-09'),(2,'2017-02-10'),(4,'2017-02-16');
This is the query :
WITH date_range AS (
SELECT MIN(t_date) as start_date,
MAX(t_date) as end_date
FROM test
)
SELECT
date_part('year', f.date) as date_year,
date_part('week', f.date) as date_week,
f.val
FROM generate_series( (SELECT start_date FROM date_range), (SELECT end_date FROM date_range), '7 day')d
LEFT JOIN
(
SELECT t_val as val, t_date as date
FROM test
WHERE t_date >= (SELECT start_date FROM date_range)
AND t_date <= (SELECT end_date FROM date_range)
GROUP BY t_val, t_date
) f
ON f.date BETWEEN d.date AND (d.date + interval '7 day')
GROUP BY date_part('year', f.date),date_part('week', f.date), f.val;
I expect a result like this :
| Year | Week | Val |
| 2017 | 6 | 3 |
| 2017 | 7 | 4 |
BUt the query returns :
| Year | Week | Val |
| 2017 | 6 | 1 |
| 2017 | 6 | 2 |
| 2017 | 7 | 4 |
What is missing ?

Updating multiple rows with a certain value from the same table

So, I have the next table:
time | name | ID |
12:00:00| access | 1 |
12:05:00| select | null |
12:10:00| update | null |
12:15:00| insert | null |
12:20:00| out | null |
12:30:00| access | 2 |
12:35:00| select | null |
The table is bigger (aprox 1-1,5 mil rows) and there will be ID equal to 2,3,4 etc and rows between.
The following should be the result:
time | name | ID |
12:00:00| access | 1 |
12:05:00| select | 1 |
12:10:00| update | 1 |
12:15:00| insert | 1 |
12:20:00| out | 1 |
12:30:00| access | 2 |
12:35:00| select | 2 |
What is the most simple method to update the rows without making the log full? Like, one ID at a time.
You can do it with a sub query:
UPDATE YourTable t
SET t.ID = (SELECT TOP 1 s.ID
FROM YourTable s
WHERE s.time < t.time AND s.name = 'access'
ORDER BY s.time DESC)
WHERE t.name <> 'access'
Index on (ID,time,name) will help.
You can do it using CTE as below:
;WITH myCTE
AS ( SELECT time
, name
, ROW_NUMBER() OVER ( PARTITION BY name ORDER BY time ) AS [rank]
, ID
FROM YourTable
)
UPDATE myCTE
SET myCTE.ID = myCTE.rank
SELECT *
FROM YourTable ORDER BY ID