Select first value over threshold or max in group - postgresql

I have data that looks something like this:
time
value
replicate
1
0.1
1
2
0.812
1
3
0.9
1
1
0.2
2
2
0.3
2
3
0.4
2
And I want to find the duration when the values first cross a threshold value and the first value, in this case let's say it's 0.8 for each replicate group. However if a group doesn't have any value greater than the threshold, just return the max value.
Wanted output would be:
duration
replicate
1
1
2
2
Is this even possible to do in a single query in Postgres?

Something like:
select time, max(value)
from ...
where value > 0.8
group by replicate,time
union
select time, max(value)
from ...
where value < 0.8
group by replicate, time

Related

How to count the number of values in a column in a dataframe based on the values in the other dataframe

I have two dataframes. the first one is a raw dataframe so its item_value column has all the item values. and the other dataframe has columns named min,avg,max which has min,avg,max values specified for the items in the first dataframe. and I want to count the number of item values in the first dataframe based on the specified agg values in the second dataframe.
the first dataframe looks like this
item_name
item_value
A
1.4
A
2.1
B
3.0
A
2.8
B
4.5
B
1.1
the second dataframe looks like this
item_name
min
avg
max
A
1.1
2
2.7
B
2.1
3
4.0
I want to count the number of item values that are greater than the defined min,avg,max values in the other dataframe
So the result I want is
item_name
min
avg
max
A
3
2
1
B
2
1
1
Any help would be much appreciated
*please forgive my grammar
If you don't mind SQL implementation, you can try:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
sql = """
select df2.item_name,
sum(case when df1.item_value > df2.min then 1 else 0 end) as min,
sum(case when df1.item_value > df2.avg then 1 else 0 end) as avg,
sum(case when df1.item_value > df2.max then 1 else 0 end) as max
from df2 join df1 on df2.item_name=df1.item_name
group by df2.item_name
"""
df = spark.sql(sql)
df.show()

indicator for increase over time

I'm trying to create an indicator for value increase over time within a group. In particular, I'm trying to flag certain grp if value ever increases by 50% over time.
I have a raw data that looks like:
id grp value_dt value
--------------------------------
1 1 11/20/20 1.4
1 1 11/21/20 0.8
1 1 11/24/20 2.8
1 1 11/25/20 2.5
1 2 11/29/20 1.5
1 2 12/1/20 1.6
2 1 11/21/20 0.8
2 2 11/26/20 0.9
2 3 12/1/20 0.9
2 3 12/3/20 2.8
You can see that for id = 1 and grp = 1, the value fluctuates as it increases and decreases over time, but because it had increase over time between 11/21/20 and 11/24/20 from 0.8 to 2.8 (greater than 50% increase), I want to flag the whole grp 1. I want my output to look like:
id grp val_ind
-----------------------
1 1 1
1 2 0
2 1 0
2 2 0
2 3 1
I can only think of using min and max (something like below), which doesn't include the 'over the time' factor in...
select id,
grp,
min(value) as min_grp,
max(value) as max_grp,
(max_grp - min_grp) as val_diff,
case when val_diff >= min_grp * 1.5 then 1 else 0 end as val_ind
If anyone can offer their advice, I will greatly appreciate it!
I think you want to know if at any point at time there is an increase of 50% , you flag that group. if yes , here is how you can do it,
you need to use cte and window functions :
; WITH cte AS (
SELECT *
, CASE WHEN COALESCE(LEAD(value) OVER (PARTITION BY id, grp ORDER BY value_dt),0) >= value* 2 THEN 1 ELSE 0 END val_ind
FROM ttt
)
SELECT
id , grp , MAX(val_ind) val_ind
FROM cte
GROUP BY
id , grp
id | grp | val_ind
-: | --: | ------:
1 | 1 | 1
1 | 2 | 0
2 | 1 | 0
2 | 2 | 0
2 | 3 | 1
db<>fiddle here

calculating median to output similar results to over partition

i have a large table, here is a snippet of how it looks like
name class brand rating
12 d 1 3.8
22 d 1 3.9
33 a 2 1.1
12 d 1 2.3
12 a 3 3.4
44 b 1 9.8
22 c 2 3.0
i calculated for the average of the rating doing the below
select avg(rating) over(partition by name,class,brand) as avg_rating from df
i'm aware that postgres doesn't have a median function but i would like to calculate for that column and have the output in a similar structure to that of my window function for average
in case of even number of rows, i would like the average number between the middle two numbers
To get the median, you should use percentile_cont
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY rating) FROM df;

Remove duplicates based on only 1 column

My data is in the following format:
rep_id user_id other non-duplicated data
1 1 ...
1 2 ...
2 3 ...
3 4 ...
3 5 ...
I am trying to achieve a column for deduped_rep with 0/1 such that only first rep id across the associated users has a 1 and rest have 0.
Expected result:
rep_id user_id deduped_rep
1 1 1
1 2 0
2 3 1
3 4 1
3 5 0
For reference, in Excel, I would use the following formula:
IF(SUMPRODUCT(($A$2:$A2=A2)*($A$2:$A2=A2))>1,0,1)
I know there is the FIXED() LoD calculation http://kb.tableau.com/articles/howto/removing-duplicate-data-with-lod-calculations, but I only see use cases of it deduplicating based on another column. However, mine are distinct.
Define a field first_reg_date_per_rep_id as
{ fixed rep_id : min(registration_date) }
The define a field is_first_reg_date? as
registration_date = first_reg_date_per_rep_id
You can use that last Boolean field to distinguish the first record for each rep_id from later ones
try this query
select
rep_id,
user_id,
row_number() over(partition by rep_id order by rep_id,user_id) deduped_rep
from
table

Column of counts for time intervals

I want to get a table that constructs a column that tracks how many times an id appears in a given week. If the id appears once it is given a 1, if it appears twice it is given a 2, but if it appears more than two times it is given a 0.
id date
a 2015-11-10
a 2015-11-25
a 2015-11-09
b 2015-11-10
b 2015-11-09
a 2015-11-05
b 2015-11-23
b 2015-11-28
b 2015-12-04
a 2015-11-10
b 2015-12-04
a 2015-12-07
a 2015-12-09
c 2015-11-30
a 2015-12-06
c 2015-10-31
c 2015-11-04
b 2015-12-01
a 2015-10-30
a 2015-12-14
the one week intervals are given as follows
1 - 2015-10-30 to 2015-11-05
2 - 2015-11-06 to 2015-11-12
3 - 2015-11-13 to 2015-11-19
4 - 2015-11-20 to 2015-11-26
5 - 2015-11-27 to 2015-12-03
6 - 2015-12-04 to 2015-12-10
7 - 2015-12-11 to 2015-12-17
The table should look like this.
id interval count
a 1 2
b 1 0
c 1 2
a 2 0
b 2 2
c 2 0
a 3 0
b 3 0
c 3 0
a 4 1
b 4 1
c 4 0
a 5 0
b 5 2
c 5 1
a 6 0
b 6 2
c 6 0
a 7 1
b 7 0
c 7 0
The interval column doesn't have to be there, I simply added it for clarity.
I am new to sql and am unsure how to break the dates into intervals. The only thing I have is grouping by date and counting.
Select id ,date, count (*) as frequency
from data_1
group by id, date having frequency <= 2;
Looking at just the data you provided, this does the trick:
SELECT v.id,
i.interval,
coalesce((CASE WHEN sub.cnt < 3 THEN sub.cnt ELSE 0 END), 0) AS count
FROM (VALUES('a'), ('b'), ('c')) v(id)
CROSS JOIN generate_series(1, 7) i(interval)
LEFT JOIN (
SELECT id, ((date - '2015-10-30')/7 + 1)::int AS interval, count(*) AS cnt
FROM my_table
GROUP BY 1, 2) sub USING (id, interval)
ORDER BY 2, 1;
A few words of explanation:
You have three id values which are here recreated with a VALUES clause. If you have many more or don't know beforehand which id's to enumerate, you can always replace the VALUES clause with a sub-query.
You provide a specific date range over 7 weeks. Since you might have weeks where a certain id is not present you need to generate a series of the interval values and CROSS JOIN that to the id values above. This yields the 21 rows you are looking for.
Then you calculate the occurrences of ids in intervals. You can subtract a date from another date which will give you the number of days in between. So subtract the date of the row from the earliest date, divide that by 7 to get the interval period, add 1 to make the interval 1-based and convert to integer. You can then convert counts of > 2 to 0 and NULL to 0 with a combination of CASE and coalesce().
The query outputs the interval too, otherwise you will have no clue what the data refers to. Optionally, you can turn this into a column which shows the date range of the interval.
More flexible solution
If you have more ids and a larger date range, you can use the below version which first determines the distinct ids and the date range. Note that the interval is now 0-based to make calculations easier. Not that it matters much because instead of the interval number, the corresponding date range is displayed.
WITH mi AS (
SELECT min(date) AS min, ((max(date) - min(date))/7)::int AS intv FROM my_table)
SELECT v.id,
to_char((mi.min + i.intv * 7)::timestamp, 'YYYY-mm-dd') || ' - ' ||
to_char((mi.min + i.intv * 7 + 6)::timestamp, 'YYYY-mm-dd') AS period,
coalesce((CASE WHEN sub.cnt < 3 THEN sub.cnt ELSE 0 END), 0) AS count
FROM mi,
(SELECT DISTINCT id FROM my_table) v
CROSS JOIN LATERAL generate_series(0, mi.intv) i(intv)
LEFT JOIN LATERAL (
SELECT id, ((date - mi.min)/7)::int AS intv, count(*) AS cnt
FROM my_table
GROUP BY 1, 2) sub USING (id, intv)
ORDER BY 2, 1;
SQLFiddle with both solutions.
Assuming you have a table of all users, this will do the trick.
select
users.id,
interval_table.id,
CASE
WHEN count(log_table.user_id)>2 THEN 0
ELSE count(log_table.user_id)
END
from users
cross join interval_table
left outer join log_table
on users.id = log_table.user_id
and log_table.event_date >= interval_table.start_interval
and log_table.event_date < interval_table.stop_interval
group by users.id, interval_table.id
order by interval_table.id, users.id
Check it out: http://sqlfiddle.com/#!15/1a822/21