postgreSQL: averaging non-null values - postgresql

Seems simple, but... I have data like so:
pid score1 score2 score3
1 1 3 2
2 3 1
3 4
I want to do an average score for the three only where there are non null values. Sort of like sum(score1+score2+score3)/3 but the denominator essentially needs to be a total of the non-null values for the given row, so 3 for pid 1, 2 for 2, and 1 for 3.
Is there a simple thing I'm missing here?

with t(pid, score1, score2, score3) as (
values (1,1,3,2), (2,3,null,1), (3,4,null,null)
)
select
(sum(score1) + sum(score2) + sum(score3))::numeric /
(count(score1) + count(score2) + count(score3))
as avg,
avg(coalesce(score1, 0) + coalesce(score2, 0) + coalesce(score3, 0))
as avg2
from t;
avg | avg2
--------------------+--------------------
2.3333333333333333 | 4.6666666666666667

Related

Get count of values in different subgroups

I need to delete some rows in the dataset, of which the speed equals zero and lasting over N times (let's assume N is 2).
The structure of the table demo looks like:
id
car
speed
time
1
foo
0
1
2
foo
0
2
3
foo
0
3
4
foo
1
4
5
foo
1
5
6
foo
0
6
7
bar
0
1
8
bar
0
2
9
bar
5
3
10
bar
5
4
11
bar
5
5
12
bar
5
6
Then I hope to generate a table like the one below by using window_function:
id
car
speed
time
lasting
1
foo
0
1
3
2
foo
0
2
3
3
foo
0
3
3
4
foo
1
4
2
5
foo
1
5
2
6
foo
0
6
1
7
bar
0
1
2
8
bar
0
2
2
9
bar
5
3
4
10
bar
5
4
4
11
bar
5
5
4
12
bar
5
6
4
Then I can easily exclude those rows by using WHERE NOT (speed = 0 AND lasting > 2)
Put the code I tried here, but it didn't return the value I expected and I guess those FROM (SELECT ... FROM (SELECT ... might not be the best practice to solve the problem:
SELECT g3.*, count(id) OVER (PARTITION BY car, cumsum ORDER BY id) as num
FROM (SELECT g2.*, sum(grp2) OVER (PARTITION BY car ORDER BY id) AS cumsum
FROM (SELECT g1.*, (CASE ne0 WHEN 0 THEN 0 ELSE 1 END) AS grp2
FROM (SELECT g.*, speed - lag(speed, 1, 0) OVER (PARTITION BY car) AS ne0
FROM (SELECT *, row_number() OVER (PARTITION BY car) AS grp FROM demo) g ) g1 ) g2 ) g3
ORDER BY id;
You can use window function LAG() to check for the previous speed value for each row and SUM() window function to create the groups for the continuous values.
Then with COUNT() window function you can count the number of rows in each group so that you can filter out the rows with 0 speed in the groups that have more than 2 rows:
SELECT id, car, speed, time
FROM (
SELECT *, COUNT(*) OVER (PARTITION BY car, grp) counter
FROM (
SELECT *, SUM(flag::int) OVER (PARTITION BY car ORDER BY time) grp
FROM (
SELECT *, speed <> LAG(speed, 1, speed - 1) OVER (PARTITION BY car ORDER BY time) flag
FROM demo
) t
) t
) t
WHERE speed <> 0 OR counter <= 2
ORDER BY id;
See the demo.

indicator for increase over time

I'm trying to create an indicator for value increase over time within a group. In particular, I'm trying to flag certain grp if value ever increases by 50% over time.
I have a raw data that looks like:
id grp value_dt value
--------------------------------
1 1 11/20/20 1.4
1 1 11/21/20 0.8
1 1 11/24/20 2.8
1 1 11/25/20 2.5
1 2 11/29/20 1.5
1 2 12/1/20 1.6
2 1 11/21/20 0.8
2 2 11/26/20 0.9
2 3 12/1/20 0.9
2 3 12/3/20 2.8
You can see that for id = 1 and grp = 1, the value fluctuates as it increases and decreases over time, but because it had increase over time between 11/21/20 and 11/24/20 from 0.8 to 2.8 (greater than 50% increase), I want to flag the whole grp 1. I want my output to look like:
id grp val_ind
-----------------------
1 1 1
1 2 0
2 1 0
2 2 0
2 3 1
I can only think of using min and max (something like below), which doesn't include the 'over the time' factor in...
select id,
grp,
min(value) as min_grp,
max(value) as max_grp,
(max_grp - min_grp) as val_diff,
case when val_diff >= min_grp * 1.5 then 1 else 0 end as val_ind
If anyone can offer their advice, I will greatly appreciate it!
I think you want to know if at any point at time there is an increase of 50% , you flag that group. if yes , here is how you can do it,
you need to use cte and window functions :
; WITH cte AS (
SELECT *
, CASE WHEN COALESCE(LEAD(value) OVER (PARTITION BY id, grp ORDER BY value_dt),0) >= value* 2 THEN 1 ELSE 0 END val_ind
FROM ttt
)
SELECT
id , grp , MAX(val_ind) val_ind
FROM cte
GROUP BY
id , grp
id | grp | val_ind
-: | --: | ------:
1 | 1 | 1
1 | 2 | 0
2 | 1 | 0
2 | 2 | 0
2 | 3 | 1
db<>fiddle here

Average after fixed interval and Group By in SQL

Is it possible to Average after every fixed interval and group by one column in MSSQL ?
Suppose I have a table A as under:
NAME Interval Data1 Data2
1 0.01 1 4
1 0.05 4 2
1 0.09 7 6
1 0.11 1 2
1 0.15 7 6
1 0.18 3 1
1 0.19 2 5
2 0.209 9 0
I want the Output to group by Name and run average every 10 counts.
So for expamle
Name - 1
Interval Start - 0
Interval End - 10
Data 1 Avg - 4 [(1 + 4 + 7) / 3]
Data 3 Avg - 4 [(4 + 2 + 6) / 3]
AND
Name - 1
Interval Start - 10
Interval End - 20
Data 1 Avg - 3.25 [(1 + 7 + 3 + 2) / 4]
Data 3 Avg - 3.50 [(2 + 6 + 1 + 5) / 4]
So I want the Ouput as below. The interval per "Name" column is different.
Name Interval-Start Interval-End DataAvg1 DataAvg2
1 0 10 4 4
1 10 20 3.25 3.50
2 0 10 0 0
2 10 20 0 0
2 20 30 9 0
I used the below query, but cant figure out logic per interval.
SELECT Name, Interval, AVG(Data1) AS Data1Avg, AVG(Data2) AS Data2Avg
FROM TableA
GROUP BY Name;
Can someone please help me with it.
using cursor and temp table
--drop table dbo.#result
--drop table dbo.#steps
CREATE TABLE dbo.#result
(
[Name] varchar(50),
[Interval-Start] float,
[Interval-End] float,
[DataAvg1] float,
[DataAvg2] float
)
CREATE TABLE dbo.#steps
(
[IntervalStart] float,
[IntervalEnd] float
)
declare #min int, #max int, #step float
DECLARE #Name varchar(50), #IntervalStart float, #IntervalEnd float;
set #min = 0
set #max = 1
set #step = 0.1
insert into #steps
select #min + Number * #step IntervalStart, #min + Number * #step + #step IntervalEnd
from master..spt_values
where type = 'P' and number between 0 and (#max - #min) / #step
DECLARE _cursor CURSOR FOR
SELECT [Name], [IntervalStart], [IntervalEnd] FROM
(select [Name] from [TableA] Group by [Name]) t
INNER JOIN #steps on 1=1
OPEN _cursor;
FETCH NEXT FROM _cursor
INTO #Name, #IntervalStart, #IntervalEnd;
WHILE ##FETCH_STATUS = 0
BEGIN
insert into dbo.#result
select #Name, #IntervalStart, #IntervalEnd, AVG(CAST(Data1 as FLOAT)), AVG(CAST(Data2 as FLOAT))
FROM [TableA]
where [NAME] = #Name and Interval between #IntervalStart and #IntervalEnd
FETCH NEXT FROM _cursor
INTO #Name, #IntervalStart, #IntervalEnd;
END
CLOSE _cursor;
DEALLOCATE _cursor;
select * from dbo.#result

Recursive Cumulative Sum up to a certain value Postgres

I have my data that looks like this:
user_id touchpoint_number days_difference
1 1 5
1 2 20
1 3 25
1 4 10
2 1 2
2 2 30
2 3 4
I would like to create one more column that would create a cumulative sum of the days_difference, partitioned by user_id, but would reset whenever the value reaches 30 and starts counting from 0. I have been trying to do it, but I couldn't figure it out how to do it in PostgreSQL, because it has to be recursive.
The outcome I would like to have would be something like:
user_id touchpoint_number days_difference cum_sum_upto30
1 1 5 5
1 2 20 25
1 3 25 0 --- new count all over again
1 4 10 10
2 1 2 2
2 2 30 0 --- new count all over again
2 3 4 4
Do you have any cool ideas how this could be done?
This should do what you want:
with cte as (
select t.a, t.b, t.c, t.c as sumc
from t
where b = 1
union all
select t.a, t.b, t.c,
(case when t.c + cte.sumc > 30 then 0 else t.c + cte.sumc end)
from t join
cte
on t.b = cte.b + 1 and t.a = cte.a
)
select *
from cte
order by a, b;
Here is a rextester.

Redshift - Get a value from one column A for each ID in the grouping ID column B based on max value in another column C

I have a sql problem (on Redshift) where I need to get the value from column index for each id in column id based on max value in column final_score and put this value in a new column fav_index. score2 equals to the value of score1 where index n = index n + 1, for example, for id = abc1, index = 0 and score1 = 10 the value of score2 will be the value of score1 where index = 1 and the value of final_score is the difference between score1 and score2.
It's easier if you look at below table score. This table score is a result of a sql query which is shown later below.
id index score1 score2 final_score
abc1 0 10 20 10
abc1 1 20 45 25
abc1 2 45 (null) (null)
abc2 0 5 10 5
abc2 1 10 (null) (null)
abc3 0 50 30 -20
abc3 1 30 (null) (null)
So, the resulting table containing column fav_index should look like this:
id index score1 score2 final_score fav_index
abc1 0 10 20 10 0
abc1 1 20 45 25 1
abc1 2 45 (null) (null) 0
abc2 0 5 10 5 0
abc2 1 10 (null) (null) 0
abc3 0 50 30 -20 0
abc3 1 30 (null) (null) 0
Below is the script to generate table score from table story:
select
m.id,
m.index,
max(m.max) as score1,
fmt.score2,
round(fmt.score2 - max(m.max), 1) as final_score
from
(select
sv.id,
case when sv.story_number % 2 = 0 then cast(sv.story_number / 2 - 1 as int) else cast(floor(sv.story_number/2) as int) end as index,
max(sv.score1)
from
story as sv
group by
sv.id,
index,
sv.score1
order by
sv.id,
index
) as m
left join
(select
sv.id,
case when sv.story_number % 2 = 0 then cast(sv.story_number / 2 - 1 as int) else cast(floor(sv.story_number/2) as int) end as index,
max(score1) as score2
from
story as sv
group by
id,
index
) as fmt
on
m.id = fmt.id
and
m.index = fmt.index - 1
group by
m.id,
m.index,
fmt.score2
Table story is as below:
id story_number score1
abc1 1 10
abc1 2 10
abc1 3 20
abc1 4 20
abc1 5 45
abc1 6 45
The only solution I can think of is to do something like,
select id, max(final_score) from score group by id
and then join it back to the long script above (which was used to generate table score). I really want to avoid writing such a long script to get just 1 extra column of information that I need.
Is there a better way to do this?
Thank you!
Update: answer in mysql is also accepted. thanks!
After spending more hours on this and asking people around, I finally figured out a solution by referring to this window function documentation - PostgreSQL https://www.postgresql.org/docs/9.1/static/tutorial-window.html
I basically added 2 x select statements at the top and 1 x where statement at the very bottom. The where statement is to take care of the rows where final_score = null because otherwise the rank() function will rank them as 1.
My code then becomes:
select
id, index, final_score, rank, case when rank = 1 then index else null end as fav_index
from
(select
id, index, final_score, rank() over (partition by id order by final_score desc)
from
(select
m.id,
m.index,
max(m.max) as score1,
fmt.score2,
round(fmt.score2 - max(m.max), 1) as final_score
from
(select
sv.id,
case when sv.story_number % 2 = 0 then cast(sv.story_number / 2 - 1 as int) else cast(floor(sv.story_number/2) as int) end as index,
max(sv.score1)
from
story as sv
group by
sv.id,
index,
sv.score1
order by
sv.id,
index
) as m
left join
(select
sv.id,
case when sv.story_number % 2 = 0 then cast(sv.story_number / 2 - 1 as int) else cast(floor(sv.story_number/2) as int) end as index,
max(score1) as score2
from
story as sv
group by
id,
index
) as fmt
on
m.id = fmt.id
and
m.index = fmt.index - 1
group by
m.id,
m.index,
fmt.score2)
where
final_score is not null)
And the result is as follows:
id index final_score rank fav_index
abc1 0 10 2 (null)
abc1 1 25 1 1
abc2 0 5 1 0
abc3 0 -20 1 0
Result is slightly different than what I stated in the question, however, the fav_index for each id is identified and this is what I needed really. Hope this might help someone. Cheers