Problem with using two different sum()'s on same column - tsql

I'm trying to get two different counts on the same column. The first count works fine with the constraints given, but the second count is not counting correctly. I have two tables, which are DailyFieldRecord and AB953. DailyFieldRecord contains: DailyFieldRecordID and ActivityCodeID. The AB953 table contains:DailyFieldRecordID, ItemID, and GroupID. Count1 will return the count of the DailyfieldrecordID's that contain ActivityCodeID=387 and GroupID=260 and that DON'T have ItemID in (1302,1303,1305,1306). Count2 will return the count of the DailyfieldrecordID's that contain ActivityCodeID=387 and GroupID=260 and that HAVE ItemID in (1302,1303,1305,1306). I'm trying to only get the count of the GroupID =260 for each DailyFieldRecordID that corresponds with the above constraints.
DailyFieldRecord: AB953:
DailyFieldRecordID ActivityCodeID DailyFieldRecordID: ItemID: GroupID:
657 387 657 1305 210
888 420 657 1333 260
672 387 657 1335 260
657 1302 210
657 1334 260
657 1111 111
888 1302 210
888 1336 260
672 1327 260
672 1334 260
672 1335 260
672 1322 260
672 1222 420
Expected Output:
Count1: Count2:
4 3
Count1 is supposed to count: Count2 is supposed to count:
672 1327 260 657 1333 260
672 1334 260 657 1335 260
672 1335 260 657 1334 260
672 1322 260
Current Count:
Count1: Count2:
4 6
SELECT sum(CASE WHEN ex=0 THEN 1 ELSE 0 END) AS COUNT1,sum(EX) AS COUNT2
FROM AB953 ab
JOIN DailyFieldRecord dfr
ON dfr.DailyFieldRecordID = ab.DailyFieldRecordID
JOIN ( SELECT AB1.DailyFieldRecordID,sum(CASE WHEN AB1.ItemID IN
(1302,1303,1305,1306) THEN 1 ELSE 0 END) AS EX
FROM AB953 AB1
GROUP BY AB1.DailyFieldRecordID) T
ON dfr.DailyFieldRecordID = T.DailyFieldRecordID
WHERE dfr.ActivityCodeID = 387
AND ab.GroupID = 260

First need to identify all of the DailyFieldRecordIDs that have any of the ItemIDs specified, which is what the sub-query here is doing. Then you can determine if the record in the outer query belongs to Count1 or Count2 based on where or not it exists in the result set of the subquery.
select sum(case when i.DailyFieldRecordID is null then 1 else 0 end) as Count1
, sum(case when i.DailyFieldRecordID is null then 0 else 1 end) as Count2
from AB953 as ab
inner join DailyFieldRecord as dfr on ab.DailyFieldRecordID = dfr.DailyFieldRecordID
left join (
select distinct a.DailyFieldRecordID
from AB953 as a
where a.ItemID in (1302, 1303, 1305, 1306)
) as i on ab.DailyFieldRecordID = i.DailyFieldRecordID
where dfr.ActivityCodeID = 387
and ab.GroupID = 260
Final Output:
+--------+--------+
| Count1 | Count2 |
+--------+--------+
| 4 | 3 |
+--------+--------+

Related

Address and smoothen noise in sensor data

I have sensors data as below wherein under Data Column, there are 6rows containing value 45 in between preceding and following rows containing value 50. The requirement is to clean this data and impute with 50 (prev value) in the new_data column. Moreover, the no of noise records (shown as 45 in table) might either vary in number or with level of rows.
Case 1 (sample data) :-
Sl.no
Timestamp
Data
New_data
1
1/1/2021 0:00:00
50
50
2
1/1/2021 0:15:00
50
50
3
1/1/2021 0:30:00
50
50
4
1/1/2021 0:45:00
50
50
5
1/1/2021 1:00:00
50
50
6
1/1/2021 1:15:00
50
50
7
1/1/2021 1:30:00
50
50
8
1/1/2021 1:45:00
50
50
9
1/1/2021 2:00:00
50
50
10
1/1/2021 2:15:00
50
50
11
1/1/2021 2:30:00
45
50
12
1/1/2021 2:45:00
45
50
13
1/1/2021 3:00:00
45
50
14
1/1/2021 3:15:00
45
50
15
1/1/2021 3:30:00
45
50
16
1/1/2021 3:45:00
45
50
17
1/1/2021 4:00:00
50
50
18
1/1/2021 4:15:00
50
50
19
1/1/2021 4:30:00
50
50
20
1/1/2021 4:45:00
50
50
21
1/1/2021 5:00:00
50
50
22
1/1/2021 5:15:00
50
50
23
1/1/2021 5:30:00
50
50
I am thinking of a need to group these data ordered by timestamp asc (like below) and then could have a condition in place where it will have to check group by group in large sample data and if group 1 is same as group 3 , replace group 2 with group 1 values.
Sl.no
Timestamp
Data
New_data
group
1
1/1/2021 0:00:00
50
50
1
2
1/1/2021 0:15:00
50
50
1
3
1/1/2021 0:30:00
50
50
1
4
1/1/2021 0:45:00
50
50
1
5
1/1/2021 1:00:00
50
50
1
6
1/1/2021 1:15:00
50
50
1
7
1/1/2021 1:30:00
50
50
1
8
1/1/2021 1:45:00
50
50
1
9
1/1/2021 2:00:00
50
50
1
10
1/1/2021 2:15:00
50
50
1
11
1/1/2021 2:30:00
45
50
2
12
1/1/2021 2:45:00
45
50
2
13
1/1/2021 3:00:00
45
50
2
14
1/1/2021 3:15:00
45
50
2
15
1/1/2021 3:30:00
45
50
2
16
1/1/2021 3:45:00
45
50
2
17
1/1/2021 4:00:00
50
50
3
18
1/1/2021 4:15:00
50
50
3
19
1/1/2021 4:30:00
50
50
3
20
1/1/2021 4:45:00
50
50
3
21
1/1/2021 5:00:00
50
50
3
22
1/1/2021 5:15:00
50
50
3
23
1/1/2021 5:30:00
50
50
3
Moreover, there is also a need to add an exception like, if the next group is having similar pattern, not to change but to retain the data as it is.
Ex below : If group 1 and group 3 are same , impute group 2 with group 1 value.
But if group 2 and group 4 are same, do not change group 3 , retain same data in New_data.
Case 2:-
Sl.no
Timestamp
Data
New_data
group
1
1/1/2021 0:00:00
50
50
1
2
1/1/2021 0:15:00
50
50
1
3
1/1/2021 0:30:00
50
50
1
4
1/1/2021 0:45:00
50
50
1
5
1/1/2021 1:00:00
50
50
1
6
1/1/2021 1:15:00
50
50
1
7
1/1/2021 1:30:00
50
50
1
8
1/1/2021 1:45:00
50
50
1
9
1/1/2021 2:00:00
50
50
1
10
1/1/2021 2:15:00
50
50
1
11
1/1/2021 2:30:00
45
50
2
12
1/1/2021 2:45:00
45
50
2
13
1/1/2021 3:00:00
45
50
2
14
1/1/2021 3:15:00
45
50
2
15
1/1/2021 3:30:00
45
50
2
16
1/1/2021 3:45:00
45
50
2
17
1/1/2021 4:00:00
50
50
3
18
1/1/2021 4:15:00
50
50
3
19
1/1/2021 4:30:00
50
50
3
20
1/1/2021 4:45:00
50
50
3
21
1/1/2021 5:00:00
50
50
3
22
1/1/2021 5:15:00
50
50
3
23
1/1/2021 5:30:00
50
50
3
24
1/1/2021 5:45:00
45
45
4
25
1/1/2021 6:00:00
45
45
4
26
1/1/2021 6:15:00
45
45
4
27
1/1/2021 6:30:00
45
45
4
28
1/1/2021 6:45:00
45
45
4
29
1/1/2021 7:00:00
45
45
4
30
1/1/2021 7:15:00
45
45
4
31
1/1/2021 7:30:00
45
45
4
Reaching out for help in coding in postgresql to address above scenario. Please feel free to suggest any alternative approaches to solve above problem.
The query below should answer the need.
The first query identifies the rows which correspond to a change of
data.
The second query groups the rows between two successive changes of data and set up the corresponding range of timestamp
The third query is a recursive query which calculates the new_data in an
iterative way according to the timestamp order.
The last query display the expected result.
WITH RECURSIVE list As
(
SELECT no
, timestamp
, lag(data) OVER w AS previous
, data
, lead(data) OVER w AS next
, data IS DISTINCT FROM lag(data) OVER w AS first
, data IS DISTINCT FROM lead(data) OVER w AS last
FROM sensors
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
), range_list AS
(
SELECT tsrange(timestamp, lead(timestamp) OVER w, '[]') AS range
, previous
, data
, lead(next) OVER w AS next
, first
FROM list
WHERE first OR last
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
), rec_list (range, previous, data, next, new_data, arr) AS
(
SELECT range
, previous
, data
, next
, data
, array[range]
FROM range_list
WHERE previous IS NULL
UNION ALL
SELECT c.range
, p.data
, c.data
, c.next
, CASE
WHEN p.new_data IS NOT DISTINCT FROM c.next
THEN p.data
ELSE c.data
END
, p.arr || c.range
FROM rec_list AS p
INNER JOIN range_list AS c
ON lower(c.range) = upper(p.range) + interval '15 minutes'
WHERE NOT array[c.range] <# p.arr
AND first
)
SELECT s.*, r.new_data
FROM sensors AS s
INNER JOIN rec_list AS r
ON r.range #> s.timestamp
ORDER BY timestamp
see the test result in dbfiddle

Add unique rows for each group when similar group repeats after certain rows

Hi Can anyone help me please to get unique group number?
I need to give unique rows for each group even when same group repeats after some groups.
I have following data:
id version product startdate enddate
123 0 2443 2010/09/01 2011/01/02
123 1 131 2011/01/03 2011/03/09
123 2 131 2011/08/10 2012/09/10
123 3 3009 2012/09/11 2014/03/31
123 4 668 2014/04/01 2014/04/30
123 5 668 2014/05/01 2016/01/01
123 6 668 2016/01/02 2017/09/08
123 7 131 2017/09/09 2017/10/10
123 8 131 2018/10/11 2019/01/01
123 9 550 2019/01/02 2099/01/01
select *,
dense_rank()over(partition by id order by id,product)
from table
Expected results:
id version product startdate enddate count
123 0 2443 2010/09/01 2011/01/02 1
123 1 131 2011/01/03 2011/03/09 2
123 2 131 2011/08/10 2012/09/10 2
123 3 3009 2012/09/11 2014/03/31 3
123 4 668 2014/04/01 2014/04/30 4
123 5 668 2014/05/01 2016/01/01 4
123 6 668 2016/01/02 2017/09/08 4
123 7 131 2017/09/09 2017/10/10 5
123 8 131 2018/10/11 2019/01/01 5
123 9 550 2019/01/02 2099/01/01 6
Try the following
SELECT
id,version,product,startdate,enddate,
1+SUM(v)OVER(PARTITION BY id ORDER BY version) n
FROM
(
SELECT
*,
IIF(LAG(product)OVER(PARTITION BY id ORDER BY version)<>product,1,0) v
FROM TestTable
) q

Combine 2 data frames with different columns in spark

I have 2 dataframes:
df1 :
Id purchase_count purchase_sim
12 100 1500
13 1020 1300
14 1010 1100
20 1090 1400
21 1300 1600
df2:
Id click_count click_sim
12 1030 2500
13 1020 1300
24 1010 1100
30 1090 1400
31 1300 1600
I need to get the combined data frame with results as :
Id click_count click_sim purchase_count purchase_sim
12 1030 2500 100 1500
13 1020 1300 1020 1300
14 null null 1010 1100
24 1010 1100 null null
30 1090 1400 null null
31 1300 1600 null null
20 null null 1090 1400
21 null null 1300 1600
I can't use union because of different column names. Can some one suggest me a better way to do this ?
All you require a full outer join on ID column.
df1.join(df2, Seq("Id"), "full_outer")
// Since the Id column name is same in both the dataframes, if you use comparison like
df1($"Id") === df2($"Id"), you will get duplicate ID columns
Please refer the below documentation for future references.
https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

Select only those records which are twice in postgres

select distinct(msg_id),sub_id from programs where sub_id IN
(
select sub_id from programs group by sub_id having count(sub_id) = 2 limit 5
)
sub_id means subscriberID
Inner query will return those subscriberID which are exactly 2 times in the program table and main query will gives those subscriberID which having distinct msg_id.
This result will generated
msg_id sub_id
------|--------|
112 | 313
111 | 222
113 | 313
115 | 112
116 | 112
117 | 101
118 | 115
119 | 115
110 | 222
I want it should be
msg_id sub_id
------|--------|
112 | 313
111 | 222
113 | 313
115 | 112
116 | 112
118 | 115
119 | 115
110 | 222
117 | 101 (this result should not be in output because its only once)
I want only those record which are twice.
I'm not sure, but are you just missing the second field in your in-list?
select distinct msg_id, sub_id, <presumably other fields>
from programs
where (sub_id, msg_id) IN
(
select sub_id, msg_id
from programs
group by sub_id, msg_id
having count(sub_id) = 2
)
If so, you can also do this with a windowing function:
with cte as (
select
msg_id, sub_id, <presumably other fields>,
count (*) over (partition by msg_id, sub_id) as cnt
from programs
)
select distinct
msg_id, sub_id, <presumably other fields>
from cte
where cnt = 2
try this
SELECT msg_id, MAX(sub_id)
FROM programs
GROUP BY msg_id
HAVING COUNT(sub_id) = 2 -- COUNT(sub_id) > 1 if you want all those that repeat more than once
ORDER BY msg_id

kdb getting float from integer division

I have a table
id, turnover, qty
and I want to query
select sum turnover, sum qty, (sum turnover) div (sum qty) by id from Table
However, the the resulting value from the division seems to be an int and shows 0 (as the unit price is a lot smaller than 1). I tried to cast the results into a float, but that doesnt help
select sum turnover, sum qty, `float$(`float$(sum turnover) div `float$(sum qty)) by id from Table.
How can I get a float in return?
Also, as a side question. How can I name the column (equivalently to sql select sum(x) as my_column_name ...)
That's the expected output from div, you should use % to divide numbers - which always returns a float.
q)200 div 8.5
22
q)200%8.5
23.52941
q)
Reference here;
Div: http://code.kx.com/q/ref/arith-integer/#div
%: http://code.kx.com/q/ref/arith-float/#divide
*edit
Apologies - forgot to reference the rest of your question. In your example, you are calculating the sum turnover and sum qty twice - you will want to avoid that, if you're dealing with a lot of records.
How is this;
q)show trade:([] id:(`$"A",'string[til 10]);turnover:10?til 10; qty:10?100+til 200)
id turnover qty
---------------
A0 4 152
A1 4 238
A2 2 298
A3 2 268
A4 7 246
A5 2 252
A6 0 279
A7 5 286
A8 7 245
A9 5 191
q)update toverq:sumT%sumQ from select sumT:sum turnover,sumQ:sum qty by id from trade
id| sumT sumQ toverq
--| ---------------------
A0| 4 152 0.02631579
A1| 4 238 0.01680672
A2| 2 298 0.006711409
A3| 2 268 0.007462687
A4| 7 246 0.02845528
A5| 2 252 0.007936508
A6| 0 279 0
A7| 5 286 0.01748252
A8| 7 245 0.02857143
A9| 5 191 0.02617801