Add unique rows for each group when similar group repeats after certain rows - tsql

Hi Can anyone help me please to get unique group number?
I need to give unique rows for each group even when same group repeats after some groups.
I have following data:
id version product startdate enddate
123 0 2443 2010/09/01 2011/01/02
123 1 131 2011/01/03 2011/03/09
123 2 131 2011/08/10 2012/09/10
123 3 3009 2012/09/11 2014/03/31
123 4 668 2014/04/01 2014/04/30
123 5 668 2014/05/01 2016/01/01
123 6 668 2016/01/02 2017/09/08
123 7 131 2017/09/09 2017/10/10
123 8 131 2018/10/11 2019/01/01
123 9 550 2019/01/02 2099/01/01
select *,
dense_rank()over(partition by id order by id,product)
from table
Expected results:
id version product startdate enddate count
123 0 2443 2010/09/01 2011/01/02 1
123 1 131 2011/01/03 2011/03/09 2
123 2 131 2011/08/10 2012/09/10 2
123 3 3009 2012/09/11 2014/03/31 3
123 4 668 2014/04/01 2014/04/30 4
123 5 668 2014/05/01 2016/01/01 4
123 6 668 2016/01/02 2017/09/08 4
123 7 131 2017/09/09 2017/10/10 5
123 8 131 2018/10/11 2019/01/01 5
123 9 550 2019/01/02 2099/01/01 6

Try the following
SELECT
id,version,product,startdate,enddate,
1+SUM(v)OVER(PARTITION BY id ORDER BY version) n
FROM
(
SELECT
*,
IIF(LAG(product)OVER(PARTITION BY id ORDER BY version)<>product,1,0) v
FROM TestTable
) q

Related

Get distinct values in Pyspark and if duplicate value then should be placed in another column

Input Table:
prod
acct
acctno
newcinsfx
John
A01
1
89
John
A01
2
90
John
A01
2
92
Mary
A02
1
92
Mary
A02
3
81
Desired output table:
prod
acct
newcinsfx1
newcinsfx2
John
A01
89
John
A01
90
92
Mary
A02
92
Mary
A02
81
I tried to do it by distinct function.
df.select('prod',"acctno").distinct()
df.show()

Address and smoothen noise in sensor data

I have sensors data as below wherein under Data Column, there are 6rows containing value 45 in between preceding and following rows containing value 50. The requirement is to clean this data and impute with 50 (prev value) in the new_data column. Moreover, the no of noise records (shown as 45 in table) might either vary in number or with level of rows.
Case 1 (sample data) :-
Sl.no
Timestamp
Data
New_data
1
1/1/2021 0:00:00
50
50
2
1/1/2021 0:15:00
50
50
3
1/1/2021 0:30:00
50
50
4
1/1/2021 0:45:00
50
50
5
1/1/2021 1:00:00
50
50
6
1/1/2021 1:15:00
50
50
7
1/1/2021 1:30:00
50
50
8
1/1/2021 1:45:00
50
50
9
1/1/2021 2:00:00
50
50
10
1/1/2021 2:15:00
50
50
11
1/1/2021 2:30:00
45
50
12
1/1/2021 2:45:00
45
50
13
1/1/2021 3:00:00
45
50
14
1/1/2021 3:15:00
45
50
15
1/1/2021 3:30:00
45
50
16
1/1/2021 3:45:00
45
50
17
1/1/2021 4:00:00
50
50
18
1/1/2021 4:15:00
50
50
19
1/1/2021 4:30:00
50
50
20
1/1/2021 4:45:00
50
50
21
1/1/2021 5:00:00
50
50
22
1/1/2021 5:15:00
50
50
23
1/1/2021 5:30:00
50
50
I am thinking of a need to group these data ordered by timestamp asc (like below) and then could have a condition in place where it will have to check group by group in large sample data and if group 1 is same as group 3 , replace group 2 with group 1 values.
Sl.no
Timestamp
Data
New_data
group
1
1/1/2021 0:00:00
50
50
1
2
1/1/2021 0:15:00
50
50
1
3
1/1/2021 0:30:00
50
50
1
4
1/1/2021 0:45:00
50
50
1
5
1/1/2021 1:00:00
50
50
1
6
1/1/2021 1:15:00
50
50
1
7
1/1/2021 1:30:00
50
50
1
8
1/1/2021 1:45:00
50
50
1
9
1/1/2021 2:00:00
50
50
1
10
1/1/2021 2:15:00
50
50
1
11
1/1/2021 2:30:00
45
50
2
12
1/1/2021 2:45:00
45
50
2
13
1/1/2021 3:00:00
45
50
2
14
1/1/2021 3:15:00
45
50
2
15
1/1/2021 3:30:00
45
50
2
16
1/1/2021 3:45:00
45
50
2
17
1/1/2021 4:00:00
50
50
3
18
1/1/2021 4:15:00
50
50
3
19
1/1/2021 4:30:00
50
50
3
20
1/1/2021 4:45:00
50
50
3
21
1/1/2021 5:00:00
50
50
3
22
1/1/2021 5:15:00
50
50
3
23
1/1/2021 5:30:00
50
50
3
Moreover, there is also a need to add an exception like, if the next group is having similar pattern, not to change but to retain the data as it is.
Ex below : If group 1 and group 3 are same , impute group 2 with group 1 value.
But if group 2 and group 4 are same, do not change group 3 , retain same data in New_data.
Case 2:-
Sl.no
Timestamp
Data
New_data
group
1
1/1/2021 0:00:00
50
50
1
2
1/1/2021 0:15:00
50
50
1
3
1/1/2021 0:30:00
50
50
1
4
1/1/2021 0:45:00
50
50
1
5
1/1/2021 1:00:00
50
50
1
6
1/1/2021 1:15:00
50
50
1
7
1/1/2021 1:30:00
50
50
1
8
1/1/2021 1:45:00
50
50
1
9
1/1/2021 2:00:00
50
50
1
10
1/1/2021 2:15:00
50
50
1
11
1/1/2021 2:30:00
45
50
2
12
1/1/2021 2:45:00
45
50
2
13
1/1/2021 3:00:00
45
50
2
14
1/1/2021 3:15:00
45
50
2
15
1/1/2021 3:30:00
45
50
2
16
1/1/2021 3:45:00
45
50
2
17
1/1/2021 4:00:00
50
50
3
18
1/1/2021 4:15:00
50
50
3
19
1/1/2021 4:30:00
50
50
3
20
1/1/2021 4:45:00
50
50
3
21
1/1/2021 5:00:00
50
50
3
22
1/1/2021 5:15:00
50
50
3
23
1/1/2021 5:30:00
50
50
3
24
1/1/2021 5:45:00
45
45
4
25
1/1/2021 6:00:00
45
45
4
26
1/1/2021 6:15:00
45
45
4
27
1/1/2021 6:30:00
45
45
4
28
1/1/2021 6:45:00
45
45
4
29
1/1/2021 7:00:00
45
45
4
30
1/1/2021 7:15:00
45
45
4
31
1/1/2021 7:30:00
45
45
4
Reaching out for help in coding in postgresql to address above scenario. Please feel free to suggest any alternative approaches to solve above problem.
The query below should answer the need.
The first query identifies the rows which correspond to a change of
data.
The second query groups the rows between two successive changes of data and set up the corresponding range of timestamp
The third query is a recursive query which calculates the new_data in an
iterative way according to the timestamp order.
The last query display the expected result.
WITH RECURSIVE list As
(
SELECT no
, timestamp
, lag(data) OVER w AS previous
, data
, lead(data) OVER w AS next
, data IS DISTINCT FROM lag(data) OVER w AS first
, data IS DISTINCT FROM lead(data) OVER w AS last
FROM sensors
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
), range_list AS
(
SELECT tsrange(timestamp, lead(timestamp) OVER w, '[]') AS range
, previous
, data
, lead(next) OVER w AS next
, first
FROM list
WHERE first OR last
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
), rec_list (range, previous, data, next, new_data, arr) AS
(
SELECT range
, previous
, data
, next
, data
, array[range]
FROM range_list
WHERE previous IS NULL
UNION ALL
SELECT c.range
, p.data
, c.data
, c.next
, CASE
WHEN p.new_data IS NOT DISTINCT FROM c.next
THEN p.data
ELSE c.data
END
, p.arr || c.range
FROM rec_list AS p
INNER JOIN range_list AS c
ON lower(c.range) = upper(p.range) + interval '15 minutes'
WHERE NOT array[c.range] <# p.arr
AND first
)
SELECT s.*, r.new_data
FROM sensors AS s
INNER JOIN rec_list AS r
ON r.range #> s.timestamp
ORDER BY timestamp
see the test result in dbfiddle

Select only those records which are twice in postgres

select distinct(msg_id),sub_id from programs where sub_id IN
(
select sub_id from programs group by sub_id having count(sub_id) = 2 limit 5
)
sub_id means subscriberID
Inner query will return those subscriberID which are exactly 2 times in the program table and main query will gives those subscriberID which having distinct msg_id.
This result will generated
msg_id sub_id
------|--------|
112 | 313
111 | 222
113 | 313
115 | 112
116 | 112
117 | 101
118 | 115
119 | 115
110 | 222
I want it should be
msg_id sub_id
------|--------|
112 | 313
111 | 222
113 | 313
115 | 112
116 | 112
118 | 115
119 | 115
110 | 222
117 | 101 (this result should not be in output because its only once)
I want only those record which are twice.
I'm not sure, but are you just missing the second field in your in-list?
select distinct msg_id, sub_id, <presumably other fields>
from programs
where (sub_id, msg_id) IN
(
select sub_id, msg_id
from programs
group by sub_id, msg_id
having count(sub_id) = 2
)
If so, you can also do this with a windowing function:
with cte as (
select
msg_id, sub_id, <presumably other fields>,
count (*) over (partition by msg_id, sub_id) as cnt
from programs
)
select distinct
msg_id, sub_id, <presumably other fields>
from cte
where cnt = 2
try this
SELECT msg_id, MAX(sub_id)
FROM programs
GROUP BY msg_id
HAVING COUNT(sub_id) = 2 -- COUNT(sub_id) > 1 if you want all those that repeat more than once
ORDER BY msg_id

Tableau Pivot Rows into Columns

I have a table structure like this:
Department Employee Class Peroid Qty1 Qty2 Qty3
----------------------------------------------------
Dept1 John 1 1st 1 2 3
Dept1 John 1 2nd 11 22 33
Dept1 Mary 1 1st 2 3 4
Dept1 Mary 1 2nd 22 33 44
Dept2 Joe 1 1st 3 4 5
Dept2 Joe 1 2nd 33 44 55
Dept2 Paul 1 1st 4 5 6
Dept2 Paul 1 2nd 44 55 66
In a view I'd like to display the format as such:
Class / Period
1
Department Employee 1st 2nd
----------------------------------------------
Dept1 John 1 2 3 11 22 33
Dept1 Mary 2 3 4 22 33 44
Dept2 Joe 3 4 5 33 44 55
Dept2 Paul 4 5 6 44 55 66
I can't seem to find a way to do this. I have Class, Period as Columns and Department, Employee as Rows then drag Qty1, Qty2, Qty3 to the Text Mark but the format becomes:
Class / Period
1
Department Employee 1st 2nd
----------------------------------------------
Dept1 John 1 11
2 22
3 33
Dept1 Mary 2 22
3 33
4 44
Dept2 Joe 3 33
4 44
5 55
Dept2 Paul 4 44
5 55
6 66
How do I turn those rows under each employee to sub-columns under Period?
I think this is what you're trying to achieve.
A lot of times when you see a repeating column in a database table, Qty1, Qty2, Qty3, it is a sign that you really want multiple rows each with a single Qty (and repeating the other information) -- At least when you are building reports. That way you can have rows with any number of instances of Qty, and you can also easily aggregate all the Qty together when needed.
There are situations where you may want to stick with a repeating field design. But if you do want to reshape the data, you can do that in Tableau's data connection window by selecting the columns you want to pull out into a single field and selecting the pivot command.

Difference between SAS merge and full outer join [duplicate]

This question already has answers here:
How to replicate a SAS merge
(2 answers)
Closed 7 years ago.
Table t1:
person | visit | code_num1 | code_desc1
1 1 100 OTD
1 2 101 SED
2 3 102 CHM
3 4 103 OTD
3 4 103 OTD
4 5 101 SED
Table t2:
person | visit | code_num2 | code_desc2
1 1 104 DME
1 6 104 DME
3 4 103 OTD
3 4 103 OTD
3 7 103 OTD
4 5 104 DME
I have the following SAS code that merges the two tables t1 and t2 by person and visit:
DATA t3;
MERGE t1 t2;
BY person visit;
RUN;
Which produces the following output:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED
1 6 104 DME
2 3 102 CHM
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 103 OTD
4 5 101 SED 104 DME
I want to replicate this in a hive query, and tried using a full outer join:
create table t3 as
select case when a.person is null then b.person else a.person end as person,
case when a.visit is null then b.visit else a.visit end as visit,
a.code_num1, a.code_desc1, b.code_num2, b.code_desc2
from t1 a
full outer join t2 b
on a.person=b.person and a.visit=b.visit
Which yields the table:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED null null
1 6 null null 104 DME
2 3 102 CHM null null
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 null null 103 OTD
4 5 101 SED 104 DME
Which is almost the same as SAS, but we have 2 extra rows for (person=3, visit=4). I assume this is because hive is matching each row in one table with two rows in the other, producing the 4 rows in t3, whereas SAS does not. Any suggestions on how I could get my query to match the output of the SAS merge?
If you merge two data sets and they have variables with the same names (besides the by variables) then variables from the second data set will overwwrite any variables having the same name in the first data set. So your sas code creates a overlaid dataset. A full outer join does not do this.
It seems to me if you first dedupe the right side table then do a full outer join you should get the equivalent table in hive. I don't see a need for the case when statements either as Joe pointed out. Just do a join on the key values:
create table t3 as
select coalesce(a.person, b.person) as person
, coalesce(a.visit, b.visit) as visit
, a.code_num1
, a.code_desc1
, b.code_num2
, b.code_desc2
from
(select * from t1) a
full outer join
(select person, visit, code_num2, code_desc2
group by person, visit, code_num2, code_desc2 from t2) b
on a.person=b.person and a.visit=b.visit
;
I can't test this code currently so be sure to test it. Good luck.