SQL Server: FAILING extra records - tsql

I have a tableA (ID int, Match varchar, code char, status = char)
ID Match code Status
101 123 A
102 123 B
103 123 C
104 234 A
105 234 B
106 234 C
107 234 B
108 456 A
109 456 B
110 456 C
I want to populate status with 'FAIL' when:
For same match, there exists code different than (A,B or C)
or the code exists multiple times.
In other words, code can be only (A,B,C) and it should exists only one for same match, else fail. So, expected result would be:
ID Match code Status
101 123 A NULL
102 123 B NULL
103 123 C NULL
104 234 A NULL
105 234 B NULL
106 234 C NULL
107 234 B FAIL
108 456 A NULL
109 456 B NULL
110 456 C NULL
Thanks

No guarantees on efficiency here...
update tableA
set status = 'FAIL'
where ID not in (
select min(ID)
from tableA
group by Match, code)

Related

pyspark - converting DF Structure

I am new to Python and Spark Programming.
I have data in Below given format-1, which will have data captured for different fields based on timestamp and trigger.
I need to convert this data into format-2, i.e, based on timestamp and Key, need to group all the fields given in format-1 and created records as per Format-2. In Format-1, there are field that does not have any key value (timestamp and Trigger), these fields should be populated for all the records in format-2
Can you please suggest me the best approach to perform this in pyspark.
Format-1:
Event time (key-1) trig (key-2) data field_Name
------------------------------------------------------
2021-05-01T13:57:29Z 30Sec 10 A
2021-05-01T13:57:59Z 30Sec 11 A
2021-05-01T13:58:29Z 30Sec 12 A
2021-05-01T13:58:59Z 30Sec 13 A
2021-05-01T13:59:29Z 30Sec 14 A
2021-05-01T13:59:59Z 30Sec 15 A
2021-05-01T14:00:29Z 30Sec 16 A
2021-05-01T14:00:48Z OFF 17 A
2021-05-01T13:57:29Z 30Sec 110 B
2021-05-01T13:57:59Z 30Sec 111 B
2021-05-01T13:58:29Z 30Sec 112 B
2021-05-01T13:58:59Z 30Sec 113 B
2021-05-01T13:59:29Z 30Sec 114 B
2021-05-01T13:59:59Z 30Sec 115 B
2021-05-01T14:00:29Z 30Sec 116 B
2021-05-01T14:00:48Z OFF 117 B
2021-05-01T14:00:48Z OFF 21 C
2021-05-01T14:00:48Z OFF 31 D
Null Null 41 E
Null Null 51 F
Format-2:
Event Time Trig A B C D E F
--------------------------------------------------------------
2021-05-01T13:57:29Z 30Sec 10 110 Null Null 41 51
2021-05-01T13:57:59Z 30Sec 11 111 Null Null 41 51
2021-05-01T13:58:29Z 30Sec 12 112 Null Null 41 51
2021-05-01T13:58:59Z 30Sec 13 113 Null Null 41 51
2021-05-01T13:59:29Z 30Sec 14 114 Null Null 41 51
2021-05-01T13:59:59Z 30Sec 15 115 Null Null 41 51
2021-05-01T14:00:29Z 30Sec 16 116 Null Null 41 51
2021-05-01T14:00:48Z OFF 17 117 21 31 41 51

Add unique rows for each group when similar group repeats after certain rows

Hi Can anyone help me please to get unique group number?
I need to give unique rows for each group even when same group repeats after some groups.
I have following data:
id version product startdate enddate
123 0 2443 2010/09/01 2011/01/02
123 1 131 2011/01/03 2011/03/09
123 2 131 2011/08/10 2012/09/10
123 3 3009 2012/09/11 2014/03/31
123 4 668 2014/04/01 2014/04/30
123 5 668 2014/05/01 2016/01/01
123 6 668 2016/01/02 2017/09/08
123 7 131 2017/09/09 2017/10/10
123 8 131 2018/10/11 2019/01/01
123 9 550 2019/01/02 2099/01/01
select *,
dense_rank()over(partition by id order by id,product)
from table
Expected results:
id version product startdate enddate count
123 0 2443 2010/09/01 2011/01/02 1
123 1 131 2011/01/03 2011/03/09 2
123 2 131 2011/08/10 2012/09/10 2
123 3 3009 2012/09/11 2014/03/31 3
123 4 668 2014/04/01 2014/04/30 4
123 5 668 2014/05/01 2016/01/01 4
123 6 668 2016/01/02 2017/09/08 4
123 7 131 2017/09/09 2017/10/10 5
123 8 131 2018/10/11 2019/01/01 5
123 9 550 2019/01/02 2099/01/01 6
Try the following
SELECT
id,version,product,startdate,enddate,
1+SUM(v)OVER(PARTITION BY id ORDER BY version) n
FROM
(
SELECT
*,
IIF(LAG(product)OVER(PARTITION BY id ORDER BY version)<>product,1,0) v
FROM TestTable
) q

Update Spark dataframe to populate data from another dataframe

I have 2 dataframes. I want to take distinct values of 1 column and link it with all the rows of another dataframe. For e.g -
Dataframe 1 : df1 contains
scenarioId
---------------
101
102
103
Dataframe 2 : df2 contains columns
trades
-------------------------------------
isin price
ax11 111
re32 909
erre 445
Expected output
trades
----------------
isin price scenarioid
ax11 111 101
re32 909 101
erre 445 101
ax11 111 102
re32 909 102
erre 445 102
ax11 111 103
re32 909 103
erre 445 103
Note that i dont have a possibility to join the 2 dataframes on a common column. Please suggest.
What you need is cross join or cartessian product:
val result = df1.crossJoin(df2)
although I do not recommend it as the amount of data rises very fast. You'll get all possible pairs - elements of cartessian product (the number will be number of rows in df1 times number of rows in df2).

Select only those records which are twice in postgres

select distinct(msg_id),sub_id from programs where sub_id IN
(
select sub_id from programs group by sub_id having count(sub_id) = 2 limit 5
)
sub_id means subscriberID
Inner query will return those subscriberID which are exactly 2 times in the program table and main query will gives those subscriberID which having distinct msg_id.
This result will generated
msg_id sub_id
------|--------|
112 | 313
111 | 222
113 | 313
115 | 112
116 | 112
117 | 101
118 | 115
119 | 115
110 | 222
I want it should be
msg_id sub_id
------|--------|
112 | 313
111 | 222
113 | 313
115 | 112
116 | 112
118 | 115
119 | 115
110 | 222
117 | 101 (this result should not be in output because its only once)
I want only those record which are twice.
I'm not sure, but are you just missing the second field in your in-list?
select distinct msg_id, sub_id, <presumably other fields>
from programs
where (sub_id, msg_id) IN
(
select sub_id, msg_id
from programs
group by sub_id, msg_id
having count(sub_id) = 2
)
If so, you can also do this with a windowing function:
with cte as (
select
msg_id, sub_id, <presumably other fields>,
count (*) over (partition by msg_id, sub_id) as cnt
from programs
)
select distinct
msg_id, sub_id, <presumably other fields>
from cte
where cnt = 2
try this
SELECT msg_id, MAX(sub_id)
FROM programs
GROUP BY msg_id
HAVING COUNT(sub_id) = 2 -- COUNT(sub_id) > 1 if you want all those that repeat more than once
ORDER BY msg_id

Difference between SAS merge and full outer join [duplicate]

This question already has answers here:
How to replicate a SAS merge
(2 answers)
Closed 7 years ago.
Table t1:
person | visit | code_num1 | code_desc1
1 1 100 OTD
1 2 101 SED
2 3 102 CHM
3 4 103 OTD
3 4 103 OTD
4 5 101 SED
Table t2:
person | visit | code_num2 | code_desc2
1 1 104 DME
1 6 104 DME
3 4 103 OTD
3 4 103 OTD
3 7 103 OTD
4 5 104 DME
I have the following SAS code that merges the two tables t1 and t2 by person and visit:
DATA t3;
MERGE t1 t2;
BY person visit;
RUN;
Which produces the following output:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED
1 6 104 DME
2 3 102 CHM
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 103 OTD
4 5 101 SED 104 DME
I want to replicate this in a hive query, and tried using a full outer join:
create table t3 as
select case when a.person is null then b.person else a.person end as person,
case when a.visit is null then b.visit else a.visit end as visit,
a.code_num1, a.code_desc1, b.code_num2, b.code_desc2
from t1 a
full outer join t2 b
on a.person=b.person and a.visit=b.visit
Which yields the table:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED null null
1 6 null null 104 DME
2 3 102 CHM null null
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 null null 103 OTD
4 5 101 SED 104 DME
Which is almost the same as SAS, but we have 2 extra rows for (person=3, visit=4). I assume this is because hive is matching each row in one table with two rows in the other, producing the 4 rows in t3, whereas SAS does not. Any suggestions on how I could get my query to match the output of the SAS merge?
If you merge two data sets and they have variables with the same names (besides the by variables) then variables from the second data set will overwwrite any variables having the same name in the first data set. So your sas code creates a overlaid dataset. A full outer join does not do this.
It seems to me if you first dedupe the right side table then do a full outer join you should get the equivalent table in hive. I don't see a need for the case when statements either as Joe pointed out. Just do a join on the key values:
create table t3 as
select coalesce(a.person, b.person) as person
, coalesce(a.visit, b.visit) as visit
, a.code_num1
, a.code_desc1
, b.code_num2
, b.code_desc2
from
(select * from t1) a
full outer join
(select person, visit, code_num2, code_desc2
group by person, visit, code_num2, code_desc2 from t2) b
on a.person=b.person and a.visit=b.visit
;
I can't test this code currently so be sure to test it. Good luck.