Difference between SAS merge and full outer join [duplicate] - merge

This question already has answers here:
How to replicate a SAS merge
(2 answers)
Closed 7 years ago.
Table t1:
person | visit | code_num1 | code_desc1
1 1 100 OTD
1 2 101 SED
2 3 102 CHM
3 4 103 OTD
3 4 103 OTD
4 5 101 SED
Table t2:
person | visit | code_num2 | code_desc2
1 1 104 DME
1 6 104 DME
3 4 103 OTD
3 4 103 OTD
3 7 103 OTD
4 5 104 DME
I have the following SAS code that merges the two tables t1 and t2 by person and visit:
DATA t3;
MERGE t1 t2;
BY person visit;
RUN;
Which produces the following output:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED
1 6 104 DME
2 3 102 CHM
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 103 OTD
4 5 101 SED 104 DME
I want to replicate this in a hive query, and tried using a full outer join:
create table t3 as
select case when a.person is null then b.person else a.person end as person,
case when a.visit is null then b.visit else a.visit end as visit,
a.code_num1, a.code_desc1, b.code_num2, b.code_desc2
from t1 a
full outer join t2 b
on a.person=b.person and a.visit=b.visit
Which yields the table:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED null null
1 6 null null 104 DME
2 3 102 CHM null null
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 null null 103 OTD
4 5 101 SED 104 DME
Which is almost the same as SAS, but we have 2 extra rows for (person=3, visit=4). I assume this is because hive is matching each row in one table with two rows in the other, producing the 4 rows in t3, whereas SAS does not. Any suggestions on how I could get my query to match the output of the SAS merge?

If you merge two data sets and they have variables with the same names (besides the by variables) then variables from the second data set will overwwrite any variables having the same name in the first data set. So your sas code creates a overlaid dataset. A full outer join does not do this.
It seems to me if you first dedupe the right side table then do a full outer join you should get the equivalent table in hive. I don't see a need for the case when statements either as Joe pointed out. Just do a join on the key values:
create table t3 as
select coalesce(a.person, b.person) as person
, coalesce(a.visit, b.visit) as visit
, a.code_num1
, a.code_desc1
, b.code_num2
, b.code_desc2
from
(select * from t1) a
full outer join
(select person, visit, code_num2, code_desc2
group by person, visit, code_num2, code_desc2 from t2) b
on a.person=b.person and a.visit=b.visit
;
I can't test this code currently so be sure to test it. Good luck.

Related

Unpivot data in PostgreSQL

I have a table in PostgreSQL with the below values,
empid hyderabad bangalore mumbai chennai
1 20 30 40 50
2 10 20 30 40
And my output should be like below
empid city nos
1 hyderabad 20
1 bangalore 30
1 mumbai 40
1 chennai 50
2 hyderabad 10
2 bangalore 20
2 mumbai 30
2 chennai 40
How can I do this unpivot in PostgreSQL?
You can use a lateral join:
select t.empid, x.city, x.nos
from the_table t
cross join lateral (
values
('hyderabad', t.hyderabad),
('bangalore', t.bangalore),
('mumbai', t.mumbai),
('chennai', t.chennai)
) as x(city, nos)
order by t.empid, x.city;
Or this one: simpler to read- and real plain SQL ...
WITH
input(empid,hyderabad,bangalore,mumbai,chennai) AS (
SELECT 1,20,30,40,50
UNION ALL SELECT 2,10,20,30,40
)
,
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
)
SELECT
empid
, CASE i
WHEN 1 THEN 'hyderabad'
WHEN 2 THEN 'bangalore'
WHEN 3 THEN 'mumbai'
WHEN 4 THEN 'chennai'
ELSE 'unknown'
END AS city
, CASE i
WHEN 1 THEN hyderabad
WHEN 2 THEN bangalore
WHEN 3 THEN mumbai
WHEN 4 THEN chennai
ELSE NULL::INT
END AS city
FROM input CROSS JOIN i
ORDER BY empid,i;
-- out empid | city | city
-- out -------+-----------+------
-- out 1 | hyderabad | 20
-- out 1 | bangalore | 30
-- out 1 | mumbai | 40
-- out 1 | chennai | 50
-- out 2 | hyderabad | 10
-- out 2 | bangalore | 20
-- out 2 | mumbai | 30
-- out 2 | chennai | 40

Add unique rows for each group when similar group repeats after certain rows

Hi Can anyone help me please to get unique group number?
I need to give unique rows for each group even when same group repeats after some groups.
I have following data:
id version product startdate enddate
123 0 2443 2010/09/01 2011/01/02
123 1 131 2011/01/03 2011/03/09
123 2 131 2011/08/10 2012/09/10
123 3 3009 2012/09/11 2014/03/31
123 4 668 2014/04/01 2014/04/30
123 5 668 2014/05/01 2016/01/01
123 6 668 2016/01/02 2017/09/08
123 7 131 2017/09/09 2017/10/10
123 8 131 2018/10/11 2019/01/01
123 9 550 2019/01/02 2099/01/01
select *,
dense_rank()over(partition by id order by id,product)
from table
Expected results:
id version product startdate enddate count
123 0 2443 2010/09/01 2011/01/02 1
123 1 131 2011/01/03 2011/03/09 2
123 2 131 2011/08/10 2012/09/10 2
123 3 3009 2012/09/11 2014/03/31 3
123 4 668 2014/04/01 2014/04/30 4
123 5 668 2014/05/01 2016/01/01 4
123 6 668 2016/01/02 2017/09/08 4
123 7 131 2017/09/09 2017/10/10 5
123 8 131 2018/10/11 2019/01/01 5
123 9 550 2019/01/02 2099/01/01 6
Try the following
SELECT
id,version,product,startdate,enddate,
1+SUM(v)OVER(PARTITION BY id ORDER BY version) n
FROM
(
SELECT
*,
IIF(LAG(product)OVER(PARTITION BY id ORDER BY version)<>product,1,0) v
FROM TestTable
) q

Spark - Grouping 2 Dataframe Rows in only 1 row [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I have the following dataframe
id col1 col2 col3 col4
1 1 10 100 A
1 1 20 101 B
1 1 30 102 C
2 1 10 80 D
2 1 20 90 E
2 1 30 100 F
2 1 40 104 G
So, I want to return a new dataframe, in which I can have in olnly one row the values for the same (col1, col2), and also create a new column with some oeration over both col3 columns, for example
id(1) col1(1) col2(1) col3(1) col4(1) id(2) col1(2) col2(2) col3(3) col4(4) new_column
1 1 10 100 A 2 1 10 80 D (100-80)*100
1 1 20 101 B 2 1 20 90 E (101-90)*100
1 1 30 102 C 2 1 30 100 F (102-100)*100
- - - - - 2 1 40 104 G -
I tried ordering, grouping by (col1, col2) but the grouping returns a RelationalGroupedDataset that I cannot do anything appart of aggregation functions. SO I will appreciate any help. I'm using Scala 2.11 Thanks!
what about joining the df with itself?
something like:
df.as("left")
.join(df.as("right"), Seq("col1", "col2"), "outer")
.where($"left.id" =!= $"right.id")

Spark, Scala - How to get Top 3 value from each group of two column in dataframe [duplicate]

This question already has answers here:
Retrieve top n in each group of a DataFrame in pyspark
(6 answers)
get TopN of all groups after group by using Spark DataFrame
(1 answer)
Closed 5 years ago.
I have one DataFrame which contains these values :
Dept_id | name | salary
1 A 10
2 B 100
1 D 100
2 C 105
1 N 103
2 F 102
1 K 90
2 E 110
I want the result in this form :
Dept_id | name | salary
1 N 103
1 D 100
1 K 90
2 E 110
2 C 105
2 F 102
Thanks In Advance :).
the solution is similar to Retrieve top n in each group of a DataFrame in pyspark which is in pyspark
If you do the same in scala, then it should be as below
df.withColumn("rank", rank().over(Window.partitionBy("Dept_id").orderBy($"salary".desc)))
.filter($"rank" <= 3)
.drop("rank")
I hope the answer is helpful

Tableau Pivot Rows into Columns

I have a table structure like this:
Department Employee Class Peroid Qty1 Qty2 Qty3
----------------------------------------------------
Dept1 John 1 1st 1 2 3
Dept1 John 1 2nd 11 22 33
Dept1 Mary 1 1st 2 3 4
Dept1 Mary 1 2nd 22 33 44
Dept2 Joe 1 1st 3 4 5
Dept2 Joe 1 2nd 33 44 55
Dept2 Paul 1 1st 4 5 6
Dept2 Paul 1 2nd 44 55 66
In a view I'd like to display the format as such:
Class / Period
1
Department Employee 1st 2nd
----------------------------------------------
Dept1 John 1 2 3 11 22 33
Dept1 Mary 2 3 4 22 33 44
Dept2 Joe 3 4 5 33 44 55
Dept2 Paul 4 5 6 44 55 66
I can't seem to find a way to do this. I have Class, Period as Columns and Department, Employee as Rows then drag Qty1, Qty2, Qty3 to the Text Mark but the format becomes:
Class / Period
1
Department Employee 1st 2nd
----------------------------------------------
Dept1 John 1 11
2 22
3 33
Dept1 Mary 2 22
3 33
4 44
Dept2 Joe 3 33
4 44
5 55
Dept2 Paul 4 44
5 55
6 66
How do I turn those rows under each employee to sub-columns under Period?
I think this is what you're trying to achieve.
A lot of times when you see a repeating column in a database table, Qty1, Qty2, Qty3, it is a sign that you really want multiple rows each with a single Qty (and repeating the other information) -- At least when you are building reports. That way you can have rows with any number of instances of Qty, and you can also easily aggregate all the Qty together when needed.
There are situations where you may want to stick with a repeating field design. But if you do want to reshape the data, you can do that in Tableau's data connection window by selecting the columns you want to pull out into a single field and selecting the pivot command.