use corelated subquery in pyspark sql - pyspark

Tab1 Columns [F,S,E]
F1 S1 R
F1 S2 R2
F1 S3 R1
F2 S1 R2
F2 S4 R4
F1 S4 R
Tab2 Columns [F,S]
F1 S1
F1 S3
F2 S1
F2 S4
TAKE ROWS FROM TAB1 FOR ONLY IF F->S RELATION IS PRESENT IN Tab2
RESULT Columns [F,S,E]
F1 S1 R
F1 S3 R
F2 S4 R4
I have the query now, but am not able to get results with pyspark.I am able to run on MySql db.
I tried to use corelated subquery in spark 2.4.3, but this returns 0 rows.
Tab1.createOrReplaceTempView("Tab1")
Tab2.createOrReplaceTempView("Tab2")
joined_df = spark.sql(
"""SELECT F, S, E FROM Tab1
WHERE EXISTS (SELECT * FROM Tab2 WHERE Tab1.F=Tab2.F AND Tab1.S=Tab2.S)"""
)
joined_df.show(10)

Related

Can we write a hive query in Spark - UDF

Can we write a hive query in Spark - UDF.
eg I have 2 tables:
Table A and B
where b1 contains column names of A and b2 contains the value of that column in A.
Now I want to query the tables in such a way that I get result as below:
Result.
Basically replace the values of column in A with B based on column names and their corresponding values.
To achieve that I wrote spark-UDF eg:convert as below
def convert(colname: String, colvalue:String)={
sqlContext.sql("SELECT b3 from B where b1 = colname and b2 = colvalue").toString;
}
I registered it as:
sqlContext.udf.register("conv",convert(_:String,_:String));
Now my main query is-
val result = sqlContext.sql("select a1 , conv('a2',a2), conv('a3',a3)");
result.take(2);
It gives me java.lang.NullPointerException.
Can someone please suggest if this feature is supported in spark/hive.
Any other approach is also welcome.
Thanks!
No, UDF Doesn't permit to write a Query inside.
You can only pass the data as variables and do transformation to get the final result back at row/column/table level.
Here is the solution to your question. You can do it in Hive itself.
WITH a_plus_col
AS (SELECT a1
,'a2' AS col_name
,a2 AS col_value
FROM A
UNION ALL
SELECT a1
,'a3' AS col_name
,a3 AS col_value
FROM A)
SELECT a_plus_col.a1 AS r1
,MAX(CASE WHEN a_plus_col.col_name = 'a2' THEN B.b3 END) AS r2
,MAX(CASE WHEN a_plus_col.col_name = 'a3' THEN B.b3 END) AS r3
FROM a_plus_col
INNER JOIN B ON ( a_plus_col.col_name = b1 AND a_plus_col.col_value = b2)
GROUP BY a_plus_col.a1;

In EF I'm looking for duplicates and doing a self-referencing query, how do I write this query?

Here is the SQL statement:
SELECT f1.*
FROM [File] f1
where 1 < (select count(*) from [File] f2 where f1.FileName = f2.FileName)
order by f1.FileName
This is a fairly simple query to do in SQL, but I'm not sure how to do it in EF. The closest I've come to the answer is this (gives me the PK and count), but I want the full file record back:
from f1 in File
join f2 in File on f1.FileName equals f2.FileName
group f1 by f1.FileId into c
where c.Count() > 1
select new { FileId = c.Key, number = c.Count() }
You can use group join:
from f1 in File
join f2 in File on f1.FileName equals f2.FileName into g
where g.Count() > 1
select f1
I generally use EF with lambda syntax but couldn't you just project the object?
from f1 in File
join f2 in File on f1.FileName equals f2.FileName
group f1 by f1.FileId into c
where c.Count() > 1
select c

Common records for 2 fields in a table?

I have a Table which has 2 fields say A,B. Suppose A has values a1,a2.
Corresponding records for a1 in B are 1,2,3,x,y,z.
Corresponding records for a2 in B are 1,2,3,4,d,e,f
I need a a query to be written in DB2, so that it will fetch the common records in B for each record in A (a1 and a2).
So here the output would be :
A B
a1 1
a1 2
a1 3
a2 1
a2 2
a2 3
Can someone please help on this?
Try something like:
SELECT A, B
FROM Table t1
WHERE (SELECT COUNT(*) FROM Table t2 WHERE t2.B = t1.B)
= (SELECT COUNT(DISTINCT t3.A) FROM Table t3)
ORDER BY A, B
This might not be 100% accurate as I can't test it out in DB2 so you might have to tweak the query a little bit to make it work.
with t(num) as (select count(distinct A) from table)
select t1.A, t1.B
from table t1, table t2, t
where t1.B = t2.B
group by t1.A, t1.B, num
having count(*) = num
Basically, the idea is to join the same table with column B and filter out just the ones that match exactly the same number of times as the number of elements in column A, which indicates that it is a common record out of all the A values.

delete duplicate rows matching with other table

I have two tables
Table A Table B
-------- ---------
a b c a b c
a b c a b c
a b c a b c
e f g a b c
h i j e f g
k l m k l m
k l m
x y z
s t u
a b c
a b c
Now i want to remove rows in Table B matching on column 1, 2 and 3 with table A where the count of each duplicate row in Table B should be less than or equal to table A.
So the output should be
Table A Table B
-------- ---------
a b c a b c
a b c a b c
a b c a b c
e f g e f g
h i j k l m
k l m x y z
s t u
I have tried using inner join and intersect but failed to get the desired result.
Try:
DELETE FROM tableB
WHERE ctid IN (
SELECT BB.ctid
FROM (
SELECT a, b, c, count(*) cnt
FROM tablea
GROUP BY a, b, c
) AA
JOIN (
SELECT ctid,
a, b, c,
row_number() over (partition by a,b,c) cnt
FROM tableb
) BB
ON AA.a = BB.a
AND AA.b = BB.b
AND AA.c = BB.c
AND AA.cnt < BB.cnt
)
demo: http://sqlfiddle.com/#!12/73e99/1
I think if table isn't big the simply way is to delete all rows from TableB which exist in TableA and then insert TableA into TableB. Another ways IMHO are required at least a primary key in TableB.
DELETE FROM TableB
WHERE EXISTS(SELECT * FROM TableA
WHERE C1=TableA.C1
AND C2=TableA.C2
AND C3=TableA.C3) ;
INSERT INTO TableB SELECT * FROM TableA;

jdbc data comparision

I want to compare content of 15 columns in two rows.
I am using db2 9 with jdbc.
Can I use a sql to get a result like "match or not match"
And How can I get columns differs?
You can use the EXCEPT operator to do this.
In the example below, I'm using common table expressions to fetch a the single rows (assuming, in this case, that id is the primary key.
with r1 as (select c1, c2, ..., c15 from t where id = 1),
r2 as (select c1, c2, ..., c15 from t where id = 2)
select * from r1
except
select * from r2
If this returns 0 rows, then the rows are identical. If it returns a row, then the two rows differ.
If you really want the result to be 'MATCH' or 'NOT MATCH':
with r1 as (select c1, c2, ..., c15 from t where id = 1),
r2 as (select c1, c2, ..., c15 from t where id = 2),
rs as (select * from r1 except select * from r2)
select
case when count(*) = 0 then 'MATCH'
else 'NOT MATCH'
end as comparison
from
rs;