pyspark join issues - dataframe not defined - pyspark

I am trying to join 2 dataframes..This is what I did
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
df3=spark.sql (select a.*, b.column1
from df1 a
left join df2 b
on a.col1=b.col1 and a.col2=b.col2 and a.col3=b.col3)
But when I do print(df3) I get this error
NameError:name'df3' is not defined
Why I am getting this error? can't I join 2 temp views in spark?

Related

converting sql to dataframe api

How can the below sql be converted in spark? I attempted to do the below but saw this error -
Error evaluating method : '$eq$eq$eq': Method threw
'java.lang.RuntimeException' exception.
I am also not sure how to represent where sp1.cart_id = sp.cart_id in spark query
select distinct
o.order_id
, 'PENDING'
from shopping sp
inner join order o
on o.cart_id = sp.cart_id
where o.order_date = (select max(sp1.order_date)
from shopping sp1
where sp1.cart_id = sp.cart_id)
SHOPPING_DF
.select(
"ORDER_ID",
"PENDING")
.join(ORDER_DF, Seq("CART_ID"), "inner")
.filter(col(ORDER_DATE) === SHOPPING_DF.groupBy("CART_ID").agg(max("ORDER_DATE")))```
If this query is rewritten as a simple join on a table shopping that uses the window function max to determine the order date for each cart_id, this could easily be rewritten as sql as
SELECT DISTINCT
o.order_id,
'PENDING'
FROM
order o
INNER JOIN (
SELECT
cart_id,
MAX(order_date) OVER (
PARTITION BY cart_id
) as order_date
FROM
shopping
) sp ON sp.cart_id = o.cart_id AND
sp.order_date = o.order_date
This may be run on your spark session to achieve the results.
Converting this to the spark api could be written as
ORDER_DF.alias('o')
.join(
SHOPPING_DF.selectExpr(
"cart_id",
"MAX(order_date) OVER (PARTITION BY cart_id) as order_date"
).alias("sp"),
Seq("cart_id","order_date"),
"inner"
)
.selectExpr(
"o.order_id",
"'PENDING' as PENDING"
).distinct()
Let me know if this works for you.

Inner Join On (Hive) gives different result than PySpark Inner Join

I observed this phenomenon today. When I execute the following command in Hive CLI, I obtains something different than by doing this with pyspark :
Hive :
Select count(distinct t1.fieldX) from table1 t1 JOIN table2 t2 ON (t1.fieldX=t2.fieldX AND t1.fieldY=t2.fieldY);
Result : 17 488
SparkSQL :
hc.sql("Select count(distinct t1.fieldX) from table1 t1 JOIN table2 t2 ON (t1.fieldX==t2.fieldX AND t1.fieldY==t2.fieldY)")
Result : 5555
I obtain the same result using this code :
tabl1.alias("t1").join(
other=table2.alias("t2"),
on=[t1.fieldX==t2.fieldX, t1.fieldY==t2.fieldY]
how='inner'
).select("fieldX").distinct().count()
Result : 5555
I don't get why I obtain different results !

How to access fields after cross join of the same table in PySpark

I ran a cross join on a table and it worked fine. Now the problem is I don't know how to address the same field from the resulting dataframe.
df = spark.sql("select p1.id, p2.id from profile p1 CROSS JOIN profile p2 WHERE p1.id < p2.id")
When I printed out the first row, I got something like this:
Row(id=21398968, id=76109821)
Running "print(res_2[0]['id'])" yields only the first one as a scalar value (not a list)
You can change your query to be:
df = spark.sql("select p1.id AS p1_id, p2.id AS p2_id from profile CROSS JOIN profile p2 WHERE p1.id < p2.id")
By using AS you should be able to avoid the name conflict.

SQL: Joining tables using wildcards

When I join 2 tables
select t1.col1, t1.col2, t2.col3
from t1
left join t2
on t1.col2=t2.col3
Then I don't get any duplicate rows. However, when I try joining
tables using wildcards:
select t1.col1, t1.col2, t2.col3
from t1
left join t2
on t1.col2 like '%'||t2.col3
Then I'll get duplicate values. I saw this post that I think is
getting me somewhere but I couldn't really understand the solution.
Joining 2 Tables using a wildcard
It says I can use exists to get rid of duplicate values. I also
don't really understand his query. Here is the query in the other
post:
select *
from tableA a
where exists (select 1 from tableB b where a.id like '%' + b.id + '%');
What does the select 1 from tableB do?
I'm using PostgreSQL
This is what I've tried even though I don't understand it and still gives me duplicates:
select t1.col1,t1.col2,t2.col3
from t1
left join t2
on t1.col2 like '%'||t2.col3
where exists (select 1 from t2 where t1.col2 like '%'||t2.col3)
Use the DISTINCT clause:
SELECT DISTINCT t1.col1, t1.col2, t2.col3
FROM t1
LEFT JOIN t2 ON t1.col2 LIKE '%'||t2.col3;

Get the difference between two values

I got totally stuck on comparing two tables and getting the difference between them
So here we go:
I got table a with the following columns
Name|Value|Date
and the second table b with the same columns
What i wanna do now is get the difference between the values like
Table a
Name|Value|Date
Test|3|2013-20-06
Table b
Name|Value|Date
Test|9|2013-20-06
What i wann get is the difference between the 3 and 9 so i would recieve 6
Any Idea how i'm able to get that from a query in my PostgreSQL-DB?
Join the tables and select the difference:
select a.name, b.value - a.value, a.date
from a inner join b on a.name = b.name and a.date = b.date