pyspark join issues - dataframe not defined

pyspark join issues - dataframe not defined - pyspark

I am trying to join 2 dataframes..This is what I did
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
df3=spark.sql (select a.*, b.column1
from df1 a
left join df2 b
on a.col1=b.col1 and a.col2=b.col2 and a.col3=b.col3)
But when I do print(df3) I get this error
NameError:name'df3' is not defined
Why I am getting this error? can't I join 2 temp views in spark?

Related

converting sql to dataframe api

How can the below sql be converted in spark? I attempted to do the below but saw this error -
Error evaluating method : '$eq$eq$eq': Method threw
'java.lang.RuntimeException' exception.
I am also not sure how to represent where sp1.cart_id = sp.cart_id in spark query
select distinct
o.order_id
, 'PENDING'
from shopping sp
inner join order o
on o.cart_id = sp.cart_id
where o.order_date = (select max(sp1.order_date)
from shopping sp1
where sp1.cart_id = sp.cart_id)
SHOPPING_DF
.select(
"ORDER_ID",
"PENDING")
.join(ORDER_DF, Seq("CART_ID"), "inner")
.filter(col(ORDER_DATE) === SHOPPING_DF.groupBy("CART_ID").agg(max("ORDER_DATE")))```

If this query is rewritten as a simple join on a table shopping that uses the window function max to determine the order date for each cart_id, this could easily be rewritten as sql as
SELECT DISTINCT
o.order_id,
'PENDING'
FROM
order o
INNER JOIN (
SELECT
cart_id,
MAX(order_date) OVER (
PARTITION BY cart_id
) as order_date
FROM
shopping
) sp ON sp.cart_id = o.cart_id AND
sp.order_date = o.order_date
This may be run on your spark session to achieve the results.
Converting this to the spark api could be written as
ORDER_DF.alias('o')
.join(
SHOPPING_DF.selectExpr(
"cart_id",
"MAX(order_date) OVER (PARTITION BY cart_id) as order_date"
).alias("sp"),
Seq("cart_id","order_date"),
"inner"
)
.selectExpr(
"o.order_id",
"'PENDING' as PENDING"
).distinct()
Let me know if this works for you.

Inner Join On (Hive) gives different result than PySpark Inner Join

I observed this phenomenon today. When I execute the following command in Hive CLI, I obtains something different than by doing this with pyspark :
Hive :
Select count(distinct t1.fieldX) from table1 t1 JOIN table2 t2 ON (t1.fieldX=t2.fieldX AND t1.fieldY=t2.fieldY);
Result : 17 488
SparkSQL :
hc.sql("Select count(distinct t1.fieldX) from table1 t1 JOIN table2 t2 ON (t1.fieldX==t2.fieldX AND t1.fieldY==t2.fieldY)")
Result : 5555
I obtain the same result using this code :
tabl1.alias("t1").join(
other=table2.alias("t2"),
on=[t1.fieldX==t2.fieldX, t1.fieldY==t2.fieldY]
how='inner'
).select("fieldX").distinct().count()
Result : 5555
I don't get why I obtain different results !

How to access fields after cross join of the same table in PySpark

I ran a cross join on a table and it worked fine. Now the problem is I don't know how to address the same field from the resulting dataframe.
df = spark.sql("select p1.id, p2.id from profile p1 CROSS JOIN profile p2 WHERE p1.id < p2.id")
When I printed out the first row, I got something like this:
Row(id=21398968, id=76109821)
Running "print(res_2[0]['id'])" yields only the first one as a scalar value (not a list)

You can change your query to be:
df = spark.sql("select p1.id AS p1_id, p2.id AS p2_id from profile CROSS JOIN profile p2 WHERE p1.id < p2.id")
By using AS you should be able to avoid the name conflict.

SQL: Joining tables using wildcards

When I join 2 tables
select t1.col1, t1.col2, t2.col3
from t1
left join t2
on t1.col2=t2.col3
Then I don't get any duplicate rows. However, when I try joining
tables using wildcards:
select t1.col1, t1.col2, t2.col3
from t1
left join t2
on t1.col2 like '%'||t2.col3
Then I'll get duplicate values. I saw this post that I think is
getting me somewhere but I couldn't really understand the solution.
Joining 2 Tables using a wildcard
It says I can use exists to get rid of duplicate values. I also
don't really understand his query. Here is the query in the other
post:
select *
from tableA a
where exists (select 1 from tableB b where a.id like '%' + b.id + '%');
What does the select 1 from tableB do?
I'm using PostgreSQL
This is what I've tried even though I don't understand it and still gives me duplicates:
select t1.col1,t1.col2,t2.col3
from t1
left join t2
on t1.col2 like '%'||t2.col3
where exists (select 1 from t2 where t1.col2 like '%'||t2.col3)

Use the DISTINCT clause:
SELECT DISTINCT t1.col1, t1.col2, t2.col3
FROM t1
LEFT JOIN t2 ON t1.col2 LIKE '%'||t2.col3;

Get the difference between two values

I got totally stuck on comparing two tables and getting the difference between them
So here we go:
I got table a with the following columns
Name|Value|Date
and the second table b with the same columns
What i wanna do now is get the difference between the values like
Table a
Name|Value|Date
Test|3|2013-20-06
Table b
Name|Value|Date
Test|9|2013-20-06
What i wann get is the difference between the 3 and 9 so i would recieve 6
Any Idea how i'm able to get that from a query in my PostgreSQL-DB?

Join the tables and select the difference:
select a.name, b.value - a.value, a.date
from a inner join b on a.name = b.name and a.date = b.date

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pyspark join issues - dataframe not defined - pyspark

Related

converting sql to dataframe api

Inner Join On (Hive) gives different result than PySpark Inner Join

How to access fields after cross join of the same table in PySpark

SQL: Joining tables using wildcards

Get the difference between two values

Categories

Resources