Inner join in spark dataframes not getting proper columns as expected - scala

I am doing inner join for the in spark dataframes similar coversion of sql query
SELECT DISTINCT a.aid,a.DId,a.BM,a.BY,b.TO FROM GetRaw a
INNER JOIN DF_SD b WHERE a.aid = b.aid AND a.DId= b.DId AND a.BM= b.BM AND a.BY = b.BY"
I am converting as
val Pr = DF_SD.select("aid","DId","BM","BY","TO").distinct()
.join(GetRaw,GetRaw.("aid") <=> DF_SD("aid")
&& GetRaw.("DId") <=> DF_SD("DId")
&& DF_SD,GetRaw.("BM") <=> DF_SD("BM")
&& DF_SD,GetRaw.("BY") <=> DF_SD("BY"))
My Output Table contains columns
"aid","DId","BM","BY","TO","aid","DId","BM","BY"
Can any one correct where I am doing wrong

Just use SELECT of distincts after join:
val Pr = DF_SD.join(GetRaw,Seq("aid","DId","BM","BY"))
.select("aid","DId","BM","BY","TO").distinct

you can mention column names in sequence, which is correct way of handling this problem..
pls see https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
val Pr = DF_SD.join(GetRaw,Seq("aid","DId","BM","BY"))
.dropDuplicates() //optionally, if you want to drop duplicate rows from the dataframe then
Pr.show();

Related

Effectively counting records after join in spark

This is what I am doing. I need to get number of records present in one dataset and not the other and then again join with a third dataset to get some other columns.
val tooCompare = dw
.select(
"loc",
"id",
"country",
"region"
).dropDuplicates()
val previous = dw
.select(
"loc",
"id",
"country",
"region"
).dropDuplicates()
val delta = tooCompare.exceptAll(previous).cache()
val records = delta
.join(
dw,//another dataset
delta
.col("loc").equalTo(dw.col("loc"))
.and(delta.col("id").equalTo(dw.col("id")))
.and(delta.col("country").equalTo(dw.col("country")))
.and(delta.col("region").equalTo(dw.col("region")))
)
.drop(delta.col("loc"))
.drop(delta.col("id"))
.drop(delta.col("country"))
.drop(delta.col("region"))
.cache()
}
val recordsToSend = records.cache()
val count = recordsToSend.select("loc").distinct().count()
Is there a more efficient way to do this?
I am new to Spark. I am pretty sure I am missing something here
I would suggest using SQL to make this more readable.
First, create Temp Views of the dataframes in question. Don't know exactly what data frames you have, so something like
dfToCompare.createOrReplaceTempView("toCompare")
previousDf.createOrReplaceTempView("previous")
anotherDataSet.createOrReplaceTempView("another")
Then you can proceed to do all your opertions in one SQL statement
val records = spark.sql("""select loc, id, country,region
from toCompare c
inner join another a
on a.loc = c.loc
and a.id = p.id
and a.country = c.country
and a.region = c.region
where not exists (select null
from previous p
where p.loc = c.loc
and p.id = p.id
and p.country = c.country
and p.region = c.region""")
Then you can proceed as before...
val recordsToSend = records.cache()
val count = recordsToSend.select("loc").distinct().count()
I think there's potentially some errors in the code you've pasted as tooCompare and previous are the same, + the third dataset join references deAnon but dw on the table....
For this example answer, assume your current table is called "current", previous is called "previous" and third table is "extra". Then:
val delta = current.join(
previous,
Seq("loc","id","country","region"),
"leftanti"
).select("loc","id","country","region").distinct
val recordsToSend = delta
.join(
extra,
Seq("loc", "id", "country", "region")
)
val count = recordsToSend.select("loc").distinct().count()
This may be more efficient, but I'd appreciate you commenting as to whether it actually was!
Just as an aside: note that I'm using the Seq[String] as a join argument (this requires the column names to be identical on both tables, and won't produce two copies of the columns). However, your original join logic can be written a bit more succinctly, as follows (using my naming conventions):
val recordsToSend = delta
.join(
extra,
delta("loc") === extra("loc")
&& delta("id") === extra("id")
&& delta("country") === extra("country")
&& delta("region") === extra("region")
)
.drop(delta("loc"))
.drop(delta("id"))
.drop(delta("country"))
.drop(delta("region"))
Even better would be to write a drop function that lets you provide more than one column, but I'm going really off topic now ;-)

Scala Left Join returns result of Full Join

I tried to do Join two dataframes in spark shell.
One of the dataframe is having 15000 records and another is having 14000 rows.
I tried Left outer join and inner join of these dataframes, but result is having count of 29000 rows.
How is that happening?
The code which i tried is given below.
val joineddf = finaldf.as("df1").join(cosmos.as("df2"), $"df1.BatchKey" === $"df2.BatchKey", "left_outer").select(($"df1.*"),col("df2.BatchKey").as("B2"))
val joineddf = finaldf.as("df1").join(cosmos.as("df2"), $"df1.BatchKey" === $"df2.BatchKey", "inner").select(($"df1.*"),col("df2.BatchKey").as("B2"))
Both above methods are resulted in a dataframe where count is sum of both dataframes.
Even I tried the below method, but still getting same result.
finaldf.createOrReplaceTempView("df1")
cosmos.createOrReplaceTempView("df2")
val test = spark.sql("""SELECT df1.*, df2.* FROM df1 LEFT OUTER JOIN df2 ON trim(df1.BatchKey) == trim(df2.BatchKey)""")
If i try to add more condition for join then the no of count is again increasing.
How to get exact result for a left outer join?
here in the case max count should be 15000
Antony,
Can you try performing the join below :
val joineddf = finaldf.join(cosmos.select("BatchKey"), Seq("BatchKey"), "left_outer")
Here I'm not using any alias.

convert SQL join query to pyspark syntax

I'm working to convert a known working SQL query to work in pyspark, given two dataframes, using methods such as: .join, .where, filter, etc.
Here are examples of SQL queries that work (only selecting r.id where I will normally select more columns):
# "invalid" records, where there is a matching `record_id` for rv_df
SELECT DISTINCT(r.id) FROM core_record AS r LEFT OUTER JOIN core_recordvalidation rv ON r.id = rv.record_id WHERE r.job_id = 41 AND rv.record_id is not null;
# "valid" records, where there is no matching `record_id` for rv_df
SELECT DISTINCT(r.id) FROM core_record AS r LEFT OUTER JOIN core_recordvalidation rv ON r.id = rv.record_id WHERE r.job_id = 41 AND rv.record_id is not null;
I'm 80/20 close, but having trouble wrapping my head around the the last few steps, and/or how to do this most efficiently.
I've got a Dataframe r_df with column id that I'd like to join with Dataframe rv_df on column record_id. As output, I'd like only distinct r.id, and only columns from r_df, none from rv_df. Finally, I'd like two different calls where there is a match (what will be "invalid" records for me), and where there is not a match (what I consider "valid" records).
I have pyspark queries that get close, but not terribly clear on how to ensure that r_df.id is distinct, and select only columns from r_df, none from rv_df.
Any help would be much appreciated!
Just had to walk away for a couple hours. Found a solution that works for my use case.
First, selecting only distinct record_id from rv_df:
rv_df = rv_df.select('record_id').distinct()
Then use that for intersection and disjoints:
# Intersection:
j_df = r_df.join(rv_df, r_df.id == rv_df.record_id, 'leftsemi').select(r_df['*'])
# Disjoint:
j_df = r_df.join(rv_df, r_df.id == rv_df.record_id, 'leftanti').select(r_df['*'])

How to select all columns of a dataframe in join - Spark-scala

I am doing join of 2 data frames and select all columns of left frame for example:
val join_df = first_df.join(second_df, first_df("id") === second_df("id") , "left_outer")
in above I want to do select first_df.* .How can I select all columns of one frame in join ?
With alias:
first_df.alias("fst").join(second_df, Seq("id"), "left_outer").select("fst.*")
We can also do it with leftsemi join. leftsemi join will select the data from left side dataframe from a joined dataframe.
Here we join two dataframes df1 and df2 based on column col1.
df1.join(df2, df1.col("col1").equalTo(df2.col("col1")), "leftsemi")
Suppose you:
Want to use the DataFrame syntax.
Want to select all columns from df1 but only a couple from df2.
This is cumbersome to list out explicitly due to the number of columns in df1.
Then, you might do the following:
val selectColumns = df1.columns.map(df1(_)) ++ Array(df2("field1"), df2("field2"))
df1.join(df2, df1("key") === df2("key")).select(selectColumns:_*)
Just to add one possibility, whithout using alias, I was able to do that in pyspark with
first_df.join(second_df, "id", "left_outer").select( first_df["*"] )
Not sure if applies here, but hope it helps

Join two RDD in spark

I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?
val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")
val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()
Edit:
Let me convert this question in SQL. Say for example I have table1 (moveid) and table2 (movieid,moviename). In SQL we write something like:
select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid
group by ....
here in SQL table1 has only one column where as table2 has two columns still the join works, same way in Spark can join on keys from both the RDD's.
Join operation is defined only on PairwiseRDDs which are quite different from a relation / table in SQL. Each element of PairwiseRDD is a Tuple2 where the first element is the key and the second is value. Both can contain complex objects as long as key provides a meaningful hashCode
If you want to think about this in a SQL-ish you can consider key as everything that goes to ON clause and value contains selected columns.
SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key
While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key and value in the PairwiseRDD have a clear meaning.
Going back to your problem to use join you need both key and value. Arguably much cleaner than using 0 as a placeholder would be to use null singleton but there is really no way around it.
For small data you can use filter in a similar way to broadcast join:
val moviesidBD = sc.broadcast(
lines.map(x => x.split("\t")).map(_.head).collect.toSet)
movienames.filter{case (id, _) => moviesidBD.value contains id}
but if you really want SQL-ish joins then you should simply use SparkSQL.
val movieIdsDf = lines
.map(x => x.split("\t"))
.map(a => Tuple1(a.head))
.toDF("id")
val movienamesDf = movienames.toDF("id", "name")
// Add optional join type qualifier
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))
On RDD Join operation is only defined for PairwiseRDDs, So need to change the value to pairedRDD. Below is a sample
val rdd1=sc.textFile("/data-001/part/")
val rdd_1=rdd1.map(x=>x.split('|')).map(x=>(x(0),x(1)))
val rdd2=sc.textFile("/data-001/partsupp/")
val rdd_2=rdd2.map(x=>x.split('|')).map(x=>(x(0),x(1)))
rdd_1.join(rdd_2).take(2).foreach(println)