Effectively counting records after join in spark - scala

This is what I am doing. I need to get number of records present in one dataset and not the other and then again join with a third dataset to get some other columns.
val tooCompare = dw
.select(
"loc",
"id",
"country",
"region"
).dropDuplicates()
val previous = dw
.select(
"loc",
"id",
"country",
"region"
).dropDuplicates()
val delta = tooCompare.exceptAll(previous).cache()
val records = delta
.join(
dw,//another dataset
delta
.col("loc").equalTo(dw.col("loc"))
.and(delta.col("id").equalTo(dw.col("id")))
.and(delta.col("country").equalTo(dw.col("country")))
.and(delta.col("region").equalTo(dw.col("region")))
)
.drop(delta.col("loc"))
.drop(delta.col("id"))
.drop(delta.col("country"))
.drop(delta.col("region"))
.cache()
}
val recordsToSend = records.cache()
val count = recordsToSend.select("loc").distinct().count()
Is there a more efficient way to do this?
I am new to Spark. I am pretty sure I am missing something here

I would suggest using SQL to make this more readable.
First, create Temp Views of the dataframes in question. Don't know exactly what data frames you have, so something like
dfToCompare.createOrReplaceTempView("toCompare")
previousDf.createOrReplaceTempView("previous")
anotherDataSet.createOrReplaceTempView("another")
Then you can proceed to do all your opertions in one SQL statement
val records = spark.sql("""select loc, id, country,region
from toCompare c
inner join another a
on a.loc = c.loc
and a.id = p.id
and a.country = c.country
and a.region = c.region
where not exists (select null
from previous p
where p.loc = c.loc
and p.id = p.id
and p.country = c.country
and p.region = c.region""")
Then you can proceed as before...
val recordsToSend = records.cache()
val count = recordsToSend.select("loc").distinct().count()

I think there's potentially some errors in the code you've pasted as tooCompare and previous are the same, + the third dataset join references deAnon but dw on the table....
For this example answer, assume your current table is called "current", previous is called "previous" and third table is "extra". Then:
val delta = current.join(
previous,
Seq("loc","id","country","region"),
"leftanti"
).select("loc","id","country","region").distinct
val recordsToSend = delta
.join(
extra,
Seq("loc", "id", "country", "region")
)
val count = recordsToSend.select("loc").distinct().count()
This may be more efficient, but I'd appreciate you commenting as to whether it actually was!
Just as an aside: note that I'm using the Seq[String] as a join argument (this requires the column names to be identical on both tables, and won't produce two copies of the columns). However, your original join logic can be written a bit more succinctly, as follows (using my naming conventions):
val recordsToSend = delta
.join(
extra,
delta("loc") === extra("loc")
&& delta("id") === extra("id")
&& delta("country") === extra("country")
&& delta("region") === extra("region")
)
.drop(delta("loc"))
.drop(delta("id"))
.drop(delta("country"))
.drop(delta("region"))
Even better would be to write a drop function that lets you provide more than one column, but I'm going really off topic now ;-)

Related

Check if join stream was successful using Apache Spark - Scala

I am new to Apache Spark, using Scala. I am able to join a table to the stream using following command:
Updated_DF = Inbound_DF.join(colToAdd, colToAdd("key") <=> Inbound_DF("key"), "left")
.withColumnRenamed("Data_DF","site").drop("Id","key")
Now I want to check if colToAdd("key") and Inbound_DF("key") matched and join was successful or not. For example, colToAdd:
Id key Data_DF
S31 S3 {"name":"nick","region":"IN"}
S21 S2 {"name":"john","region":"CA"}
S11 S1 {"name":"ashley","region":"CA"}
S51 S5 {"name":"bella","region":"UK"}
S41 S4 {"name":"kumar","region":"In"}
S6 S6 {"name":"ben","region":"US"}
P11 P1 {"name":"MKD","region":"UAE"}
P21 P2 {"name":"ahmad","region":"UAE"}
Message from incoming stream look like:
cusId key item price
1897 S2 book 54
After join, the updated message should look like:
cusId key item price site
1897 S2 book 54 {"name":"john","region":"CA"}
But if I get a stream message with key = S9, the join will not happen and then I want to log a message:
------- join failed, key not found ---------
As far as I know, this can be achieved using the filter method but I am not sure how to implement that. Please help me how this can be done or is there any better way to do the same.
There are multiple ways of doing it. I am just providing you with an idea of how this could be done and you can adjust according to your use case.
First the way you are doing the left join is incorrect, you need to swap the dataframes. The Stream dataframe should be left dataframe.
//Source data
val df = Seq(("S31","S3","""{"name":"nick","region":"IN"}"""),("S21","S2","""{"name":"john","region":"CA"}"""),("S11","S1","""{"name":"john","region":"CA"}""")).toDF("Id","Key","Data_DF")
val df1 = Seq((1897,"S2","book",54),(1920,"S9","movie",200)).toDF("custId","Key","item","price")
//initial join and the count of the records
val df2 = df1.join(df,Seq("Key"),"left").drop("Id").withColumnRenamed("Data_DF","site")
val initialjoincount = df2.count()
//filter and count of the records
val filteredDF = df2.filter($"site".isNotNull)
val filtereddfcount = filteredDF.count()
//compare both the counts and print message/log
if(filtereddfcount == initialjoincount)
{
println("Join Happened")
}
else
{
println("Value not found in stream.")
}

Best way to update a dataframe in Spark scala

Consider two Dataframe data_df and update_df. These two dataframes have the same schema (key, update_time, bunch of columns).
I know two (main) way to "update" data_df with update_df
full outer join
I join the two dataframes (on key) and then pick the appropriate columns (according to the value of update_timestamp)
max over partition
Union both dataframes, compute the max update_timestamp by key and then filter only rows that equal this maximum.
Here are the questions :
Is there any other way ?
Which one is the best way and why ?
I've already done the comparison with some Open Data
Here is the join code
var join_df = data_df.alias("data").join(maj_df.alias("maj"), Seq("key"), "outer")
var res_df = join_df.where( $"data.update_time" > $"maj.update_time" || $"maj.update_time".isNull)
.select(col("data.*"))
.union(
join_df.where( $"data.update_time" < $"maj.update_time" || $"data.update_time".isNull)
.select(col("maj.*")))
And here is window code
import org.apache.spark.sql.expressions._
val byKey = Window.partitionBy($"key") // orderBy is implicit here
res_df = data_df.union(maj_df)
.withColumn("max_version", max("update_time").over(byKey))
.where($"update_time" === $"max_version")
I can paste you DAGs and Plans here if needed, but they are pretty large
My first guess is that the join solution might be the best way but it only works if the update dataframe got only one version per key.
PS : I'm aware of Apache Delta solution but sadly i'm not able too use it.
Below is one way of doing it to only join on the keys, in an effort to minimize the amount of memory to be used on filters and on join commands.
///Two records, one with a change, one no change
val originalDF = spark.sql("select 'aa' as Key, 'Joe' as Name").unionAll(spark.sql("select 'cc' as Key, 'Doe' as Name"))
///Two records, one change, one new
val updateDF = = spark.sql("select 'aa' as Key, 'Aoe' as Name").unionAll(spark.sql("select 'bb' as Key, 'Moe' as Name"))
///Make new DFs of each just for Key
val originalKeyDF = originalDF.selectExpr("Key")
val updateKeyDF = updateDF.selectExpr("Key")
///Find the keys that are similar between both
val joinKeyDF = updateKeyDF.join(originalKeyDF, updateKeyDF("Key") === originalKeyDF("Key"), "inner")
///Turn the known keys into an Array
val joinKeyArray = joinKeyDF.select(originalKeyDF("Key")).rdd.map(x=>x.mkString).collect
///Filter the rows from original that are not found in the new file
val originalNoChangeDF = originalDF.where(!($"Key".isin(joinKeyArray:_*)))
///Update the output with unchanged records, update records, and new records
val finalDF = originalNoChangeDF.unionAll(updateDF)

How to Compare columns of two tables using Spark?

I am trying to compare two tables() by reading as DataFrames. And for each common column in those tables using concatenation of a primary key say order_id with other columns like order_date, order_name, order_event.
The Scala Code I am using
val primary_key=order_id
for (i <- commonColumnsList){
val column_name = i
val tempDataFrameForNew = newDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
val tempDataFrameOld = oldDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
//Get those records which aren common in both old/new tables
matchCountCalculated = tempDataFrameForNew.intersect(tempDataFrameOld)
//Get those records which aren't common in both old/new tables
nonMatchCountCalculated = tempDataFrameOld.unionAll(tempDataFrameForNew).except(matchCountCalculated)
//Total Null/Non-Null Counts in both old and new tables.
nullsCountInNewDataFrame = newDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nullsCountInOldDataFrame = oldDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nonNullsCountInNewDataFrame = newDFCount - nullsCountInNewDataFrame
nonNullsCountInOldDataFrame = oldDFCount - nullsCountInOldDataFrame
//Put the result for a given column in a Seq variable, later convert it to Dataframe.
tempSeq = tempSeq :+ Row(column_name, matchCountCalculated.toString, nonMatchCountCalculated.toString, (nullsCountInNewDataFrame - nullsCountInOldDataFrame).toString,
(nonNullsCountInNewDataFrame - nonNullsCountInOldDataFrame).toString)
}
// Final Step: Create DataFrame using Seq and some Schema.
spark.createDataFrame(spark.sparkContext.parallelize(tempSeq), schema)
The above code is working fine for a medium set of Data, but as the number of Columns and Records increases in my New & Old Table, the execution time is increasing. Any sort of advice is appreciated.
Thank you in Advance.
You can do the following:
1. Outer join the old and new dataframe on priamary key
joined_df = df_old.join(df_new, primary_key, "outer")
2. Cache it if you possibly can. This will save you a lot of time
3. Now you can iterate over columns and compare columns using spark functions (.isNull for not matched, == for matched etc)
for (col <- df_new.columns){
val matchCount = df_joined.filter(df_new[col].isNotNull && df_old[col].isNotNull).count()
val nonMatchCount = ...
}
This should be considerably faster, especially when you can cache your dataframe. If you can't it might be a good idea so save the joined df to disk in order to avoid a shuffle each time

Join on several conditions without duplicated columns

Joining on identity with Spark leads to the common key column being duplicated in the final Dataset:
val result = ds1.join(ds2, ds1("key") === ds2("key"))
// result now has two "key" columns
This is avoidable by using a Seq instead of the comparison, similar to USING keyword in SQL:
val result = ds1.join(ds2, Seq("key"))
// result now has only one "key" column
However, this doesn't work when joining with a common key + another condition, like:
val result = ds1.join(ds2, ds1("key") === ds2("key") && ds1("foo") < ds2("foo"))
// result has two "key" columns
val result = ds1.join(ds2, Seq("key") && ds1("foo") < ds2("foo"))
// compile error: value && is not a member of Seq[String]
Currently one way of getting out of this is to drop the duplicated column afterwards, but this is quite cumbersome:
val result = ds1.join(ds2, ds1("key") === ds2("key") && ds1("foo") < ds2("foo"))
.drop(ds1("key"))
Is there a more natural, cleaner way to achieve the same goal?
You can separate equi join component and filter:
ds1.join(ds2, Seq("key")).where(ds1("foo") < ds2("foo"))

Join two RDD in spark

I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?
val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")
val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()
Edit:
Let me convert this question in SQL. Say for example I have table1 (moveid) and table2 (movieid,moviename). In SQL we write something like:
select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid
group by ....
here in SQL table1 has only one column where as table2 has two columns still the join works, same way in Spark can join on keys from both the RDD's.
Join operation is defined only on PairwiseRDDs which are quite different from a relation / table in SQL. Each element of PairwiseRDD is a Tuple2 where the first element is the key and the second is value. Both can contain complex objects as long as key provides a meaningful hashCode
If you want to think about this in a SQL-ish you can consider key as everything that goes to ON clause and value contains selected columns.
SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key
While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key and value in the PairwiseRDD have a clear meaning.
Going back to your problem to use join you need both key and value. Arguably much cleaner than using 0 as a placeholder would be to use null singleton but there is really no way around it.
For small data you can use filter in a similar way to broadcast join:
val moviesidBD = sc.broadcast(
lines.map(x => x.split("\t")).map(_.head).collect.toSet)
movienames.filter{case (id, _) => moviesidBD.value contains id}
but if you really want SQL-ish joins then you should simply use SparkSQL.
val movieIdsDf = lines
.map(x => x.split("\t"))
.map(a => Tuple1(a.head))
.toDF("id")
val movienamesDf = movienames.toDF("id", "name")
// Add optional join type qualifier
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))
On RDD Join operation is only defined for PairwiseRDDs, So need to change the value to pairedRDD. Below is a sample
val rdd1=sc.textFile("/data-001/part/")
val rdd_1=rdd1.map(x=>x.split('|')).map(x=>(x(0),x(1)))
val rdd2=sc.textFile("/data-001/partsupp/")
val rdd_2=rdd2.map(x=>x.split('|')).map(x=>(x(0),x(1)))
rdd_1.join(rdd_2).take(2).foreach(println)