Check if join stream was successful using Apache Spark - Scala - scala

I am new to Apache Spark, using Scala. I am able to join a table to the stream using following command:
Updated_DF = Inbound_DF.join(colToAdd, colToAdd("key") <=> Inbound_DF("key"), "left")
.withColumnRenamed("Data_DF","site").drop("Id","key")
Now I want to check if colToAdd("key") and Inbound_DF("key") matched and join was successful or not. For example, colToAdd:
Id key Data_DF
S31 S3 {"name":"nick","region":"IN"}
S21 S2 {"name":"john","region":"CA"}
S11 S1 {"name":"ashley","region":"CA"}
S51 S5 {"name":"bella","region":"UK"}
S41 S4 {"name":"kumar","region":"In"}
S6 S6 {"name":"ben","region":"US"}
P11 P1 {"name":"MKD","region":"UAE"}
P21 P2 {"name":"ahmad","region":"UAE"}
Message from incoming stream look like:
cusId key item price
1897 S2 book 54
After join, the updated message should look like:
cusId key item price site
1897 S2 book 54 {"name":"john","region":"CA"}
But if I get a stream message with key = S9, the join will not happen and then I want to log a message:
------- join failed, key not found ---------
As far as I know, this can be achieved using the filter method but I am not sure how to implement that. Please help me how this can be done or is there any better way to do the same.

There are multiple ways of doing it. I am just providing you with an idea of how this could be done and you can adjust according to your use case.
First the way you are doing the left join is incorrect, you need to swap the dataframes. The Stream dataframe should be left dataframe.
//Source data
val df = Seq(("S31","S3","""{"name":"nick","region":"IN"}"""),("S21","S2","""{"name":"john","region":"CA"}"""),("S11","S1","""{"name":"john","region":"CA"}""")).toDF("Id","Key","Data_DF")
val df1 = Seq((1897,"S2","book",54),(1920,"S9","movie",200)).toDF("custId","Key","item","price")
//initial join and the count of the records
val df2 = df1.join(df,Seq("Key"),"left").drop("Id").withColumnRenamed("Data_DF","site")
val initialjoincount = df2.count()
//filter and count of the records
val filteredDF = df2.filter($"site".isNotNull)
val filtereddfcount = filteredDF.count()
//compare both the counts and print message/log
if(filtereddfcount == initialjoincount)
{
println("Join Happened")
}
else
{
println("Value not found in stream.")
}

Related

Effectively counting records after join in spark

This is what I am doing. I need to get number of records present in one dataset and not the other and then again join with a third dataset to get some other columns.
val tooCompare = dw
.select(
"loc",
"id",
"country",
"region"
).dropDuplicates()
val previous = dw
.select(
"loc",
"id",
"country",
"region"
).dropDuplicates()
val delta = tooCompare.exceptAll(previous).cache()
val records = delta
.join(
dw,//another dataset
delta
.col("loc").equalTo(dw.col("loc"))
.and(delta.col("id").equalTo(dw.col("id")))
.and(delta.col("country").equalTo(dw.col("country")))
.and(delta.col("region").equalTo(dw.col("region")))
)
.drop(delta.col("loc"))
.drop(delta.col("id"))
.drop(delta.col("country"))
.drop(delta.col("region"))
.cache()
}
val recordsToSend = records.cache()
val count = recordsToSend.select("loc").distinct().count()
Is there a more efficient way to do this?
I am new to Spark. I am pretty sure I am missing something here
I would suggest using SQL to make this more readable.
First, create Temp Views of the dataframes in question. Don't know exactly what data frames you have, so something like
dfToCompare.createOrReplaceTempView("toCompare")
previousDf.createOrReplaceTempView("previous")
anotherDataSet.createOrReplaceTempView("another")
Then you can proceed to do all your opertions in one SQL statement
val records = spark.sql("""select loc, id, country,region
from toCompare c
inner join another a
on a.loc = c.loc
and a.id = p.id
and a.country = c.country
and a.region = c.region
where not exists (select null
from previous p
where p.loc = c.loc
and p.id = p.id
and p.country = c.country
and p.region = c.region""")
Then you can proceed as before...
val recordsToSend = records.cache()
val count = recordsToSend.select("loc").distinct().count()
I think there's potentially some errors in the code you've pasted as tooCompare and previous are the same, + the third dataset join references deAnon but dw on the table....
For this example answer, assume your current table is called "current", previous is called "previous" and third table is "extra". Then:
val delta = current.join(
previous,
Seq("loc","id","country","region"),
"leftanti"
).select("loc","id","country","region").distinct
val recordsToSend = delta
.join(
extra,
Seq("loc", "id", "country", "region")
)
val count = recordsToSend.select("loc").distinct().count()
This may be more efficient, but I'd appreciate you commenting as to whether it actually was!
Just as an aside: note that I'm using the Seq[String] as a join argument (this requires the column names to be identical on both tables, and won't produce two copies of the columns). However, your original join logic can be written a bit more succinctly, as follows (using my naming conventions):
val recordsToSend = delta
.join(
extra,
delta("loc") === extra("loc")
&& delta("id") === extra("id")
&& delta("country") === extra("country")
&& delta("region") === extra("region")
)
.drop(delta("loc"))
.drop(delta("id"))
.drop(delta("country"))
.drop(delta("region"))
Even better would be to write a drop function that lets you provide more than one column, but I'm going really off topic now ;-)

Best way to update a dataframe in Spark scala

Consider two Dataframe data_df and update_df. These two dataframes have the same schema (key, update_time, bunch of columns).
I know two (main) way to "update" data_df with update_df
full outer join
I join the two dataframes (on key) and then pick the appropriate columns (according to the value of update_timestamp)
max over partition
Union both dataframes, compute the max update_timestamp by key and then filter only rows that equal this maximum.
Here are the questions :
Is there any other way ?
Which one is the best way and why ?
I've already done the comparison with some Open Data
Here is the join code
var join_df = data_df.alias("data").join(maj_df.alias("maj"), Seq("key"), "outer")
var res_df = join_df.where( $"data.update_time" > $"maj.update_time" || $"maj.update_time".isNull)
.select(col("data.*"))
.union(
join_df.where( $"data.update_time" < $"maj.update_time" || $"data.update_time".isNull)
.select(col("maj.*")))
And here is window code
import org.apache.spark.sql.expressions._
val byKey = Window.partitionBy($"key") // orderBy is implicit here
res_df = data_df.union(maj_df)
.withColumn("max_version", max("update_time").over(byKey))
.where($"update_time" === $"max_version")
I can paste you DAGs and Plans here if needed, but they are pretty large
My first guess is that the join solution might be the best way but it only works if the update dataframe got only one version per key.
PS : I'm aware of Apache Delta solution but sadly i'm not able too use it.
Below is one way of doing it to only join on the keys, in an effort to minimize the amount of memory to be used on filters and on join commands.
///Two records, one with a change, one no change
val originalDF = spark.sql("select 'aa' as Key, 'Joe' as Name").unionAll(spark.sql("select 'cc' as Key, 'Doe' as Name"))
///Two records, one change, one new
val updateDF = = spark.sql("select 'aa' as Key, 'Aoe' as Name").unionAll(spark.sql("select 'bb' as Key, 'Moe' as Name"))
///Make new DFs of each just for Key
val originalKeyDF = originalDF.selectExpr("Key")
val updateKeyDF = updateDF.selectExpr("Key")
///Find the keys that are similar between both
val joinKeyDF = updateKeyDF.join(originalKeyDF, updateKeyDF("Key") === originalKeyDF("Key"), "inner")
///Turn the known keys into an Array
val joinKeyArray = joinKeyDF.select(originalKeyDF("Key")).rdd.map(x=>x.mkString).collect
///Filter the rows from original that are not found in the new file
val originalNoChangeDF = originalDF.where(!($"Key".isin(joinKeyArray:_*)))
///Update the output with unchanged records, update records, and new records
val finalDF = originalNoChangeDF.unionAll(updateDF)

How to get the row top 1 in Spark Structured Streaming?

I have an issue with Spark Streaming (Spark 2.2.1). I am developing a real time pipeline where first I get data from Kafka, second join the result with another table, then send the Dataframe to a ALS model (Spark ML) and it return a streaming Dataframe with one additional column predit. The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
I tried:
Apply SQL functions like Limit, Take, sort
dense_rank() function
search in StackOverflow
I read Unsupported Operations but doesn't seem to be much there.
Additional with the highest score I would send to a Kafka queue
My code is as follows:
val result = lines.selectExpr("CAST(value AS STRING)")
.select(from_json($"value", mySchema).as("data"))
//.select("data.*")
.selectExpr("cast(data.largo as int) as largo","cast(data.stock as int) as stock","data.verificavalormax","data.codbc","data.ide","data.timestamp_cli","data.tef_cli","data.nombre","data.descripcion","data.porcentaje","data.fechainicio","data.fechafin","data.descripcioncompleta","data.direccion","data.coordenadax","data.coordenaday","data.razon_social","data.segmento_app","data.categoria","data.subcategoria")
result.printSchema()
val model = ALSModel.load("ALSParaTiDos")
val fullPredictions = model.transform(result)
//fullPredictions is a streaming dataframe with a extra column "prediction", here i need the code to get the first row
val query = fullPredictions.writeStream.format("console").outputMode(OutputMode.Append()).option("truncate", "false").start()
query.awaitTermination()
Update
Maybe I was not clear, so I'm attaching an image with my problem. Also I wrote a more simple code to complement it: https://gist.github.com/.../9193c8a983c9007e8a1b6ec280d8df25
detailing what i need. Please I will appreciate any help :)
TL;DR Use stream-stream inner joins (Spark 2.3.0) or use memory sink (or a Hive table) for a temporary storage.
I think that the following sentence describes your case very well:
The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
Machine learning aside as it gives you a streaming Dataset with predictions so focusing on finding a maximum value in a column in a streaming Dataset is the real case here.
The first step is to calculate the max value as follows (copied directly from your code):
streaming.groupBy("idCustomer").agg(max("score") as "maxscore")
With that, you have two streaming Datasets that you can join as of Spark 2.3.0 (that has been released few days ago):
In Spark 2.3, we have added support for stream-stream joins, that is, you can join two streaming Datasets/DataFrames.
Inner joins on any kind of columns along with any kind of join conditions are supported.
Inner join the streaming Datasets and you're done.
Try this:
Implement a function that extract the max value of the column and then filter your dataframe with the max
def getDataFrameMaxRow(df:DataFrame , col:String):DataFrame = {
// get the maximum value
val list_prediction = df.select(col).toJSON.rdd
.collect()
.toList
.map { x => gson.fromJson[JsonObject](x, classOf[JsonObject])}
.map { x => x.get(col).getAsString.toInt}
val max = getMaxFromList(list_prediction)
// filter dataframe by the maximum value
val df_filtered = df.filter(df(col) === max.toString())
return df_filtered
}
def getMaxFromList(xs: List[Int]): Int = xs match {
case List(x: Int) => x
case x :: y :: rest => getMaxFromList( (if (x > y) x else y) :: rest )
}
And in the body of your code add:
import com.google.gson.JsonObject
import com.google.gson.Gson
import org.apache.spark.sql.DataFrame
val fullPredictions = model.transform(result)
val df_with_row_max = getDataFrameMaxRow(fullPredictions, "prediction")
Good Luck !!

Spark: efficient way to search another dataframe

I have one dataframe (df) with ip addresses and their corresponding long value (ip_int) and now I want to search in an another dataframe (ip2Country) which contains geolocation information to find their corresponding country name. How should I do it in Scala. My code currently didnt work out: Memory limit exceed.
val ip_ints=df.select("ip_int").distinct.collect().flatMap(_.toSeq)
val df_list = ListBuffer[DataFrame]()
for(v <- ip_ints){
var ip_int=v.toString.toLong
df_list +=ip2Country.filter(($"network_start_integer"<=ip_int)&&($"network_last_integer">=ip_int)).select("country_name").withColumn("ip_int", lit(ip_int))
}
var df1 = df_list.reduce(_ union _)
df=df.join(df1,Seq("ip_int"),"left")
Basically I try to iterate through every ip_int value and search them in ip2Country and merge them back with df.
Any help is much appreciated!
A simple join should do the trick for you
df.join(df1, df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int"), "left")
.select("ip", "ip_int", "country_name")
If you want to remove the null country_name then you can add filter too
df.join(df1, df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int"), "left")
.select("ip", "ip_int", "country_name")
.filter($"country_name".isNotNull)
I hope the answer is helpful
You want to do a non-equi join, which you can implement by cross joining and then filtering, though it is resource heavy to do so. Assuming you are using Spark 2.1:
df.createOrReplaceTempView("ip_int")
df.select("network_start_integer", "network_start_integer", "country_name").createOrReplaceTempView("ip_int_lookup")
// val spark: SparkSession
val result: DataFrame = spark.sql("select a.*, b.country_name from ip_int a, ip_int_lookup b where b.network_start_integer <= a.ip_int and b.network_last_integer >= a.ip_int)
If you want to include null ip_int, you will need to right join df to result.
I feel puzzled here.
df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int")
Can we use the
df1("network_start_integer")===df("ip_int")
here please?

Flink: Table API copy operators in execution plan

I use Flink 1.2.0 Table API for processing some streaming data. Following is my code:
val dataTable = myDataStream
// table A
val tableA = dataTable
.window(Tumble over 5.minutes on 'rowtime as 'w)
.groupBy("w, group1, group2")
.select("w.start as time, group1, group2, data1.sum as data1, data2.sum as data2")
tableEnv.registerTable("tableA", tableA)
// table A sink
tableA.writeToSink(sinkTableA)
//...
// I shoul get some other outputs from TableA output
//...
val dataTable = tableEnv.ingest("tableA")
// table result1
val result1 = dataTable
.window(Tumble over 5.minutes on 'rowtime as 'w)
.groupBy("w, group1")
.select("w.start as time, group1, data1.sum as data1")
// result1 sink
result2.writeToSink(sinkResult1)
// table result2
val result2 = dataTable
.window(Tumble over 5.minutes on 'rowtime as 'w)
.groupBy("w, group2")
.select("w.start as time, group2, data2.sum as data1")
// result2 sink
result2.writeToSink(sinkResult2)
I wait to get this tree in the flink execution plan.
Same as I have for Flink Streaming in my other Flink jobs.
DataStream_Operators -> TableA_Operators -> TableA_Sink
|-> Result1_Operators -> Result1_Sink
|-> Result2_Operators -> Result2_Sink
But, I get this with 3 copies of same opertoprs for TableA !
DataStream_Operators -> TableA_Operators -> TableA_Sink
|-> Copy_of_TableA_Operators -> Result1_Operators -> Result1_Sink
|-> Copy_of_TableA_Operators -> Result2_Operators -> Result2_Sink
I have bad performance with big input data for this job in result.
How I can fix this and get optimal execution plan ?
I undestand, what the Flink Table API and SQL are experimental features and
maybe it's will fixed in next versions.
At the current state, the Table API translates the whole query whenever you convert a Table into a DataSet or DataStream or write it to a TableSink. In your program, you call three times writeToSink which means that the each time the complete query is translated.
But what is the complete query? There are all Table API operators that have been applied on a Table. When you register a Table in the TableEnvironment it is basically registered as a view, i.e., only its definition (all the operators that define the Table) are registered. Therefore, these operators are translated again when you call writeToSink the second and third time.
You can solve this issue if you translate tableA into a DataStream and register the DataStream in the TableEnvironment instead for registering it as a Table. This would look as follows:
val tableA = ...
val streamA = tableA.toDataStream[X] // X should be a case class for rows of tableA
val tableEnv.registerDataStream("tableA", streamA)
tableEnv.ingest("tableA").writeToSink(sinkTableA) // emit tableA by ingesting the registered DataStream
I know, this is not very convenient but at the moment the only way to avoid repeated translation of a Table.