Spark: efficient way to search another dataframe - scala

I have one dataframe (df) with ip addresses and their corresponding long value (ip_int) and now I want to search in an another dataframe (ip2Country) which contains geolocation information to find their corresponding country name. How should I do it in Scala. My code currently didnt work out: Memory limit exceed.
val ip_ints=df.select("ip_int").distinct.collect().flatMap(_.toSeq)
val df_list = ListBuffer[DataFrame]()
for(v <- ip_ints){
var ip_int=v.toString.toLong
df_list +=ip2Country.filter(($"network_start_integer"<=ip_int)&&($"network_last_integer">=ip_int)).select("country_name").withColumn("ip_int", lit(ip_int))
}
var df1 = df_list.reduce(_ union _)
df=df.join(df1,Seq("ip_int"),"left")
Basically I try to iterate through every ip_int value and search them in ip2Country and merge them back with df.
Any help is much appreciated!

A simple join should do the trick for you
df.join(df1, df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int"), "left")
.select("ip", "ip_int", "country_name")
If you want to remove the null country_name then you can add filter too
df.join(df1, df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int"), "left")
.select("ip", "ip_int", "country_name")
.filter($"country_name".isNotNull)
I hope the answer is helpful

You want to do a non-equi join, which you can implement by cross joining and then filtering, though it is resource heavy to do so. Assuming you are using Spark 2.1:
df.createOrReplaceTempView("ip_int")
df.select("network_start_integer", "network_start_integer", "country_name").createOrReplaceTempView("ip_int_lookup")
// val spark: SparkSession
val result: DataFrame = spark.sql("select a.*, b.country_name from ip_int a, ip_int_lookup b where b.network_start_integer <= a.ip_int and b.network_last_integer >= a.ip_int)
If you want to include null ip_int, you will need to right join df to result.

I feel puzzled here.
df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int")
Can we use the
df1("network_start_integer")===df("ip_int")
here please?

Related

Best way to update a dataframe in Spark scala

Consider two Dataframe data_df and update_df. These two dataframes have the same schema (key, update_time, bunch of columns).
I know two (main) way to "update" data_df with update_df
full outer join
I join the two dataframes (on key) and then pick the appropriate columns (according to the value of update_timestamp)
max over partition
Union both dataframes, compute the max update_timestamp by key and then filter only rows that equal this maximum.
Here are the questions :
Is there any other way ?
Which one is the best way and why ?
I've already done the comparison with some Open Data
Here is the join code
var join_df = data_df.alias("data").join(maj_df.alias("maj"), Seq("key"), "outer")
var res_df = join_df.where( $"data.update_time" > $"maj.update_time" || $"maj.update_time".isNull)
.select(col("data.*"))
.union(
join_df.where( $"data.update_time" < $"maj.update_time" || $"data.update_time".isNull)
.select(col("maj.*")))
And here is window code
import org.apache.spark.sql.expressions._
val byKey = Window.partitionBy($"key") // orderBy is implicit here
res_df = data_df.union(maj_df)
.withColumn("max_version", max("update_time").over(byKey))
.where($"update_time" === $"max_version")
I can paste you DAGs and Plans here if needed, but they are pretty large
My first guess is that the join solution might be the best way but it only works if the update dataframe got only one version per key.
PS : I'm aware of Apache Delta solution but sadly i'm not able too use it.
Below is one way of doing it to only join on the keys, in an effort to minimize the amount of memory to be used on filters and on join commands.
///Two records, one with a change, one no change
val originalDF = spark.sql("select 'aa' as Key, 'Joe' as Name").unionAll(spark.sql("select 'cc' as Key, 'Doe' as Name"))
///Two records, one change, one new
val updateDF = = spark.sql("select 'aa' as Key, 'Aoe' as Name").unionAll(spark.sql("select 'bb' as Key, 'Moe' as Name"))
///Make new DFs of each just for Key
val originalKeyDF = originalDF.selectExpr("Key")
val updateKeyDF = updateDF.selectExpr("Key")
///Find the keys that are similar between both
val joinKeyDF = updateKeyDF.join(originalKeyDF, updateKeyDF("Key") === originalKeyDF("Key"), "inner")
///Turn the known keys into an Array
val joinKeyArray = joinKeyDF.select(originalKeyDF("Key")).rdd.map(x=>x.mkString).collect
///Filter the rows from original that are not found in the new file
val originalNoChangeDF = originalDF.where(!($"Key".isin(joinKeyArray:_*)))
///Update the output with unchanged records, update records, and new records
val finalDF = originalNoChangeDF.unionAll(updateDF)

Dataframe column substring based on the value during join

I have a dataframe with column having values like "COR//xxxxxx-xx-xxxx" or "xxxxxx-xx-xxxx"
I need to compare this column with another column in a different dataframe based on the column value.
If column value have "COR//xxxxx-xx-xxxx", I need to use substring("column", 4, length($"column")
If the column value have "xxxxx-xx-xxxx", I can compare directly without using substring.
For example:
val DF1 = DF2.join(DF3, upper(trim($"column1".substr(4, length($"column1")))) === upper(trim(DF3("column1"))))
I am not sure how to add the condition while joining. Could anyone please let me know how can we achieve this in Spark dataframe?
You can try adding a new column based on the conditions and join on the new column. Something like this.
val data = List("COR//xxxxx-xx-xxxx", "xxxxx-xx-xxxx")
val DF2 = ps.sparkSession.sparkContext.parallelize(data).toDF("column1")
val DF4 = DF2.withColumn("joinCol", when(col("column1").like("%COR%"),
expr("substring(column1, 6, length(column1)-1)")).otherwise(col("column1")) )
DF4.show(false)
The new column will have values like this.
+------------------+-------------+
|column1 |joinCol |
+------------------+-------------+
|COR//xxxxx-xx-xxxx|xxxxx-xx-xxxx|
|xxxxx-xx-xxxx |xxxxx-xx-xxxx|
+------------------+-------------+
You can now join based on the new column added.
val DF1 = DF4.join(DF3, upper(trim(DF4("joinCol"))) === upper(trim(DF3("column1"))))
Hope this helps.
Simply create a new column to use in the join:
DF2.withColumn("column2",
when($"column1" rlike "COR//.*",
$"column1".substr(lit(4), length($"column1")).
otherwise($"column1"))
Then use column2 in the join. It is also possible to add the whole when clause directly in the join but it would look very messy.
Note that to use a constant value in substr you need to use lit. And if you want to remove the whole "COR//" part, use 6 instead of 4.

How to get the row top 1 in Spark Structured Streaming?

I have an issue with Spark Streaming (Spark 2.2.1). I am developing a real time pipeline where first I get data from Kafka, second join the result with another table, then send the Dataframe to a ALS model (Spark ML) and it return a streaming Dataframe with one additional column predit. The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
I tried:
Apply SQL functions like Limit, Take, sort
dense_rank() function
search in StackOverflow
I read Unsupported Operations but doesn't seem to be much there.
Additional with the highest score I would send to a Kafka queue
My code is as follows:
val result = lines.selectExpr("CAST(value AS STRING)")
.select(from_json($"value", mySchema).as("data"))
//.select("data.*")
.selectExpr("cast(data.largo as int) as largo","cast(data.stock as int) as stock","data.verificavalormax","data.codbc","data.ide","data.timestamp_cli","data.tef_cli","data.nombre","data.descripcion","data.porcentaje","data.fechainicio","data.fechafin","data.descripcioncompleta","data.direccion","data.coordenadax","data.coordenaday","data.razon_social","data.segmento_app","data.categoria","data.subcategoria")
result.printSchema()
val model = ALSModel.load("ALSParaTiDos")
val fullPredictions = model.transform(result)
//fullPredictions is a streaming dataframe with a extra column "prediction", here i need the code to get the first row
val query = fullPredictions.writeStream.format("console").outputMode(OutputMode.Append()).option("truncate", "false").start()
query.awaitTermination()
Update
Maybe I was not clear, so I'm attaching an image with my problem. Also I wrote a more simple code to complement it: https://gist.github.com/.../9193c8a983c9007e8a1b6ec280d8df25
detailing what i need. Please I will appreciate any help :)
TL;DR Use stream-stream inner joins (Spark 2.3.0) or use memory sink (or a Hive table) for a temporary storage.
I think that the following sentence describes your case very well:
The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
Machine learning aside as it gives you a streaming Dataset with predictions so focusing on finding a maximum value in a column in a streaming Dataset is the real case here.
The first step is to calculate the max value as follows (copied directly from your code):
streaming.groupBy("idCustomer").agg(max("score") as "maxscore")
With that, you have two streaming Datasets that you can join as of Spark 2.3.0 (that has been released few days ago):
In Spark 2.3, we have added support for stream-stream joins, that is, you can join two streaming Datasets/DataFrames.
Inner joins on any kind of columns along with any kind of join conditions are supported.
Inner join the streaming Datasets and you're done.
Try this:
Implement a function that extract the max value of the column and then filter your dataframe with the max
def getDataFrameMaxRow(df:DataFrame , col:String):DataFrame = {
// get the maximum value
val list_prediction = df.select(col).toJSON.rdd
.collect()
.toList
.map { x => gson.fromJson[JsonObject](x, classOf[JsonObject])}
.map { x => x.get(col).getAsString.toInt}
val max = getMaxFromList(list_prediction)
// filter dataframe by the maximum value
val df_filtered = df.filter(df(col) === max.toString())
return df_filtered
}
def getMaxFromList(xs: List[Int]): Int = xs match {
case List(x: Int) => x
case x :: y :: rest => getMaxFromList( (if (x > y) x else y) :: rest )
}
And in the body of your code add:
import com.google.gson.JsonObject
import com.google.gson.Gson
import org.apache.spark.sql.DataFrame
val fullPredictions = model.transform(result)
val df_with_row_max = getDataFrameMaxRow(fullPredictions, "prediction")
Good Luck !!

How to insert record into a dataframe in spark

I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. I also have another dataframe (df2) which contains only cust_id. I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. What is the best way to do that?
I use a full outer join but it generates two cust_id columns and I need one. I should somehow merge these two cust_id columns but don't know how.
You can try to achieve something like that by doing a full outer join like the following:
val result = df1.join(df2, Seq("cust_id"), "full_outer")
However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be:
val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
(df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)

How to select all columns of a dataframe in join - Spark-scala

I am doing join of 2 data frames and select all columns of left frame for example:
val join_df = first_df.join(second_df, first_df("id") === second_df("id") , "left_outer")
in above I want to do select first_df.* .How can I select all columns of one frame in join ?
With alias:
first_df.alias("fst").join(second_df, Seq("id"), "left_outer").select("fst.*")
We can also do it with leftsemi join. leftsemi join will select the data from left side dataframe from a joined dataframe.
Here we join two dataframes df1 and df2 based on column col1.
df1.join(df2, df1.col("col1").equalTo(df2.col("col1")), "leftsemi")
Suppose you:
Want to use the DataFrame syntax.
Want to select all columns from df1 but only a couple from df2.
This is cumbersome to list out explicitly due to the number of columns in df1.
Then, you might do the following:
val selectColumns = df1.columns.map(df1(_)) ++ Array(df2("field1"), df2("field2"))
df1.join(df2, df1("key") === df2("key")).select(selectColumns:_*)
Just to add one possibility, whithout using alias, I was able to do that in pyspark with
first_df.join(second_df, "id", "left_outer").select( first_df["*"] )
Not sure if applies here, but hope it helps