How to insert record into a dataframe in spark - scala

I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. I also have another dataframe (df2) which contains only cust_id. I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. What is the best way to do that?
I use a full outer join but it generates two cust_id columns and I need one. I should somehow merge these two cust_id columns but don't know how.

You can try to achieve something like that by doing a full outer join like the following:
val result = df1.join(df2, Seq("cust_id"), "full_outer")
However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be:
val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
(df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)

Related

Best way to update a dataframe in Spark scala

Consider two Dataframe data_df and update_df. These two dataframes have the same schema (key, update_time, bunch of columns).
I know two (main) way to "update" data_df with update_df
full outer join
I join the two dataframes (on key) and then pick the appropriate columns (according to the value of update_timestamp)
max over partition
Union both dataframes, compute the max update_timestamp by key and then filter only rows that equal this maximum.
Here are the questions :
Is there any other way ?
Which one is the best way and why ?
I've already done the comparison with some Open Data
Here is the join code
var join_df = data_df.alias("data").join(maj_df.alias("maj"), Seq("key"), "outer")
var res_df = join_df.where( $"data.update_time" > $"maj.update_time" || $"maj.update_time".isNull)
.select(col("data.*"))
.union(
join_df.where( $"data.update_time" < $"maj.update_time" || $"data.update_time".isNull)
.select(col("maj.*")))
And here is window code
import org.apache.spark.sql.expressions._
val byKey = Window.partitionBy($"key") // orderBy is implicit here
res_df = data_df.union(maj_df)
.withColumn("max_version", max("update_time").over(byKey))
.where($"update_time" === $"max_version")
I can paste you DAGs and Plans here if needed, but they are pretty large
My first guess is that the join solution might be the best way but it only works if the update dataframe got only one version per key.
PS : I'm aware of Apache Delta solution but sadly i'm not able too use it.
Below is one way of doing it to only join on the keys, in an effort to minimize the amount of memory to be used on filters and on join commands.
///Two records, one with a change, one no change
val originalDF = spark.sql("select 'aa' as Key, 'Joe' as Name").unionAll(spark.sql("select 'cc' as Key, 'Doe' as Name"))
///Two records, one change, one new
val updateDF = = spark.sql("select 'aa' as Key, 'Aoe' as Name").unionAll(spark.sql("select 'bb' as Key, 'Moe' as Name"))
///Make new DFs of each just for Key
val originalKeyDF = originalDF.selectExpr("Key")
val updateKeyDF = updateDF.selectExpr("Key")
///Find the keys that are similar between both
val joinKeyDF = updateKeyDF.join(originalKeyDF, updateKeyDF("Key") === originalKeyDF("Key"), "inner")
///Turn the known keys into an Array
val joinKeyArray = joinKeyDF.select(originalKeyDF("Key")).rdd.map(x=>x.mkString).collect
///Filter the rows from original that are not found in the new file
val originalNoChangeDF = originalDF.where(!($"Key".isin(joinKeyArray:_*)))
///Update the output with unchanged records, update records, and new records
val finalDF = originalNoChangeDF.unionAll(updateDF)

Dataframe column substring based on the value during join

I have a dataframe with column having values like "COR//xxxxxx-xx-xxxx" or "xxxxxx-xx-xxxx"
I need to compare this column with another column in a different dataframe based on the column value.
If column value have "COR//xxxxx-xx-xxxx", I need to use substring("column", 4, length($"column")
If the column value have "xxxxx-xx-xxxx", I can compare directly without using substring.
For example:
val DF1 = DF2.join(DF3, upper(trim($"column1".substr(4, length($"column1")))) === upper(trim(DF3("column1"))))
I am not sure how to add the condition while joining. Could anyone please let me know how can we achieve this in Spark dataframe?
You can try adding a new column based on the conditions and join on the new column. Something like this.
val data = List("COR//xxxxx-xx-xxxx", "xxxxx-xx-xxxx")
val DF2 = ps.sparkSession.sparkContext.parallelize(data).toDF("column1")
val DF4 = DF2.withColumn("joinCol", when(col("column1").like("%COR%"),
expr("substring(column1, 6, length(column1)-1)")).otherwise(col("column1")) )
DF4.show(false)
The new column will have values like this.
+------------------+-------------+
|column1 |joinCol |
+------------------+-------------+
|COR//xxxxx-xx-xxxx|xxxxx-xx-xxxx|
|xxxxx-xx-xxxx |xxxxx-xx-xxxx|
+------------------+-------------+
You can now join based on the new column added.
val DF1 = DF4.join(DF3, upper(trim(DF4("joinCol"))) === upper(trim(DF3("column1"))))
Hope this helps.
Simply create a new column to use in the join:
DF2.withColumn("column2",
when($"column1" rlike "COR//.*",
$"column1".substr(lit(4), length($"column1")).
otherwise($"column1"))
Then use column2 in the join. It is also possible to add the whole when clause directly in the join but it would look very messy.
Note that to use a constant value in substr you need to use lit. And if you want to remove the whole "COR//" part, use 6 instead of 4.

Spark: efficient way to search another dataframe

I have one dataframe (df) with ip addresses and their corresponding long value (ip_int) and now I want to search in an another dataframe (ip2Country) which contains geolocation information to find their corresponding country name. How should I do it in Scala. My code currently didnt work out: Memory limit exceed.
val ip_ints=df.select("ip_int").distinct.collect().flatMap(_.toSeq)
val df_list = ListBuffer[DataFrame]()
for(v <- ip_ints){
var ip_int=v.toString.toLong
df_list +=ip2Country.filter(($"network_start_integer"<=ip_int)&&($"network_last_integer">=ip_int)).select("country_name").withColumn("ip_int", lit(ip_int))
}
var df1 = df_list.reduce(_ union _)
df=df.join(df1,Seq("ip_int"),"left")
Basically I try to iterate through every ip_int value and search them in ip2Country and merge them back with df.
Any help is much appreciated!
A simple join should do the trick for you
df.join(df1, df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int"), "left")
.select("ip", "ip_int", "country_name")
If you want to remove the null country_name then you can add filter too
df.join(df1, df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int"), "left")
.select("ip", "ip_int", "country_name")
.filter($"country_name".isNotNull)
I hope the answer is helpful
You want to do a non-equi join, which you can implement by cross joining and then filtering, though it is resource heavy to do so. Assuming you are using Spark 2.1:
df.createOrReplaceTempView("ip_int")
df.select("network_start_integer", "network_start_integer", "country_name").createOrReplaceTempView("ip_int_lookup")
// val spark: SparkSession
val result: DataFrame = spark.sql("select a.*, b.country_name from ip_int a, ip_int_lookup b where b.network_start_integer <= a.ip_int and b.network_last_integer >= a.ip_int)
If you want to include null ip_int, you will need to right join df to result.
I feel puzzled here.
df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int")
Can we use the
df1("network_start_integer")===df("ip_int")
here please?

How to select all columns of a dataframe in join - Spark-scala

I am doing join of 2 data frames and select all columns of left frame for example:
val join_df = first_df.join(second_df, first_df("id") === second_df("id") , "left_outer")
in above I want to do select first_df.* .How can I select all columns of one frame in join ?
With alias:
first_df.alias("fst").join(second_df, Seq("id"), "left_outer").select("fst.*")
We can also do it with leftsemi join. leftsemi join will select the data from left side dataframe from a joined dataframe.
Here we join two dataframes df1 and df2 based on column col1.
df1.join(df2, df1.col("col1").equalTo(df2.col("col1")), "leftsemi")
Suppose you:
Want to use the DataFrame syntax.
Want to select all columns from df1 but only a couple from df2.
This is cumbersome to list out explicitly due to the number of columns in df1.
Then, you might do the following:
val selectColumns = df1.columns.map(df1(_)) ++ Array(df2("field1"), df2("field2"))
df1.join(df2, df1("key") === df2("key")).select(selectColumns:_*)
Just to add one possibility, whithout using alias, I was able to do that in pyspark with
first_df.join(second_df, "id", "left_outer").select( first_df["*"] )
Not sure if applies here, but hope it helps

Join two RDD in spark

I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?
val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")
val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()
Edit:
Let me convert this question in SQL. Say for example I have table1 (moveid) and table2 (movieid,moviename). In SQL we write something like:
select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid
group by ....
here in SQL table1 has only one column where as table2 has two columns still the join works, same way in Spark can join on keys from both the RDD's.
Join operation is defined only on PairwiseRDDs which are quite different from a relation / table in SQL. Each element of PairwiseRDD is a Tuple2 where the first element is the key and the second is value. Both can contain complex objects as long as key provides a meaningful hashCode
If you want to think about this in a SQL-ish you can consider key as everything that goes to ON clause and value contains selected columns.
SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key
While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key and value in the PairwiseRDD have a clear meaning.
Going back to your problem to use join you need both key and value. Arguably much cleaner than using 0 as a placeholder would be to use null singleton but there is really no way around it.
For small data you can use filter in a similar way to broadcast join:
val moviesidBD = sc.broadcast(
lines.map(x => x.split("\t")).map(_.head).collect.toSet)
movienames.filter{case (id, _) => moviesidBD.value contains id}
but if you really want SQL-ish joins then you should simply use SparkSQL.
val movieIdsDf = lines
.map(x => x.split("\t"))
.map(a => Tuple1(a.head))
.toDF("id")
val movienamesDf = movienames.toDF("id", "name")
// Add optional join type qualifier
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))
On RDD Join operation is only defined for PairwiseRDDs, So need to change the value to pairedRDD. Below is a sample
val rdd1=sc.textFile("/data-001/part/")
val rdd_1=rdd1.map(x=>x.split('|')).map(x=>(x(0),x(1)))
val rdd2=sc.textFile("/data-001/partsupp/")
val rdd_2=rdd2.map(x=>x.split('|')).map(x=>(x(0),x(1)))
rdd_1.join(rdd_2).take(2).foreach(println)