Join on several conditions without duplicated columns - scala

Joining on identity with Spark leads to the common key column being duplicated in the final Dataset:
val result = ds1.join(ds2, ds1("key") === ds2("key"))
// result now has two "key" columns
This is avoidable by using a Seq instead of the comparison, similar to USING keyword in SQL:
val result = ds1.join(ds2, Seq("key"))
// result now has only one "key" column
However, this doesn't work when joining with a common key + another condition, like:
val result = ds1.join(ds2, ds1("key") === ds2("key") && ds1("foo") < ds2("foo"))
// result has two "key" columns
val result = ds1.join(ds2, Seq("key") && ds1("foo") < ds2("foo"))
// compile error: value && is not a member of Seq[String]
Currently one way of getting out of this is to drop the duplicated column afterwards, but this is quite cumbersome:
val result = ds1.join(ds2, ds1("key") === ds2("key") && ds1("foo") < ds2("foo"))
.drop(ds1("key"))
Is there a more natural, cleaner way to achieve the same goal?

You can separate equi join component and filter:
ds1.join(ds2, Seq("key")).where(ds1("foo") < ds2("foo"))

Related

Effectively counting records after join in spark

This is what I am doing. I need to get number of records present in one dataset and not the other and then again join with a third dataset to get some other columns.
val tooCompare = dw
.select(
"loc",
"id",
"country",
"region"
).dropDuplicates()
val previous = dw
.select(
"loc",
"id",
"country",
"region"
).dropDuplicates()
val delta = tooCompare.exceptAll(previous).cache()
val records = delta
.join(
dw,//another dataset
delta
.col("loc").equalTo(dw.col("loc"))
.and(delta.col("id").equalTo(dw.col("id")))
.and(delta.col("country").equalTo(dw.col("country")))
.and(delta.col("region").equalTo(dw.col("region")))
)
.drop(delta.col("loc"))
.drop(delta.col("id"))
.drop(delta.col("country"))
.drop(delta.col("region"))
.cache()
}
val recordsToSend = records.cache()
val count = recordsToSend.select("loc").distinct().count()
Is there a more efficient way to do this?
I am new to Spark. I am pretty sure I am missing something here
I would suggest using SQL to make this more readable.
First, create Temp Views of the dataframes in question. Don't know exactly what data frames you have, so something like
dfToCompare.createOrReplaceTempView("toCompare")
previousDf.createOrReplaceTempView("previous")
anotherDataSet.createOrReplaceTempView("another")
Then you can proceed to do all your opertions in one SQL statement
val records = spark.sql("""select loc, id, country,region
from toCompare c
inner join another a
on a.loc = c.loc
and a.id = p.id
and a.country = c.country
and a.region = c.region
where not exists (select null
from previous p
where p.loc = c.loc
and p.id = p.id
and p.country = c.country
and p.region = c.region""")
Then you can proceed as before...
val recordsToSend = records.cache()
val count = recordsToSend.select("loc").distinct().count()
I think there's potentially some errors in the code you've pasted as tooCompare and previous are the same, + the third dataset join references deAnon but dw on the table....
For this example answer, assume your current table is called "current", previous is called "previous" and third table is "extra". Then:
val delta = current.join(
previous,
Seq("loc","id","country","region"),
"leftanti"
).select("loc","id","country","region").distinct
val recordsToSend = delta
.join(
extra,
Seq("loc", "id", "country", "region")
)
val count = recordsToSend.select("loc").distinct().count()
This may be more efficient, but I'd appreciate you commenting as to whether it actually was!
Just as an aside: note that I'm using the Seq[String] as a join argument (this requires the column names to be identical on both tables, and won't produce two copies of the columns). However, your original join logic can be written a bit more succinctly, as follows (using my naming conventions):
val recordsToSend = delta
.join(
extra,
delta("loc") === extra("loc")
&& delta("id") === extra("id")
&& delta("country") === extra("country")
&& delta("region") === extra("region")
)
.drop(delta("loc"))
.drop(delta("id"))
.drop(delta("country"))
.drop(delta("region"))
Even better would be to write a drop function that lets you provide more than one column, but I'm going really off topic now ;-)

Best way to update a dataframe in Spark scala

Consider two Dataframe data_df and update_df. These two dataframes have the same schema (key, update_time, bunch of columns).
I know two (main) way to "update" data_df with update_df
full outer join
I join the two dataframes (on key) and then pick the appropriate columns (according to the value of update_timestamp)
max over partition
Union both dataframes, compute the max update_timestamp by key and then filter only rows that equal this maximum.
Here are the questions :
Is there any other way ?
Which one is the best way and why ?
I've already done the comparison with some Open Data
Here is the join code
var join_df = data_df.alias("data").join(maj_df.alias("maj"), Seq("key"), "outer")
var res_df = join_df.where( $"data.update_time" > $"maj.update_time" || $"maj.update_time".isNull)
.select(col("data.*"))
.union(
join_df.where( $"data.update_time" < $"maj.update_time" || $"data.update_time".isNull)
.select(col("maj.*")))
And here is window code
import org.apache.spark.sql.expressions._
val byKey = Window.partitionBy($"key") // orderBy is implicit here
res_df = data_df.union(maj_df)
.withColumn("max_version", max("update_time").over(byKey))
.where($"update_time" === $"max_version")
I can paste you DAGs and Plans here if needed, but they are pretty large
My first guess is that the join solution might be the best way but it only works if the update dataframe got only one version per key.
PS : I'm aware of Apache Delta solution but sadly i'm not able too use it.
Below is one way of doing it to only join on the keys, in an effort to minimize the amount of memory to be used on filters and on join commands.
///Two records, one with a change, one no change
val originalDF = spark.sql("select 'aa' as Key, 'Joe' as Name").unionAll(spark.sql("select 'cc' as Key, 'Doe' as Name"))
///Two records, one change, one new
val updateDF = = spark.sql("select 'aa' as Key, 'Aoe' as Name").unionAll(spark.sql("select 'bb' as Key, 'Moe' as Name"))
///Make new DFs of each just for Key
val originalKeyDF = originalDF.selectExpr("Key")
val updateKeyDF = updateDF.selectExpr("Key")
///Find the keys that are similar between both
val joinKeyDF = updateKeyDF.join(originalKeyDF, updateKeyDF("Key") === originalKeyDF("Key"), "inner")
///Turn the known keys into an Array
val joinKeyArray = joinKeyDF.select(originalKeyDF("Key")).rdd.map(x=>x.mkString).collect
///Filter the rows from original that are not found in the new file
val originalNoChangeDF = originalDF.where(!($"Key".isin(joinKeyArray:_*)))
///Update the output with unchanged records, update records, and new records
val finalDF = originalNoChangeDF.unionAll(updateDF)

generate dynamic join condition spark/scala

I have a array of tuple and I want to generate a join condition(OR) using that.
e.g.
input --> [("leftId", "rightId"), ("leftId", leftAltId")]
output --> leftDF("leftId") === rightDF("rightId") || leftDF("leftAltId") === rightDF("rightAltId")
method signature:
def inner(leftDF: DataFrame, rightDF: DataFrame, fieldsToJoin: Array[(String,String)]): Unit = {
}
I tried using reduce operation on the array but output of my reduce operation is Column and not String hence it can't be fed back as input. I could do recursive but hoping there's simpler way to initiate empty column variable and build the query. thoughts ?
You can do something like this:
val cond = fieldsToJoin.map(x => col(x._1) === col(x._2)).reduce(_ || _)
leftDF.join(rightDF, cond)
Basically you first turn the array into an array of conditions (col transforms the string to column and then === does the comparison) and then the reduce adds the "or" between them. The result is a column you can use.

Accessing column in a dataframe using Spark

I am working on SPARK 1.6.1 version using SCALA and facing a unusual issue. When creating a new column using an existing column created during same execution getting "org.apache.spark.sql.AnalysisException".
WORKING:.
val resultDataFrame = dataFrame.withColumn("FirstColumn",lit(2021)).withColumn("SecondColumn",when($"FirstColumn" - 2021 === 0, 1).otherwise(10))
resultDataFrame.printSchema().
NOT WORKING
val resultDataFrame = dataFrame.withColumn("FirstColumn",lit(2021)).withColumn("SecondColumn",when($"FirstColumn" - **max($"FirstColumn")** === 0, 1).otherwise(10))
resultDataFrame.printSchema().
Here i am creating my SecondColumn using the FirstColumn created during the same execution. Question is why it does not work while using avg/max functions. Please let me know how can i resolve this problem.
If you want to use aggregate functions together with "normal" columns, the functions should come after a groupBy or with a Window definition clause. Out of these cases they make no sense. Examples:
val result = df.groupBy($"col1").max("col2").as("max") // This works
In the above case, the resulting DataFrame will have both "col1" and "max" as columns.
val max = df.select(min("col2"), max("col2"))
This works because there are only aggregate functions in the query. However, the following will not work:
val result = df.filter($"col1" === max($"col2"))
because I am trying to mix a non aggregated column with an aggregated column.
If you want to compare a column with an aggregated value, you can try a join:
val maxDf = df.select(max("col2").as("maxValue"))
val joined = df.join(maxDf)
val result = joined.filter($"col1" === $"maxValue").drop("maxValue")
Or even use the simple value:
val maxValue = df.select(max("col2")).first.get(0)
val result = filter($"col1" === maxValue)

How to update rows based on condition in spark-sql

I am working on spark-sql for data preparation.
The problem I am facing is after getting the result of sql query. How should I update rows based on the If-then-else condition.
What I am doing
val table_join = sqlContext.sql(""" SELECT a.*,b.col as someCol
from table1 a LEFT JOIN table2 b
on a.ID=b.ID """)
table_join.registerTempTable("Table_join")
Now when I have final joined table which is in df format. How should I update rows?
//Final filtering operation
val final_filtered_table = table_join.map{ case record=>
if(record.getAs[String]("col1") == "Y" && record.getAs[String]("col2") == "") record.getAs[String]("col2")="UNKNOWN"
else if (record.getAs[String]("col1") == "N") record("col1")=""
else record
}
In the above map the if syntax works properly but the moment I apply the update condition to modify It gives me error.
But why the below query is working
if(record.getAs[String]("col1") == "Y" && record.getAs[String]("col2") == "") "UNKNOWN"
But the moment I change "UNKNOWN" to record.getAs[String]("col2")="UNKNOWN" It gives me error at at .getAs
Another approach I tried is this:
val final_filtered_sql = table_join.map{row =>
if(row.getString(6) == "Y" && row.getString(33) == "") row.getString(6) == "UNKNOWN"
else if(row.getString(6) == "N") row.getString(6) == ""
else row
}
This is working but is this the right approach as I should not call the columns by their no's but instead their names. What approach should I follow to get names of the column and then update ??
Please help me regarding this. What syntax should I do to update rows based on the condition in dataframe in spark-sql
record.getAs[String]("col2")="UNKNOWN" won't work because record.getAs[String](NAME) will return a String which doesn't have a = method and assigning a new value to a string doesn't make sense.
DataFrame records don't have any setter methods because DataFrames are based on RDD which are immutable collections, meaning you cannot change their state and that's how you're trying to do here.
One way would be to create a new DataFrame using selectExpr on table_join and put that if/else logic there using SQL.