Scala Dataframe join and get right table columns only - scala

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df1 = Seq(
(1,"A")
).toDF("id","value")
df1.show()
val df2 = Seq(
(1,"C")
).toDF("id","value")
df2.show()
val joinKey = "id"
df1.join(df2.as("dfy"),joinKey.split(",").toSeq).show()
Output:
+---+-----+
| id|value|
+---+-----+
| 1| A|
+---+-----+
+---+-----+
| id|value|
+---+-----+
| 1| C|
+---+-----+
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A| C|
+---+-----+-----+
I want to get only the columns from the right table, including join key 'id'. But since scala skips the duplicate columns, this is not available in the right table if I do as below.
df1.as("dfx").join(df2.as("dfy"),joinKey.split(",").toSeq).select($"dfy.*").show()
+-----+
|value|
+-----+
| C|
+-----+
This works, but I dont want to get all rows from right table as there can be a lot.
df1.as("dfx").join(df2.as("dfy"),joinKey.split(",").toSeq,"right").select($"dfy.*").show()
+---+-----+
| id|value|
+---+-----+
| 1| C|
+---+-----+
what is the best way to do this?
Thanks.

Since what you really want is an inner join, you should just flip the query and use a left_semi join.
df2.join(df1, joinKey.split(","), "leftsemi").show()
Based on you saying the df2 is big, this should even bring you a performance benefit.

Related

How to convert the below T-SQL Query ISNULL(NAME,'N/A') to Spark-SQL Equivalent

Not able to convert the below T-SQL Query part ISNULL(NAME,'N/A') to Spark-SQL Equivalent
SELECT
ID,
ISNULL(NAME,'N/A') AS NAME,
COMPANY
FROM TEST to
convert the below T-SQL Query part ISNULL(NAME,'N/A') to Spark-SQL Equivalent
SELECT
ID,
ISNULL(NAME,'N/A') AS NAME,
COMPANY
FROM TEST
You can do it in two ways, like so:
df = spark.createDataFrame([(1, None), (2, None)], "id: int, value: string")
df.show()
+---+-----+
| id|value|
+---+-----+
| 1| null|
| 2| null|
+---+-----+
df.na.fill("N/A", subset=["value"]).show()
+---+-----+
| id|value|
+---+-----+
| 1| N/A|
| 2| N/A|
+---+-----+
from pyspark.sql.functions import col, when
df.withColumn("value", when(col("value").isNull(), "N/A")).show()
+---+-----+
| id|value|
+---+-----+
| 1| N/A|
| 2| N/A|
+---+-----+
Either option gives you the same result.
The function isnull() merely returns a boolean stating if the input was null or null. Alternatively try using an expression within a case statement or (coalesce)[https://docs.databricks.com/sql/language-manual/functions/coalesce.html]
CASE WHEN NAME IS NULL THEN 'N/A' ELSE NAME END AS NAME
OR
SELECT COALESCE(NAME,'N/A') AS NAME
Hope it helps...

Join two dataframes and replace the original column values using Spark Scala

I have two DFs
df1:
+---+-----+--------+
|key|price| date|
+---+-----+--------+
| 1| 1.0|20210101|
| 2| 2.0|20210101|
| 3| 3.0|20210101|
+---+-----+--------+
df2:
+---+-----+
|key|price|
+---+-----+
| 1| 1.1|
| 2| 2.2|
| 3| 3.3|
+---+-----+
I'd like to replace price column values from df1 with price values from df2 where df1.key == df2.key
Expected output:
+---+-----+--------+
|key|price| date|
+---+-----+--------+
| 1| 1.1|20210101|
| 2| 2.1|20210101|
| 3| 3.3|20210101|
+---+-----+--------+
I've found some solutions in python but I couldn't come up with a working solution in Scala.
Simply join + drop df1 column price:
val df = df1.join(df2, Seq("key")).drop(df1("price"))
df.show
//+---+-----+--------+
//|key|price| date|
//+---+-----+--------+
//| 1| 1.1|20210101|
//| 2| 2.2|20210101|
//| 3| 3.3|20210101|
//+---+-----+--------+
Or if you have more entries in df1 and you want to keep their price when there is no match in df2 then use left join + coalesce expression:
val df = df1.join(df2, Seq("key"), "left").select(
col("key"),
col("date"),
coalesce(df2("price"), df1("price")).as("price")
)

Exploding pipe separated data in spark

I have a spark dataframe(input_dataframe), data in this dataframe looks like as below:
id value
1 a
2 x|y|z
3 t|u
I want to have output_dataframe, having pipe separated fields exploded and it should look like below:
id value
1 a
2 x
2 y
2 z
3 t
3 u
Please help me achieving the desired solution using PySpark. Any help will be appreciated
we can first split and then explode the value column using functions as below,
>>> l=[(1,'a'),(2,'x|y|z'),(3,'t|u')]
>>> df = spark.createDataFrame(l,['id','val'])
>>> df.show()
+---+-----+
| id| val|
+---+-----+
| 1| a|
| 2|x|y|z|
| 3| t|u|
+---+-----+
>>> from pyspark.sql import functions as F
>>> df.select('id',F.explode(F.split(df.val,'[|]')).alias('value')).show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 2| x|
| 2| y|
| 2| z|
| 3| t|
| 3| u|
+---+-----+

Spark dataframe select rows with at least one null or blank in any column of that row

from one dataframe i want to create a new dataframe where at least one value in any of the columns is null or blank in spark 1.5 / scala.
i am trying to write a generalize function to create this new dataframe. where i pass the dataframe and the list of columns and creates the record.
Thanks
Sample Data:
val df = Seq((null, Some(2)), (Some("a"), Some(4)), (Some(""), Some(5)), (Some("b"), null)).toDF("A", "B")
df.show
+----+----+
| A| B|
+----+----+
|null| 2|
| a| 4|
| | 5|
| b|null|
+----+----+
You can construct the condition as, assume blank means empty string here:
import org.apache.spark.sql.functions.col
val cond = df.columns.map(x => col(x).isNull || col(x) === "").reduce(_ || _)
df.filter(cond).show
+----+----+
| A| B|
+----+----+
|null| 2|
| | 5|
| b|null|
+----+----+

Aggregate rows of Spark DataFrame to String after groupby

I'm quite new both Spark and Scale and could really need a hint to solve my problem. So I have two DataFrames A (columns id and name) and B (columns id and text) would like to join them, group by id and combine all rows of text into a single String:
A
+--------+--------+
| id| name|
+--------+--------+
| 0| A|
| 1| B|
+--------+--------+
B
+--------+ -------+
| id| text|
+--------+--------+
| 0| one|
| 0| two|
| 1| three|
| 1| four|
+--------+--------+
desired result:
+--------+--------+----------+
| id| name| texts|
+--------+--------+----------+
| 0| A| one two|
| 1| B|three four|
+--------+--------+----------+
So far I'm trying the following:
var C = A.join(B, "id")
var D = C.groupBy("id", "name").agg(collect_list("text") as "texts")
This works quite well besides that my texts column is an Array of Strings instead of a String. I would appreciate some help very much.
I am just adding some minor functions in yours to give the right solution, which is
A.join(B, Seq("id"), "left").orderBy("id").groupBy("id", "name").agg(concat_ws(" ", collect_list("text")) as "texts")
It's quite simple:
val bCollected = b.groupBy('id).agg(collect_list('text).as("texts")
val ab = a.join(bCollected, a("id") == bCollected("id"), "left")
First DataFrame is immediate result, b DataFrame that has texts collected for every id. Then you are joining it with a. bCollected should be smaller that b itself, so it will probably get better shuffle time