Scala Dataframe join and get right table columns only

Scala Dataframe join and get right table columns only - scala

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df1 = Seq(
(1,"A")
).toDF("id","value")
df1.show()
val df2 = Seq(
(1,"C")
).toDF("id","value")
df2.show()
val joinKey = "id"
df1.join(df2.as("dfy"),joinKey.split(",").toSeq).show()
Output:
+---+-----+
| id|value|
+---+-----+
| 1| A|
+---+-----+
+---+-----+
| id|value|
+---+-----+
| 1| C|
+---+-----+
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A| C|
+---+-----+-----+
I want to get only the columns from the right table, including join key 'id'. But since scala skips the duplicate columns, this is not available in the right table if I do as below.
df1.as("dfx").join(df2.as("dfy"),joinKey.split(",").toSeq).select($"dfy.*").show()
+-----+
|value|
+-----+
| C|
+-----+
This works, but I dont want to get all rows from right table as there can be a lot.
df1.as("dfx").join(df2.as("dfy"),joinKey.split(",").toSeq,"right").select($"dfy.*").show()
+---+-----+
| id|value|
+---+-----+
| 1| C|
+---+-----+
what is the best way to do this?
Thanks.

Since what you really want is an inner join, you should just flip the query and use a left_semi join.
df2.join(df1, joinKey.split(","), "leftsemi").show()
Based on you saying the df2 is big, this should even bring you a performance benefit.

Related

How to convert the below T-SQL Query ISNULL(NAME,'N/A') to Spark-SQL Equivalent

Not able to convert the below T-SQL Query part ISNULL(NAME,'N/A') to Spark-SQL Equivalent
SELECT
ID,
ISNULL(NAME,'N/A') AS NAME,
COMPANY
FROM TEST to
convert the below T-SQL Query part ISNULL(NAME,'N/A') to Spark-SQL Equivalent
SELECT
ID,
ISNULL(NAME,'N/A') AS NAME,
COMPANY
FROM TEST

You can do it in two ways, like so:
df = spark.createDataFrame([(1, None), (2, None)], "id: int, value: string")
df.show()
+---+-----+
| id|value|
+---+-----+
| 1| null|
| 2| null|
+---+-----+
df.na.fill("N/A", subset=["value"]).show()
+---+-----+
| id|value|
+---+-----+
| 1| N/A|
| 2| N/A|
+---+-----+
from pyspark.sql.functions import col, when
df.withColumn("value", when(col("value").isNull(), "N/A")).show()
+---+-----+
| id|value|
+---+-----+
| 1| N/A|
| 2| N/A|
+---+-----+
Either option gives you the same result.

The function isnull() merely returns a boolean stating if the input was null or null. Alternatively try using an expression within a case statement or (coalesce)[https://docs.databricks.com/sql/language-manual/functions/coalesce.html]
CASE WHEN NAME IS NULL THEN 'N/A' ELSE NAME END AS NAME
OR
SELECT COALESCE(NAME,'N/A') AS NAME
Hope it helps...

Join two dataframes and replace the original column values using Spark Scala

I have two DFs
df1:
+---+-----+--------+
|key|price| date|
+---+-----+--------+
| 1| 1.0|20210101|
| 2| 2.0|20210101|
| 3| 3.0|20210101|
+---+-----+--------+
df2:
+---+-----+
|key|price|
+---+-----+
| 1| 1.1|
| 2| 2.2|
| 3| 3.3|
+---+-----+
I'd like to replace price column values from df1 with price values from df2 where df1.key == df2.key
Expected output:
+---+-----+--------+
|key|price| date|
+---+-----+--------+
| 1| 1.1|20210101|
| 2| 2.1|20210101|
| 3| 3.3|20210101|
+---+-----+--------+
I've found some solutions in python but I couldn't come up with a working solution in Scala.

Simply join + drop df1 column price:
val df = df1.join(df2, Seq("key")).drop(df1("price"))
df.show
//+---+-----+--------+
//|key|price| date|
//+---+-----+--------+
//| 1| 1.1|20210101|
//| 2| 2.2|20210101|
//| 3| 3.3|20210101|
//+---+-----+--------+
Or if you have more entries in df1 and you want to keep their price when there is no match in df2 then use left join + coalesce expression:
val df = df1.join(df2, Seq("key"), "left").select(
col("key"),
col("date"),
coalesce(df2("price"), df1("price")).as("price")
)

Exploding pipe separated data in spark

I have a spark dataframe(input_dataframe), data in this dataframe looks like as below:
id value
1 a
2 x|y|z
3 t|u
I want to have output_dataframe, having pipe separated fields exploded and it should look like below:
id value
1 a
2 x
2 y
2 z
3 t
3 u
Please help me achieving the desired solution using PySpark. Any help will be appreciated

we can first split and then explode the value column using functions as below,
>>> l=[(1,'a'),(2,'x|y|z'),(3,'t|u')]
>>> df = spark.createDataFrame(l,['id','val'])
>>> df.show()
+---+-----+
| id| val|
+---+-----+
| 1| a|
| 2|x|y|z|
| 3| t|u|
+---+-----+
>>> from pyspark.sql import functions as F
>>> df.select('id',F.explode(F.split(df.val,'[|]')).alias('value')).show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 2| x|
| 2| y|
| 2| z|
| 3| t|
| 3| u|
+---+-----+

Spark dataframe select rows with at least one null or blank in any column of that row

from one dataframe i want to create a new dataframe where at least one value in any of the columns is null or blank in spark 1.5 / scala.
i am trying to write a generalize function to create this new dataframe. where i pass the dataframe and the list of columns and creates the record.
Thanks

Sample Data:
val df = Seq((null, Some(2)), (Some("a"), Some(4)), (Some(""), Some(5)), (Some("b"), null)).toDF("A", "B")
df.show
+----+----+
| A| B|
+----+----+
|null| 2|
| a| 4|
| | 5|
| b|null|
+----+----+
You can construct the condition as, assume blank means empty string here:
import org.apache.spark.sql.functions.col
val cond = df.columns.map(x => col(x).isNull || col(x) === "").reduce(_ || _)
df.filter(cond).show
+----+----+
| A| B|
+----+----+
|null| 2|
| | 5|
| b|null|
+----+----+

Aggregate rows of Spark DataFrame to String after groupby

I'm quite new both Spark and Scale and could really need a hint to solve my problem. So I have two DataFrames A (columns id and name) and B (columns id and text) would like to join them, group by id and combine all rows of text into a single String:
A
+--------+--------+
| id| name|
+--------+--------+
| 0| A|
| 1| B|
+--------+--------+
B
+--------+ -------+
| id| text|
+--------+--------+
| 0| one|
| 0| two|
| 1| three|
| 1| four|
+--------+--------+
desired result:
+--------+--------+----------+
| id| name| texts|
+--------+--------+----------+
| 0| A| one two|
| 1| B|three four|
+--------+--------+----------+
So far I'm trying the following:
var C = A.join(B, "id")
var D = C.groupBy("id", "name").agg(collect_list("text") as "texts")
This works quite well besides that my texts column is an Array of Strings instead of a String. I would appreciate some help very much.

I am just adding some minor functions in yours to give the right solution, which is
A.join(B, Seq("id"), "left").orderBy("id").groupBy("id", "name").agg(concat_ws(" ", collect_list("text")) as "texts")

It's quite simple:
val bCollected = b.groupBy('id).agg(collect_list('text).as("texts")
val ab = a.join(bCollected, a("id") == bCollected("id"), "left")
First DataFrame is immediate result, b DataFrame that has texts collected for every id. Then you are joining it with a. bCollected should be smaller that b itself, so it will probably get better shuffle time

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scala Dataframe join and get right table columns only - scala

Since what you really want is an inner join, you should just flip the query and use a left_semi join. df2.join(df1, joinKey.split(","), "leftsemi").show() Based on you saying the df2 is big, this should even bring you a performance benefit.

Related

How to convert the below T-SQL Query ISNULL(NAME,'N/A') to Spark-SQL Equivalent

Join two dataframes and replace the original column values using Spark Scala

Exploding pipe separated data in spark

Spark dataframe select rows with at least one null or blank in any column of that row

Aggregate rows of Spark DataFrame to String after groupby

Categories

Resources