spark add a col to dataframe with condtions on another df - scala

I have the following problem: I want to add a column RealCity to dataframe A, when City value is 'noClue', I what to select from df B, using the Key, to get the City.
Table A:
+---------+--------+
| Key | City|
+---------+--------+
|a | PDX |
+---------+--------+
|b | noClue |
Table B:
+---------+--------+
| Key | Name |
+---------+--------+
|c | SYD |
+---------+--------+
|b | AKL |
I want to use .withColumnand when but I can't select value another table (table B) by doing it this way. What's a good way of doing this? Many Thanks!

Given that you have two dataframes
A:
+---+------+
|key|City |
+---+------+
|a |PDX |
|b |noClue|
+---+------+
B:
+---+----+
|key|Name|
+---+----+
|a |SYD |
|b |AKL |
+---+----+
You can simply join them with common Key and use withColumn and when function as
val finalDF = A.join(B, Seq("Key"), "left").withColumn("RealCity", when($"City" === "noClue", $"Name").otherwise($"City")).drop("Name")
you should have final output as
+---+------+--------+
|key|City |RealCity|
+---+------+--------+
|a |PDX |PDX |
|b |noClue|AKL |
+---+------+--------+

Related

How do you split a column such that first half becomes the column name and the second the column value in Scala Spark?

I have a column which has value like
+----------------------+-----------------------------------------+
|UserId |col |
+----------------------+-----------------------------------------+
|1 |firstname=abc |
|2 |lastname=xyz |
|3 |firstname=pqr;lastname=zzz |
|4 |firstname=aaa;middlename=xxx;lastname=bbb|
+----------------------+-----------------------------------------+
and what I want is something like this:
+----------------------+--------------------------------+
|UserId |firstname | lastname| middlename|
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |null | xyz | null |
|3 |pqr | zzz | null |
|4 |aaa | bbb | xxx |
+----------------------+--------------------------------+
I have already done this:
var new_df = df.withColumn("temp_new", split(col("col"), "\\;")).select(
(0 until numCols).map(i => split(col("temp_new").getItem(i), "=").getItem(1).as(s"col$i")): _*
)
where numCols is the max length of col
but as you may have guessed I get something like this as the output:
+----------------------+--------------------------------+
|UserId |col0 | col1 | col2 |
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |xyz | null | null |
|3 |pqr | zzz | null |
|4 |aaa | xxx | bbb |
+----------------------+--------------------------------+
NOTE: The above is just an example. There could be more additions to the columns like firstname=aaa;middlename=xxx;lastname=bbb;age=20;country=India and so on for around 40-50 columnnames and values. They are dynamic and I don't know most of them in advance
I am looking for a a way to achieve the result with Scala in Spark.
You could apply groupBy/pivot to generate key columns after converting the key/value-pairs string column into a Map column via SQL function str_to_map, as shown below:
val df = Seq(
(1, "firstname=joe;age=33"),
(2, "lastname=smith;country=usa"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe")
).toDF("user_id", "key_values")
df.
select($"user_id", explode(expr("str_to_map(key_values, ';', '=')"))).
groupBy("user_id").pivot("key").agg(first("value").as("value")).
orderBy("user_id"). // only for ordered output
show
/*
+-------+----+-------+---------+--------+
|user_id| age|country|firstname|lastname|
+-------+----+-------+---------+--------+
| 1| 33| null| joe| null|
| 2|null| usa| null| smith|
| 3| 44| aus| zoe| cooper|
| 4|null| null| john| doe|
+-------+----+-------+---------+--------+
*/
Since your data is split by ; then your key value pairs are split by = you may consider using str_to_map the following:
creating a temporary view of your data eg
df.createOrReplaceTempView("my_table")
Running the following on your spark session
result_df = sparkSession.sql("<insert sql below here>")
WITH split_data AS (
SELECT
UserId,
str_to_map(col,';','=') full_name
FROM
my_table
)
SELECT
UserId,
full_name['firstname'] as firstname,
full_name['lastname'] as lastname,
full_name['middlename'] as middlename
FROM
split_data
This solution is proposed in accordance with the expanded requirement described in the other answer's comments section:
Existence of duplicate keys in column key_values
Only duplicate key columns will be aggregated as ArrayType
There are probably other approaches. The solution below uses groupBy/pivot with collect_list, followed by extracting the single element (null if empty) from the non-duplicate key columns.
val df = Seq(
(1, "firstname=joe;age=33;moviegenre=comedy"),
(2, "lastname=smith;country=usa;moviegenre=drama"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe;moviegenre=drama;moviegenre=comedy")
).toDF("user_id", "key_values")
val mainCols = df.columns diff Seq("key_values")
val dfNew = df.
withColumn("kv_arr", split($"key_values", ";")).
withColumn("kv", explode(expr("transform(kv_arr, kv -> split(kv, '='))"))).
groupBy("user_id").pivot($"kv"(0)).agg(collect_list($"kv"(1)))
val dupeKeys = Seq("moviegenre") // user-provided
val nonDupeKeys = dfNew.columns diff (mainCols ++ dupeKeys)
dfNew.select(
mainCols.map(col) ++
dupeKeys.map(col) ++
nonDupeKeys.map(k => when(size(col(k)) > 0, col(k)(0)).as(k)): _*
).
orderBy("user_id"). // only for ordered output
show
/*
+-------+---------------+----+-------+---------+--------+
|user_id| moviegenre| age|country|firstname|lastname|
+-------+---------------+----+-------+---------+--------+
| 1| [comedy]| 33| null| joe| null|
| 2| [drama]|null| usa| null| smith|
| 3| []| 44| aus| zoe| cooper|
| 4|[drama, comedy]|null| null| john| doe|
+-------+---------------+----+-------+---------+--------+
/*
Note that higher-order function transform is used to handle the key/value split, as SQL function str_to_map (used in the original solution) can't handle duplicate keys.

Filter DF using the column of another DF (same col in both DF) Spark Scala

I am trying to filter a DataFrame DF1 using the column of another DataFrame DF2, the col is country_id. I Want to reduce all the rows of the first DataFrame to only the countries that there are on the second DF. An example:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |ARG |5 |
|2015-12-14 |GER |1 |
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
|2015-12-14 |USA |1 |
+--------------+------------+
|USE | country_id |
+--------------+------------+
| F |RUS |
| F |CHN |
Expected:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
How could I do this? I am new with Spark so I have thought on use maybe intersect? or would be more efficient other method?
Thanks in advance!
You can use left semi join:
val DF3 = DF1.join(DF2, Seq("country_id"), "left_semi")
DF3.show
//+----------+----------+-----+
//|country_id| Date|value|
//+----------+----------+-----+
//| RUS|2015-12-14| 1|
//| CHN|2015-12-14| 3|
//+----------+----------+-----+
You can also use inner join :
val DF3 = DF1.alias("a").join(DF2.alias("b"), Seq("country_id")).select("a.*")

Spark DF create Seq column in witcolumn

I have a df:
col1
col2
1
abcdefghi
2
qwertyuio
and I want to repeat each row, dividing the col2 in 3 substrings of lenght 3:
col1
col2
1
abcdefghi
1
abc
1
def
1
ghi
2
qwertyuio
2
qwe
2
rty
2
uio
I was trying to create a new column of Seq containng Seq((col("col1"), substring(col("col2"),0,3))...) :
val df1 = df.withColumn("col3", Seq(
(col("col1"), substring(col("col2"),0,3)),
(col("col1"), substring(col("col2"),3,3)),
(col("col1"), substring(col("col2"),6,3)) ))
My idea was to select that new column, and reduce it, getting one final Seq. Then pass it to DF and append it to the initial df.
I am getting an error in the withColumn like:
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon
You can use the Spark array function instead:
val df1 = df.union(
df.select(
$"col1",
explode(array(
substring(col("col2"),0,3),
substring(col("col2"),3,3),
substring(col("col2"),6,3)
)).as("col2")
)
)
df1.show
+----+---------+
|col1| col2|
+----+---------+
| 1|abcdefghi|
| 2|qwertyuio|
| 1| abc|
| 1| cde|
| 1| fgh|
| 2| qwe|
| 2| ert|
| 2| yui|
+----+---------+
You can use udf also,
val df = spark.sparkContext.parallelize(Seq((1L,"abcdefghi"), (2L,"qwertyuio"))).toDF("col1","col2")
df.show(false)
// input
+----+---------+
|col1|col2 |
+----+---------+
|1 |abcdefghi|
|2 |qwertyuio|
+----+---------+
// udf
val getSeq = udf((col2: String) => col2.split("(?<=\\G...)"))
df.withColumn("col2", explode(getSeq($"col2")))
.union(df).show(false)
+----+---------+
|col1|col2 |
+----+---------+
|1 |abc |
|1 |ghi |
|1 |abcdefghi|
|1 |def |
|2 |qwe |
|2 |rty |
|2 |uio |
|2 |qwertyuio|
+----+---------+

How to get number of lines resulted by join in Spark

Consider these two Dataframes:
+---+
|id |
+---+
|1 |
|2 |
|3 |
+---+
+---+-----+
|idz|word |
+---+-----+
|1 |bat |
|1 |mouse|
|2 |horse|
+---+-----+
I am doing a Left join on ID=IDZ:
val r = df1.join(df2, (df1("id") === df2("idz")), "left_outer").
withColumn("ID_EMPLOYE_VENDEUR", when(col("word") =!= ("null"), col("word")).otherwise(null)).drop("word")
r.show(false)
+---+----+------------------+
|id |idz |ID_EMPLOYE_VENDEUR|
+---+----+------------------+
|1 |1 |mouse |
|1 |1 |bat |
|2 |2 |horse |
|3 |null|null |
+---+----+------------------+
But what if I only want to keep the lines whose ID only have one equal IDZ? If not, I would Like to have null in ID_EMPLOYE_VENDEUR. Desired output is:
+---+----+------------------+
|id |idz |ID_EMPLOYE_VENDEUR|
+---+----+------------------+
|1 |1 |null | --Because the Join resulted two different lines
|2 |2 |horse |
|3 |null|null |
+---+----+------------------+
I should precise that I am working on a large DF. The solution should be not very expensive in time.
Thank you
As per you mentioned data your data is too large, so groupBy is not good option to group data and join Windows over function as below :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
def windowSpec = Window.partitionBy("idz")
val newDF = df1.withColumn("count", count("idz").over(windowSpec)).dropDuplicates("idz").withColumn("word", when(col("count") >=2 , lit(null)).otherwise(col("word"))).drop("count")
val r = df1.join(newDF, (df1("id") === newDF("idz")), "left_outer").withColumn("ID_EMPLOYE_VENDEUR", when(col("word") =!= ("null"), col("word")).otherwise(null)).drop("word")
r show
+---+----+------------------+
| id| idz|ID_EMPLOYE_VENDEUR|
+---+----+------------------+
| 1| 1| null|
| 3|null| null|
| 2| 2| horse|
+---+----+------------------+
You can retrieve easily the information that more than two df2's idz matched a single df1's id with a groupBy and a join.
r.join(
r.groupBy("id").count().as("g"),
$"g.id" === r("id")
)
.withColumn(
"ID_EMPLOYE_VENDEUR",
expr("if(count != 1, null, ID_EMPLOYE_VENDEUR)")
)
.drop($"g.id").drop("count")
.distinct()
.show()
Note: Both the groupBy and the join do not trigger any additional exchange step (shuffle around network) because the dataframe r is already partitioned on id (because it is the result of a join on id).

Splittling list of JSON key/value pairs into columns of a row in a Dataset

I have a column with a list of key/value objects:
+----+--------------------------------------------------------------------------------------------+
|ID | Settings |
+----+--------------------------------------------------------------------------------------------+
|1 | [{"key":"key1","value":"val1"}, {"key":"key2","value":"val2"}, {"key":"key3","value":"val3"}] |
+----+--------------------------------------------------------------------------------------------+
Is it possible to split this list of objects into its own row?
As such:
+----+------+-------+-------+
|ID | key1 | key2 | key3 |
+----+------+-------+-------+
|1 | val1 | val2 | val3 |
+----+------+-------+-------+
I've tried exploding, and placing into a Struct:
case class Setting(key: String, value: String)
val newDF = df.withColumn("setting", explode($"settings"))
.select($"id", from_json($"setting" Encoders.product[Setting].schema) as 'settings)
which gives me:
+------+------------------------------+
|ID |settings |
+------+------------------------------+
|1 |[key1,val1] |
|1 |[key2,val2] |
|1 |[key3,val3] |
+------+------------------------------+
And from here I can use the specifies rows by such settings.key
But its not quite what I need. I need to access multiple keys in the one row of data
You are almost near, If you already got this
+------+------------------------------+
|ID |settings |
+------+------------------------------+
|1 |[key1,val1] |
|1 |[key2,val2] |
|1 |[key3,val3] |
+------+------------------------------+
Now you can use pivot to reshape the data as
newDF.groupBy($"ID")
.pivot("settings.key")
.agg(first("settings.value"))
Group by ID and use pivot, Use agg to get the first value but you can use any other function here.
Output:
+---+----+----+----+
|ID |key1|key2|key3|
+---+----+----+----+
|1 |val1|val2|val3|
+---+----+----+----+
Hope this helps!