How to convert a map to individual columns in spark scala? - scala

I have a spark dataframe with values like below and I am struggling to find ways to convert in the input dataframe to separate columns like Id, Fld1, Fld2. Appreciate any help or pointer to the documentation which does this ?
val df2 = Seq(
("1", Map("Fld1" -> "USA","Fld2" -> "UK")),
("2", Map("Fld1" -> "Germany", "Fld2" -> "Portugal"))
).toDF("id", "map")
df2.show()
Input:
+---+-----------------------------------+
|id |map |
+---+-----------------------------------+
|1 |[Fld1 -> USA, Fld2 -> UK] |
|2 |[Fld1 -> Germany, Fld2 -> Portugal]|
+---+-----------------------------------+
Expected Output:
+---+-------+--------+
| id| Fld1 | Fld2 |
+---+-------+--------+
| 1 | USA | UK |
| 2 |Germany|Portugal|
+---+-------+--------+

Here's the performant solution:
df2
.withColumn("Fld1", $"map".getItem("Fld1"))
.withColumn("Fld2", $"map".getItem("Fld2"))
.drop("map")
.show()
+---+-------+--------+
| id| Fld1| Fld2|
+---+-------+--------+
| 1| USA| UK|
| 2|Germany|Portugal|
+---+-------+--------+
The other answer suggests using pivot which can be really slow.

You could explode the map using the selectExpr and then apply pivot as shown below:
.selectExpr("id", "explode(map)")
.groupBy(col("id")).pivot(col("key")).agg(first(col("value")))
// result
+---+-------+--------+
|id |Fld1 |Fld2 |
+---+-------+--------+
|1 |USA |UK |
|2 |Germany|Portugal|
+---+-------+--------+

Related

merge rows in a dataframe by id trying to avoid null values in columns (Spark scala)

I am developing in Spark scala, and I would like to merge some rows in a dataframe...
My dataframe is the next one:
+-------------------------+-------------------+---------------+------------------------------+
|name |col1 |col2 |col3 |
+-------------------------+-------------------+---------------+------------------------------+
| a | null| null| 0.000000|
| a | 0.000000| null| null|
| b | null| null| 0.000000|
| b | 300.000000| null| null|
+-------------------------+-------------------+---------------+------------------------------+
And I want to turn on the next dataframe:
+-------------------------+-------------------+---------------+------------------------------+
|name |col1 |col2 |col3 |
+-------------------------+-------------------+---------------+------------------------------+
| a | 0.000000| null| 0.000000|
| b | 300.000000| null| 0.000000|
+-------------------------+-------------------+---------------+------------------------------+
Having into account:
-Some column can have all values to null.
-There can be a lot of columns in a dataframe.
As far as I know, I have to use the groupBy with the agg(), but I am unable to get the correct expression:
df.groupBy("name").agg()
If "merge" means sum, column list can be received from dataframe schema and included into "agg":
val df = Seq(
("a", Option.empty[Double], Option.empty[Double], Some(0.000000)),
("a", Some(0.000000), Option.empty[Double], Option.empty[Double]),
("b", Option.empty[Double], Option.empty[Double], Some(0.000000)),
("b", Some(300.000000), Option.empty[Double], Option.empty[Double])
).toDF(
"name", "col1", "col2", "col3"
)
val columnsToMerge = df
.columns
.filterNot(_ == "name")
.map(c => sum(c).alias(c))
df.groupBy("name")
.agg(columnsToMerge.head, columnsToMerge.tail: _*)
Result:
+----+-----+----+----+
|name|col1 |col2|col3|
+----+-----+----+----+
|a |0.0 |null|0.0 |
|b |300.0|null|0.0 |
+----+-----+----+----+
You can use groupby('name') as you suggest, and then, ffill() + bfill().
df = df.groupby('name').ffill().bfill().drop_duplicates(keep='first')
If you want to keep the name column you can use pandas update():
df.update(df.groupby('name').ffill().bfill())
df.drop_duplicates(keep='first', inplace=True)
Result df:
name
col1
col2
col3
a
0
0
b
300
0

How do you split a column such that first half becomes the column name and the second the column value in Scala Spark?

I have a column which has value like
+----------------------+-----------------------------------------+
|UserId |col |
+----------------------+-----------------------------------------+
|1 |firstname=abc |
|2 |lastname=xyz |
|3 |firstname=pqr;lastname=zzz |
|4 |firstname=aaa;middlename=xxx;lastname=bbb|
+----------------------+-----------------------------------------+
and what I want is something like this:
+----------------------+--------------------------------+
|UserId |firstname | lastname| middlename|
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |null | xyz | null |
|3 |pqr | zzz | null |
|4 |aaa | bbb | xxx |
+----------------------+--------------------------------+
I have already done this:
var new_df = df.withColumn("temp_new", split(col("col"), "\\;")).select(
(0 until numCols).map(i => split(col("temp_new").getItem(i), "=").getItem(1).as(s"col$i")): _*
)
where numCols is the max length of col
but as you may have guessed I get something like this as the output:
+----------------------+--------------------------------+
|UserId |col0 | col1 | col2 |
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |xyz | null | null |
|3 |pqr | zzz | null |
|4 |aaa | xxx | bbb |
+----------------------+--------------------------------+
NOTE: The above is just an example. There could be more additions to the columns like firstname=aaa;middlename=xxx;lastname=bbb;age=20;country=India and so on for around 40-50 columnnames and values. They are dynamic and I don't know most of them in advance
I am looking for a a way to achieve the result with Scala in Spark.
You could apply groupBy/pivot to generate key columns after converting the key/value-pairs string column into a Map column via SQL function str_to_map, as shown below:
val df = Seq(
(1, "firstname=joe;age=33"),
(2, "lastname=smith;country=usa"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe")
).toDF("user_id", "key_values")
df.
select($"user_id", explode(expr("str_to_map(key_values, ';', '=')"))).
groupBy("user_id").pivot("key").agg(first("value").as("value")).
orderBy("user_id"). // only for ordered output
show
/*
+-------+----+-------+---------+--------+
|user_id| age|country|firstname|lastname|
+-------+----+-------+---------+--------+
| 1| 33| null| joe| null|
| 2|null| usa| null| smith|
| 3| 44| aus| zoe| cooper|
| 4|null| null| john| doe|
+-------+----+-------+---------+--------+
*/
Since your data is split by ; then your key value pairs are split by = you may consider using str_to_map the following:
creating a temporary view of your data eg
df.createOrReplaceTempView("my_table")
Running the following on your spark session
result_df = sparkSession.sql("<insert sql below here>")
WITH split_data AS (
SELECT
UserId,
str_to_map(col,';','=') full_name
FROM
my_table
)
SELECT
UserId,
full_name['firstname'] as firstname,
full_name['lastname'] as lastname,
full_name['middlename'] as middlename
FROM
split_data
This solution is proposed in accordance with the expanded requirement described in the other answer's comments section:
Existence of duplicate keys in column key_values
Only duplicate key columns will be aggregated as ArrayType
There are probably other approaches. The solution below uses groupBy/pivot with collect_list, followed by extracting the single element (null if empty) from the non-duplicate key columns.
val df = Seq(
(1, "firstname=joe;age=33;moviegenre=comedy"),
(2, "lastname=smith;country=usa;moviegenre=drama"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe;moviegenre=drama;moviegenre=comedy")
).toDF("user_id", "key_values")
val mainCols = df.columns diff Seq("key_values")
val dfNew = df.
withColumn("kv_arr", split($"key_values", ";")).
withColumn("kv", explode(expr("transform(kv_arr, kv -> split(kv, '='))"))).
groupBy("user_id").pivot($"kv"(0)).agg(collect_list($"kv"(1)))
val dupeKeys = Seq("moviegenre") // user-provided
val nonDupeKeys = dfNew.columns diff (mainCols ++ dupeKeys)
dfNew.select(
mainCols.map(col) ++
dupeKeys.map(col) ++
nonDupeKeys.map(k => when(size(col(k)) > 0, col(k)(0)).as(k)): _*
).
orderBy("user_id"). // only for ordered output
show
/*
+-------+---------------+----+-------+---------+--------+
|user_id| moviegenre| age|country|firstname|lastname|
+-------+---------------+----+-------+---------+--------+
| 1| [comedy]| 33| null| joe| null|
| 2| [drama]|null| usa| null| smith|
| 3| []| 44| aus| zoe| cooper|
| 4|[drama, comedy]|null| null| john| doe|
+-------+---------------+----+-------+---------+--------+
/*
Note that higher-order function transform is used to handle the key/value split, as SQL function str_to_map (used in the original solution) can't handle duplicate keys.

Filter DF using the column of another DF (same col in both DF) Spark Scala

I am trying to filter a DataFrame DF1 using the column of another DataFrame DF2, the col is country_id. I Want to reduce all the rows of the first DataFrame to only the countries that there are on the second DF. An example:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |ARG |5 |
|2015-12-14 |GER |1 |
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
|2015-12-14 |USA |1 |
+--------------+------------+
|USE | country_id |
+--------------+------------+
| F |RUS |
| F |CHN |
Expected:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
How could I do this? I am new with Spark so I have thought on use maybe intersect? or would be more efficient other method?
Thanks in advance!
You can use left semi join:
val DF3 = DF1.join(DF2, Seq("country_id"), "left_semi")
DF3.show
//+----------+----------+-----+
//|country_id| Date|value|
//+----------+----------+-----+
//| RUS|2015-12-14| 1|
//| CHN|2015-12-14| 3|
//+----------+----------+-----+
You can also use inner join :
val DF3 = DF1.alias("a").join(DF2.alias("b"), Seq("country_id")).select("a.*")

Spark DF create Seq column in witcolumn

I have a df:
col1
col2
1
abcdefghi
2
qwertyuio
and I want to repeat each row, dividing the col2 in 3 substrings of lenght 3:
col1
col2
1
abcdefghi
1
abc
1
def
1
ghi
2
qwertyuio
2
qwe
2
rty
2
uio
I was trying to create a new column of Seq containng Seq((col("col1"), substring(col("col2"),0,3))...) :
val df1 = df.withColumn("col3", Seq(
(col("col1"), substring(col("col2"),0,3)),
(col("col1"), substring(col("col2"),3,3)),
(col("col1"), substring(col("col2"),6,3)) ))
My idea was to select that new column, and reduce it, getting one final Seq. Then pass it to DF and append it to the initial df.
I am getting an error in the withColumn like:
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon
You can use the Spark array function instead:
val df1 = df.union(
df.select(
$"col1",
explode(array(
substring(col("col2"),0,3),
substring(col("col2"),3,3),
substring(col("col2"),6,3)
)).as("col2")
)
)
df1.show
+----+---------+
|col1| col2|
+----+---------+
| 1|abcdefghi|
| 2|qwertyuio|
| 1| abc|
| 1| cde|
| 1| fgh|
| 2| qwe|
| 2| ert|
| 2| yui|
+----+---------+
You can use udf also,
val df = spark.sparkContext.parallelize(Seq((1L,"abcdefghi"), (2L,"qwertyuio"))).toDF("col1","col2")
df.show(false)
// input
+----+---------+
|col1|col2 |
+----+---------+
|1 |abcdefghi|
|2 |qwertyuio|
+----+---------+
// udf
val getSeq = udf((col2: String) => col2.split("(?<=\\G...)"))
df.withColumn("col2", explode(getSeq($"col2")))
.union(df).show(false)
+----+---------+
|col1|col2 |
+----+---------+
|1 |abc |
|1 |ghi |
|1 |abcdefghi|
|1 |def |
|2 |qwe |
|2 |rty |
|2 |uio |
|2 |qwertyuio|
+----+---------+

How to unpivot the table based on the multiple columns

I am using Scala and Spark to unpivot a table which looks like as below:
+---+----------+--------+-------+------+-----+
| ID| Date | Type1 | Type2 | 0:30 | 1:00|
+---+----------+--------+-------+------+-----+
| G| 12/3/2018| Import|Voltage| 3.5 | 6.8 |
| H| 13/3/2018| Import|Voltage| 7.5 | 9.8 |
| H| 13/3/2018| Export| Watt| 4.5 | 8.9 |
| H| 13/3/2018| Export|Voltage| 5.6 | 9.1 |
+---+----------+--------+-------+------+-----+
I want to transpose it as follow:
| ID|Date | Time|Import-Voltage |Export-Votage|Import-Watt|Export-Watt|
| G|12/3/2018|0:30 |3.5 |0 |0 |0 |
| G|12/3/2018|1:00 |6.8 |0 |0 |0 |
| H|13/3/2018|0:30 |7.5 |5.6 |0 |4.5 |
| H|13/3/2018|1:00 |9.8 |9.1 |0 |8.9 |
And Time and Date columns should be also merged like
12/3/2018 0:30
Not a straight forward task, but one approach would be to:
group time and corresponding value into a "map" of time-value pairs
flatten it out into a column of time-value pairs
perform groupBy-pivot-agg transformation using time as part of the groupBy key and types as the pivot column to aggregate the time-corresponding value
Sample code below:
import org.apache.spark.sql.functions._
val df = Seq(
("G", "12/3/2018", "Import", "Voltage", 3.5, 6.8),
("H", "13/3/2018", "Import", "Voltage", 7.5, 9.8),
("H", "13/3/2018", "Export", "Watt", 4.5, 8.9),
("H", "13/3/2018", "Export", "Voltage", 5.6, 9.1)
).toDF("ID", "Date", "Type1", "Type2", "0:30", "1:00")
df.
withColumn("TimeValMap", array(
struct(lit("0:30").as("_1"), $"0:30".as("_2")),
struct(lit("1:00").as("_1"), $"1:00".as("_2"))
)).
withColumn("TimeVal", explode($"TimeValMap")).
withColumn("Time", $"TimeVal._1").
withColumn("Types", concat_ws("-", array($"Type1", $"Type2"))).
groupBy("ID", "Date", "Time").pivot("Types").agg(first($"TimeVal._2")).
orderBy("ID", "Date", "Time").
na.fill(0.0).
show
// +---+---------+----+--------------+-----------+--------------+
// | ID| Date|Time|Export-Voltage|Export-Watt|Import-Voltage|
// +---+---------+----+--------------+-----------+--------------+
// | G|12/3/2018|0:30| 0.0| 0.0| 3.5|
// | G|12/3/2018|1:00| 0.0| 0.0| 6.8|
// | H|13/3/2018|0:30| 5.6| 4.5| 7.5|
// | H|13/3/2018|1:00| 9.1| 8.9| 9.8|
// +---+---------+----+--------------+-----------+--------------+