How to convert a map to individual columns in spark scala?

How to convert a map to individual columns in spark scala? - scala

I have a spark dataframe with values like below and I am struggling to find ways to convert in the input dataframe to separate columns like Id, Fld1, Fld2. Appreciate any help or pointer to the documentation which does this ?
val df2 = Seq(
("1", Map("Fld1" -> "USA","Fld2" -> "UK")),
("2", Map("Fld1" -> "Germany", "Fld2" -> "Portugal"))
).toDF("id", "map")
df2.show()
Input:
+---+-----------------------------------+
|id |map |
+---+-----------------------------------+
|1 |[Fld1 -> USA, Fld2 -> UK] |
|2 |[Fld1 -> Germany, Fld2 -> Portugal]|
+---+-----------------------------------+
Expected Output:
+---+-------+--------+
| id| Fld1 | Fld2 |
+---+-------+--------+
| 1 | USA | UK |
| 2 |Germany|Portugal|
+---+-------+--------+

Here's the performant solution:
df2
.withColumn("Fld1", $"map".getItem("Fld1"))
.withColumn("Fld2", $"map".getItem("Fld2"))
.drop("map")
.show()
+---+-------+--------+
| id| Fld1| Fld2|
+---+-------+--------+
| 1| USA| UK|
| 2|Germany|Portugal|
+---+-------+--------+
The other answer suggests using pivot which can be really slow.

You could explode the map using the selectExpr and then apply pivot as shown below:
.selectExpr("id", "explode(map)")
.groupBy(col("id")).pivot(col("key")).agg(first(col("value")))
// result
+---+-------+--------+
|id |Fld1 |Fld2 |
+---+-------+--------+
|1 |USA |UK |
|2 |Germany|Portugal|
+---+-------+--------+

Related

merge rows in a dataframe by id trying to avoid null values in columns (Spark scala)

I am developing in Spark scala, and I would like to merge some rows in a dataframe...
My dataframe is the next one:
+-------------------------+-------------------+---------------+------------------------------+
|name |col1 |col2 |col3 |
+-------------------------+-------------------+---------------+------------------------------+
| a | null| null| 0.000000|
| a | 0.000000| null| null|
| b | null| null| 0.000000|
| b | 300.000000| null| null|
+-------------------------+-------------------+---------------+------------------------------+
And I want to turn on the next dataframe:
+-------------------------+-------------------+---------------+------------------------------+
|name |col1 |col2 |col3 |
+-------------------------+-------------------+---------------+------------------------------+
| a | 0.000000| null| 0.000000|
| b | 300.000000| null| 0.000000|
+-------------------------+-------------------+---------------+------------------------------+
Having into account:
-Some column can have all values to null.
-There can be a lot of columns in a dataframe.
As far as I know, I have to use the groupBy with the agg(), but I am unable to get the correct expression:
df.groupBy("name").agg()

If "merge" means sum, column list can be received from dataframe schema and included into "agg":
val df = Seq(
("a", Option.empty[Double], Option.empty[Double], Some(0.000000)),
("a", Some(0.000000), Option.empty[Double], Option.empty[Double]),
("b", Option.empty[Double], Option.empty[Double], Some(0.000000)),
("b", Some(300.000000), Option.empty[Double], Option.empty[Double])
).toDF(
"name", "col1", "col2", "col3"
)
val columnsToMerge = df
.columns
.filterNot(_ == "name")
.map(c => sum(c).alias(c))
df.groupBy("name")
.agg(columnsToMerge.head, columnsToMerge.tail: _*)
Result:
+----+-----+----+----+
|name|col1 |col2|col3|
+----+-----+----+----+
|a |0.0 |null|0.0 |
|b |300.0|null|0.0 |
+----+-----+----+----+

You can use groupby('name') as you suggest, and then, ffill() + bfill().
df = df.groupby('name').ffill().bfill().drop_duplicates(keep='first')
If you want to keep the name column you can use pandas update():
df.update(df.groupby('name').ffill().bfill())
df.drop_duplicates(keep='first', inplace=True)
Result df:
name
col1
col2
col3
a
0
0
b
300
0

How do you split a column such that first half becomes the column name and the second the column value in Scala Spark?

I have a column which has value like
+----------------------+-----------------------------------------+
|UserId |col |
+----------------------+-----------------------------------------+
|1 |firstname=abc |
|2 |lastname=xyz |
|3 |firstname=pqr;lastname=zzz |
|4 |firstname=aaa;middlename=xxx;lastname=bbb|
+----------------------+-----------------------------------------+
and what I want is something like this:
+----------------------+--------------------------------+
|UserId |firstname | lastname| middlename|
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |null | xyz | null |
|3 |pqr | zzz | null |
|4 |aaa | bbb | xxx |
+----------------------+--------------------------------+
I have already done this:
var new_df = df.withColumn("temp_new", split(col("col"), "\\;")).select(
(0 until numCols).map(i => split(col("temp_new").getItem(i), "=").getItem(1).as(s"col$i")): _*
)
where numCols is the max length of col
but as you may have guessed I get something like this as the output:
+----------------------+--------------------------------+
|UserId |col0 | col1 | col2 |
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |xyz | null | null |
|3 |pqr | zzz | null |
|4 |aaa | xxx | bbb |
+----------------------+--------------------------------+
NOTE: The above is just an example. There could be more additions to the columns like firstname=aaa;middlename=xxx;lastname=bbb;age=20;country=India and so on for around 40-50 columnnames and values. They are dynamic and I don't know most of them in advance
I am looking for a a way to achieve the result with Scala in Spark.

You could apply groupBy/pivot to generate key columns after converting the key/value-pairs string column into a Map column via SQL function str_to_map, as shown below:
val df = Seq(
(1, "firstname=joe;age=33"),
(2, "lastname=smith;country=usa"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe")
).toDF("user_id", "key_values")
df.
select($"user_id", explode(expr("str_to_map(key_values, ';', '=')"))).
groupBy("user_id").pivot("key").agg(first("value").as("value")).
orderBy("user_id"). // only for ordered output
show
/*
+-------+----+-------+---------+--------+
|user_id| age|country|firstname|lastname|
+-------+----+-------+---------+--------+
| 1| 33| null| joe| null|
| 2|null| usa| null| smith|
| 3| 44| aus| zoe| cooper|
| 4|null| null| john| doe|
+-------+----+-------+---------+--------+
*/

Since your data is split by ; then your key value pairs are split by = you may consider using str_to_map the following:
creating a temporary view of your data eg
df.createOrReplaceTempView("my_table")
Running the following on your spark session
result_df = sparkSession.sql("<insert sql below here>")
WITH split_data AS (
SELECT
UserId,
str_to_map(col,';','=') full_name
FROM
my_table
)
SELECT
UserId,
full_name['firstname'] as firstname,
full_name['lastname'] as lastname,
full_name['middlename'] as middlename
FROM
split_data

This solution is proposed in accordance with the expanded requirement described in the other answer's comments section:
Existence of duplicate keys in column key_values
Only duplicate key columns will be aggregated as ArrayType
There are probably other approaches. The solution below uses groupBy/pivot with collect_list, followed by extracting the single element (null if empty) from the non-duplicate key columns.
val df = Seq(
(1, "firstname=joe;age=33;moviegenre=comedy"),
(2, "lastname=smith;country=usa;moviegenre=drama"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe;moviegenre=drama;moviegenre=comedy")
).toDF("user_id", "key_values")
val mainCols = df.columns diff Seq("key_values")
val dfNew = df.
withColumn("kv_arr", split($"key_values", ";")).
withColumn("kv", explode(expr("transform(kv_arr, kv -> split(kv, '='))"))).
groupBy("user_id").pivot($"kv"(0)).agg(collect_list($"kv"(1)))
val dupeKeys = Seq("moviegenre") // user-provided
val nonDupeKeys = dfNew.columns diff (mainCols ++ dupeKeys)
dfNew.select(
mainCols.map(col) ++
dupeKeys.map(col) ++
nonDupeKeys.map(k => when(size(col(k)) > 0, col(k)(0)).as(k)): _*
).
orderBy("user_id"). // only for ordered output
show
/*
+-------+---------------+----+-------+---------+--------+
|user_id| moviegenre| age|country|firstname|lastname|
+-------+---------------+----+-------+---------+--------+
| 1| [comedy]| 33| null| joe| null|
| 2| [drama]|null| usa| null| smith|
| 3| []| 44| aus| zoe| cooper|
| 4|[drama, comedy]|null| null| john| doe|
+-------+---------------+----+-------+---------+--------+
/*
Note that higher-order function transform is used to handle the key/value split, as SQL function str_to_map (used in the original solution) can't handle duplicate keys.

Filter DF using the column of another DF (same col in both DF) Spark Scala

I am trying to filter a DataFrame DF1 using the column of another DataFrame DF2, the col is country_id. I Want to reduce all the rows of the first DataFrame to only the countries that there are on the second DF. An example:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |ARG |5 |
|2015-12-14 |GER |1 |
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
|2015-12-14 |USA |1 |
+--------------+------------+
|USE | country_id |
+--------------+------------+
| F |RUS |
| F |CHN |
Expected:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
How could I do this? I am new with Spark so I have thought on use maybe intersect? or would be more efficient other method?
Thanks in advance!

You can use left semi join:
val DF3 = DF1.join(DF2, Seq("country_id"), "left_semi")
DF3.show
//+----------+----------+-----+
//|country_id| Date|value|
//+----------+----------+-----+
//| RUS|2015-12-14| 1|
//| CHN|2015-12-14| 3|
//+----------+----------+-----+
You can also use inner join :
val DF3 = DF1.alias("a").join(DF2.alias("b"), Seq("country_id")).select("a.*")

Spark DF create Seq column in witcolumn

I have a df:
col1
col2
1
abcdefghi
2
qwertyuio
and I want to repeat each row, dividing the col2 in 3 substrings of lenght 3:
col1
col2
1
abcdefghi
1
abc
1
def
1
ghi
2
qwertyuio
2
qwe
2
rty
2
uio
I was trying to create a new column of Seq containng Seq((col("col1"), substring(col("col2"),0,3))...) :
val df1 = df.withColumn("col3", Seq(
(col("col1"), substring(col("col2"),0,3)),
(col("col1"), substring(col("col2"),3,3)),
(col("col1"), substring(col("col2"),6,3)) ))
My idea was to select that new column, and reduce it, getting one final Seq. Then pass it to DF and append it to the initial df.
I am getting an error in the withColumn like:
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon

You can use the Spark array function instead:
val df1 = df.union(
df.select(
$"col1",
explode(array(
substring(col("col2"),0,3),
substring(col("col2"),3,3),
substring(col("col2"),6,3)
)).as("col2")
)
)
df1.show
+----+---------+
|col1| col2|
+----+---------+
| 1|abcdefghi|
| 2|qwertyuio|
| 1| abc|
| 1| cde|
| 1| fgh|
| 2| qwe|
| 2| ert|
| 2| yui|
+----+---------+

You can use udf also,
val df = spark.sparkContext.parallelize(Seq((1L,"abcdefghi"), (2L,"qwertyuio"))).toDF("col1","col2")
df.show(false)
// input
+----+---------+
|col1|col2 |
+----+---------+
|1 |abcdefghi|
|2 |qwertyuio|
+----+---------+
// udf
val getSeq = udf((col2: String) => col2.split("(?<=\\G...)"))
df.withColumn("col2", explode(getSeq($"col2")))
.union(df).show(false)
+----+---------+
|col1|col2 |
+----+---------+
|1 |abc |
|1 |ghi |
|1 |abcdefghi|
|1 |def |
|2 |qwe |
|2 |rty |
|2 |uio |
|2 |qwertyuio|
+----+---------+

How to unpivot the table based on the multiple columns

I am using Scala and Spark to unpivot a table which looks like as below:
+---+----------+--------+-------+------+-----+
| ID| Date | Type1 | Type2 | 0:30 | 1:00|
+---+----------+--------+-------+------+-----+
| G| 12/3/2018| Import|Voltage| 3.5 | 6.8 |
| H| 13/3/2018| Import|Voltage| 7.5 | 9.8 |
| H| 13/3/2018| Export| Watt| 4.5 | 8.9 |
| H| 13/3/2018| Export|Voltage| 5.6 | 9.1 |
+---+----------+--------+-------+------+-----+
I want to transpose it as follow:
| ID|Date | Time|Import-Voltage |Export-Votage|Import-Watt|Export-Watt|
| G|12/3/2018|0:30 |3.5 |0 |0 |0 |
| G|12/3/2018|1:00 |6.8 |0 |0 |0 |
| H|13/3/2018|0:30 |7.5 |5.6 |0 |4.5 |
| H|13/3/2018|1:00 |9.8 |9.1 |0 |8.9 |
And Time and Date columns should be also merged like
12/3/2018 0:30

Not a straight forward task, but one approach would be to:
group time and corresponding value into a "map" of time-value pairs
flatten it out into a column of time-value pairs
perform groupBy-pivot-agg transformation using time as part of the groupBy key and types as the pivot column to aggregate the time-corresponding value
Sample code below:
import org.apache.spark.sql.functions._
val df = Seq(
("G", "12/3/2018", "Import", "Voltage", 3.5, 6.8),
("H", "13/3/2018", "Import", "Voltage", 7.5, 9.8),
("H", "13/3/2018", "Export", "Watt", 4.5, 8.9),
("H", "13/3/2018", "Export", "Voltage", 5.6, 9.1)
).toDF("ID", "Date", "Type1", "Type2", "0:30", "1:00")
df.
withColumn("TimeValMap", array(
struct(lit("0:30").as("_1"), $"0:30".as("_2")),
struct(lit("1:00").as("_1"), $"1:00".as("_2"))
)).
withColumn("TimeVal", explode($"TimeValMap")).
withColumn("Time", $"TimeVal._1").
withColumn("Types", concat_ws("-", array($"Type1", $"Type2"))).
groupBy("ID", "Date", "Time").pivot("Types").agg(first($"TimeVal._2")).
orderBy("ID", "Date", "Time").
na.fill(0.0).
show
// +---+---------+----+--------------+-----------+--------------+
// | ID| Date|Time|Export-Voltage|Export-Watt|Import-Voltage|
// +---+---------+----+--------------+-----------+--------------+
// | G|12/3/2018|0:30| 0.0| 0.0| 3.5|
// | G|12/3/2018|1:00| 0.0| 0.0| 6.8|
// | H|13/3/2018|0:30| 5.6| 4.5| 7.5|
// | H|13/3/2018|1:00| 9.1| 8.9| 9.8|
// +---+---------+----+--------------+-----------+--------------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to convert a map to individual columns in spark scala? - scala

Related

merge rows in a dataframe by id trying to avoid null values in columns (Spark scala)

How do you split a column such that first half becomes the column name and the second the column value in Scala Spark?

Filter DF using the column of another DF (same col in both DF) Spark Scala

Spark DF create Seq column in witcolumn

How to unpivot the table based on the multiple columns

Categories

Resources