I am using Scala and Spark to unpivot a table which looks like as below:
+---+----------+--------+-------+------+-----+
| ID| Date | Type1 | Type2 | 0:30 | 1:00|
+---+----------+--------+-------+------+-----+
| G| 12/3/2018| Import|Voltage| 3.5 | 6.8 |
| H| 13/3/2018| Import|Voltage| 7.5 | 9.8 |
| H| 13/3/2018| Export| Watt| 4.5 | 8.9 |
| H| 13/3/2018| Export|Voltage| 5.6 | 9.1 |
+---+----------+--------+-------+------+-----+
I want to transpose it as follow:
| ID|Date | Time|Import-Voltage |Export-Votage|Import-Watt|Export-Watt|
| G|12/3/2018|0:30 |3.5 |0 |0 |0 |
| G|12/3/2018|1:00 |6.8 |0 |0 |0 |
| H|13/3/2018|0:30 |7.5 |5.6 |0 |4.5 |
| H|13/3/2018|1:00 |9.8 |9.1 |0 |8.9 |
And Time and Date columns should be also merged like
12/3/2018 0:30
Not a straight forward task, but one approach would be to:
group time and corresponding value into a "map" of time-value pairs
flatten it out into a column of time-value pairs
perform groupBy-pivot-agg transformation using time as part of the groupBy key and types as the pivot column to aggregate the time-corresponding value
Sample code below:
import org.apache.spark.sql.functions._
val df = Seq(
("G", "12/3/2018", "Import", "Voltage", 3.5, 6.8),
("H", "13/3/2018", "Import", "Voltage", 7.5, 9.8),
("H", "13/3/2018", "Export", "Watt", 4.5, 8.9),
("H", "13/3/2018", "Export", "Voltage", 5.6, 9.1)
).toDF("ID", "Date", "Type1", "Type2", "0:30", "1:00")
df.
withColumn("TimeValMap", array(
struct(lit("0:30").as("_1"), $"0:30".as("_2")),
struct(lit("1:00").as("_1"), $"1:00".as("_2"))
)).
withColumn("TimeVal", explode($"TimeValMap")).
withColumn("Time", $"TimeVal._1").
withColumn("Types", concat_ws("-", array($"Type1", $"Type2"))).
groupBy("ID", "Date", "Time").pivot("Types").agg(first($"TimeVal._2")).
orderBy("ID", "Date", "Time").
na.fill(0.0).
show
// +---+---------+----+--------------+-----------+--------------+
// | ID| Date|Time|Export-Voltage|Export-Watt|Import-Voltage|
// +---+---------+----+--------------+-----------+--------------+
// | G|12/3/2018|0:30| 0.0| 0.0| 3.5|
// | G|12/3/2018|1:00| 0.0| 0.0| 6.8|
// | H|13/3/2018|0:30| 5.6| 4.5| 7.5|
// | H|13/3/2018|1:00| 9.1| 8.9| 9.8|
// +---+---------+----+--------------+-----------+--------------+
Related
I have a column which has value like
+----------------------+-----------------------------------------+
|UserId |col |
+----------------------+-----------------------------------------+
|1 |firstname=abc |
|2 |lastname=xyz |
|3 |firstname=pqr;lastname=zzz |
|4 |firstname=aaa;middlename=xxx;lastname=bbb|
+----------------------+-----------------------------------------+
and what I want is something like this:
+----------------------+--------------------------------+
|UserId |firstname | lastname| middlename|
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |null | xyz | null |
|3 |pqr | zzz | null |
|4 |aaa | bbb | xxx |
+----------------------+--------------------------------+
I have already done this:
var new_df = df.withColumn("temp_new", split(col("col"), "\\;")).select(
(0 until numCols).map(i => split(col("temp_new").getItem(i), "=").getItem(1).as(s"col$i")): _*
)
where numCols is the max length of col
but as you may have guessed I get something like this as the output:
+----------------------+--------------------------------+
|UserId |col0 | col1 | col2 |
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |xyz | null | null |
|3 |pqr | zzz | null |
|4 |aaa | xxx | bbb |
+----------------------+--------------------------------+
NOTE: The above is just an example. There could be more additions to the columns like firstname=aaa;middlename=xxx;lastname=bbb;age=20;country=India and so on for around 40-50 columnnames and values. They are dynamic and I don't know most of them in advance
I am looking for a a way to achieve the result with Scala in Spark.
You could apply groupBy/pivot to generate key columns after converting the key/value-pairs string column into a Map column via SQL function str_to_map, as shown below:
val df = Seq(
(1, "firstname=joe;age=33"),
(2, "lastname=smith;country=usa"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe")
).toDF("user_id", "key_values")
df.
select($"user_id", explode(expr("str_to_map(key_values, ';', '=')"))).
groupBy("user_id").pivot("key").agg(first("value").as("value")).
orderBy("user_id"). // only for ordered output
show
/*
+-------+----+-------+---------+--------+
|user_id| age|country|firstname|lastname|
+-------+----+-------+---------+--------+
| 1| 33| null| joe| null|
| 2|null| usa| null| smith|
| 3| 44| aus| zoe| cooper|
| 4|null| null| john| doe|
+-------+----+-------+---------+--------+
*/
Since your data is split by ; then your key value pairs are split by = you may consider using str_to_map the following:
creating a temporary view of your data eg
df.createOrReplaceTempView("my_table")
Running the following on your spark session
result_df = sparkSession.sql("<insert sql below here>")
WITH split_data AS (
SELECT
UserId,
str_to_map(col,';','=') full_name
FROM
my_table
)
SELECT
UserId,
full_name['firstname'] as firstname,
full_name['lastname'] as lastname,
full_name['middlename'] as middlename
FROM
split_data
This solution is proposed in accordance with the expanded requirement described in the other answer's comments section:
Existence of duplicate keys in column key_values
Only duplicate key columns will be aggregated as ArrayType
There are probably other approaches. The solution below uses groupBy/pivot with collect_list, followed by extracting the single element (null if empty) from the non-duplicate key columns.
val df = Seq(
(1, "firstname=joe;age=33;moviegenre=comedy"),
(2, "lastname=smith;country=usa;moviegenre=drama"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe;moviegenre=drama;moviegenre=comedy")
).toDF("user_id", "key_values")
val mainCols = df.columns diff Seq("key_values")
val dfNew = df.
withColumn("kv_arr", split($"key_values", ";")).
withColumn("kv", explode(expr("transform(kv_arr, kv -> split(kv, '='))"))).
groupBy("user_id").pivot($"kv"(0)).agg(collect_list($"kv"(1)))
val dupeKeys = Seq("moviegenre") // user-provided
val nonDupeKeys = dfNew.columns diff (mainCols ++ dupeKeys)
dfNew.select(
mainCols.map(col) ++
dupeKeys.map(col) ++
nonDupeKeys.map(k => when(size(col(k)) > 0, col(k)(0)).as(k)): _*
).
orderBy("user_id"). // only for ordered output
show
/*
+-------+---------------+----+-------+---------+--------+
|user_id| moviegenre| age|country|firstname|lastname|
+-------+---------------+----+-------+---------+--------+
| 1| [comedy]| 33| null| joe| null|
| 2| [drama]|null| usa| null| smith|
| 3| []| 44| aus| zoe| cooper|
| 4|[drama, comedy]|null| null| john| doe|
+-------+---------------+----+-------+---------+--------+
/*
Note that higher-order function transform is used to handle the key/value split, as SQL function str_to_map (used in the original solution) can't handle duplicate keys.
I created a dataframe in spark with the following schema:
root
|-- user_id: string (nullable = true)
|-- rate: decimal(32,16) (nullable =true)
|-- date: timestamp (nullable =true)
|-- type: string (nullable = true)
Data is like this in my schema
+----------+----------+-------------+---------+
| user_id| rate |date | type |
+----------+----------+-------------+---------+
| XO_121 | 10 |2020-04-20 | A |
| XO_121 | 10 |2020-04-21 | A |
| XO_121 | 30 |2020-04-22 | A |
| XO_121 |0 |2020-04-23 | A |
| XO_121 |0 |2020-04-24 | A |
| XO_121 |0 |2020-04-25 | A |
| XO_121 |0 |2020-04-26 | A |
| XO_121 |5 |2020-04-27 | A |
| XO_121 |0 |2020-04-28 | A |
| XO_121 |0 |2020-04-29 | A |
| XO_121 |1 |2020-04-30 | A |
I want to save space so I want to skip rate which has zero value but just want it's initial occurrence only other rate duplicates are allowed like you see case of 10 and they need to preserve Date order . So after applying filter my data should look like this
+----------+----------+-------------+---------+
| user_id| rate |date | type |
+----------+----------+-------------+---------+
| XO_121 | 10 |2020-04-20 | A |
| XO_121 | 10 |2020-04-21 | A |
| XO_121 | 30 |2020-04-22 | A |
| XO_121 |0 |2020-04-23 | A |
| XO_121 |5 |2020-04-27 | A |
| XO_121 |0 |2020-04-28 | A |
| XO_121 |1 |2020-04-30 | A |
I'm new to spark so just want to find out way to filter . I used Rank concept but that don't work .If any body can provide solution to this problem
Data Preparation :
val df = Seq( ("XO_121","10","2020-04-20"),("XO_121","10","2020-04-21"),("XO_121","30","2020-04-22"),("XO_121","0","2020-04-23"),("XO_121","0","2020-04-24"),("XO_121","0","2020-04-25"),("XO_121","0","2020-04-26"),("XO_121","5","2020-04-27"),("XO_121","0","2020-04-28"),("XO_121","0","2020-04-29"),("XO_121","1","2020-04-30"))
.toDF("user_id","rate","date")
Get the previous value of rate, and check for each record "rate" === "0" && "previous_rate" === "0"
import org.apache.spark.sql.expressions.Window
val winSpec = Window.partitionBy("user_id").orderBy("date")
val finalDf = df.withColumn("previous_rate", lag("rate", 1).over(winSpec))
.filter( !($"rate" === "0" && $"previous_rate" === "0"))
.drop("previous_rate")
Output :
scala> finalDf.show
+-------+----+----------+
|user_id|rate| date|
+-------+----+----------+
| XO_121| 10|2020-04-20|
| XO_121| 10|2020-04-21|
| XO_121| 30|2020-04-22|
| XO_121| 0|2020-04-23|
| XO_121| 5|2020-04-27|
| XO_121| 0|2020-04-28|
| XO_121| 1|2020-04-30|
+-------+----+----------+
Now you can apply orderBy($"date") or orderBy($"userd_id",$"date") which ever is applicable for you.
You can use row_number() instead of Rank as below
_w = W.partitionBy("col2").orderBy("col1")
df = df.withColumn("rnk", F.row_number().over(_w))
df = df.filter(F.col("rnk") == F.lit("1"))
df.show()
+------+----+---+
| col1|col2|rnk|
+------+----+---+
|XO_121| 0| 1|
|XO_121| 10| 1|
|XO_121| 30| 1|
|XO_121| 20| 1|
|XO_121| 40| 1|
+------+----+---+
Also , you can use first() in case you know there is only repetition on value 0
df = df.groupBy("col1","col2").agg(F.first("col2").alias("col2")).orderBy("col2")
df.show()
+------+----+----+
| col1|col2|col2|
+------+----+----+
|XO_121| 0| 0|
|XO_121| 10| 10|
|XO_121| 20| 20|
|XO_121| 30| 30|
|XO_121| 50| 50|
+------+----+----+
I have a DataFrame that has a list of countries and the corresponding data. However the countries are either iso3 or iso2.
dfJSON
.select("value.country")
.filter(size($"value.country") > 0)
.groupBy($"country")
.agg(count("*").as("cnt"));
Now this country field can have USA as the country code or US as the country code. I need to map both USA / US ==> "United States" and then do a groupBy. How do I do this in scala.
Create another DataFrame with country_name, iso_2 & iso_3 columns.
Join your actual DataFrame with this DataFrame & Apply your logic on that data.
Check below code for sample.
scala> countryDF.show(false)
+-------------------+-----+-----+
|country_name |iso_2|iso_3|
+-------------------+-----+-----+
|Afghanistan |AF |AFG |
|?land Islands |AX |ALA |
|Albania |AL |ALB |
|Algeria |DZ |DZA |
|American Samoa |AS |ASM |
|Andorra |AD |AND |
|Angola |AO |AGO |
|Anguilla |AI |AIA |
|Antarctica |AQ |ATA |
|Antigua and Barbuda|AG |ATG |
|Argentina |AR |ARG |
|Armenia |AM |ARM |
|Aruba |AW |ABW |
|Australia |AU |AUS |
|Austria |AT |AUT |
|Azerbaijan |AZ |AZE |
|Bahamas |BS |BHS |
|Bahrain |BH |BHR |
|Bangladesh |BD |BGD |
|Barbados |BB |BRB |
+-------------------+-----+-----+
only showing top 20 rows ```
scala> df.show(false)
+-------+
|country|
+-------+
|USA |
|US |
|IN |
|IND |
|ID |
|IDN |
|IQ |
|IRQ |
+-------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.show(false)
+-------+------------------------+
|country|country_name |
+-------+------------------------+
|USA |United States of America|
|US |United States of America|
|IN |India |
|IND |India |
|ID |Indonesia |
|IDN |Indonesia |
|IQ |Iraq |
|IRQ |Iraq |
+-------+------------------------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.groupBy($"country_name")
.agg(collect_list($"country").as("country_code"),count("*").as("country_count"))
.show(false)
+------------------------+------------+-------------+
|country_name |country_code|country_count|
+------------------------+------------+-------------+
|Iraq |[IQ, IRQ] |2 |
|India |[IN, IND] |2 |
|United States of America|[USA, US] |2 |
|Indonesia |[ID, IDN] |2 |
+------------------------+------------+-------------+
This is an extension of this question, Apache Spark group by combining types and sub types.
val sales = Seq(
("Warsaw", 2016, "facebook","share",100),
("Warsaw", 2017, "facebook","like",200),
("Boston", 2015,"twitter","share",50),
("Boston", 2016,"facebook","share",150),
("Toronto", 2017,"twitter","like",50)
).toDF("city", "year","media","action","amount")
All good with that solution, however the expected output should be counted in different categories conditionally.
So, the output should look like,
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| share1 | 2|
| Boston| share2 | 2|
| Boston| twitter| 1|
|Toronto| twitter| 1|
|Toronto| like | 1|
| Warsaw|facebook| 2|
| Warsaw|share1 | 1|
| Warsaw|share2 | 1|
| Warsaw|like | 1|
+-------+--------+-----+
Here if the action is share, I need to have that counted both in share1 and share2. When I count it programmatically, I use case statement and say case when action is share, share1 = share1 +1 , share2 = share2+1
But how can I do this in Scala or pyspark or sql ?
Simple filter and unions should give you your desired output
val media = sales.groupBy("city", "media").count()
val action = sales.groupBy("city", "action").count().select($"city", $"action".as("media"), $"count")
val share = action.filter($"media" === "share")
media.union(action.filter($"media" =!= "share"))
.union(share.withColumn("media", lit("share1")))
.union(share.withColumn("media", lit("share2")))
.show(false)
which should give you
+-------+--------+-----+
|city |media |count|
+-------+--------+-----+
|Boston |facebook|1 |
|Boston |twitter |1 |
|Toronto|twitter |1 |
|Warsaw |facebook|2 |
|Warsaw |like |1 |
|Toronto|like |1 |
|Boston |share1 |2 |
|Warsaw |share1 |1 |
|Boston |share2 |2 |
|Warsaw |share2 |1 |
+-------+--------+-----+
I have two dataframe,
Dataframe1 contains key/value pairs:
+------+-----------------+
| Key | Value |
+------+-----------------+
| key1 | Column1 |
+------+-----------------+
| key2 | Column2 |
+------+-----------------+
| key3 | Column1,Column3 |
+------+-----------------+
Second dataframe:
This is actual dataframe where I need to apply groupBy operation
+---------+---------+---------+--------+
| Column1 | Column2 | Column3 | Amount |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A2 | XYZ | 10 |
+---------+---------+---------+--------+
| A | A3 | PQR | 100 |
+---------+---------+---------+--------+
| B | B1 | XYZ | 200 |
+---------+---------+---------+--------+
| B | B2 | PQR | 280 |
+---------+---------+---------+--------+
| B | B3 | XYZ | 20 |
+---------+---------+---------+--------+
Dataframe1 contains the key,value columns
It has to take the keys from dataframe1, it has to take the respective value and do the groupBy operation on the dataframe2
Dframe= df.groupBy($"key").sum("amount").show()
Expected Output: Generate three dataframes based on number of keys in dataframe
d1= df.grouBy($"key1").sum("amount").show()
it has to be : df.grouBy($"column1").sum("amount").show()
+---+-----+
| A | 310 |
+---+-----+
| B | 500 |
+---+-----+
Code:
d2=df.groupBy($"key2").sum("amount").show()
result: df.grouBy($"column2").sum("amount").show()
dataframe:
+----+-----+
| A1 | 200 |
+----+-----+
| A2 | 10 |
+----+-----+
Code :
d3.df.groupBy($"key3").sum("amount").show()
DataFrame:
+---+-----+-----+
| A | XYZ | 320 |
+---+-----+-----+
| A | PQR | 10 |
+---+-----+-----+
| B | XYZ | 220 |
+---+-----+-----+
| B | PQR | 280 |
+---+-----+-----+
In future, if I add more keys , it has to show the dataframe. Can someone help me.
Given the key value dataframe as ( which I suggest you not to form dataframe from the source data, reason is given below)
+----+---------------+
|Key |Value |
+----+---------------+
|key1|Column1 |
|key2|Column2 |
|key3|Column1,Column3|
+----+---------------+
and actual dataframe as
+-------+-------+-------+------+
|Column1|Column2|Column3|Amount|
+-------+-------+-------+------+
|A |A1 |XYZ |100 |
|A |A1 |XYZ |100 |
|A |A2 |XYZ |10 |
|A |A3 |PQR |100 |
|B |B1 |XYZ |200 |
|B |B2 |PQR |280 |
|B |B3 |XYZ |20 |
+-------+-------+-------+------+
I would suggest you not to convert the first dataframe to rdd maps as
val maps = df1.rdd.map(row => row(0) -> row(1)).collect()
And then loop the maps as
import org.apache.spark.sql.functions._
for(kv <- maps){
df2.groupBy(kv._2.toString.split(",").map(col): _*).agg(sum($"Amount")).show(false)
//you can store the results in separate dataframes or write them to files or database
}
You should have follwing outputs
+-------+-----------+
|Column1|sum(Amount)|
+-------+-----------+
|B |500 |
|A |310 |
+-------+-----------+
+-------+-----------+
|Column2|sum(Amount)|
+-------+-----------+
|A2 |10 |
|B2 |280 |
|B1 |200 |
|B3 |20 |
|A3 |100 |
|A1 |200 |
+-------+-----------+
+-------+-------+-----------+
|Column1|Column3|sum(Amount)|
+-------+-------+-----------+
|B |PQR |280 |
|B |XYZ |220 |
|A |PQR |100 |
|A |XYZ |210 |
+-------+-------+-----------+