Reduce a json string column into a key/val column - scala

i have a dataframe with the following structure :
| a | b | c |
-----------------------------------------------------------------------------
|01 |ABC | {"key1":"valueA","key2":"valueC"} |
|02 |ABC | {"key1":"valueA","key2":"valueC"} |
|11 |DEF | {"key1":"valueB","key2":"valueD", "key3":"valueE"} |
|12 |DEF | {"key1":"valueB","key2":"valueD", "key3":"valueE"} |
i would like to turn into something like :
| a | b | key | value |
--------------------------------------------------------
|01 |ABC | key1 | valueA |
|01 |ABC | key2 | valueC |
|02 |ABC | key1 | valueA |
|02 |ABC | key2 | valueC |
|11 |DEF | key1 | valueB |
|11 |DEF | key2 | valueD |
|11 |DEF | key3 | valueE |
|12 |DEF | key1 | valueB |
|12 |DEF | key2 | valueD |
|12 |DEF | key3 | valueE |
in an efficient way, as the dataset can be quite large.

Try using from_json function then explode the array.
Example:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val df=Seq(("01","ABC","""{"key1":"valueA","key2":"valueC"}""")).toDF("a","b","c")
val Schema = MapType(StringType, StringType)
df.withColumn("d",from_json(col("c"),Schema)).selectExpr("a","b","explode(d)").show(10,false)
//+---+---+----+------+
//|a |b |key |value |
//+---+---+----+------+
//|01 |ABC|key1|valueA|
//|01 |ABC|key2|valueC|
//+---+---+----+------+

Related

SparkSQL- Add new column to DataFrame based on the aggregation

Having the following DataFrame:
+--------+----------+------------+
|user_id |level |new_columns |
+--------+----------+------------+
|4 |B |null |
|6 |B |null |
|5 |A |col1 |
|3 |B |col2 |
|5 |A |col2 |
|2 |A |null |
|1 |A |col3 |
+--------+----------+------------+
I need to convert each not null value of the new_columns column to a new column, which should be done based on the aggregation on the user_id column. The desired output would be
+--------+-------------+------+
|user_id | col1 | col2 | col3 |
+--------+------+------+------+
|4 | null | null | null |
|6 | null | null | null |
|5 | A | A | null |
|3 | null | B | null |
|2 | null | null | null |
|1 | null | null | A |
+--------+-------------+------+
As you can see, the value of the new columns comes from the level column in the base DF. I know how to use the withColumn method to add new columns on a DF but here the critical part is how to add new columns on the aggregated DF (for the case of the user_id = 5).
Every hint based on the DataFrame API would be appreciated.
You can do a pivot:
val df2 = df.groupBy("event_id")
.pivot("new_columns")
.agg(first("level"))
.drop("null")
df2.show
+--------+-------------+------+
|user_id | col1 | col2 | col3 |
+--------+------+------+------+
|4 | null | null | null |
|6 | null | null | null |
|5 | A | A | null |
|3 | null | B | null |
|2 | null | null | null |
|1 | null | null | A |
+--------+-------------+------+
You can collect the non-null values from new_columns first before doing pivot :
val nonNull = df.select("new_columns").filter("new_columns is not null").distinct().as[String].collect
val df1 = df.groupBy("user_id")
.pivot("new_columns", nonNull)
.agg(first("level"))
df1.show
//+-------+----+----+----+
//|user_id|col3|col1|col2|
//+-------+----+----+----+
//| 1| A|null|null|
//| 6|null|null|null|
//| 3|null|null| B|
//| 5|null| A| A|
//| 4|null|null|null|
//| 2|null|null|null|
//+-------+----+----+----+

How to rank dataframe depending on a group of rows in a column?

I have this dataframe :
+-----+----------+---------+
|num |Timestamp |frequency|
+-----+----------+---------+
|20.0 |1632899456|4 |
|20.0 |1632901256|4 |
|20.0 |1632901796|4 |
|20.0 |1632899155|4 |
|10.0 |1632901743|2 |
|10.0 |1632899933|2 |
|91.0 |1632899756|1 |
|32.0 |1632900776|1 |
|41.0 |1632900176|1 |
+-----+----------+---------+
I want to add a column containing the rank of each frequency. The new dataframe would be like this :
+-----+----------+---------+------------+
|num |Timestamp |frequency|rank |
+-----+----------+---------+------------+
|20.0 |1632899456|4 |1 |
|20.0 |1632901256|4 |1 |
|20.0 |1632901796|4 |1 |
|20.0 |1632899155|4 |1 |
|10.0 |1632901743|2 |2 |
|10.0 |1632899933|2 |2 |
|91.0 |1632899756|1 |3 |
|32.0 |1632900776|1 |3 |
|41.0 |1632900176|1 |3 |
+-----+----------+---------+------------+
I am using Spark version 2.4.3 and SQLContext, with scala language.
You can use dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("rank", dense_rank().over(Window.orderBy(desc("frequency")))

Scala group by with mapped keys

I have a DataFrame that has a list of countries and the corresponding data. However the countries are either iso3 or iso2.
dfJSON
.select("value.country")
.filter(size($"value.country") > 0)
.groupBy($"country")
.agg(count("*").as("cnt"));
Now this country field can have USA as the country code or US as the country code. I need to map both USA / US ==> "United States" and then do a groupBy. How do I do this in scala.
Create another DataFrame with country_name, iso_2 & iso_3 columns.
Join your actual DataFrame with this DataFrame & Apply your logic on that data.
Check below code for sample.
scala> countryDF.show(false)
+-------------------+-----+-----+
|country_name |iso_2|iso_3|
+-------------------+-----+-----+
|Afghanistan |AF |AFG |
|?land Islands |AX |ALA |
|Albania |AL |ALB |
|Algeria |DZ |DZA |
|American Samoa |AS |ASM |
|Andorra |AD |AND |
|Angola |AO |AGO |
|Anguilla |AI |AIA |
|Antarctica |AQ |ATA |
|Antigua and Barbuda|AG |ATG |
|Argentina |AR |ARG |
|Armenia |AM |ARM |
|Aruba |AW |ABW |
|Australia |AU |AUS |
|Austria |AT |AUT |
|Azerbaijan |AZ |AZE |
|Bahamas |BS |BHS |
|Bahrain |BH |BHR |
|Bangladesh |BD |BGD |
|Barbados |BB |BRB |
+-------------------+-----+-----+
only showing top 20 rows ```
scala> df.show(false)
+-------+
|country|
+-------+
|USA |
|US |
|IN |
|IND |
|ID |
|IDN |
|IQ |
|IRQ |
+-------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.show(false)
+-------+------------------------+
|country|country_name |
+-------+------------------------+
|USA |United States of America|
|US |United States of America|
|IN |India |
|IND |India |
|ID |Indonesia |
|IDN |Indonesia |
|IQ |Iraq |
|IRQ |Iraq |
+-------+------------------------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.groupBy($"country_name")
.agg(collect_list($"country").as("country_code"),count("*").as("country_count"))
.show(false)
+------------------------+------------+-------------+
|country_name |country_code|country_count|
+------------------------+------------+-------------+
|Iraq |[IQ, IRQ] |2 |
|India |[IN, IND] |2 |
|United States of America|[USA, US] |2 |
|Indonesia |[ID, IDN] |2 |
+------------------------+------------+-------------+

Creating a new column based on a window and a condition in Spark

INITIAL DATA FRAME:
+------------------------------+----------+-------+
| Timestamp | Property | Value |
+------------------------------+----------+-------+
| 2019-09-01T01:36:57.000+0000 | X | N |
| 2019-09-01T01:37:39.000+0000 | A | 3 |
| 2019-09-01T01:42:55.000+0000 | X | Y |
| 2019-09-01T01:53:44.000+0000 | A | 17 |
| 2019-09-01T01:55:34.000+0000 | A | 9 |
| 2019-09-01T01:57:32.000+0000 | X | N |
| 2019-09-01T02:59:40.000+0000 | A | 2 |
| 2019-09-01T02:00:03.000+0000 | A | 16 |
| 2019-09-01T02:01:40.000+0000 | X | Y |
| 2019-09-01T02:04:03.000+0000 | A | 21 |
+------------------------------+----------+-------+
FINAL DATA FRAME:
+------------------------------+----------+-------+---+
| Timestamp | Property | Value | X |
+------------------------------+----------+-------+---+
| 2019-09-01T01:37:39.000+0000 | A | 3 | N |
| 2019-09-01T01:53:44.000+0000 | A | 17 | Y |
| 2019-09-01T01:55:34.000+0000 | A | 9 | Y |
| 2019-09-01T02:00:03.000+0000 | A | 16 | N |
| 2019-09-01T02:04:03.000+0000 | A | 21 | Y |
| 2019-09-01T02:59:40.000+0000 | A | 2 | Y |
+------------------------------+----------+-------+---+
Basically, I have a Timestamp, a Property, and a Value field. The Property could be either A or X and it has a value. I would like to have a new DataFrame with a fourth column named X based on the values of the X property.
I start going through the rows from the earliest to the oldest.
I encounter a row with the X-property, I store its value and I insert it into the X-column.
IF I encounter an A-property row: I insert the stored value from the previous step into the X-column.
ELSE (meaning I encounter an X-property row): I update the stored value (since it is more recent) and I insert the new stored value into the X column.
I keep doing so until I have gone through the whole dataframe.
I remove the rows with the X property to have the final dataframe showed above.
I am sure there is some sort of way to do so efficiently with the Window function.
create a temp column with value X's value, null if A. Then use window to get last not-null Temp value. Filter property "A" in the end.
scala> val df = Seq(
| ("2019-09-01T01:36:57.000+0000", "X", "N"),
| ("2019-09-01T01:37:39.000+0000", "A", "3"),
| ("2019-09-01T01:42:55.000+0000", "X", "Y"),
| ("2019-09-01T01:53:44.000+0000", "A", "17"),
| ("2019-09-01T01:55:34.000+0000", "A", "9"),
| ("2019-09-01T01:57:32.000+0000", "X", "N"),
| ("2019-09-01T02:59:40.000+0000", "A", "2"),
| ("2019-09-01T02:00:03.000+0000", "A", "16"),
| ("2019-09-01T02:01:40.000+0000", "X", "Y"),
| ("2019-09-01T02:04:03.000+0000", "A", "21")
| ).toDF("Timestamp", "Property", "Value").withColumn("Temp", when($"Property" === "X", $"Value").otherwise(null))
df: org.apache.spark.sql.DataFrame = [Timestamp: string, Property: string ... 2 more fields]
scala> df.show(false)
+----------------------------+--------+-----+----+
|Timestamp |Property|Value|Temp|
+----------------------------+--------+-----+----+
|2019-09-01T01:36:57.000+0000|X |N |N |
|2019-09-01T01:37:39.000+0000|A |3 |null|
|2019-09-01T01:42:55.000+0000|X |Y |Y |
|2019-09-01T01:53:44.000+0000|A |17 |null|
|2019-09-01T01:55:34.000+0000|A |9 |null|
|2019-09-01T01:57:32.000+0000|X |N |N |
|2019-09-01T02:59:40.000+0000|A |2 |null|
|2019-09-01T02:00:03.000+0000|A |16 |null|
|2019-09-01T02:01:40.000+0000|X |Y |Y |
|2019-09-01T02:04:03.000+0000|A |21 |null|
+----------------------------+--------+-----+----+
scala> val overColumns = Window.orderBy("TimeStamp").rowsBetween(Window.unboundedPreceding, Window.currentRow)
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#1b759662
scala> df.withColumn("X", last($"Temp",true).over(overColumns)).show(false)
+----------------------------+--------+-----+----+---+
|Timestamp |Property|Value|Temp|X |
+----------------------------+--------+-----+----+---+
|2019-09-01T01:36:57.000+0000|X |N |N |N |
|2019-09-01T01:37:39.000+0000|A |3 |null|N |
|2019-09-01T01:42:55.000+0000|X |Y |Y |Y |
|2019-09-01T01:53:44.000+0000|A |17 |null|Y |
|2019-09-01T01:55:34.000+0000|A |9 |null|Y |
|2019-09-01T01:57:32.000+0000|X |N |N |N |
|2019-09-01T02:00:03.000+0000|A |16 |null|N |
|2019-09-01T02:01:40.000+0000|X |Y |Y |Y |
|2019-09-01T02:04:03.000+0000|A |21 |null|Y |
|2019-09-01T02:59:40.000+0000|A |2 |null|Y |
+----------------------------+--------+-----+----+---+
scala> df.withColumn("X", last($"Temp",true).over(overColumns)).filter($"Property" === "A").show(false)
+----------------------------+--------+-----+----+---+
|Timestamp |Property|Value|Temp|X |
+----------------------------+--------+-----+----+---+
|2019-09-01T01:37:39.000+0000|A |3 |null|N |
|2019-09-01T01:53:44.000+0000|A |17 |null|Y |
|2019-09-01T01:55:34.000+0000|A |9 |null|Y |
|2019-09-01T02:00:03.000+0000|A |16 |null|N |
|2019-09-01T02:04:03.000+0000|A |21 |null|Y |
|2019-09-01T02:59:40.000+0000|A |2 |null|Y |
+----------------------------+--------+-----+----+---+

GroupBy based on conditions in Spark dataframe

I have two dataframe,
Dataframe1 contains key/value pairs:
+------+-----------------+
| Key | Value |
+------+-----------------+
| key1 | Column1 |
+------+-----------------+
| key2 | Column2 |
+------+-----------------+
| key3 | Column1,Column3 |
+------+-----------------+
Second dataframe:
This is actual dataframe where I need to apply groupBy operation
+---------+---------+---------+--------+
| Column1 | Column2 | Column3 | Amount |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A2 | XYZ | 10 |
+---------+---------+---------+--------+
| A | A3 | PQR | 100 |
+---------+---------+---------+--------+
| B | B1 | XYZ | 200 |
+---------+---------+---------+--------+
| B | B2 | PQR | 280 |
+---------+---------+---------+--------+
| B | B3 | XYZ | 20 |
+---------+---------+---------+--------+
Dataframe1 contains the key,value columns
It has to take the keys from dataframe1, it has to take the respective value and do the groupBy operation on the dataframe2
Dframe= df.groupBy($"key").sum("amount").show()
Expected Output: Generate three dataframes based on number of keys in dataframe
d1= df.grouBy($"key1").sum("amount").show()
it has to be : df.grouBy($"column1").sum("amount").show()
+---+-----+
| A | 310 |
+---+-----+
| B | 500 |
+---+-----+
Code:
d2=df.groupBy($"key2").sum("amount").show()
result: df.grouBy($"column2").sum("amount").show()
dataframe:
+----+-----+
| A1 | 200 |
+----+-----+
| A2 | 10 |
+----+-----+
Code :
d3.df.groupBy($"key3").sum("amount").show()
DataFrame:
+---+-----+-----+
| A | XYZ | 320 |
+---+-----+-----+
| A | PQR | 10 |
+---+-----+-----+
| B | XYZ | 220 |
+---+-----+-----+
| B | PQR | 280 |
+---+-----+-----+
In future, if I add more keys , it has to show the dataframe. Can someone help me.
Given the key value dataframe as ( which I suggest you not to form dataframe from the source data, reason is given below)
+----+---------------+
|Key |Value |
+----+---------------+
|key1|Column1 |
|key2|Column2 |
|key3|Column1,Column3|
+----+---------------+
and actual dataframe as
+-------+-------+-------+------+
|Column1|Column2|Column3|Amount|
+-------+-------+-------+------+
|A |A1 |XYZ |100 |
|A |A1 |XYZ |100 |
|A |A2 |XYZ |10 |
|A |A3 |PQR |100 |
|B |B1 |XYZ |200 |
|B |B2 |PQR |280 |
|B |B3 |XYZ |20 |
+-------+-------+-------+------+
I would suggest you not to convert the first dataframe to rdd maps as
val maps = df1.rdd.map(row => row(0) -> row(1)).collect()
And then loop the maps as
import org.apache.spark.sql.functions._
for(kv <- maps){
df2.groupBy(kv._2.toString.split(",").map(col): _*).agg(sum($"Amount")).show(false)
//you can store the results in separate dataframes or write them to files or database
}
You should have follwing outputs
+-------+-----------+
|Column1|sum(Amount)|
+-------+-----------+
|B |500 |
|A |310 |
+-------+-----------+
+-------+-----------+
|Column2|sum(Amount)|
+-------+-----------+
|A2 |10 |
|B2 |280 |
|B1 |200 |
|B3 |20 |
|A3 |100 |
|A1 |200 |
+-------+-----------+
+-------+-------+-----------+
|Column1|Column3|sum(Amount)|
+-------+-------+-----------+
|B |PQR |280 |
|B |XYZ |220 |
|A |PQR |100 |
|A |XYZ |210 |
+-------+-------+-----------+