Convert Scala Dataframe to HashMap - scala

I have a scala dataframe like this
-----------------------------------
| Code | Name | Age | ID |
-----------------------------------
| ABC | Alan | 22 | 111111 |
| ABC | Bob | 25 | 222222 |
| DEF | Charlie | 29 | 333333 |
| GHI | David | 11 | 555555 |
-----------------------------------
I want to have an output HashMap like this:
{
'ABC': [{'Name': 'Alan', 'Age': 22', 'ID': 111111} , {'Name': 'Bob', 'Age': 25', 'ID': 22222}],
'DEF': [{'Name': 'Charlie', 'Age': 29', 'ID': 333333}],
'GHI': [{'Name': 'David', 'Age': 11', 'ID': 555555}]
}
How can I efficiently do this?

Assuming your DataFrame is named ds, this should work:
ds.select('code, to_json(struct('name, 'age, 'id)) as "json")
.groupBy('code).agg(collect_list('json))
.as[(String, Array[String])]
.collect.toMap
This will give you a Map[String, Array[String]]. If what you wanted was to turn the whole DataFrame into a single JSON, I wouldn't recommend that, but it would be doable as well.

Related

How to get column with list of values from another column in Pyspark

Can someone help me with any idea how to create pyspark DataFrame with all Recepients of each person?
For example:
Input DataFrame:
+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob |
|Alice | John |
|Alice | Mike |
|Bob | Tom |
|Bob | George |
|George| Alice |
|George| Bob |
+------+---------+
Result:
+------+------------------+
|Sender|Recepients |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob |[Tom, George] |
|George|[Alice, Bob] |
+------+------------------+
I tried df.groupBy("Sender").sum("Recepients") to get string and split it but had the error Aggregation function can only be applied on a numeric column.
All you need was to do was a groupBy Sender column and collect the Recepient.
Below is the full solution
# create a data frame
df = spark.createDataFrame(
[
("Alice","Bob"),
("Alice","John"),
("Alice","Mike"),
("Bob","Tom"),
("Bob","George"),
("George","Alice"),
("George","Bob")],
("sender","Recepient"))
df.show()
# results below
+------+---------+
|Sender|Recepient|
+------+---------+
|Alice | Bob |
|Alice | John |
|Alice | Mike |
|Bob | Tom |
|Bob | George |
|George| Alice |
|George| Bob |
+------+---------+
# Import functions
import pyspark.sql.functions as f
# perform a groupBy and use collect_list
df1 = df.groupby("Sender").agg(f.collect_list('Recepient').alias('Recepients'))
df1.show()
# results
+------+------------------+
|Sender|Recepients |
+------+------------------+
|Alice |[Bob, John, Mike] |
|Bob |[Tom, George] |
|George|[Alice, Bob] |
+------+------------------+

Converting distinct values of a Spark dataframe column into a list

I have a dataset that looks something like this:
+-------+-----+----------+--------------+
| Name | Age | Pet Name | Phone Number |
+-------+-----+----------+--------------+
| Brett | 14 | Rover | 123 456 7889 |
| Amy | 15 | Ginger | 123 456 8888 |
| Amy | 15 | Polly | 123 456 8888 |
| Josh | 14 | Fido | 312 456 9999 |
+-------+-----+----------+--------------+
And I need to present it in the following format using Spark:
+-------+-----+---------------+--------------+
| Name | Age | Pet Name | Phone Number |
+-------+-----+---------------+--------------+
| Brett | 14 | Rover | 123 456 7889 |
| Amy | 15 | Ginger, Polly | 123 456 8888 |
| Josh | 14 | Fido | 312 456 9999 |
+-------+-----+---------------+--------------+
Can someone please help me with the best way to go about this?
You can also use groupBy Name and Age and collect as list for Pet Name as below
df.groupBy("Name", "Age")
.agg(collect_list($"Pet Name").as("PetName"), first("Phone Number").as("PhoneNumber"))
Or you could also do
data.groupBy("Name", "Age", "Phone Number")
.agg(collect_list($"Pet Name").as("PetName"))
Output:
+-----+---+---------------+------------+
|Name |Age|PetName |PhoneNumber |
+-----+---+---------------+------------+
|Amy |15 |[Ginger, Polly]|123 456 8888|
|Brett|14 |[Rover] |123 456 7889|
|Josh |14 |[Fido] |312 456 9999|
+-----+---+---------------+------------+
If you need string you can use concat_ws as
data.groupBy("Name", "Age", "Phone Number")
.agg(concat_ws(",",collect_list($"Pet Name")).as("PetName"))
Output:
+-----+---+------------+------------+
|Name |Age|Phone Number|PetName |
+-----+---+------------+------------+
|Brett|14 |123 456 7889|Rover |
|Amy |15 |123 456 8888|Ginger,Polly|
|Josh |14 |312 456 9999|Fido |
+-----+---+------------+------------+

Select specific rows from Spark dataframe per grouping

I have a dataframe like that:
+-----------------+------------------+-----------+--------+---+
| conversation_id | message_body | timestamp | sender | |
+-----------------+------------------+-----------+--------+---+
| A | hi | 9:00 | John | |
| A | how are you? | 10:00 | John | |
| A | can we meet? | 10:05 | John | * |
| A | not bad | 10:30 | Steven | * |
| A | great | 10:40 | John | |
| A | yeah, let's meet | 10:35 | Steven | |
| B | Hi | 12:00 | Anna | * |
| B | Hello | 12:05 | Ken | * |
+-----------------+------------------+-----------+--------+---+
For each conversation I would like to have the last message in the first block of the 1st sender and the first message of the 2nd sender. I marked them with an asterisk.
One idea that I had is to assign 0s for the first user and 1s for the second user.
Ideally I would like to have something like that:
+-----------------+---------+------------+--------------+---------+------------+----------+
| conversation_id | sender1 | timestamp1 | message1 | sender2 | timestamp2 | message2 |
+-----------------+---------+------------+--------------+---------+------------+----------+
| A | John | 10:05 | can we meet? | Steven | 10:30 | not bad |
| B | Anna | 12:00 | Hi | Ken | 12:05 | Hello |
+-----------------+---------+------------+--------------+---------+------------+----------+
How could I do that in Spark?
Interesting issues arose.
Amended 10:35 to 10:45
Used leading 0 format e.g. 09:00 instead of 9:00
You will need to use your own data types accordingly, this simply demonstrates the approach needed
Done in DataBricks Notebook
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df =
Seq(("A", "hi", "09:00", "John"), ("A", "how are you?", "10:00", "John"),
("A", "can we meet?", "10:05", "John"), ("A", "not bad", "10:30", "Steven"),
("A", "great", "10:40", "John"), ("A", "yeah, let's meet", "10:45", "Steven"),
("B", "Hi", "12:00", "Anna"), ("B", "Hello", "12:05", "Ken")
).toDF("conversation_id", "message_body", "timestampX", "sender")
// Get rank, 1 is who were initiates conversation, the other values can be used to infer relationships
// Note no #Transient required here with Window
val df2 = df.withColumn("rankC", row_number().over(Window.partitionBy($"conversation_id").orderBy($"timestampX".asc)))
// A value <> 1 is the first message of second Sender.
// The 1 value of this is the last message of first "block"
val df3 = df2.select('conversation_id as "df3_conversation_id", 'sender as "df3_sender", 'rankC as "df3_rank")
// To avoid pipelined renaming issues that occur
val df3a = df3.groupBy("df3_conversation_id", "df3_sender").agg(min("df3_rank") as "rankC2").filter("rankC2 != 1")
//JOIN the values with some smarts. Some odd errors in Spark thru pipe-lining gotten. Need to drop pipelined row(), ranking etc.
val df4 = df3a.join(df2, (df3a("df3_conversation_id") === df2("conversation_id")) && (df3a("rankC2") === df2("rankC") + 1)).drop("rankC").drop("rankC2")
val df4a = df3a.join(df2, (df3a("df3_conversation_id") === df2("conversation_id")) && (df3a("rankC2") === df2("rankC"))).drop("rankC").drop("rankC2")
// The get other missing data, could have all been combined but done in steps for simplicity. Just a simple final JOIN and you ahve the answer.
val df5 = df4.join(df4a, (df4("df3_conversation_id") === df4a("df3_conversation_id")))
df5.show(false)
returns:
Output will not completely format here, run it in REPL to see titles.
|B |Ken |B |Hi |12:00 |Anna |B |Ken |B |Hello |12:05 |Ken |
|A |Steven |A |can we meet?|10:05 |John |A |Steven |A |not bad |10:30 |Steven|
You can further manipulate the data, the heavy lifting is done now. The Catalyst Optimizer has some issues compiling etc. so this is why I worked around in this fashion.

Spark SQL Map only one column of DataFrame

Sorry for the noob question, I have a dataframe in SparkSQL like this:
id | name | data
----------------
1 | Mary | ABCD
2 | Joey | DOGE
3 | Lane | POOP
4 | Jack | MEGA
5 | Lynn | ARGH
I want to know how to do two things:
1) use a scala function on one or more columns to produce another column
2) use a scala function on one or more columns to replace a column
Examples:
1) Create a new boolean column that tells whether the data starts with A:
id | name | data | startsWithA
------------------------------
1 | Mary | ABCD | true
2 | Joey | DOGE | false
3 | Lane | POOP | false
4 | Jack | MEGA | false
5 | Lynn | ARGH | true
2) Replace the data column with its lowercase counterpart:
id | name | data
----------------
1 | Mary | abcd
2 | Joey | doge
3 | Lane | poop
4 | Jack | mega
5 | Lynn | argh
What is the best way to do this in SparkSQL? I've seen many examples of how to return a single transformed column, but I don't know how to get back a new DataFrame with all the original columns as well.
You can use withColumn to add new column or to replace the existing column
as
val df = Seq(
(1, "Mary", "ABCD"),
(2, "Joey", "DOGE"),
(3, "Lane", "POOP"),
(4, "Jack", "MEGA"),
(5, "Lynn", "ARGH")
).toDF("id", "name", "data")
val resultDF = df.withColumn("startsWithA", $"data".startsWith("A"))
.withColumn("data", lower($"data"))
If you want separate dataframe then
val resultDF1 = df.withColumn("startsWithA", $"data".startsWith("A"))
val resultDF2 = df.withColumn("data", lower($"data"))
withColumn replaces the old column if the same column name is provided and creates a new column if new column name is provided.
Output:
+---+----+----+-----------+
|id |name|data|startsWithA|
+---+----+----+-----------+
|1 |Mary|abcd|true |
|2 |Joey|doge|false |
|3 |Lane|poop|false |
|4 |Jack|mega|false |
|5 |Lynn|argh|true |
+---+----+----+-----------+

Apache Spark concatenate multiple rows into list in single row [duplicate]

This question already has answers here:
Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function
(10 answers)
Closed 4 years ago.
I need to create a table(hive table/spark dataframe) from a source table that stores data of users in multiple rows into list in single row.
User table:
Schema: userid: string | transactiondate:string | charges: string |events:array<struct<name:string,value:string>>
----|------------|-------| ---------------------------------------
123 | 2017-09-01 | 20.00 | [{"name":"chargeperiod","value":"this"}]
123 | 2017-09-01 | 30.00 | [{"name":"chargeperiod","value":"last"}]
123 | 2017-09-01 | 20.00 | [{"name":"chargeperiod","value":"recent"}]
123 | 2017-09-01 | 30.00 | [{"name":"chargeperiod","value":"0"}]
456 | 2017-09-01 | 20.00 | [{"name":"chargeperiod","value":"this"}]
456 | 2017-09-01 | 30.00 | [{"name":"chargeperiod","value":"last"}]
456 | 2017-09-01 | 20.00 | [{"name":"chargeperiod","value":"recent"}]
456 | 2017-09-01 | 30.00 | [{"name":"chargeperiod","value":"0"}]
Output table should be
userid:String | concatenatedlist :List[Row]
-------|-----------------
123 | [[2017-09-01,20.00,[{"name":"chargeperiod","value":"this"}]],[2017-09-01,30.00,[{"name":"chargeperiod","value":"last"}]],[2017-09-01,20.00,[{"name":"chargeperiod","value":"recent"}]], [2017-09-01,30.00, [{"name":"chargeperiod","value":"0"}]]]
456 | [[2017-09-01,20.00,[{"name":"chargeperiod","value":"this"}]],[2017-09-01,30.00,[{"name":"chargeperiod","value":"last"}]],[2017-09-01,20.00,[{"name":"chargeperiod","value":"recent"}]], [2017-09-01,30.00, [{"name":"chargeperiod","value":"0"}]]]
Spark version: 1.6.2
Seq(("1", "2017-02-01", "20.00", "abc"),
("1", "2017-02-01", "30.00", "abc2"),
("2", "2017-02-01", "20.00", "abc"),
("2", "2017-02-01", "30.00", "abc"))
.toDF("id", "date", "amt", "array")
df.withColumn("new", concat_ws(",", $"date", $"amt", $"array"))
.select("id", "new")
.groupBy("id")
.agg(concat_ws(",", collect_list("new")))