How to get Unique list of string from the columns? - scala

I want to append list of string to dataFrame , But What getting error while defining its DataType
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|["Book1","Book2"]|
|fb1|fb11|fb111|fb1111|fb11111|["Book2"] |
+---+----+-----+------+-------+--------------------+
How to get unique list of string from column webhooks columns which will be
List1 = List("Book1","Book2")

If explode is not something that you are looking for, then try below:
val df = List(("a","b",Seq("book1","book2")),("c","d",Seq("book1","book3"))).toDF("col1","col2","webhooks")
df.printSchema
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- webhooks: array (nullable = true)
| |-- element: string (containsNull = true)
val ww = Window.partitionBy()
ww: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#71bcd49c
df.withColumn("webhooksList",array_distinct(flatten(collect_set($"webhooks").over(ww)))).show(false)
20/06/08 14:24:12 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+----+--------------+---------------------+
|col1|col2|webhooks |webhooksList |
+----+----+--------------+---------------------+
|c |d |[book1, book3]|[book1, book3, book2]|
|a |b |[book1, book2]|[book1, book3, book2]|
+----+----+--------------+---------------------+

Related

Flattening map<string,string> column in spark scala

Below is my source schema.
root
|-- header: struct (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- id: string (nullable = true)
| |-- honame: string (nullable = true)
|-- device: struct (nullable = true)
| |-- srcId: string (nullable = true)
| |-- srctype.: string (nullable = true)
|-- ATTRIBUTES: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- event_date: date (nullable = true)
|-- event_datetime: string (nullable = true)
I want to explode the ATTRIBUTES map type column and select all the columns which ends with _id.
Im using the below code.
val exploded = batch_df.select($"event_date", explode($"ATTRIBUTES")).show()
I am getting the below sample output.
---+----------+--------------------+--------------------+
|date | key| value|
+----------+--------------------+--------------------+
|2021-05-18|SYST_id | 85|
|2021-05-18|RECVR_id | 1|
|2021-05-18|Account_Id| | 12345|
|2021-05-18|Vb_id | 845|
|2021-05-18|SYS_INFO_id | 640|
|2021-05-18|mem_id | 456|
------------------------------------------------------
However, my required output is as below.
+---+-------+--------------+-----------+------------+-------+-------------+-------+
|date | SYST_id | RECVR_id | Account_Id | Vb_id | SYS_INFO_id| mem_id|
+----+------+--------------+-----------+------------+-------+-------------+-------+
|2021-05-18| 85 | 1 | 12345 | 845 | 640 | 456 |
+-----------+--------------+-----------+------------+-------+-------------+-------+
Could someone pls assist.
Your approach works. You only have to add a pivot operation after the explode:
import org.apache.spark.sql.functions._
exploded.groupBy("date").pivot("key").agg(first("value")).show()
I assume that the combination of date and key is unique, so it is safe to take the first (and only) value in the aggregation. If the combination is not unique, you could use collect_list as aggregation function.
Edit:
To add scrId and srctype, simply add these columns to the select statement:
val exploded = batch_df.select($"event_date", $"device.srcId", $"device.srctype", explode($"ATTRIBUTES"))
To reduce the number of columns after the pivot operation, apply a filter on the key column before aggregating:
val relevant_cols = Array("Account_Id", "Vb_id", "RECVR_id", "mem_id") // the four additional columns
exploded.filter($"key".isin(relevant_cols:_*).or($"key".endsWith(lit("_split"))))
.groupBy("date").pivot("key").agg(first("value")).show()

Spark collect_list change data_type from array to string

I am having a following aggregation
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1")).alias("final_data1"),
collect_list(array($"b",$"c",$"data2")).alias("final_data2"))
Here I am doing some aggregation and collecting the result with collect_list. Earlier we were using spark 1 and it was giving me below data types.
|-- final_data1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- final_data2: array (nullable = true)
| |-- element: string (containsNull = true)
Now we have to migrate to spark 2 but we are getting below schema.
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
On getting first() record below is the difference
spark 1.6
[2020-09-26, Ayush, 103.67] => datatype string
spark 2
WrappedArray(2020-09-26, Ayush, 103.67)
How can I keep the same data type?
Edit - Tried Using Concat
One way I got exact schema like Spark 1.6 is by using concat like this
val df_date_agg = df
.groupBy($"msisdn",$"event_date",$"network")
.agg(sum($"data_mou").alias("data_mou_dly"),sum($"voice_mou").alias("voice_mou_dly"))
.groupBy($"msisdn")
.agg(collect_list(concat(lit("["),lit($"event_date"),lit(","),lit($"network"),lit(","),lit($"data_mou_dly"),lit("]")))
Will it affect my code performance?? Is there a better way to do this?
Since you want a string representation of an array, how about casting the array into a string like this?
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1") cast "string").alias("final_data1"),
collect_list(array($"b",$"c",$"data2") cast "string").alias("final_data2"))
It might simply be what your old version of spark was doing.
The solution you propose would probably work as well by the way but wrapping your column references with lit is not necessary (lit($"event_date")). $"event_date" is enough.
Fllttening final1 and final2 columns would fix this problem.
val data = Seq((1,"A", "B"), (1, "C", "D"), (2,"E", "F"), (2,"G", "H"), (2,"I", "J"))
val df = spark.createDataFrame(
data
).toDF("col1", "col2", "col3")
val old_df = df.groupBy(col("col1")).agg(
collect_list(
array(
col("col2"),
col("col3")
)
).as("final")
)
val new_df = old_df.select(col("col1"), flatten(col("final")).as("final_new"))
println("Input Dataframe")
df.show(false)
println("Old schema format")
old_df.show(false)
old_df.printSchema()
println("New schema format")
new_df.show(false)
new_df.printSchema()
Output:
Input Dataframe
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1 |A |B |
|1 |C |D |
|2 |E |F |
|2 |G |H |
|2 |I |J |
+----+----+----+
Old schema format
+----+------------------------+
|col1|final |
+----+------------------------+
|1 |[[A, B], [C, D]] |
|2 |[[E, F], [G, H], [I, J]]|
+----+------------------------+
root
|-- col1: integer (nullable = false)
|-- final: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
New schema format
+----+------------------+
|col1|final_new |
+----+------------------+
|1 |[A, B, C, D] |
|2 |[E, F, G, H, I, J]|
+----+------------------+
root
|-- col1: integer (nullable = false)
|-- final_new: array (nullable = true)
| |-- element: string (containsNull = true)
In you specefic case
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1")).alias("final_data1"),
collect_list(array($"b",$"c",$"data2")).alias("final_data2"))
.select(flatten(col("final_data1").as("final_data1"), flatten(col("final_data2).as("final_data2))

Convert all the columns of a spark dataframe into a json format and then include the json formatted data as a column in another/parent dataframe

Converted dataframe(say child dataframe) into json using df.toJSON
After json conversion the schema looks like this :
root
|-- value: string (nullable = true)
I used the following suggestion to get child dataframe into the intermediate parent schema/dataframe:
scala> parentDF.toJSON.select(struct($"value").as("data")).printSchema
root
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
Now I still need to build the parentDF schema further to make it look like:
root
|-- id
|-- version
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
Q1) How can I build the id column using the id from value(i.e value.id needs to be represented as id)
Q2) I need to bring version from a different dataframe(say versionDF) where version is a constant(in all columns). Do I fetch one row from this versionDF to read value of version column and then populate it as literal in the parentDF ?
please help with any code snippets on this.
Instead of toJSON use to_json in select statement & select required columns along with to_json function.
Check below code.
val version = // Get version value from versionDF
parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version"))
scala> parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
|-- version: double (nullable = false)
Updated
scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: string (nullable = true)
|-- version: integer (nullable = false)
scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).show(false)
+---+------------------------------------------+-------+
|id |data |version|
+---+------------------------------------------+-------+
|1 |{"value":{"id":1,"col1":"a1","col2":"b1"}}|1 |
|2 |{"value":{"id":2,"col1":"a2","col2":"b2"}}|1 |
|3 |{"value":{"id":3,"col1":"a3","col2":"b3"}}|1 |
+---+------------------------------------------+-------+
Update-1
scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: string (nullable = true)
|-- version: integer (nullable = false)
scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).show(false)
+---+--------------------------------+-------+
|id |data |version|
+---+--------------------------------+-------+
|1 |{"id":1,"col1":"a1","col2":"b1"}|1 |
|2 |{"id":2,"col1":"a2","col2":"b2"}|1 |
|3 |{"id":3,"col1":"a3","col2":"b3"}|1 |
+---+--------------------------------+-------+
Try this:
scala> val versionDF = List((1.0)).toDF("version")
versionDF: org.apache.spark.sql.DataFrame = [version: double]
scala> versionDF.show
+-------+
|version|
+-------+
| 1.0|
+-------+
scala> val version = versionDF.first.get(0)
version: Any = 1.0
scala>
scala> val childDF = List((1,"a1","b1"),(2,"a2","b2"),(3,"a3","b3")).toDF("id","col1","col2")
childDF: org.apache.spark.sql.DataFrame = [id: int, col1: string ... 1 more field]
scala> childDF.show
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| b2|
| 3| a3| b3|
+---+----+----+
scala>
scala> val parentDF = childDF.toJSON.select(struct($"value").as("data")).withColumn("id",from_json($"data.value",childDF.schema).getItem("id")).withColumn("version",lit(version))
parentDF: org.apache.spark.sql.DataFrame = [data: struct<value: string>, id: int ... 1 more field]
scala> parentDF.printSchema
root
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
|-- id: integer (nullable = true)
|-- version: double (nullable = false)
scala> parentDF.show(false)
+----------------------------------+---+-------+
|data |id |version|
+----------------------------------+---+-------+
|[{"id":1,"col1":"a1","col2":"b1"}]|1 |1.0 |
|[{"id":2,"col1":"a2","col2":"b2"}]|2 |1.0 |
|[{"id":3,"col1":"a3","col2":"b3"}]|3 |1.0 |
+----------------------------------+---+-------+

Removing duplicate array structs by last item in array struct in Spark Dataframe

So my table looks something like this:
customer_1|place|customer_2|item |count
-------------------------------------------------
a | NY | b |(2010,304,310)| 34
a | NY | b |(2024,201,310)| 21
a | NY | b |(2010,304,312)| 76
c | NY | x |(2010,304,310)| 11
a | NY | b |(453,131,235) | 10
I've tried doing, but this does not eliminate the duplicates as the former array is still there (as it should be, I need it for end results).
val df= df_one.withColumn("vs", struct(col("item").getItem(size(col("item"))-1), col("item"), col("count")))
.groupBy(col("customer_1"), col("place"), col("customer_2"))
.agg(max("vs").alias("vs"))
.select(col("customer_1"), col("place"), col("customer_2"), col("vs.item"), col("vs.count"))
I would like to group by customer_1, place and customer_2 columns and return only array structs whose last item (-1) is unique with the highest count, any ideas?
Expected output:
customer_1|place|customer_2|item |count
-------------------------------------------------
a | NY | b |(2010,304,312)| 76
a | NY | b |(2010,304,310)| 34
a | NY | b |(453,131,235) | 10
c | NY | x |(2010,304,310)| 11
Given that the schema of the dataframe is as
root
|-- customer_1: string (nullable = true)
|-- place: string (nullable = true)
|-- customer_2: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- count: string (nullable = true)
You can apply concat funcations to create temp column for checking duplicate rows as done below
import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item"(size($"item")-1)))
.dropDuplicates("temp")
.drop("temp")
You should get following output
+----------+-----+----------+----------------+-----+
|customer_1|place|customer_2|item |count|
+----------+-----+----------+----------------+-----+
|a |NY |b |[2010, 304, 312]|76 |
|c |NY |x |[2010, 304, 310]|11 |
|a |NY |b |[453, 131, 235] |10 |
|a |NY |b |[2010, 304, 310]|34 |
+----------+-----+----------+----------------+-----+
Struct
Given the schema of dataframe is as
root
|-- customer_1: string (nullable = true)
|-- place: string (nullable = true)
|-- customer_2: string (nullable = true)
|-- item: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: integer (nullable = false)
| |-- _3: integer (nullable = false)
|-- count: string (nullable = true)
We can still do same as above with slight change in getting the third item from the struct as
import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item._3"))
.dropDuplicates("temp")
.drop("temp")
Hope the answer is helpful

"Unresolved attributes found when constructing LocalRelation", can somebody explain?

I have a dataframe as
+---+---+---+---+
|A |B |C |D |
+---+---+---+---+
|a |b |b |c |
+---+---+---+---+
I convert two columns as structs by doing the following
import org.apache.spark.sql.functions._
val df = myDF.withColumn("colA", struct($"A", $"B"))
.withColumn("colB", struct($"C".as("A"), $"D".as("B")))
the dataframe and schema are
+-----+-----+
|colA |colB |
+-----+-----+
|[a,b]|[b,c]|
+-----+-----+
root
|-- colA: struct (nullable = false)
| |-- A: string (nullable = true)
| |-- B: string (nullable = true)
|-- colB: struct (nullable = false)
| |-- A: string (nullable = true)
| |-- B: string (nullable = true)
I want to combine both struct columns into one column, so I do
df.select(array(struct($"colA.A", $"colA.B"),struct($"colB.A", $"colB.B")).as("Result"))
which gives correct dataframe and schema as
+--------------+
|Result |
+--------------+
|[[a,b], [b,c]]|
+--------------+
root
|-- Result: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- A: string (nullable = true)
| | |-- B: string (nullable = true)
I could get the same result by doing
df.select(array(struct($"A", $"B"),struct($"C".as("A"), $"D".as("B"))).as("Result"))
Now, if we look at the whole process, we have
$"colA" == struct($"A", $"B") == struct($"colA.A", $"colA.B")
and
$"colB" == struct($"C".as("A"), $"D".as("B")) == struct($"colB.A", $"colB.B")
BUT
when I do
df.select(array($"colA", $"colB").as("Result"))
I get the following error
requirement failed: Unresolved attributes found when constructing LocalRelation.
java.lang.IllegalArgumentException: requirement failed: Unresolved attributes found when constructing LocalRelation.
at scala.Predef$.require(Predef.scala:219)
at org.apache.spark.sql.catalyst.plans.logical.LocalRelation.(LocalRelation.scala:50)
at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$33.applyOrElse(Optimizer.scala:1402)
at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$33.applyOrElse(Optimizer.scala:1398)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
.......
........
What is the meaning of the error and how should I correct?