Strange behavior when casting array of structs in spark scala - scala

I'm encountering a strange behavior using spark 2.1.1 and scala 2.11.8:
import spark.implicits._
val df = Seq(
(1,Seq(("a","b"))),
(2,Seq(("c","d")))
).toDF("id","data")
df.show(false)
df.printSchema()
+---+-------+
|id |data |
+---+-------+
|1 |[[a,b]]|
|2 |[[c,d]]|
+---+-------+
root
|-- id: integer (nullable = false)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: string (nullable = true)
Now I want to rename my struct fields as suggested in https://stackoverflow.com/a/39781382/1138523
df
.select($"id",$"data".cast("array<struct<k:string,v:string>>"))
.show()
Which results in the correct schema, but the content of the dataframe is now:
+---+-------+
| id| data|
+---+-------+
| 1|[[c,d]]|
| 2|[[c,d]]|
+---+-------+
Both lines show now the same array. What am I doing wrong?
EDIT: In spark 2.1.2 (and also spark 2.3.0) I get the expected output. I also get the expected output if I cache the dataframe:
val df = Seq(
(1,Seq(("a","b"))),
(2,Seq(("c","d")))
).toDF("id","data")
.cache

Related

Spark collect_list change data_type from array to string

I am having a following aggregation
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1")).alias("final_data1"),
collect_list(array($"b",$"c",$"data2")).alias("final_data2"))
Here I am doing some aggregation and collecting the result with collect_list. Earlier we were using spark 1 and it was giving me below data types.
|-- final_data1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- final_data2: array (nullable = true)
| |-- element: string (containsNull = true)
Now we have to migrate to spark 2 but we are getting below schema.
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
On getting first() record below is the difference
spark 1.6
[2020-09-26, Ayush, 103.67] => datatype string
spark 2
WrappedArray(2020-09-26, Ayush, 103.67)
How can I keep the same data type?
Edit - Tried Using Concat
One way I got exact schema like Spark 1.6 is by using concat like this
val df_date_agg = df
.groupBy($"msisdn",$"event_date",$"network")
.agg(sum($"data_mou").alias("data_mou_dly"),sum($"voice_mou").alias("voice_mou_dly"))
.groupBy($"msisdn")
.agg(collect_list(concat(lit("["),lit($"event_date"),lit(","),lit($"network"),lit(","),lit($"data_mou_dly"),lit("]")))
Will it affect my code performance?? Is there a better way to do this?
Since you want a string representation of an array, how about casting the array into a string like this?
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1") cast "string").alias("final_data1"),
collect_list(array($"b",$"c",$"data2") cast "string").alias("final_data2"))
It might simply be what your old version of spark was doing.
The solution you propose would probably work as well by the way but wrapping your column references with lit is not necessary (lit($"event_date")). $"event_date" is enough.
Fllttening final1 and final2 columns would fix this problem.
val data = Seq((1,"A", "B"), (1, "C", "D"), (2,"E", "F"), (2,"G", "H"), (2,"I", "J"))
val df = spark.createDataFrame(
data
).toDF("col1", "col2", "col3")
val old_df = df.groupBy(col("col1")).agg(
collect_list(
array(
col("col2"),
col("col3")
)
).as("final")
)
val new_df = old_df.select(col("col1"), flatten(col("final")).as("final_new"))
println("Input Dataframe")
df.show(false)
println("Old schema format")
old_df.show(false)
old_df.printSchema()
println("New schema format")
new_df.show(false)
new_df.printSchema()
Output:
Input Dataframe
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1 |A |B |
|1 |C |D |
|2 |E |F |
|2 |G |H |
|2 |I |J |
+----+----+----+
Old schema format
+----+------------------------+
|col1|final |
+----+------------------------+
|1 |[[A, B], [C, D]] |
|2 |[[E, F], [G, H], [I, J]]|
+----+------------------------+
root
|-- col1: integer (nullable = false)
|-- final: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
New schema format
+----+------------------+
|col1|final_new |
+----+------------------+
|1 |[A, B, C, D] |
|2 |[E, F, G, H, I, J]|
+----+------------------+
root
|-- col1: integer (nullable = false)
|-- final_new: array (nullable = true)
| |-- element: string (containsNull = true)
In you specefic case
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1")).alias("final_data1"),
collect_list(array($"b",$"c",$"data2")).alias("final_data2"))
.select(flatten(col("final_data1").as("final_data1"), flatten(col("final_data2).as("final_data2))

Convert all the columns of a spark dataframe into a json format and then include the json formatted data as a column in another/parent dataframe

Converted dataframe(say child dataframe) into json using df.toJSON
After json conversion the schema looks like this :
root
|-- value: string (nullable = true)
I used the following suggestion to get child dataframe into the intermediate parent schema/dataframe:
scala> parentDF.toJSON.select(struct($"value").as("data")).printSchema
root
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
Now I still need to build the parentDF schema further to make it look like:
root
|-- id
|-- version
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
Q1) How can I build the id column using the id from value(i.e value.id needs to be represented as id)
Q2) I need to bring version from a different dataframe(say versionDF) where version is a constant(in all columns). Do I fetch one row from this versionDF to read value of version column and then populate it as literal in the parentDF ?
please help with any code snippets on this.
Instead of toJSON use to_json in select statement & select required columns along with to_json function.
Check below code.
val version = // Get version value from versionDF
parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version"))
scala> parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
|-- version: double (nullable = false)
Updated
scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: string (nullable = true)
|-- version: integer (nullable = false)
scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).show(false)
+---+------------------------------------------+-------+
|id |data |version|
+---+------------------------------------------+-------+
|1 |{"value":{"id":1,"col1":"a1","col2":"b1"}}|1 |
|2 |{"value":{"id":2,"col1":"a2","col2":"b2"}}|1 |
|3 |{"value":{"id":3,"col1":"a3","col2":"b3"}}|1 |
+---+------------------------------------------+-------+
Update-1
scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: string (nullable = true)
|-- version: integer (nullable = false)
scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).show(false)
+---+--------------------------------+-------+
|id |data |version|
+---+--------------------------------+-------+
|1 |{"id":1,"col1":"a1","col2":"b1"}|1 |
|2 |{"id":2,"col1":"a2","col2":"b2"}|1 |
|3 |{"id":3,"col1":"a3","col2":"b3"}|1 |
+---+--------------------------------+-------+
Try this:
scala> val versionDF = List((1.0)).toDF("version")
versionDF: org.apache.spark.sql.DataFrame = [version: double]
scala> versionDF.show
+-------+
|version|
+-------+
| 1.0|
+-------+
scala> val version = versionDF.first.get(0)
version: Any = 1.0
scala>
scala> val childDF = List((1,"a1","b1"),(2,"a2","b2"),(3,"a3","b3")).toDF("id","col1","col2")
childDF: org.apache.spark.sql.DataFrame = [id: int, col1: string ... 1 more field]
scala> childDF.show
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| b2|
| 3| a3| b3|
+---+----+----+
scala>
scala> val parentDF = childDF.toJSON.select(struct($"value").as("data")).withColumn("id",from_json($"data.value",childDF.schema).getItem("id")).withColumn("version",lit(version))
parentDF: org.apache.spark.sql.DataFrame = [data: struct<value: string>, id: int ... 1 more field]
scala> parentDF.printSchema
root
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
|-- id: integer (nullable = true)
|-- version: double (nullable = false)
scala> parentDF.show(false)
+----------------------------------+---+-------+
|data |id |version|
+----------------------------------+---+-------+
|[{"id":1,"col1":"a1","col2":"b1"}]|1 |1.0 |
|[{"id":2,"col1":"a2","col2":"b2"}]|2 |1.0 |
|[{"id":3,"col1":"a3","col2":"b3"}]|3 |1.0 |
+----------------------------------+---+-------+

How to return ListBuffer as a column from UDF using Spark Scala?

I am trying to use UDF's and return ListBuffer as a column from UDF, i am getting error.
I have created Df by executing below code:
val df = Seq((1,"dept3##rama##kumar","dept3##rama##kumar"), (2,"dept31##rama1##kumar1","dept33##rama3##kumar3")).toDF("id","str1","str2")
df.show()
it show like below:
+---+--------------------+--------------------+
| id| str1| str2|
+---+--------------------+--------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar|
| 2|dept31##rama1##ku...|dept33##rama3##ku...|
+---+--------------------+--------------------+
as per my requirement i have to use i have to split the above columns based some inputs so i have tried UDF like below :
def appendDelimiterError=udf((id: Int, str1: String, str2: String)=> {
var lit = new ListBuffer[Any]()
if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("#&"){val a=str1.split("#&")}
if(str2.contains("##"){ val b=str2.split("##")}
else if(str2.contains("##"){ val b=str2.split("##") }
else if(str1.contains("##"){val b=str2.split("##")}
var tmp_row = List(a,"test1",b)
lit +=tmp_row
return lit
})
val
i try to cal by executing below code:
val df1=df.appendDelimiterError("newcol",appendDelimiterError(df("id"),df("str1"),df("str2"))
i getting error "this was a bad call" .i want use ListBuffer/list to store and return to calling place.
my expected output will be:
+---+--------------------+------------------------+----------------------------------------------------------------------+
| id| str1| str2 | newcol |
+---+--------------------+------------------------+----------------------------------------------------------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar |ListBuffer(List("dept","rama","kumar"),List("dept3","rama","kumar")) |
| 2|dept31##rama1##kumar1|dept33##rama3##kumar3 | ListBuffer(List("dept31","rama1","kumar1"),List("dept33","rama3","kumar3")) |
+---+--------------------+------------------------+----------------------------------------------------------------------+
How to achieve this?
An alternative with my own fictional data to which you can tailor and no UDF:
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = Seq(
(1, "111##cat##666", "222##fritz##777"),
(2, "AAA##cat##555", "BBB##felix##888"),
(3, "HHH##mouse##yyy", "123##mickey##ZZZ")
).toDF("c0", "c1", "c2")
val df2 = df.withColumn( "c_split", split(col("c1"), ("(##)|(##)|(##)|(##)") ))
.union(df.withColumn("c_split", split(col("c2"), ("(##)|(##)|(##)|(##)") )) )
df2.show(false)
df2.printSchema()
val df3 = df2.groupBy(col("c0")).agg(collect_list(col("c_split")).as("List_of_Data") )
df3.show(false)
df3.printSchema()
Gives answer but no ListBuffer - really necessary?, as follows:
+---+---------------+----------------+------------------+
|c0 |c1 |c2 |c_split |
+---+---------------+----------------+------------------+
|1 |111##cat##666 |222##fritz##777 |[111, cat, 666] |
|2 |AAA##cat##555 |BBB##felix##888 |[AAA, cat, 555] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[HHH, mouse, yyy] |
|1 |111##cat##666 |222##fritz##777 |[222, fritz, 777] |
|2 |AAA##cat##555 |BBB##felix##888 |[BBB, felix, 888] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[123, mickey, ZZZ]|
+---+---------------+----------------+------------------+
root
|-- c0: integer (nullable = false)
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- c_split: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---------------------------------------+
|c0 |List_of_Data |
+---+---------------------------------------+
|1 |[[111, cat, 666], [222, fritz, 777]] |
|3 |[[HHH, mouse, yyy], [123, mickey, ZZZ]]|
|2 |[[AAA, cat, 555], [BBB, felix, 888]] |
+---+---------------------------------------+
root
|-- c0: integer (nullable = false)
|-- List_of_Data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)

"Unresolved attributes found when constructing LocalRelation", can somebody explain?

I have a dataframe as
+---+---+---+---+
|A |B |C |D |
+---+---+---+---+
|a |b |b |c |
+---+---+---+---+
I convert two columns as structs by doing the following
import org.apache.spark.sql.functions._
val df = myDF.withColumn("colA", struct($"A", $"B"))
.withColumn("colB", struct($"C".as("A"), $"D".as("B")))
the dataframe and schema are
+-----+-----+
|colA |colB |
+-----+-----+
|[a,b]|[b,c]|
+-----+-----+
root
|-- colA: struct (nullable = false)
| |-- A: string (nullable = true)
| |-- B: string (nullable = true)
|-- colB: struct (nullable = false)
| |-- A: string (nullable = true)
| |-- B: string (nullable = true)
I want to combine both struct columns into one column, so I do
df.select(array(struct($"colA.A", $"colA.B"),struct($"colB.A", $"colB.B")).as("Result"))
which gives correct dataframe and schema as
+--------------+
|Result |
+--------------+
|[[a,b], [b,c]]|
+--------------+
root
|-- Result: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- A: string (nullable = true)
| | |-- B: string (nullable = true)
I could get the same result by doing
df.select(array(struct($"A", $"B"),struct($"C".as("A"), $"D".as("B"))).as("Result"))
Now, if we look at the whole process, we have
$"colA" == struct($"A", $"B") == struct($"colA.A", $"colA.B")
and
$"colB" == struct($"C".as("A"), $"D".as("B")) == struct($"colB.A", $"colB.B")
BUT
when I do
df.select(array($"colA", $"colB").as("Result"))
I get the following error
requirement failed: Unresolved attributes found when constructing LocalRelation.
java.lang.IllegalArgumentException: requirement failed: Unresolved attributes found when constructing LocalRelation.
at scala.Predef$.require(Predef.scala:219)
at org.apache.spark.sql.catalyst.plans.logical.LocalRelation.(LocalRelation.scala:50)
at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$33.applyOrElse(Optimizer.scala:1402)
at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$33.applyOrElse(Optimizer.scala:1398)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
.......
........
What is the meaning of the error and how should I correct?

How to transform Spark Dataframe columns to a single column of a string array

I want to know how can I "merge" multiple dataframe columns into one as a string array?
For example, I have this dataframe:
val df = sqlContext.createDataFrame(Seq((1, "Jack", "125", "Text"), (2,"Mary", "152", "Text2"))).toDF("Id", "Name", "Number", "Comment")
Which looks like this:
scala> df.show
+---+----+------+-------+
| Id|Name|Number|Comment|
+---+----+------+-------+
| 1|Jack| 125| Text|
| 2|Mary| 152| Text2|
+---+----+------+-------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- Name: string (nullable = true)
|-- Number: string (nullable = true)
|-- Comment: string (nullable = true)
How can I transform it so it would look like this:
scala> df.show
+---+-----------------+
| Id| List|
+---+-----------------+
| 1| [Jack,125,Text]|
| 2| [Mary,152,Text2]|
+---+-----------------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- List: Array (nullable = true)
| |-- element: string (containsNull = true)
Use org.apache.spark.sql.functions.array:
import org.apache.spark.sql.functions._
val result = df.select($"Id", array($"Name", $"Number", $"Comment") as "List")
result.show()
// +---+------------------+
// |Id |List |
// +---+------------------+
// |1 |[Jack, 125, Text] |
// |2 |[Mary, 152, Text2]|
// +---+------------------+
Can also be used with withColumn :
import org.apache.spark.sql.functions as F
df.withColumn("Id", F.array(F.col("Name"), F.col("Number"), F.col("Comment")))