How to handle duplicate column(s) in table definition - scala

Given a dataframe with following schema . The problem is that the dataframe is dynamic and so is its field . So you can pre assume a given schema.
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- c: string (nullable = true)
|-- a: string (nullable = true)
|-- b: long (nullable = true)
|-- c: long (nullable = true)
|-- d: long (nullable = true)
|-- a: long (nullable = true)
The following error is shown :-
Found duplicate column(s) in table definition
how should we rename the column name to remove ambiguity

Here is how you can rename it
import spark.implicits._
val df = Seq(
("a", 1, "a"),
("a", 1, "a"),
("a", 1, "a")
).toDF("a", "x", "a")
val columns = List("a", "b", "c")
val newDF = df.toDF(columns: _*)
newDF.show(false)
newDF.printSchema()
New Output:
+---+---+---+
|a |b |c |
+---+---+---+
|a |1 |a |
|a |1 |a |
|a |1 |a |
+---+---+---+
New Schema:
root
|-- a: string (nullable = true)
|-- b: integer (nullable = false)
|-- c: string (nullable = true)

Related

Spark collect_list change data_type from array to string

I am having a following aggregation
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1")).alias("final_data1"),
collect_list(array($"b",$"c",$"data2")).alias("final_data2"))
Here I am doing some aggregation and collecting the result with collect_list. Earlier we were using spark 1 and it was giving me below data types.
|-- final_data1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- final_data2: array (nullable = true)
| |-- element: string (containsNull = true)
Now we have to migrate to spark 2 but we are getting below schema.
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
On getting first() record below is the difference
spark 1.6
[2020-09-26, Ayush, 103.67] => datatype string
spark 2
WrappedArray(2020-09-26, Ayush, 103.67)
How can I keep the same data type?
Edit - Tried Using Concat
One way I got exact schema like Spark 1.6 is by using concat like this
val df_date_agg = df
.groupBy($"msisdn",$"event_date",$"network")
.agg(sum($"data_mou").alias("data_mou_dly"),sum($"voice_mou").alias("voice_mou_dly"))
.groupBy($"msisdn")
.agg(collect_list(concat(lit("["),lit($"event_date"),lit(","),lit($"network"),lit(","),lit($"data_mou_dly"),lit("]")))
Will it affect my code performance?? Is there a better way to do this?
Since you want a string representation of an array, how about casting the array into a string like this?
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1") cast "string").alias("final_data1"),
collect_list(array($"b",$"c",$"data2") cast "string").alias("final_data2"))
It might simply be what your old version of spark was doing.
The solution you propose would probably work as well by the way but wrapping your column references with lit is not necessary (lit($"event_date")). $"event_date" is enough.
Fllttening final1 and final2 columns would fix this problem.
val data = Seq((1,"A", "B"), (1, "C", "D"), (2,"E", "F"), (2,"G", "H"), (2,"I", "J"))
val df = spark.createDataFrame(
data
).toDF("col1", "col2", "col3")
val old_df = df.groupBy(col("col1")).agg(
collect_list(
array(
col("col2"),
col("col3")
)
).as("final")
)
val new_df = old_df.select(col("col1"), flatten(col("final")).as("final_new"))
println("Input Dataframe")
df.show(false)
println("Old schema format")
old_df.show(false)
old_df.printSchema()
println("New schema format")
new_df.show(false)
new_df.printSchema()
Output:
Input Dataframe
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1 |A |B |
|1 |C |D |
|2 |E |F |
|2 |G |H |
|2 |I |J |
+----+----+----+
Old schema format
+----+------------------------+
|col1|final |
+----+------------------------+
|1 |[[A, B], [C, D]] |
|2 |[[E, F], [G, H], [I, J]]|
+----+------------------------+
root
|-- col1: integer (nullable = false)
|-- final: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
New schema format
+----+------------------+
|col1|final_new |
+----+------------------+
|1 |[A, B, C, D] |
|2 |[E, F, G, H, I, J]|
+----+------------------+
root
|-- col1: integer (nullable = false)
|-- final_new: array (nullable = true)
| |-- element: string (containsNull = true)
In you specefic case
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1")).alias("final_data1"),
collect_list(array($"b",$"c",$"data2")).alias("final_data2"))
.select(flatten(col("final_data1").as("final_data1"), flatten(col("final_data2).as("final_data2))

Convert all the columns of a spark dataframe into a json format and then include the json formatted data as a column in another/parent dataframe

Converted dataframe(say child dataframe) into json using df.toJSON
After json conversion the schema looks like this :
root
|-- value: string (nullable = true)
I used the following suggestion to get child dataframe into the intermediate parent schema/dataframe:
scala> parentDF.toJSON.select(struct($"value").as("data")).printSchema
root
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
Now I still need to build the parentDF schema further to make it look like:
root
|-- id
|-- version
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
Q1) How can I build the id column using the id from value(i.e value.id needs to be represented as id)
Q2) I need to bring version from a different dataframe(say versionDF) where version is a constant(in all columns). Do I fetch one row from this versionDF to read value of version column and then populate it as literal in the parentDF ?
please help with any code snippets on this.
Instead of toJSON use to_json in select statement & select required columns along with to_json function.
Check below code.
val version = // Get version value from versionDF
parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version"))
scala> parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
|-- version: double (nullable = false)
Updated
scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: string (nullable = true)
|-- version: integer (nullable = false)
scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).show(false)
+---+------------------------------------------+-------+
|id |data |version|
+---+------------------------------------------+-------+
|1 |{"value":{"id":1,"col1":"a1","col2":"b1"}}|1 |
|2 |{"value":{"id":2,"col1":"a2","col2":"b2"}}|1 |
|3 |{"value":{"id":3,"col1":"a3","col2":"b3"}}|1 |
+---+------------------------------------------+-------+
Update-1
scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: string (nullable = true)
|-- version: integer (nullable = false)
scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).show(false)
+---+--------------------------------+-------+
|id |data |version|
+---+--------------------------------+-------+
|1 |{"id":1,"col1":"a1","col2":"b1"}|1 |
|2 |{"id":2,"col1":"a2","col2":"b2"}|1 |
|3 |{"id":3,"col1":"a3","col2":"b3"}|1 |
+---+--------------------------------+-------+
Try this:
scala> val versionDF = List((1.0)).toDF("version")
versionDF: org.apache.spark.sql.DataFrame = [version: double]
scala> versionDF.show
+-------+
|version|
+-------+
| 1.0|
+-------+
scala> val version = versionDF.first.get(0)
version: Any = 1.0
scala>
scala> val childDF = List((1,"a1","b1"),(2,"a2","b2"),(3,"a3","b3")).toDF("id","col1","col2")
childDF: org.apache.spark.sql.DataFrame = [id: int, col1: string ... 1 more field]
scala> childDF.show
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| b2|
| 3| a3| b3|
+---+----+----+
scala>
scala> val parentDF = childDF.toJSON.select(struct($"value").as("data")).withColumn("id",from_json($"data.value",childDF.schema).getItem("id")).withColumn("version",lit(version))
parentDF: org.apache.spark.sql.DataFrame = [data: struct<value: string>, id: int ... 1 more field]
scala> parentDF.printSchema
root
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
|-- id: integer (nullable = true)
|-- version: double (nullable = false)
scala> parentDF.show(false)
+----------------------------------+---+-------+
|data |id |version|
+----------------------------------+---+-------+
|[{"id":1,"col1":"a1","col2":"b1"}]|1 |1.0 |
|[{"id":2,"col1":"a2","col2":"b2"}]|2 |1.0 |
|[{"id":3,"col1":"a3","col2":"b3"}]|3 |1.0 |
+----------------------------------+---+-------+

Merge two columns of array of structs based on a key

I have a dataframe of schema as below:
input dataframe
|-- A: string (nullable = true)
|-- B_2020: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
| | |-- z: double (nullable = true)
|-- B_2019: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
I want to merge 2020 and 2019 columns into one column of array of structs as well based on the matching key value.
Desired schema:
expected output dataframe
|-- A: string (nullable = true)
|-- B: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x_this_year: double (nullable = true)
| | |-- y_this_year: double (nullable = true)
| | |-- x_last_year: double (nullable = true)
| | |-- y_last_year: double (nullable = true)
| | |-- z_this_year: double (nullable = true)
I would like to merge on the matching key in the structs. Also note, if there is a key present only in one of 2019 or 2020 data, then null need to be used to substitute the values of the other year in merged column.
scala> val df = Seq(
| ("ABC",
| Seq(
| ("a", 2, 4, 6),
| ("b", 3, 6, 9),
| ("c", 1, 2, 3)
| ),
| Seq(
| ("a", 4, 8),
| ("d", 3, 4)
| ))
| ).toDF("A", "B_2020", "B_2019").select(
| $"A",
| $"B_2020" cast "array<struct<key:string,x:double,y:double,z:double>>",
| $"B_2019" cast "array<struct<key:string,x:double,y:double>>"
| )
df: org.apache.spark.sql.DataFrame = [A: string, B_2020: array<struct<key:string,x:double,y:double,z:double>> ... 1 more field]
scala> df.printSchema
root
|-- A: string (nullable = true)
|-- B_2020: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
| | |-- z: double (nullable = true)
|-- B_2019: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
scala> df.show(false)
+---+------------------------------------------------------------+------------------------------+
|A |B_2020 |B_2019 |
+---+------------------------------------------------------------+------------------------------+
|ABC|[[a, 2.0, 4.0, 6.0], [b, 3.0, 6.0, 9.0], [c, 1.0, 2.0, 3.0]]|[[a, 4.0, 8.0], [d, 3.0, 4.0]]|
+---+------------------------------------------------------------+------------------------------+
scala> val df2020 = df.select($"A", explode($"B_2020") as "this_year").select($"A",
| $"this_year.key" as "key", $"this_year.x" as "x_this_year",
| $"this_year.y" as "y_this_year", $"this_year.z" as "z_this_year")
df2020: org.apache.spark.sql.DataFrame = [A: string, key: string ... 3 more fields]
scala> val df2019 = df.select($"A", explode($"B_2019") as "last_year").select($"A",
| $"last_year.key" as "key", $"last_year.x" as "x_last_year",
| $"last_year.y" as "y_last_year")
df2019: org.apache.spark.sql.DataFrame = [A: string, key: string ... 2 more fields]
scala> df2020.show(false)
+---+---+-----------+-----------+-----------+
|A |key|x_this_year|y_this_year|z_this_year|
+---+---+-----------+-----------+-----------+
|ABC|a |2.0 |4.0 |6.0 |
|ABC|b |3.0 |6.0 |9.0 |
|ABC|c |1.0 |2.0 |3.0 |
+---+---+-----------+-----------+-----------+
scala> df2019.show(false)
+---+---+-----------+-----------+
|A |key|x_last_year|y_last_year|
+---+---+-----------+-----------+
|ABC|a |4.0 |8.0 |
|ABC|d |3.0 |4.0 |
+---+---+-----------+-----------+
scala> val outputDF = df2020.join(df2019, Seq("A", "key"), "outer").select(
| $"A" as "market_name",
| struct($"key", $"x_this_year", $"y_this_year", $"x_last_year",
| $"y_last_year", $"z_this_year") as "cancellation_policy_booking")
outputDF: org.apache.spark.sql.DataFrame = [market_name: string, cancellation_policy_booking: struct<key: string, x_this_year: double ... 4 more fields>]
scala> outputDF.printSchema
root
|-- market_name: string (nullable = true)
|-- cancellation_policy_booking: struct (nullable = false)
| |-- key: string (nullable = true)
| |-- x_this_year: double (nullable = true)
| |-- y_this_year: double (nullable = true)
| |-- x_last_year: double (nullable = true)
| |-- y_last_year: double (nullable = true)
| |-- z_this_year: double (nullable = true)
scala> outputDF.show(false)
+-----------+----------------------------+
|market_name|cancellation_policy_booking |
+-----------+----------------------------+
|ABC |[b, 3.0, 6.0,,, 9.0] |
|ABC |[a, 2.0, 4.0, 4.0, 8.0, 6.0]|
|ABC |[d,,, 3.0, 4.0,] |
|ABC |[c, 1.0, 2.0,,, 3.0] |
+-----------+----------------------------+

"Unresolved attributes found when constructing LocalRelation", can somebody explain?

I have a dataframe as
+---+---+---+---+
|A |B |C |D |
+---+---+---+---+
|a |b |b |c |
+---+---+---+---+
I convert two columns as structs by doing the following
import org.apache.spark.sql.functions._
val df = myDF.withColumn("colA", struct($"A", $"B"))
.withColumn("colB", struct($"C".as("A"), $"D".as("B")))
the dataframe and schema are
+-----+-----+
|colA |colB |
+-----+-----+
|[a,b]|[b,c]|
+-----+-----+
root
|-- colA: struct (nullable = false)
| |-- A: string (nullable = true)
| |-- B: string (nullable = true)
|-- colB: struct (nullable = false)
| |-- A: string (nullable = true)
| |-- B: string (nullable = true)
I want to combine both struct columns into one column, so I do
df.select(array(struct($"colA.A", $"colA.B"),struct($"colB.A", $"colB.B")).as("Result"))
which gives correct dataframe and schema as
+--------------+
|Result |
+--------------+
|[[a,b], [b,c]]|
+--------------+
root
|-- Result: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- A: string (nullable = true)
| | |-- B: string (nullable = true)
I could get the same result by doing
df.select(array(struct($"A", $"B"),struct($"C".as("A"), $"D".as("B"))).as("Result"))
Now, if we look at the whole process, we have
$"colA" == struct($"A", $"B") == struct($"colA.A", $"colA.B")
and
$"colB" == struct($"C".as("A"), $"D".as("B")) == struct($"colB.A", $"colB.B")
BUT
when I do
df.select(array($"colA", $"colB").as("Result"))
I get the following error
requirement failed: Unresolved attributes found when constructing LocalRelation.
java.lang.IllegalArgumentException: requirement failed: Unresolved attributes found when constructing LocalRelation.
at scala.Predef$.require(Predef.scala:219)
at org.apache.spark.sql.catalyst.plans.logical.LocalRelation.(LocalRelation.scala:50)
at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$33.applyOrElse(Optimizer.scala:1402)
at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$33.applyOrElse(Optimizer.scala:1398)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
.......
........
What is the meaning of the error and how should I correct?

How to transform Spark Dataframe columns to a single column of a string array

I want to know how can I "merge" multiple dataframe columns into one as a string array?
For example, I have this dataframe:
val df = sqlContext.createDataFrame(Seq((1, "Jack", "125", "Text"), (2,"Mary", "152", "Text2"))).toDF("Id", "Name", "Number", "Comment")
Which looks like this:
scala> df.show
+---+----+------+-------+
| Id|Name|Number|Comment|
+---+----+------+-------+
| 1|Jack| 125| Text|
| 2|Mary| 152| Text2|
+---+----+------+-------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- Name: string (nullable = true)
|-- Number: string (nullable = true)
|-- Comment: string (nullable = true)
How can I transform it so it would look like this:
scala> df.show
+---+-----------------+
| Id| List|
+---+-----------------+
| 1| [Jack,125,Text]|
| 2| [Mary,152,Text2]|
+---+-----------------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- List: Array (nullable = true)
| |-- element: string (containsNull = true)
Use org.apache.spark.sql.functions.array:
import org.apache.spark.sql.functions._
val result = df.select($"Id", array($"Name", $"Number", $"Comment") as "List")
result.show()
// +---+------------------+
// |Id |List |
// +---+------------------+
// |1 |[Jack, 125, Text] |
// |2 |[Mary, 152, Text2]|
// +---+------------------+
Can also be used with withColumn :
import org.apache.spark.sql.functions as F
df.withColumn("Id", F.array(F.col("Name"), F.col("Number"), F.col("Comment")))