Convert Spark DataFrame to HashMap of HashMaps - scala

I have a dataframe that looks like this:
column1_ID column2 column3 column4
A_123 12 A 1
A_123 12 B 2
A_123 23 A 1
B_456 56 DB 4
B_456 56 BD 5
B_456 60 BD 3
I would like to convert above dataframe/rdd into below OUTPUT column1_ID(KEY): HashMap(Long, HashMap(String, Long))
'A_123': {12 : {'A': 1, 'B': 2}, 23: {'A': 1} },
'B_456': {56 : {'DB': 4, 'BD': 5}, 60: {'BD': 3} }
Tried with reduceByKey and groupByKey but couldn't convert the output as expected.

Can be done with creating complex structure from three last columns, and then apply UDF:
val data = List(
("A_123", 12, "A", 1),
("A_123", 12, "B", 2),
("A_123", 23, "A", 1),
("B_456", 56, "DB", 4),
("B_456", 56, "BD", 5),
("B_456", 60, "BD", 3))
val df = data.toDF("column1_ID", "column2", "column3", "column4")
val twoLastCompacted = df.withColumn("lastTwo", struct($"column3", $"column4"))
twoLastCompacted.show(false)
val grouppedByTwoFirst = twoLastCompacted.groupBy("column1_ID", "column2").agg(collect_list("lastTwo").alias("lastTwoArray"))
grouppedByTwoFirst.show(false)
val treeLastCompacted = grouppedByTwoFirst.withColumn("lastThree", struct($"column2", $"lastTwoArray"))
treeLastCompacted.show(false)
val gruppedByFirst = treeLastCompacted.groupBy("column1_ID").agg(collect_list("lastThree").alias("lastThreeArray"))
gruppedByFirst.printSchema()
gruppedByFirst.show(false)
val structToMap = (value: Seq[Row]) =>
value.map(v => v.getInt(0) ->
v.getSeq(1).asInstanceOf[Seq[Row]].map(r => r.getString(0) -> r.getInt(1)).toMap)
.toMap
val structToMapUDF = udf(structToMap)
gruppedByFirst.select($"column1_ID", structToMapUDF($"lastThreeArray")).show(false)
Output:
+----------+-------+-------+-------+-------+
|column1_ID|column2|column3|column4|lastTwo|
+----------+-------+-------+-------+-------+
|A_123 |12 |A |1 |[A,1] |
|A_123 |12 |B |2 |[B,2] |
|A_123 |23 |A |1 |[A,1] |
|B_456 |56 |DB |4 |[DB,4] |
|B_456 |56 |BD |5 |[BD,5] |
|B_456 |60 |BD |3 |[BD,3] |
+----------+-------+-------+-------+-------+
+----------+-------+----------------+
|column1_ID|column2|lastTwoArray |
+----------+-------+----------------+
|B_456 |60 |[[BD,3]] |
|A_123 |12 |[[A,1], [B,2]] |
|B_456 |56 |[[DB,4], [BD,5]]|
|A_123 |23 |[[A,1]] |
+----------+-------+----------------+
+----------+-------+----------------+---------------------------------+
|column1_ID|column2|lastTwoArray |lastThree |
+----------+-------+----------------+---------------------------------+
|B_456 |60 |[[BD,3]] |[60,WrappedArray([BD,3])] |
|A_123 |12 |[[A,1], [B,2]] |[12,WrappedArray([A,1], [B,2])] |
|B_456 |56 |[[DB,4], [BD,5]]|[56,WrappedArray([DB,4], [BD,5])]|
|A_123 |23 |[[A,1]] |[23,WrappedArray([A,1])] |
+----------+-------+----------------+---------------------------------+
root
|-- column1_ID: string (nullable = true)
|-- lastThreeArray: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- column2: integer (nullable = false)
| | |-- lastTwoArray: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- column3: string (nullable = true)
| | | | |-- column4: integer (nullable = false)
+----------+--------------------------------------------------------------+
|column1_ID|lastThreeArray |
+----------+--------------------------------------------------------------+
|B_456 |[[60,WrappedArray([BD,3])], [56,WrappedArray([DB,4], [BD,5])]]|
|A_123 |[[12,WrappedArray([A,1], [B,2])], [23,WrappedArray([A,1])]] |
+----------+--------------------------------------------------------------+
+----------+----------------------------------------------------+
|column1_ID|UDF(lastThreeArray) |
+----------+----------------------------------------------------+
|B_456 |Map(60 -> Map(BD -> 3), 56 -> Map(DB -> 4, BD -> 5))|
|A_123 |Map(12 -> Map(A -> 1, B -> 2), 23 -> Map(A -> 1)) |
+----------+----------------------------------------------------+

You can convert the DF to an rdd and apply the operations like below:
scala> case class Data(col1: String, col2: Int, col3: String, col4: Int)
defined class Data
scala> var x: Seq[Data] = List(Data("A_123",12,"A",1), Data("A_123",12,"B",2), Data("A_123",23,"A",1), Data("B_456",56,"DB",4), Data("B_456",56,"BD",5), Data("B_456",60,"BD",3))
x: Seq[Data] = List(Data(A_123,12,A,1), Data(A_123,12,B,2), Data(A_123,23,A,1), Data(B_456,56,DB,4), Data(B_456,56,BD,5), Data(B_456,60,BD,3))
scala> sc.parallelize(x).groupBy(_.col1).map{a => (a._1, HashMap(a._2.groupBy(_.col2).map{b => (b._1, HashMap(b._2.groupBy(_.col3).map{c => (c._1, c._2.map(_.col4).head)}.toArray: _*))}.toArray: _*))}.toDF()
res26: org.apache.spark.sql.DataFrame = [_1: string, _2: map<int,map<string,int>>]
I have initialized an rdd with the data structure as in your case by sc.parallelize(x)

Related

Add new column of Map Datatype to Spark Dataframe in scala

I'm able to create a new Dataframe with one column having Map datatype.
val inputDF2 = Seq(
(1, "Visa", 1, Map[String, Int]()),
(2, "MC", 2, Map[String, Int]())).toDF("id", "card_type", "number_of_cards", "card_type_details")
scala> inputDF2.show(false)
+---+---------+---------------+-----------------+
|id |card_type|number_of_cards|card_type_details|
+---+---------+---------------+-----------------+
|1 |Visa |1 |[] |
|2 |MC |2 |[] |
+---+---------+---------------+-----------------+
Now I want to create a new column of the same type as card_type_details. I'm trying to use the spark withColumn method to add this new column.
inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").show(false)
+---------+---------+---------------+---------------------+-----+
|person_id|card_type|number_of_cards|card_type_details |tmp |
+---------+---------+---------------+---------------------+-----+
|1 |Visa |1 |[] |null |
|2 |MC |2 |[] |null |
+---------+---------+---------------+---------------------+-----+
When I checked the schema of both the columns, it is same but values are coming different.
scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").printSchema
root
|-- id: integer (nullable = false)
|-- card_type: string (nullable = true)
|-- number_of_cards: integer (nullable = false)
|-- card_type_details: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
|-- tmp: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
I'm not sure if I'm doing correctly while adding the new column. Issue is coming when I'm applying the .isEmpty method on the tmp column. I'm getting null pointer exception.
scala> def checkValue = udf((card_type_details: Map[String, Int]) => {
| var output_map = Map[String, Int]()
| if (card_type_details.isEmpty) { output_map += 0.toString -> 1 }
| else {output_map = card_type_details }
| output_map
| })
checkValue: org.apache.spark.sql.expressions.UserDefinedFunction
scala> inputDF2.withColumn("value", checkValue(col("card_type_details"))).show(false)
+---+---------+---------------+-----------------+--------+
|id |card_type|number_of_cards|card_type_details|value |
+---+---------+---------------+-----------------+--------+
|1 |Visa |1 |[] |[0 -> 1]|
|2 |MC |2 |[] |[0 -> 1]|
+---+---------+---------------+-----------------+--------+
scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>")
.withColumn("value", checkValue(col("tmp"))).show(false)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$checkValue$1: (map<string,int>) => map<string,int>)
Caused by: java.lang.NullPointerException
at $anonfun$checkValue$1.apply(<console>:28)
at $anonfun$checkValue$1.apply(<console>:26)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:108)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:107)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1063)
How to add a new column that should have the same values as card_type_details column.
To add the tmp column with the same value as card_type_details, you just do:
inputDF2.withColumn("tmp", col("cart_type_details"))
If you aim to add a column with an empty map and avoid the NullPointerException, the solution is:
inputDF2.withColumn("tmp", typedLit(Map.empty[Int, String]))

Scala modify value of nested column

I need to conditionally modify a value of a nested field in a Dataframe (or create a new field with the nested values).
I would like to do it without having to use UDF, but I really would want to avoid RDD/map since the production tables can have many hundred millions of records and map in that condition dosen't ring as efficient/fast to me.
Bellow is the test case:
case class teste(var testID: Int = 0, var testDesc: String = "", var testValue: String = "")
val DFMain = Seq( ("A",teste(1, "AAA", "10")),("B",teste(2, "BBB", "20")),("C",teste(3, "CCC", "30"))).toDF("F1","F2")
val DFNewData = Seq( ("A",teste(1, "AAA", "40")),("B",teste(2, "BBB", "50")),("C",teste(3, "CCC", "60"))).toDF("F1","F2")
val DFJoined = DFMain.join(DFNewData,DFMain("F2.testID")===DFNewData("F2.testID"),"left").
select(DFMain("F1"), DFMain("F2"), DFNewData("F2.testValue").as("NewValue")).
withColumn("F2.testValue",$"NewValue")
DFJoined.show()
This will add a new column, but I need that F2.testValue to be equal to the value of NewValue inside the Struct when its above 50.
Original Data:
+---+------------+
| F1| F2|
+---+------------+
| A|[1, AAA, 10]|
| B|[2, BBB, 20]|
| C|[3, CCC, 30]|
+---+------------+
Desired Result:
+---+------------+
| F1| F2|
+---+------------+
| A|[1, AAA, 10]|
| B|[2, BBB, 50]|
| C|[3, CCC, 60]|
+---+------------+
Could you please try this.
case class teste(var testID: Int = 0, var testDesc: String = "", var testValue: String = "")
val DFMain = Seq( ("A",teste(1, "AAA", "10")),("B",teste(2, "BBB", "20")),("C",teste(3, "CCC", "30"))).toDF("F1","F2")
DFMain.show(false)
+---+------------+
|F1 |F2 |
+---+------------+
|A |[1, AAA, 10]|
|B |[2, BBB, 20]|
|C |[3, CCC, 30]|
+---+------------+
val DFNewData = Seq( ("A",teste(1, "AAA", "40")),("B",teste(2, "BBB", "50")),("C",teste(3, "CCC", "60"))).toDF("F1","F2")
val DFJoined = DFMain.join(DFNewData,DFMain("F2.testID")===DFNewData("F2.testID"),"left").
select(DFMain("F1"), DFMain("F2"), DFNewData("F2.testValue").as("NewValue"))
.withColumn("F2_testValue",$"NewValue")
DFJoined.show
+---+------------+--------+------------+
| F1| F2|NewValue|F2_testValue|
+---+------------+--------+------------+
| A|[1, AAA, 10]| 40| 40|
| B|[2, BBB, 20]| 50| 50|
| C|[3, CCC, 30]| 60| 60|
+---+------------+--------+------------+
DFJoined.printSchema
root
|-- F1: string (nullable = true)
|-- F2: struct (nullable = true)
| |-- testID: integer (nullable = false)
| |-- testDesc: string (nullable = true)
| |-- testValue: string (nullable = true)
|-- NewValue: string (nullable = true)
|-- F2_testValue: string (nullable = true)
DFJoined.withColumn("f2_new", expr(" case when F2_testValue>=50 then concat_ws('|',F2.testID,F2.testDesc,F2_testValue) else concat_ws('|',F2.testID,F2.testDesc,F2.testValue) end "))
.withColumn("f2_new3",struct(split($"f2_new","[|]")(0),split($"f2_new","[|]")(1),split($"f2_new","[|]")(2) ) )
.show(false)
+---+------------+--------+------------+--------+------------+
|F1 |F2 |NewValue|F2_testValue|f2_new |f2_new3 |
+---+------------+--------+------------+--------+------------+
|A |[1, AAA, 10]|40 |40 |1|AAA|10|[1, AAA, 10]|
|B |[2, BBB, 20]|50 |50 |2|BBB|50|[2, BBB, 50]|
|C |[3, CCC, 30]|60 |60 |3|CCC|60|[3, CCC, 60]|
+---+------------+--------+------------+--------+------------+
f2_new3 is the desired output.
The reason for the workaround is the below one is not working.
DFJoined.withColumn("f2_new", expr(" case when F2_testValue>=50 then struct(F2.testID,F2.testDesc,F2_testValue) else struct(F2.testID,F2.testDesc,F2.testValue) end ")).show()
In addition to stack0114106 answer, I also found this solution for the problem, they are more or less alike:
val DFFinal = DFJoined.selectExpr("""
named_struct(
'F1', F1,
'F2', named_struct(
'testID', F2.testID,
'testDesc', F2.testDesc,
'testValue', case when NewValue>=50 then NewValue else F2.testValue end
)
) as named_struct
""").select($"named_struct.F1", $"named_struct.F2")

Spark select item in array by max score

Given the following DataFrame containing an id and a Seq of Stuff (with an id and score), how do I select the "best" Stuff in the array by score?
I'd like NOT to use UDFs and possibly work with Spark DataFrame functions only.
case class Stuff(id: Int, score: Double)
val df = spark.createDataFrame(Seq(
(1, Seq(Stuff(11, 0.4), Stuff(12, 0.5))),
(2, Seq(Stuff(22, 0.9), Stuff(23, 0.8)))
)).toDF("id", "data")
df.show(false)
+---+----------------------+
|id |data |
+---+----------------------+
|1 |[[11, 0.4], [12, 0.5]]|
|2 |[[22, 0.9], [23, 0.8]]|
+---+----------------------+
df.printSchema
root
|-- id: integer (nullable = false)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: integer (nullable = false)
| | |-- score: double (nullable = false)
I tried going down the route of window functions but the code gets a bit too convoluted. Expected output:
+---+---------+
|id |topStuff |
+---+---------
|1 |[12, 0.5]|
|2 |[22, 0.9]|
+---+---------+
You can use Spark 2.4 higher-order functions:
df
.selectExpr("id","(filter(data, x -> x.score == array_max(data.score)))[0] as topstuff")
.show()
gives
+---+---------+
| id| topstuff|
+---+---------+
| 1|[12, 0.5]|
| 2|[22, 0.9]|
+---+---------+
As an alternative, use window-functions (requires shuffling!):
df
.select($"id",explode($"data").as("topstuff"))
.withColumn("selector",max($"topstuff.score") .over(Window.partitionBy($"id")))
.where($"topstuff.score"===$"selector")
.drop($"selector")
.show()
also gives:
+---+---------+
| id| topstuff|
+---+---------+
| 1|[12, 0.5]|
| 2|[22, 0.9]|
+---+---------+

mininum value of struct type column in a dataframe

My question is I have below json file where it contains struct type data for column3. I can able to extract the rows but not able to find the mininum value of column3. Where column3 contain dinamic nested columns(dynamic names) with values.
inputdata is:
"result": { "data" :
[ {"col1": "value1", "col2": "value2", "col3" : { "dyno" : 3, "aeio": 5 }, "col4": "value4"},
{"col1": "value11", "col2": "value22", "col3" : { "abc" : 6, "def": 9 , "aero": 2}, "col4": "value44"},
{"col1": "value12", "col2": "value23", "col3" : { "ddc" : 6}, "col4": "value43"}]
outputDate expecting as:
col1 col2 col3 col4 col5(min value of col3)
value1 value2 [3,5] value4 3
value11 value22 [6,9,2] value44 2
value12 value23 [6] value43 6
I can able to read the file and explode the data as records but not able to find the min value of col3.
val bestseller_df1 = bestseller_json.withColumn("extractedresult", explode(col("result.data")))
can you please help me to code to find the min value of col3 in spark/scala.
my json file is:
{"success":true, "result": { "data": [ {"col1": "value1", "col2": "value2", "col3" : { "dyno" : 3, "aeio": 5 }, "col4": "value4"},{"col1": "value11", "col2": "value22", "col3" : { "abc" : 6, "def": 9 , "aero": 2}, "col4": "value44"},{"col1": "value12", "col2": "value23", "col3" : { "ddc" : 6}, "col4": "value43"}],"total":3}}
Here is how you would do it
scala> val df = spark.read.json("/tmp/stack/pathi.json")
df: org.apache.spark.sql.DataFrame = [result: struct<data: array<struct<col1:string,col2:string,col3:struct<abc:bigint,aeio:bigint,aero:bigint,ddc:bigint,def:bigint,dyno:bigint>,col4:string>>, total: bigint>, success: boolean]
scala> df.printSchema
root
|-- result: struct (nullable = true)
| |-- data: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- col1: string (nullable = true)
| | | |-- col2: string (nullable = true)
| | | |-- col3: struct (nullable = true)
| | | | |-- abc: long (nullable = true)
| | | | |-- aeio: long (nullable = true)
| | | | |-- aero: long (nullable = true)
| | | | |-- ddc: long (nullable = true)
| | | | |-- def: long (nullable = true)
| | | | |-- dyno: long (nullable = true)
| | | |-- col4: string (nullable = true)
| |-- total: long (nullable = true)
|-- success: boolean (nullable = true)
scala> df.show(false)
+-------------------------------------------------------------------------------------------------------------------------------+-------+
|result |success|
+-------------------------------------------------------------------------------------------------------------------------------+-------+
|[[[value1, value2, [, 5,,,, 3], value4], [value11, value22, [6,, 2,, 9,], value44], [value12, value23, [,,, 6,,], value43]], 3]|true |
+-------------------------------------------------------------------------------------------------------------------------------+-------+
scala> df.select(explode($"result.data")).show(false)
+-----------------------------------------+
|col |
+-----------------------------------------+
|[value1, value2, [, 5,,,, 3], value4] |
|[value11, value22, [6,, 2,, 9,], value44]|
|[value12, value23, [,,, 6,,], value43] |
+-----------------------------------------+
By looking at the schema, now we know the list of possible columns inside "col3", so we can compute the minimum of all those values by hard-coding as below
scala> df.select(explode($"result.data")).select(least($"col.col3.abc",$"col.col3.aeio",$"col.col3.aero",$"col.col3.ddc",$"col.col3.def",$"col.col3.dyno")).show(false)
+--------------------------------------------------------------------------------------------+
|least(col.col3.abc, col.col3.aeio, col.col3.aero, col.col3.ddc, col.col3.def, col.col3.dyno)|
+--------------------------------------------------------------------------------------------+
|3 |
|2 |
|6 |
+--------------------------------------------------------------------------------------------+
Dynamic handling:
I'll assume that upto col.col3, the structure remains the same, so we proceed by creating another dataframe as
scala> val df2 = df.withColumn("res_data",explode($"result.data")).select(col("success"),col("res_data"),$"res_data.col3.*")
df2: org.apache.spark.sql.DataFrame = [success: boolean, res_data: struct<col1: string, col2: string ... 2 more fields> ... 6 more fields]
scala> df2.show(false)
+-------+-----------------------------------------+----+----+----+----+----+----+
|success|res_data |abc |aeio|aero|ddc |def |dyno|
+-------+-----------------------------------------+----+----+----+----+----+----+
|true |[value1, value2, [, 5,,,, 3], value4] |null|5 |null|null|null|3 |
|true |[value11, value22, [6,, 2,, 9,], value44]|6 |null|2 |null|9 |null|
|true |[value12, value23, [,,, 6,,], value43] |null|null|null|6 |null|null|
+-------+-----------------------------------------+----+----+----+----+----+----+
Other than "success" and "res_data", the rest of the columns are the dynamic ones from "col3"
scala> val p = df2.columns
p: Array[String] = Array(success, res_data, abc, aeio, aero, ddc, def, dyno)
Filter those two and map the rest of them to spark Columns
scala> val m = p.filter(_!="success").filter(_!="res_data").map(col(_))
m: Array[org.apache.spark.sql.Column] = Array(abc, aeio, aero, ddc, def, dyno)
Now pass m:_* as argument to the least function and you get your results as below
scala> df2.withColumn("minv",least(m:_*)).show(false)
+-------+-----------------------------------------+----+----+----+----+----+----+----+
|success|res_data |abc |aeio|aero|ddc |def |dyno|minv|
+-------+-----------------------------------------+----+----+----+----+----+----+----+
|true |[value1, value2, [, 5,,,, 3], value4] |null|5 |null|null|null|3 |3 |
|true |[value11, value22, [6,, 2,, 9,], value44]|6 |null|2 |null|9 |null|2 |
|true |[value12, value23, [,,, 6,,], value43] |null|null|null|6 |null|null|6 |
+-------+-----------------------------------------+----+----+----+----+----+----+----+
scala>
Hope this helps.
dbutils.fs.put("/tmp/test.json", """
{"col1": "value1", "col2": "value2", "col3" : { "dyno" : 3, "aeio": 5 }, "col4": "value4"},
{"col1": "value11", "col2": "value22", "col3" : { "abc" : 6, "def": 9 , "aero": 2}, "col4": "value44"},
{"col1": "value12", "col2": "value23", "col3" : { "ddc" : 6}, "col4": "value43"}]}
""", true)
val df_json = spark.read.json("/tmp/test.json")
val tf = df_json.withColumn("col3", explode(array($"col3.*"))).toDF
val tmp_group = tf.groupBy("col1").agg( min(tf.col("col3")).alias("col3"))
val top_rows = tf.join(tmp_group, Seq("col3","col1"), "inner")
top_rows.select("col1", "col2", "col3","col4").show()
Wrote 282 bytes.
+-------+-------+----+-------+
| col1| col2|col3| col4|
+-------+-------+----+-------+
| value1| value2| 3| value4|
|value11|value22| 2|value44|
|value12|value23| 6|value43|
+-------+-------+----+-------+

How to build map rows from comma-delimited strings?

var clearedLine = ""
var dict = collection.mutable.Map[String, String]()
val rdd = BufferedSource.map(line=> ({
if (!line.endsWith(", ")) {
clearedLine = line+", "
} else{
clearedLine = line.trim
}
clearedLine.split(",")(0).trim->clearedLine.split(",")(1).trim
}
//,clearedLine.split(",")(1).trim->clearedLine.split(",")(0).trim
)
//dict +=clearedLine.split(",")(0).trim.replace(" TO ","->")
)
for ((k,v) <- rdd) printf("key: %s, value: %s\n", k, v)
OUTPUT:
key: EQU EB.AR.DESCRIPT TO 1, value: EB.AR.ASSET.CLASS TO 2
key: EB.AR.CURRENCY TO 3, value: EB.AR.ORIGINAL.VALUE TO 4
I want to split By ' TO ' then prouduce the single dict key->value,please help
key: 1, value: EQU EB.AR.DESCRIPT
key: 2 value: EB.AR.ASSET.CLASS
key: 3, value: EB.AR.CURRENCY
key: 4, value: EB.AR.ORIGINAL.VALUE
Assuming your input to be lines like below
EQU EB.AR.DESCRIPT TO 1,EB.AR.ASSET.CLASS TO 2
EB.AR.CURRENCY TO 3, EB.AR.ORIGINAL.VALUE TO 4
try this scala df solution
scala> val df = Seq(("EQU EB.AR.DESCRIPT TO 1,EB.AR.ASSET.CLASS TO 2"),("EB.AR.CURRENCY TO 3, EB.AR.ORIGINAL.VALUE TO 4")).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]
scala> df.show(false)
+----------------------------------------------+
|a |
+----------------------------------------------+
|EQU EB.AR.DESCRIPT TO 1,EB.AR.ASSET.CLASS TO 2|
|EB.AR.CURRENCY TO 3, EB.AR.ORIGINAL.VALUE TO 4|
+----------------------------------------------+
scala> val df2 = df.select(split($"a",",").getItem(0).as("a1"),split($"a",",").getItem(1).as("a2"))
df2: org.apache.spark.sql.DataFrame = [a1: string, a2: string]
scala> df2.show(false)
+-----------------------+--------------------------+
|a1 |a2 |
+-----------------------+--------------------------+
|EQU EB.AR.DESCRIPT TO 1|EB.AR.ASSET.CLASS TO 2 |
|EB.AR.CURRENCY TO 3 | EB.AR.ORIGINAL.VALUE TO 4|
+-----------------------+--------------------------+
scala> val df3 = df2.flatMap( r => { (0 until r.size).map( i=> r.getString(i) ) })
df3: org.apache.spark.sql.Dataset[String] = [value: string]
scala> df3.show(false)
+--------------------------+
|value |
+--------------------------+
|EQU EB.AR.DESCRIPT TO 1 |
|EB.AR.ASSET.CLASS TO 2 |
|EB.AR.CURRENCY TO 3 |
| EB.AR.ORIGINAL.VALUE TO 4|
+--------------------------+
scala> df3.select(regexp_extract($"value",""" TO (\d+)\s*$""",1).as("key"),regexp_replace($"value",""" TO (\d+)\s*$""","").as("value")).show(false)
+---+---------------------+
|key|value |
+---+---------------------+
|1 |EQU EB.AR.DESCRIPT |
|2 |EB.AR.ASSET.CLASS |
|3 |EB.AR.CURRENCY |
|4 | EB.AR.ORIGINAL.VALUE|
+---+---------------------+
If you want them as "map" column, then
scala> val df4 = df3.select(regexp_extract($"value",""" TO (\d+)\s*$""",1).as("key"),regexp_replace($"value",""" TO (\d+)\s*$""","").as("value")).select(map($"key",$"value").as("kv"))
df4: org.apache.spark.sql.DataFrame = [kv: map<string,string>]
scala> df4.show(false)
+----------------------------+
|kv |
+----------------------------+
|[1 -> EQU EB.AR.DESCRIPT] |
|[2 -> EB.AR.ASSET.CLASS] |
|[3 -> EB.AR.CURRENCY] |
|[4 -> EB.AR.ORIGINAL.VALUE]|
+----------------------------+
scala> df4.printSchema
root
|-- kv: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
scala>