Convert Spark DataFrame to HashMap of HashMaps

Convert Spark DataFrame to HashMap of HashMaps - scala

I have a dataframe that looks like this:
column1_ID column2 column3 column4
A_123 12 A 1
A_123 12 B 2
A_123 23 A 1
B_456 56 DB 4
B_456 56 BD 5
B_456 60 BD 3
I would like to convert above dataframe/rdd into below OUTPUT column1_ID(KEY): HashMap(Long, HashMap(String, Long))
'A_123': {12 : {'A': 1, 'B': 2}, 23: {'A': 1} },
'B_456': {56 : {'DB': 4, 'BD': 5}, 60: {'BD': 3} }
Tried with reduceByKey and groupByKey but couldn't convert the output as expected.

Can be done with creating complex structure from three last columns, and then apply UDF:
val data = List(
("A_123", 12, "A", 1),
("A_123", 12, "B", 2),
("A_123", 23, "A", 1),
("B_456", 56, "DB", 4),
("B_456", 56, "BD", 5),
("B_456", 60, "BD", 3))
val df = data.toDF("column1_ID", "column2", "column3", "column4")
val twoLastCompacted = df.withColumn("lastTwo", struct($"column3", $"column4"))
twoLastCompacted.show(false)
val grouppedByTwoFirst = twoLastCompacted.groupBy("column1_ID", "column2").agg(collect_list("lastTwo").alias("lastTwoArray"))
grouppedByTwoFirst.show(false)
val treeLastCompacted = grouppedByTwoFirst.withColumn("lastThree", struct($"column2", $"lastTwoArray"))
treeLastCompacted.show(false)
val gruppedByFirst = treeLastCompacted.groupBy("column1_ID").agg(collect_list("lastThree").alias("lastThreeArray"))
gruppedByFirst.printSchema()
gruppedByFirst.show(false)
val structToMap = (value: Seq[Row]) =>
value.map(v => v.getInt(0) ->
v.getSeq(1).asInstanceOf[Seq[Row]].map(r => r.getString(0) -> r.getInt(1)).toMap)
.toMap
val structToMapUDF = udf(structToMap)
gruppedByFirst.select($"column1_ID", structToMapUDF($"lastThreeArray")).show(false)
Output:
+----------+-------+-------+-------+-------+
|column1_ID|column2|column3|column4|lastTwo|
+----------+-------+-------+-------+-------+
|A_123 |12 |A |1 |[A,1] |
|A_123 |12 |B |2 |[B,2] |
|A_123 |23 |A |1 |[A,1] |
|B_456 |56 |DB |4 |[DB,4] |
|B_456 |56 |BD |5 |[BD,5] |
|B_456 |60 |BD |3 |[BD,3] |
+----------+-------+-------+-------+-------+
+----------+-------+----------------+
|column1_ID|column2|lastTwoArray |
+----------+-------+----------------+
|B_456 |60 |[[BD,3]] |
|A_123 |12 |[[A,1], [B,2]] |
|B_456 |56 |[[DB,4], [BD,5]]|
|A_123 |23 |[[A,1]] |
+----------+-------+----------------+
+----------+-------+----------------+---------------------------------+
|column1_ID|column2|lastTwoArray |lastThree |
+----------+-------+----------------+---------------------------------+
|B_456 |60 |[[BD,3]] |[60,WrappedArray([BD,3])] |
|A_123 |12 |[[A,1], [B,2]] |[12,WrappedArray([A,1], [B,2])] |
|B_456 |56 |[[DB,4], [BD,5]]|[56,WrappedArray([DB,4], [BD,5])]|
|A_123 |23 |[[A,1]] |[23,WrappedArray([A,1])] |
+----------+-------+----------------+---------------------------------+
root
|-- column1_ID: string (nullable = true)
|-- lastThreeArray: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- column2: integer (nullable = false)
| | |-- lastTwoArray: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- column3: string (nullable = true)
| | | | |-- column4: integer (nullable = false)
+----------+--------------------------------------------------------------+
|column1_ID|lastThreeArray |
+----------+--------------------------------------------------------------+
|B_456 |[[60,WrappedArray([BD,3])], [56,WrappedArray([DB,4], [BD,5])]]|
|A_123 |[[12,WrappedArray([A,1], [B,2])], [23,WrappedArray([A,1])]] |
+----------+--------------------------------------------------------------+
+----------+----------------------------------------------------+
|column1_ID|UDF(lastThreeArray) |
+----------+----------------------------------------------------+
|B_456 |Map(60 -> Map(BD -> 3), 56 -> Map(DB -> 4, BD -> 5))|
|A_123 |Map(12 -> Map(A -> 1, B -> 2), 23 -> Map(A -> 1)) |
+----------+----------------------------------------------------+

You can convert the DF to an rdd and apply the operations like below:
scala> case class Data(col1: String, col2: Int, col3: String, col4: Int)
defined class Data
scala> var x: Seq[Data] = List(Data("A_123",12,"A",1), Data("A_123",12,"B",2), Data("A_123",23,"A",1), Data("B_456",56,"DB",4), Data("B_456",56,"BD",5), Data("B_456",60,"BD",3))
x: Seq[Data] = List(Data(A_123,12,A,1), Data(A_123,12,B,2), Data(A_123,23,A,1), Data(B_456,56,DB,4), Data(B_456,56,BD,5), Data(B_456,60,BD,3))
scala> sc.parallelize(x).groupBy(_.col1).map{a => (a._1, HashMap(a._2.groupBy(_.col2).map{b => (b._1, HashMap(b._2.groupBy(_.col3).map{c => (c._1, c._2.map(_.col4).head)}.toArray: _*))}.toArray: _*))}.toDF()
res26: org.apache.spark.sql.DataFrame = [_1: string, _2: map<int,map<string,int>>]
I have initialized an rdd with the data structure as in your case by sc.parallelize(x)

Related

Add new column of Map Datatype to Spark Dataframe in scala

I'm able to create a new Dataframe with one column having Map datatype.
val inputDF2 = Seq(
(1, "Visa", 1, Map[String, Int]()),
(2, "MC", 2, Map[String, Int]())).toDF("id", "card_type", "number_of_cards", "card_type_details")
scala> inputDF2.show(false)
+---+---------+---------------+-----------------+
|id |card_type|number_of_cards|card_type_details|
+---+---------+---------------+-----------------+
|1 |Visa |1 |[] |
|2 |MC |2 |[] |
+---+---------+---------------+-----------------+
Now I want to create a new column of the same type as card_type_details. I'm trying to use the spark withColumn method to add this new column.
inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").show(false)
+---------+---------+---------------+---------------------+-----+
|person_id|card_type|number_of_cards|card_type_details |tmp |
+---------+---------+---------------+---------------------+-----+
|1 |Visa |1 |[] |null |
|2 |MC |2 |[] |null |
+---------+---------+---------------+---------------------+-----+
When I checked the schema of both the columns, it is same but values are coming different.
scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").printSchema
root
|-- id: integer (nullable = false)
|-- card_type: string (nullable = true)
|-- number_of_cards: integer (nullable = false)
|-- card_type_details: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
|-- tmp: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
I'm not sure if I'm doing correctly while adding the new column. Issue is coming when I'm applying the .isEmpty method on the tmp column. I'm getting null pointer exception.
scala> def checkValue = udf((card_type_details: Map[String, Int]) => {
| var output_map = Map[String, Int]()
| if (card_type_details.isEmpty) { output_map += 0.toString -> 1 }
| else {output_map = card_type_details }
| output_map
| })
checkValue: org.apache.spark.sql.expressions.UserDefinedFunction
scala> inputDF2.withColumn("value", checkValue(col("card_type_details"))).show(false)
+---+---------+---------------+-----------------+--------+
|id |card_type|number_of_cards|card_type_details|value |
+---+---------+---------------+-----------------+--------+
|1 |Visa |1 |[] |[0 -> 1]|
|2 |MC |2 |[] |[0 -> 1]|
+---+---------+---------------+-----------------+--------+
scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>")
.withColumn("value", checkValue(col("tmp"))).show(false)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$checkValue$1: (map<string,int>) => map<string,int>)
Caused by: java.lang.NullPointerException
at $anonfun$checkValue$1.apply(<console>:28)
at $anonfun$checkValue$1.apply(<console>:26)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:108)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:107)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1063)
How to add a new column that should have the same values as card_type_details column.

To add the tmp column with the same value as card_type_details, you just do:
inputDF2.withColumn("tmp", col("cart_type_details"))
If you aim to add a column with an empty map and avoid the NullPointerException, the solution is:
inputDF2.withColumn("tmp", typedLit(Map.empty[Int, String]))

Scala modify value of nested column

I need to conditionally modify a value of a nested field in a Dataframe (or create a new field with the nested values).
I would like to do it without having to use UDF, but I really would want to avoid RDD/map since the production tables can have many hundred millions of records and map in that condition dosen't ring as efficient/fast to me.
Bellow is the test case:
case class teste(var testID: Int = 0, var testDesc: String = "", var testValue: String = "")
val DFMain = Seq( ("A",teste(1, "AAA", "10")),("B",teste(2, "BBB", "20")),("C",teste(3, "CCC", "30"))).toDF("F1","F2")
val DFNewData = Seq( ("A",teste(1, "AAA", "40")),("B",teste(2, "BBB", "50")),("C",teste(3, "CCC", "60"))).toDF("F1","F2")
val DFJoined = DFMain.join(DFNewData,DFMain("F2.testID")===DFNewData("F2.testID"),"left").
select(DFMain("F1"), DFMain("F2"), DFNewData("F2.testValue").as("NewValue")).
withColumn("F2.testValue",$"NewValue")
DFJoined.show()
This will add a new column, but I need that F2.testValue to be equal to the value of NewValue inside the Struct when its above 50.
Original Data:
+---+------------+
| F1| F2|
+---+------------+
| A|[1, AAA, 10]|
| B|[2, BBB, 20]|
| C|[3, CCC, 30]|
+---+------------+
Desired Result:
+---+------------+
| F1| F2|
+---+------------+
| A|[1, AAA, 10]|
| B|[2, BBB, 50]|
| C|[3, CCC, 60]|
+---+------------+

Could you please try this.
case class teste(var testID: Int = 0, var testDesc: String = "", var testValue: String = "")
val DFMain = Seq( ("A",teste(1, "AAA", "10")),("B",teste(2, "BBB", "20")),("C",teste(3, "CCC", "30"))).toDF("F1","F2")
DFMain.show(false)
+---+------------+
|F1 |F2 |
+---+------------+
|A |[1, AAA, 10]|
|B |[2, BBB, 20]|
|C |[3, CCC, 30]|
+---+------------+
val DFNewData = Seq( ("A",teste(1, "AAA", "40")),("B",teste(2, "BBB", "50")),("C",teste(3, "CCC", "60"))).toDF("F1","F2")
val DFJoined = DFMain.join(DFNewData,DFMain("F2.testID")===DFNewData("F2.testID"),"left").
select(DFMain("F1"), DFMain("F2"), DFNewData("F2.testValue").as("NewValue"))
.withColumn("F2_testValue",$"NewValue")
DFJoined.show
+---+------------+--------+------------+
| F1| F2|NewValue|F2_testValue|
+---+------------+--------+------------+
| A|[1, AAA, 10]| 40| 40|
| B|[2, BBB, 20]| 50| 50|
| C|[3, CCC, 30]| 60| 60|
+---+------------+--------+------------+
DFJoined.printSchema
root
|-- F1: string (nullable = true)
|-- F2: struct (nullable = true)
| |-- testID: integer (nullable = false)
| |-- testDesc: string (nullable = true)
| |-- testValue: string (nullable = true)
|-- NewValue: string (nullable = true)
|-- F2_testValue: string (nullable = true)
DFJoined.withColumn("f2_new", expr(" case when F2_testValue>=50 then concat_ws('|',F2.testID,F2.testDesc,F2_testValue) else concat_ws('|',F2.testID,F2.testDesc,F2.testValue) end "))
.withColumn("f2_new3",struct(split($"f2_new","[|]")(0),split($"f2_new","[|]")(1),split($"f2_new","[|]")(2) ) )
.show(false)
+---+------------+--------+------------+--------+------------+
|F1 |F2 |NewValue|F2_testValue|f2_new |f2_new3 |
+---+------------+--------+------------+--------+------------+
|A |[1, AAA, 10]|40 |40 |1|AAA|10|[1, AAA, 10]|
|B |[2, BBB, 20]|50 |50 |2|BBB|50|[2, BBB, 50]|
|C |[3, CCC, 30]|60 |60 |3|CCC|60|[3, CCC, 60]|
+---+------------+--------+------------+--------+------------+
f2_new3 is the desired output.
The reason for the workaround is the below one is not working.
DFJoined.withColumn("f2_new", expr(" case when F2_testValue>=50 then struct(F2.testID,F2.testDesc,F2_testValue) else struct(F2.testID,F2.testDesc,F2.testValue) end ")).show()

In addition to stack0114106 answer, I also found this solution for the problem, they are more or less alike:
val DFFinal = DFJoined.selectExpr("""
named_struct(
'F1', F1,
'F2', named_struct(
'testID', F2.testID,
'testDesc', F2.testDesc,
'testValue', case when NewValue>=50 then NewValue else F2.testValue end
)
) as named_struct
""").select($"named_struct.F1", $"named_struct.F2")

Spark select item in array by max score

Given the following DataFrame containing an id and a Seq of Stuff (with an id and score), how do I select the "best" Stuff in the array by score?
I'd like NOT to use UDFs and possibly work with Spark DataFrame functions only.
case class Stuff(id: Int, score: Double)
val df = spark.createDataFrame(Seq(
(1, Seq(Stuff(11, 0.4), Stuff(12, 0.5))),
(2, Seq(Stuff(22, 0.9), Stuff(23, 0.8)))
)).toDF("id", "data")
df.show(false)
+---+----------------------+
|id |data |
+---+----------------------+
|1 |[[11, 0.4], [12, 0.5]]|
|2 |[[22, 0.9], [23, 0.8]]|
+---+----------------------+
df.printSchema
root
|-- id: integer (nullable = false)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: integer (nullable = false)
| | |-- score: double (nullable = false)
I tried going down the route of window functions but the code gets a bit too convoluted. Expected output:
+---+---------+
|id |topStuff |
+---+---------
|1 |[12, 0.5]|
|2 |[22, 0.9]|
+---+---------+

You can use Spark 2.4 higher-order functions:
df
.selectExpr("id","(filter(data, x -> x.score == array_max(data.score)))[0] as topstuff")
.show()
gives
+---+---------+
| id| topstuff|
+---+---------+
| 1|[12, 0.5]|
| 2|[22, 0.9]|
+---+---------+
As an alternative, use window-functions (requires shuffling!):
df
.select($"id",explode($"data").as("topstuff"))
.withColumn("selector",max($"topstuff.score") .over(Window.partitionBy($"id")))
.where($"topstuff.score"===$"selector")
.drop($"selector")
.show()
also gives:
+---+---------+
| id| topstuff|
+---+---------+
| 1|[12, 0.5]|
| 2|[22, 0.9]|
+---+---------+

mininum value of struct type column in a dataframe

My question is I have below json file where it contains struct type data for column3. I can able to extract the rows but not able to find the mininum value of column3. Where column3 contain dinamic nested columns(dynamic names) with values.
inputdata is:
"result": { "data" :
[ {"col1": "value1", "col2": "value2", "col3" : { "dyno" : 3, "aeio": 5 }, "col4": "value4"},
{"col1": "value11", "col2": "value22", "col3" : { "abc" : 6, "def": 9 , "aero": 2}, "col4": "value44"},
{"col1": "value12", "col2": "value23", "col3" : { "ddc" : 6}, "col4": "value43"}]
outputDate expecting as:
col1 col2 col3 col4 col5(min value of col3)
value1 value2 [3,5] value4 3
value11 value22 [6,9,2] value44 2
value12 value23 [6] value43 6
I can able to read the file and explode the data as records but not able to find the min value of col3.
val bestseller_df1 = bestseller_json.withColumn("extractedresult", explode(col("result.data")))
can you please help me to code to find the min value of col3 in spark/scala.
my json file is:
{"success":true, "result": { "data": [ {"col1": "value1", "col2": "value2", "col3" : { "dyno" : 3, "aeio": 5 }, "col4": "value4"},{"col1": "value11", "col2": "value22", "col3" : { "abc" : 6, "def": 9 , "aero": 2}, "col4": "value44"},{"col1": "value12", "col2": "value23", "col3" : { "ddc" : 6}, "col4": "value43"}],"total":3}}

Here is how you would do it
scala> val df = spark.read.json("/tmp/stack/pathi.json")
df: org.apache.spark.sql.DataFrame = [result: struct<data: array<struct<col1:string,col2:string,col3:struct<abc:bigint,aeio:bigint,aero:bigint,ddc:bigint,def:bigint,dyno:bigint>,col4:string>>, total: bigint>, success: boolean]
scala> df.printSchema
root
|-- result: struct (nullable = true)
| |-- data: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- col1: string (nullable = true)
| | | |-- col2: string (nullable = true)
| | | |-- col3: struct (nullable = true)
| | | | |-- abc: long (nullable = true)
| | | | |-- aeio: long (nullable = true)
| | | | |-- aero: long (nullable = true)
| | | | |-- ddc: long (nullable = true)
| | | | |-- def: long (nullable = true)
| | | | |-- dyno: long (nullable = true)
| | | |-- col4: string (nullable = true)
| |-- total: long (nullable = true)
|-- success: boolean (nullable = true)
scala> df.show(false)
+-------------------------------------------------------------------------------------------------------------------------------+-------+
|result |success|
+-------------------------------------------------------------------------------------------------------------------------------+-------+
|[[[value1, value2, [, 5,,,, 3], value4], [value11, value22, [6,, 2,, 9,], value44], [value12, value23, [,,, 6,,], value43]], 3]|true |
+-------------------------------------------------------------------------------------------------------------------------------+-------+
scala> df.select(explode($"result.data")).show(false)
+-----------------------------------------+
|col |
+-----------------------------------------+
|[value1, value2, [, 5,,,, 3], value4] |
|[value11, value22, [6,, 2,, 9,], value44]|
|[value12, value23, [,,, 6,,], value43] |
+-----------------------------------------+
By looking at the schema, now we know the list of possible columns inside "col3", so we can compute the minimum of all those values by hard-coding as below
scala> df.select(explode($"result.data")).select(least($"col.col3.abc",$"col.col3.aeio",$"col.col3.aero",$"col.col3.ddc",$"col.col3.def",$"col.col3.dyno")).show(false)
+--------------------------------------------------------------------------------------------+
|least(col.col3.abc, col.col3.aeio, col.col3.aero, col.col3.ddc, col.col3.def, col.col3.dyno)|
+--------------------------------------------------------------------------------------------+
|3 |
|2 |
|6 |
+--------------------------------------------------------------------------------------------+
Dynamic handling:
I'll assume that upto col.col3, the structure remains the same, so we proceed by creating another dataframe as
scala> val df2 = df.withColumn("res_data",explode($"result.data")).select(col("success"),col("res_data"),$"res_data.col3.*")
df2: org.apache.spark.sql.DataFrame = [success: boolean, res_data: struct<col1: string, col2: string ... 2 more fields> ... 6 more fields]
scala> df2.show(false)
+-------+-----------------------------------------+----+----+----+----+----+----+
|success|res_data |abc |aeio|aero|ddc |def |dyno|
+-------+-----------------------------------------+----+----+----+----+----+----+
|true |[value1, value2, [, 5,,,, 3], value4] |null|5 |null|null|null|3 |
|true |[value11, value22, [6,, 2,, 9,], value44]|6 |null|2 |null|9 |null|
|true |[value12, value23, [,,, 6,,], value43] |null|null|null|6 |null|null|
+-------+-----------------------------------------+----+----+----+----+----+----+
Other than "success" and "res_data", the rest of the columns are the dynamic ones from "col3"
scala> val p = df2.columns
p: Array[String] = Array(success, res_data, abc, aeio, aero, ddc, def, dyno)
Filter those two and map the rest of them to spark Columns
scala> val m = p.filter(_!="success").filter(_!="res_data").map(col(_))
m: Array[org.apache.spark.sql.Column] = Array(abc, aeio, aero, ddc, def, dyno)
Now pass m:_* as argument to the least function and you get your results as below
scala> df2.withColumn("minv",least(m:_*)).show(false)
+-------+-----------------------------------------+----+----+----+----+----+----+----+
|success|res_data |abc |aeio|aero|ddc |def |dyno|minv|
+-------+-----------------------------------------+----+----+----+----+----+----+----+
|true |[value1, value2, [, 5,,,, 3], value4] |null|5 |null|null|null|3 |3 |
|true |[value11, value22, [6,, 2,, 9,], value44]|6 |null|2 |null|9 |null|2 |
|true |[value12, value23, [,,, 6,,], value43] |null|null|null|6 |null|null|6 |
+-------+-----------------------------------------+----+----+----+----+----+----+----+
scala>
Hope this helps.

dbutils.fs.put("/tmp/test.json", """
{"col1": "value1", "col2": "value2", "col3" : { "dyno" : 3, "aeio": 5 }, "col4": "value4"},
{"col1": "value11", "col2": "value22", "col3" : { "abc" : 6, "def": 9 , "aero": 2}, "col4": "value44"},
{"col1": "value12", "col2": "value23", "col3" : { "ddc" : 6}, "col4": "value43"}]}
""", true)
val df_json = spark.read.json("/tmp/test.json")
val tf = df_json.withColumn("col3", explode(array($"col3.*"))).toDF
val tmp_group = tf.groupBy("col1").agg( min(tf.col("col3")).alias("col3"))
val top_rows = tf.join(tmp_group, Seq("col3","col1"), "inner")
top_rows.select("col1", "col2", "col3","col4").show()
Wrote 282 bytes.
+-------+-------+----+-------+
| col1| col2|col3| col4|
+-------+-------+----+-------+
| value1| value2| 3| value4|
|value11|value22| 2|value44|
|value12|value23| 6|value43|
+-------+-------+----+-------+

How to build map rows from comma-delimited strings?

var clearedLine = ""
var dict = collection.mutable.Map[String, String]()
val rdd = BufferedSource.map(line=> ({
if (!line.endsWith(", ")) {
clearedLine = line+", "
} else{
clearedLine = line.trim
}
clearedLine.split(",")(0).trim->clearedLine.split(",")(1).trim
}
//,clearedLine.split(",")(1).trim->clearedLine.split(",")(0).trim
)
//dict +=clearedLine.split(",")(0).trim.replace(" TO ","->")
)
for ((k,v) <- rdd) printf("key: %s, value: %s\n", k, v)
OUTPUT:
key: EQU EB.AR.DESCRIPT TO 1, value: EB.AR.ASSET.CLASS TO 2
key: EB.AR.CURRENCY TO 3, value: EB.AR.ORIGINAL.VALUE TO 4
I want to split By ' TO ' then prouduce the single dict key->value,please help
key: 1, value: EQU EB.AR.DESCRIPT
key: 2 value: EB.AR.ASSET.CLASS
key: 3, value: EB.AR.CURRENCY
key: 4, value: EB.AR.ORIGINAL.VALUE

Assuming your input to be lines like below
EQU EB.AR.DESCRIPT TO 1,EB.AR.ASSET.CLASS TO 2
EB.AR.CURRENCY TO 3, EB.AR.ORIGINAL.VALUE TO 4
try this scala df solution
scala> val df = Seq(("EQU EB.AR.DESCRIPT TO 1,EB.AR.ASSET.CLASS TO 2"),("EB.AR.CURRENCY TO 3, EB.AR.ORIGINAL.VALUE TO 4")).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]
scala> df.show(false)
+----------------------------------------------+
|a |
+----------------------------------------------+
|EQU EB.AR.DESCRIPT TO 1,EB.AR.ASSET.CLASS TO 2|
|EB.AR.CURRENCY TO 3, EB.AR.ORIGINAL.VALUE TO 4|
+----------------------------------------------+
scala> val df2 = df.select(split($"a",",").getItem(0).as("a1"),split($"a",",").getItem(1).as("a2"))
df2: org.apache.spark.sql.DataFrame = [a1: string, a2: string]
scala> df2.show(false)
+-----------------------+--------------------------+
|a1 |a2 |
+-----------------------+--------------------------+
|EQU EB.AR.DESCRIPT TO 1|EB.AR.ASSET.CLASS TO 2 |
|EB.AR.CURRENCY TO 3 | EB.AR.ORIGINAL.VALUE TO 4|
+-----------------------+--------------------------+
scala> val df3 = df2.flatMap( r => { (0 until r.size).map( i=> r.getString(i) ) })
df3: org.apache.spark.sql.Dataset[String] = [value: string]
scala> df3.show(false)
+--------------------------+
|value |
+--------------------------+
|EQU EB.AR.DESCRIPT TO 1 |
|EB.AR.ASSET.CLASS TO 2 |
|EB.AR.CURRENCY TO 3 |
| EB.AR.ORIGINAL.VALUE TO 4|
+--------------------------+
scala> df3.select(regexp_extract($"value",""" TO (\d+)\s*$""",1).as("key"),regexp_replace($"value",""" TO (\d+)\s*$""","").as("value")).show(false)
+---+---------------------+
|key|value |
+---+---------------------+
|1 |EQU EB.AR.DESCRIPT |
|2 |EB.AR.ASSET.CLASS |
|3 |EB.AR.CURRENCY |
|4 | EB.AR.ORIGINAL.VALUE|
+---+---------------------+
If you want them as "map" column, then
scala> val df4 = df3.select(regexp_extract($"value",""" TO (\d+)\s*$""",1).as("key"),regexp_replace($"value",""" TO (\d+)\s*$""","").as("value")).select(map($"key",$"value").as("kv"))
df4: org.apache.spark.sql.DataFrame = [kv: map<string,string>]
scala> df4.show(false)
+----------------------------+
|kv |
+----------------------------+
|[1 -> EQU EB.AR.DESCRIPT] |
|[2 -> EB.AR.ASSET.CLASS] |
|[3 -> EB.AR.CURRENCY] |
|[4 -> EB.AR.ORIGINAL.VALUE]|
+----------------------------+
scala> df4.printSchema
root
|-- kv: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
scala>

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Convert Spark DataFrame to HashMap of HashMaps - scala

Related

Add new column of Map Datatype to Spark Dataframe in scala

Scala modify value of nested column

Spark select item in array by max score

mininum value of struct type column in a dataframe

How to build map rows from comma-delimited strings?

Categories

Resources