Scala Dataframe/SQL: Dynamic Column Selection for Reporting - scala

Context:
We have a model which has 5 flows.
Flow-1 : Input Data from different sources , these are input to Flow-2
Flow-2, Flow-3 and Flow-4 : are ML models each dumping few fields to s3
Flow-5 : Reporting Layer with data from output of Flow-2,Flow-3,Flow-4
Overall data size is very small
Problem:
Flow-5 which is reporting layer based on few SQL with input data coming from Flow-2,Flow-3 and Flow-4.
Flow-2,Flow-3 and Flow-4 has one common joining field and remaining fields are different.
We can create a SQL joining Flow-2,3,4 data stored in three different tables with few calculations/Aggregation.However the number of output fields from Flow-2,3,4 may vary in each run.
Problem-1: every time s3 file (Flow-2/3/4) structure changes, which often results in issue during COPY, as the target table schema differs from s3 file structire (to fix, need to manually add/delete fields in target table,to align with s3 data)
Problem-2: for any additions/deletion in s3 files, need to make changes in reports by adding/removing column
Approach:
SQL way - standardize s3 dump/target table/report SQL i.e. standardize the number of possible columns in each flow(2,3,4) output,also in the target table, so that if any fields are not available, just load them as NULL/blank during s3 dump and COPY as blank. Standardize target table structure aligned with s3 template. Also standardize reporting SQL
SCALA/SPARK : Currently exploring this option. to perform a PoC, created two s3 dumps,created two dataframes in scala, tried dataframe joins as well as spark SQL join. Im still not sure if there is anyway we can dynamically pick new columns ie making the spark code generic.
with creating dataframes directly pointing to s3, we can solve the COPY data (dynamic fields) to target table problem.
however, the reporting SQL issue still persists (or atleast I dont know, need to find a way how can it be handled)
Question:
Is there anyway,can we handle the issue (dynamic column selection in SQL) in Scala/SparkSQL ?
Note: s3 file always contains only the required fields from each run.Issue is not reading the dynamic fields from s3 (which will be auto taken care by dataframe, instead issue is how can we make the SQL(saprkSQL/API) code to handle this)
Example/Scenario-1:
Run-1
df1= country,1st_buy (here dataframe directly points to s3, which has only required attributes)
df2= country,percent (here dataframe directly points to s3, which has only required attributes)
--Sample SQL code (need to convert to sparkSQL/dataframe API)
SELECT df1.1st_buy,
df1.percent/df2.percent -- DERIVED FIELD
FROM df1,df2
WHERE df1.country=df2.country
Run-2 (here one additional column was added to df1)
df1= country,1st_buy,2nd_buy (here dataframe directly points to s3, which has only required attributes)
df2= country,percent (here dataframe directly points to s3, which has only required attributes)
****how can we handle this part of new field 2nd_buy dynamically****
--Sample SQL code (need to convert to sparkSQL/dataframe API)
SELECT df1.1st_buy,
df1.2nd_buy,
df1.1st_buy/df2.percent -- DERIVED FIELD
df1.2nd_buy/df2.percent -- DERIVED FIELD
FROM df1,df2
WHERE df1.country=df2.country
Example/Scenario-2:
Run-1
df1= country,1st_buy (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT country,sum(df1.1st_buy)
FROM df1
GROUP BY country
--Dataframe API/SparkSQL
df1.groupBy("country").sum("1st_buy").show()
Run-2 (here one additional column was added to df1)
df1= country,1st_buy,2nd_buy (here dataframe directly points to s3, which has only required attributes)
****how can we handle this part of new field 2nd_buy dynamically****
--Sample SQL
SELECT country,sum(df1.1st_buy),sum(2nd_buy)
FROM df1
GROUP BY country
--Dataframe API/SparkSQL
df1.groupBy("country").sum("1st_buy","2nd_buy").show()

{
1.
val sqlScript = "select col1, col2, .... from ... "
// string we can create dynamic
val df = spark.sql(sqlScript)
2. try use schema = = StructType(Seq(
StructField("id",LongType,true),
....
))
// and then use schema.fieldsName... or
val cols: List[Columns] = ...
// in df.select(cols:_*)
3. get schema (list fields with json file)
package spark
import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession}
import org.apache.spark.sql.types.{DataType, StructType}
import scala.io.Source
object DFFieldsWithJson extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
case class TestData (
id: Int,
firstName: String,
lastName: String,
descr: String
)
val dataTestDF = Seq(
TestData(1, "First Name 1", "Last Name 1", "Description 1"),
TestData(2, "First Name 2", "Last Name 2", "Description 2"),
TestData(3, "First Name 3", "Last Name 3", "Description 3")
).toDF()
dataTestDF.show(false)
// +---+------------+-----------+-------------+
// |id |firstName |lastName |descr |
// +---+------------+-----------+-------------+
// |1 |First Name 1|Last Name 1|Description 1|
// |2 |First Name 2|Last Name 2|Description 2|
// |3 |First Name 3|Last Name 3|Description 3|
// +---+------------+-----------+-------------+
val schemaJson =
"""{ "type" : "struct",
|"fields" : [
|{
| "name" : "id",
| "type" : "integer",
| "nullable" : true,
| "metadata" : { }
| },
| {
| "name" : "firstName",
| "type" : "string",
| "nullable" : true,
| "metadata" : {}
| },
| {
| "name" : "lastName",
| "type" : "string",
| "nullable" : true,
| "metadata" : {}
| }
| ]}""".stripMargin
val schemaSource = schemaJson.mkString
val schemaFromJson = DataType.fromJson(schemaSource).asInstanceOf[StructType]
println(schemaFromJson)
// StructType(StructField(id,IntegerType,true), StructField(firstName,StringType,true), StructField(lastName,StringType,true))
val cols: List[String] = schemaFromJson.fieldNames.toList
val col: List[Column] = cols.map(dataTestDF(_))
val df = dataTestDF.select(col: _*)
df.printSchema()
// root
// |-- id: integer (nullable = false)
// |-- firstName: string (nullable = true)
// |-- lastName: string (nullable = true)
df.show(false)
// +---+------------+-----------+
// |id |firstName |lastName |
// +---+------------+-----------+
// |1 |First Name 1|Last Name 1|
// |2 |First Name 2|Last Name 2|
// |3 |First Name 3|Last Name 3|
// +---+------------+-----------+
}
}

{ Examples:
package spark
import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.functions.{col, column, sum}
object DynamicColumnSelection extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
case class c1(
country: String,
st_buy: Double,
nd_buy: Double
)
case class c2(
country: String,
percent: Double
)
val df1 = Seq(
c1("UA", 2, 4),
c1("PL", 3, 6),
c1("GB", 4, 8)
)
.toDF()
df1.show(false)
// +-------+------+------+
// |country|st_buy|nd_buy|
// +-------+------+------+
// |UA |2.0 |4.0 |
// |PL |3.0 |6.0 |
// |GB |4.0 |8.0 |
// +-------+------+------+
val df2 = Seq(
c2("UA", 2.21),
c2("PL", 3.26)
)
.toDF()
df2.show(false)
// +-------+-------+
// |country|percent|
// +-------+-------+
// |UA |2.21 |
// |PL |3.26 |
// +-------+-------+
// Inner Join
val df = df1.join(df2, df1.col("country") === df2.col("country"), "inner")
.select(
df1.col("country"),
df1.col("st_buy"),
df1.col("nd_buy"),
df2.col("percent")
)
df.show(false)
// +-------+------+------+-------+
// |country|st_buy|nd_buy|percent|
// +-------+------+------+-------+
// |UA |2.0 |4.0 |2.21 |
// |PL |3.0 |6.0 |3.26 |
// +-------+------+------+-------+
val res1DF = df.withColumn("st_buy_percent", 'st_buy/'percent)
.withColumn("nd_buy_percent", 'nd_buy/'percent)
res1DF.show(false)
// +-------+------+------+-------+------------------+------------------+
// |country|st_buy|nd_buy|percent|st_buy_percent |nd_buy_percent |
// +-------+------+------+-------+------------------+------------------+
// |UA |2.0 |4.0 |2.21 |0.9049773755656109|1.8099547511312217|
// |PL |3.0 |6.0 |3.26 |0.9202453987730062|1.8404907975460123|
// +-------+------+------+-------+------------------+------------------+
// GroupBy + sum
val data = Seq(
c1("UA", 2, 4),
c1("PL", 3, 6),
c1("UA", 5, 10),
c1("PL", 6, 12),
c1("GB", 4, 8)
)
.toDF()
val resGroupByDF = data
.groupBy("country")
.agg(sum("st_buy").alias("sum_st_buy")
,sum("nd_buy").alias("sum_nd_buy"))
resGroupByDF.show(false)
// +-------+----------+----------+
// |country|sum_st_buy|sum_nd_buy|
// +-------+----------+----------+
// |UA |7.0 |14.0 |
// |PL |9.0 |18.0 |
// |GB |4.0 |8.0 |
// +-------+----------+----------+
val resGroupByDF1 = data.groupBy($"country").sum()
resGroupByDF1.show(false)
// +-------+-----------+-----------+
// |country|sum(st_buy)|sum(nd_buy)|
// +-------+-----------+-----------+
// |UA |7.0 |14.0 |
// |PL |9.0 |18.0 |
// |GB |4.0 |8.0 |
// +-------+-----------+-----------+
val exprs = data.columns.map(sum(_))
val resGroupByDF2 = data.groupBy($"country").agg(exprs.head, exprs.tail: _*)
resGroupByDF2.show(false)
// +-------+------------+-----------+-----------+
// |country|sum(country)|sum(st_buy)|sum(nd_buy)|
// +-------+------------+-----------+-----------+
// |UA |null |7.0 |14.0 |
// |PL |null |9.0 |18.0 |
// |GB |null |4.0 |8.0 |
// +-------+------------+-----------+-----------+
val exprs3 = List("st_buy", "nd_buy").map(sum(_))
val resGroupByDF3 = data.groupBy($"country").agg(exprs3.head, exprs3.tail: _*)
resGroupByDF3.show(false)
// +-------+-----------+-----------+
// |country|sum(st_buy)|sum(nd_buy)|
// +-------+-----------+-----------+
// |UA |7.0 |14.0 |
// |PL |9.0 |18.0 |
// |GB |4.0 |8.0 |
// +-------+-----------+-----------+
val exprs4 = data.columns.toList.filter(_ != "country").map(sum(_))
val resGroupByDF4 = data.groupBy($"country").agg(exprs4.head, exprs4.tail: _*)
resGroupByDF4.show(false)
// +-------+-----------+-----------+
// |country|sum(st_buy)|sum(nd_buy)|
// +-------+-----------+-----------+
// |UA |7.0 |14.0 |
// |PL |9.0 |18.0 |
// |GB |4.0 |8.0 |
// +-------+-----------+-----------+
// Select
val cols = data.columns.toSeq
val selectDF1 = data.select(cols.head, cols.tail:_*)
selectDF1.show(false)
// +-------+------+------+
// |country|st_buy|nd_buy|
// +-------+------+------+
// |UA |2.0 |4.0 |
// |PL |3.0 |6.0 |
// |UA |5.0 |10.0 |
// |PL |6.0 |12.0 |
// |GB |4.0 |8.0 |
// +-------+------+------+
}
}

Related

Is it possible to apply when.otherwise functions within agg after a groupBy?

Been recently trying to apply a default function to aggregated values that were being calculated so that I didn't have to reprocess them afterwards. As far as I see I'm getting the following error.
Caused by: java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
From the following function.
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
And applying it the following way.
val initialDF = Seq(
("a", "b", 1),
("a", "b", null),
("a", null, 0)
).toDF("field1", "field2", "field3")
initialDF
.groupBy("field1", "field2")
.agg(
defaultUDF(functions.count("field3"), lit(0)).as("counter") // exception thrown here
)
Am I trying to do black magic in here or is it something that I may be missing?
The issue is in the implementation of your UserDefinedFunction:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
// java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789)
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:724)
// at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
// at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:906)
// at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:46)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:723)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:720)
// at org.apache.spark.sql.functions$.udf(functions.scala:3914)
// ... 65 elided
The error you're getting is basically because Spark cannot map the return type (i.e. Column) of your UserDefinedFunction defaultFunction to a Spark DataType.
Your defaultFunction has to accept and return Scala types that correspond with a Spark DataType. You can find the list of supported Scala types here: https://spark.apache.org/docs/latest/sql-reference.html#data-types
In any case, you don't need a UserDefinedFunction if your function takes Columns and returns a Column. For your use-case, the following code will work:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class Record(field1: String, field2: String, field3: java.lang.Integer)
val df = Seq(
Record("a", "b", 1),
Record("a", "b", null),
Record("a", null, 0)
).toDS
df.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| b| 1|
// | a| b| null|
// | a| null| 0|
// +------+------+------+
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
df
.groupBy("field1", "field2")
.agg(defaultFunction(count("field3"), lit(0)).as("counter"))
.show
// +------+------+-------+
// |field1|field2|counter|
// +------+------+-------+
// | a| b| 1|
// | a| null| 1|
// +------+------+-------+

Need to flatten a dataframe on the basis of one column in Scala

I have a dataframe with a below schema
root
|-- name: string (nullable = true)
|-- roll: string (nullable = true)
|-- subjectID: string (nullable = true)
The values in the dataframe are as below
+-------------------+---------+--------------------+
| name| roll| SubjectID|
+-------------------+---------+--------------------+
| sam|ta1i3dfk4| xy|av|mm|
| royc|rfhqdbnb3| a|
| alcaly|ta1i3dfk4| xx|zz|
+-------------------+---------+--------------------+
I need to derive the dataframe by flattenig subject ID as below.
please note : SubjectID is string as well
+-------------------+---------+--------------------+
| name| roll| SubjectID|
+-------------------+---------+--------------------+
| sam|ta1i3dfk4| xy|
| sam|ta1i3dfk4| av|
| sam|ta1i3dfk4| mm|
| royc|rfhqdbnb3| a|
| alcaly|ta1i3dfk4| xx|
| alcaly|ta1i3dfk4| zz|
+-------------------+---------+--------------------+
Any suggestion
you can use explode functions to flatten.
example:
val inputDF = Seq(
("sam", "ta1i3dfk4", "xy|av|mm"),
("royc", "rfhqdbnb3", "a"),
("alcaly", "rfhqdbnb3", "xx|zz")
).toDF("name", "roll", "subjectIDs")
//split and explode `subjectIDs`
val result = input.withColumn("subjectIDs",
split(col("subjectIDs"), "\\|"))
.withColumn("subjectIDs", explode($"subjectIDs"))
resultDF.show()
+------+---------+----------+
| name| roll|subjectIDs|
+------+---------+----------+
| sam|ta1i3dfk4| xy|
| sam|ta1i3dfk4| av|
| sam|ta1i3dfk4| mm|
| royc|rfhqdbnb3| a|
|alcaly|rfhqdbnb3| xx|
|alcaly|rfhqdbnb3| zz|
+------+---------+----------+
You can use flatMap on dataset. Full executable code:
package main
import org.apache.spark.sql.{Dataset, SparkSession}
object Main extends App {
case class Roll(name: Option[String], roll: Option[String], subjectID: Option[String])
val mySpark = SparkSession
.builder()
.master("local[2]")
.appName("Spark SQL basic example")
.getOrCreate()
import mySpark.implicits._
val inputDF: Dataset[Roll] = Seq(
("sam", "ta1i3dfk4", "xy|av|mm"),
("royc", "rfhqdbnb3", "a"),
("alcaly", "rfhqdbnb3", "xx|zz")
).toDF("name", "roll", "subjectID").as[Roll]
val out: Dataset[Roll] = inputDF.flatMap {
case Roll(n, r, Some(ids)) if ids.nonEmpty =>
ids.split("\\|").map(id => Roll(n, r, Some(id)))
case x => Some(x)
}
out.show()
}
Note:
you can use split('|') instead of split("\\|")
you can change default handle if id must be non empty

Spark functions.coalesce not working on mongodb collections but works on CSVs

Coalesce logic working fine on csv
e1.csv
id,code,type
1,,A
2,,
3,123,I
e2.csv
id,code,type
1,456,A
2,789,A1
3,,C
Dataset<Row> df1 = spark.read().format("csv").option("header", "true").load("C:\\Users\\System2\\Videos\\folder\\e1.csv");
Dataset<Row> df2 = spark.read().format("csv").option("header", "true").load("C:\\Users\\System2\\Videos\\folder\\e2.csv");
Column[] coalescedColumns =
//df1.columns().stream()
Stream.of(df1.columns())
.map(name -> functions.coalesce(df1.col(name),df2.col(name)).as(name))
.toArray(Column[]::new);
Dataset<Row> newDS = df1.as("a").join(df2.as("b")).where("a.id== b.id").select(coalescedColumns);
I can't able to apply this logic on mongo collection
what I have tried on mongo collection
String host= "localhost";
String port = "27017";
String DB = "Provider_Golden";
SparkConf conf = new SparkConf().setAppName("cust data").setMaster(mst);
SparkSession spark = SparkSession
.builder()
.config(conf)
.config("spark.mongodb.input.uri", "mongodb://"+host+":"+port+"/"+DB+".T")
.config("spark.mongodb.output.uri","mongodb://"+host+":"+port+"/"+DB+".T")
// .config("spark.exeuctor.extraJavaOptions","-XX:+UseG1GC")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
// Create a custom ReadConfig for e1
Map<String, String> readE1 = new HashMap<String, String>();
readE1.put("collection", "e1");
readE1.put("readPreference.name", "secondaryPreferred");
ReadConfig readConfig_E1 = ReadConfig.create(spark).withOptions(readE1);
Dataset<Row> e1 = MongoSpark.load(jsc,readConfig_E1).toDF();
// Create a custom ReadConfig for e2
Map<String, String> readE2 = new HashMap<String, String>();
readE2.put("collection", "e2");
readE2.put("readPreference.name", "secondaryPreferred");
ReadConfig readConfig_E2 = ReadConfig.create(spark).withOptions(readE2);
Dataset<Row> e2 = MongoSpark.load(jsc,readConfig_E2 ).toDF();
Column[] coalescedColumns =
Stream.of(e1.columns())
.map(name -> functions.coalesce(e1.col(name),e2.col(name)).as(name))
.toArray(Column[]::new);
Dataset<Row> goldenCopy = e1.as("a").join(e2.as("b")).where("a.id == b.id").select(coalescedColumns);
Expected result (works on csv file ,same logic)
+---+----+----+
| id|code|type|
+---+----+----+
| 1| 456| A|
| 2| 789| A1|
| 3| 123| I|
+---+----+----+
Actual result
+--------------------+----+---+----+
| _id|code| id|type|
+--------------------+----+---+----+
|[5cab0a4bb12dfd5f...| | 1| A|
|[5cab0a4bb12dfd5f...| 123| 3| I|
|[5cab0a4bb12dfd5f...| | 2| |
+--------------------+----+---+----+
When you mongoimport your CSVs you may get empty string instead of null.refer.So your coalesce works on null values not on empty string ''.I believe you have to replace "" to null.
Dataset<Row> e1 = MongoSpark.load(jsc,readConfig_E1).toDF();
String[] columns = e1.columns();
Dataset<Row> e1_NULLs= replaceNull(e1, columns );
e1_NULLs.show();
Replace_null function
private static Dataset<Row> replaceNull(Dataset<Row> df, String[] cl) {
for(String g : cl){
///neglecting mongo id
if(!g.equals("_id")){
df = df.withColumn(g,
functions.when(df.col(g).equalTo(""),null).otherwise(df.col(g))
);
}
}
return df.drop("_id");
}
Apply coalesce on newly created datasets with nulls, perform this replaceNull on two datasets

Scala Spark - empty map on DataFrame column for map(String, Int)

I am joining two DataFrames, where there are columns of a type Map[String, Int]
I want the merged DF to have an empty map [] and not null on the Map type columns.
val df = dfmerged.
.select("id"),
coalesce(col("map_1"), lit(null).cast(MapType(StringType, IntType))).alias("map_1"),
coalesce(col("map_2"), lit(Map.empty[String, Int])).alias("map_2")
for a map_1 column, a null will be inserted, but I'd like to have an empty map
map_2 is giving me an error:
java.lang.RuntimeException: Unsupported literal type class
scala.collection.immutable.Map$EmptyMap$ Map()
I've also tried with an udf function like:
case class myStructMap(x:Map[String, Int])
val emptyMap = udf(() => myStructMap(Map.empty[String, Int]))
also did not work.
when I try something like:
.select( coalesce(col("myMapCol"), lit(map())).alias("brand_viewed_count")...
or
.select(coalesce(col("myMapCol"), lit(map().cast(MapType(LongType, LongType)))).alias("brand_viewed_count")...
I get the error:
cannot resolve 'map()' due to data type mismatch: cannot cast
MapType(NullType,NullType,false) to MapType(LongType,IntType,true);
In Spark 2.2
import org.apache.spark.sql.functions.typedLit
val df = Seq((1L, null), (2L, Map("foo" -> "bar"))).toDF("id", "map")
df.withColumn("map", coalesce($"map", typedLit(Map[String, Int]()))).show
// +---+-----------------+
// | id| map|
// +---+-----------------+
// | 1| Map()|
// | 2|Map(foobar -> 42)|
// +---+-----------------+
Before
df.withColumn("map", coalesce($"map", map().cast("map<string,int>"))).show
// +---+-----------------+
// | id| map|
// +---+-----------------+
// | 1| Map()|
// | 2|Map(foobar -> 42)|
// +---+-----------------+

How to properly iterate/print a parquet using scala/spark?

How do I println the individual elements of a parquet containing nested array of objects in spark/scala?
{"id" : "1201", "name" : "satish", "age" : "25", "path":[{"x":1,"y":1},{"x":2,"y":2}]}
{"id" : "1202", "name" : "krishna", "age" : "28", "path":[{"x":1.23,"y":2.12},{"x":1.23,"y":2.12}]}
Specifically I want to be able to iterate over the object and print out the id, name, and age... then each item in the path. Then move on to printing the next record and soforth. Assuming I have read in the parquet file and have the dataframe, I want to do something like the following (pseudocode):
val records = dataframe.map {
row => {
val id = row.getString("id")
val name = row.getString("id")
val age = row.getString("age")
println("${id} ${name} ${age}")
row.getArray("path").map {
item => {
val x = item.getValue("x")
val y = item.getValue("y")
println("${x} ${y}")
}
}
}
}
Not sure if the above is the right way to go about it, but it should give you an idea of what I am trying to do.
val spark = SparkSession
.builder()
.master("local")
.appName("ParquetAppendMode")
.getOrCreate()
import spark.implicits._
val data1 = spark.read.json("/home/sakoirala/IdeaProjects/SparkSolutions/src/main/resources/explode.json")
val result = data1.withColumn("path", explode($"path"))
result.withColumn("x", result("path.x"))
.withColumn("y", result("path.y")).show()
Output:
+---+----+-------+-----------+----+----+
|age| id| name| path| x| y|
+---+----+-------+-----------+----+----+
| 25|1201| satish| [1.0,1.0]| 1.0| 1.0|
| 25|1201| satish| [2.0,2.0]| 2.0| 2.0|
| 28|1202|krishna|[1.23,2.12]|1.23|2.12|
| 28|1202|krishna|[1.23,2.12]|1.23|2.12|
+---+----+-------+-----------+----+----+
You can do this entirely using the Dataframe API; no need to use map.
Here is how you can easily flatten your schema by projecting the fields you want to use:
val records = dataframe.select("id", "age", "path.x", "path.y")
You can then print your data using show:
records.show()