Spark provide list of all columns in DataFrame groupBy [duplicate] - scala

This question already has an answer here:
Scala-Spark Dynamically call groupby and agg with parameter values
(1 answer)
Closed 4 years ago.
I need to group the DataFrame by all columns except "tag"
Right now I can do it in the following way:
unionDf.groupBy("name", "email", "phone", "country").agg(collect_set("tag").alias("tags"))
Is it possible to get all columns(except "tag") and pass them to groupBy method without a need to hardcode them as I do it now - "name", "email", "phone", "country".
I tried unionDf.groupBy(unionDf.columns) but it doesn't work

Here's one approach:
import org.apache.spark.sql.functions._
val df = Seq(
("a", "b#c.com", "123", "US", "ab1"),
("a", "b#c.com", "123", "US", "ab2"),
("d", "e#f.com", "456", "US", "de1")
).toDF("name", "email", "phone", "country", "tag")
val groupCols = df.columns.diff(Seq("tag"))
df.groupBy(groupCols.map(col): _*).agg(collect_set("tag").alias("tags")).show
// +----+-------+-----+-------+----------+
// |name| email|phone|country| tags|
// +----+-------+-----+-------+----------+
// | d|e#f.com| 456| US| [de1]|
// | a|b#c.com| 123| US|[ab2, ab1]|
// +----+-------+-----+-------+----------+

Related

Spark SQL - Check for a value in multiple columns

I have a status dataset like below:
I want to select all the rows from this dataset which have "FAILURE" in any of these 5 status columns.
So, I want the result to contain only IDs 1,2,4 as they have FAILURE in one of the Status columns.
I guess in SQL we can do something like below:
SELECT * FROM status WHERE "FAILURE" IN (Status1, Status2, Status3, Status4, Status5);
In spark, I know I can do a filter by comparing each Status column with "FAILURE"
status.filter(s => {s.Status1.equals(FAILURE) || s.Status2.equals(FAILURE) ... and so on..})
But I would like to know if there is a smarter way of doing this in Spark SQL.
Thanks in advance!
In case there are many columns to be examined, consider a recursive function that short-circuits upon the first match, as shown below:
val df = Seq(
(1, "T", "F", "T", "F"),
(2, "T", "T", "T", "T"),
(3, "T", "T", "F", "T")
).toDF("id", "c1", "c2", "c3", "c4")
import org.apache.spark.sql.Column
def checkFor(elem: Column, cols: List[Column]): Column = cols match {
case Nil =>
lit(true)
case h :: tail =>
when(h === elem, lit(false)).otherwise(checkFor(elem, tail))
}
val cols = df.columns.filter(_.startsWith("c")).map(col).toList
df.where(checkFor(lit("F"), cols)).show
// +---+---+---+---+---+
// | id| c1| c2| c3| c4|
// +---+---+---+---+---+
// | 2| T| T| T| T|
// +---+---+---+---+---+
A similar example you can modify and filter on the new column added. I leave that to you, here checking for zeroes excluding first col:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = sc.parallelize(Seq(
("r1", 0.0, 0.0, 0.0, 0.0),
("r2", 6.4, 4.9, 6.3, 7.1),
("r3", 4.2, 0.0, 7.2, 8.4),
("r4", 1.0, 2.0, 0.0, 0.0)
)).toDF("ID", "a", "b", "c", "d")
val count_some_val = df.columns.tail.map(x => when(col(x) === 0.0, 1).otherwise(0)).reduce(_ + _)
val df2 = df.withColumn("some_val_count", count_some_val)
df2.filter(col("some_val_count") > 0).show(false)
Afaik not possible to stop when first match found easily, but I do remember a smarter person than myself showing me this approach with lazy exists which I think does stop at first encounter of a match. Like this then, but a different approach, that I like:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = sc.parallelize(Seq(
("r1", 0.0, 0.0, 0.0, 0.0),
("r2", 6.0, 4.9, 6.3, 7.1),
("r3", 4.2, 0.0, 7.2, 8.4),
("r4", 1.0, 2.0, 0.0, 0.0)
)).toDF("ID", "a", "b", "c", "d")
df.map{r => (r.getString(0),r.toSeq.tail.exists(c =>
c.asInstanceOf[Double]==0))}
.toDF("ID","ones")
.show()
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> import spark.implicits._
import spark.implicits._
scala> val df = Seq(
| ("Prop1", "SUCCESS", "SUCCESS", "SUCCESS", "FAILURE" ,"SUCCESS"),
| ("Prop2", "SUCCESS", "FAILURE", "SUCCESS", "FAILURE", "SUCCESS"),
| ("Prop3", "SUCCESS", "SUCCESS", "SUCCESS", "SUCCESS", "SUCCESS" ),
| ("Prop4", "SUCCESS", "FAILURE", "SUCCESS", "FAILURE", "SUCCESS"),
| ("Prop5", "SUCCESS", "SUCCESS", "SUCCESS", "SUCCESS","SUCCESS")
| ).toDF("Name", "Status1", "Status2", "Status3", "Status4","Status5")
df: org.apache.spark.sql.DataFrame = [Name: string, Status1: string ... 4 more fields]
scala> df.show
+-----+-------+-------+-------+-------+-------+
| Name|Status1|Status2|Status3|Status4|Status5|
+-----+-------+-------+-------+-------+-------+
|Prop1|SUCCESS|SUCCESS|SUCCESS|FAILURE|SUCCESS|
|Prop2|SUCCESS|FAILURE|SUCCESS|FAILURE|SUCCESS|
|Prop3|SUCCESS|SUCCESS|SUCCESS|SUCCESS|SUCCESS|
|Prop4|SUCCESS|FAILURE|SUCCESS|FAILURE|SUCCESS|
|Prop5|SUCCESS|SUCCESS|SUCCESS|SUCCESS|SUCCESS|
+-----+-------+-------+-------+-------+-------+
scala> df.where($"Name".isin("Prop1","Prop4") and $"Status1".isin("SUCCESS","FAILURE")).show
+-----+-------+-------+-------+-------+-------+
| Name|Status1|Status2|Status3|Status4|Status5|
+-----+-------+-------+-------+-------+-------+
|Prop1|SUCCESS|SUCCESS|SUCCESS|FAILURE|SUCCESS|
|Prop4|SUCCESS|FAILURE|SUCCESS|FAILURE|SUCCESS|
+-----+-------+-------+-------+-------+-------+

Scala Dataframe/SQL: Dynamic Column Selection for Reporting

Context:
We have a model which has 5 flows.
Flow-1 : Input Data from different sources , these are input to Flow-2
Flow-2, Flow-3 and Flow-4 : are ML models each dumping few fields to s3
Flow-5 : Reporting Layer with data from output of Flow-2,Flow-3,Flow-4
Overall data size is very small
Problem:
Flow-5 which is reporting layer based on few SQL with input data coming from Flow-2,Flow-3 and Flow-4.
Flow-2,Flow-3 and Flow-4 has one common joining field and remaining fields are different.
We can create a SQL joining Flow-2,3,4 data stored in three different tables with few calculations/Aggregation.However the number of output fields from Flow-2,3,4 may vary in each run.
Problem-1: every time s3 file (Flow-2/3/4) structure changes, which often results in issue during COPY, as the target table schema differs from s3 file structire (to fix, need to manually add/delete fields in target table,to align with s3 data)
Problem-2: for any additions/deletion in s3 files, need to make changes in reports by adding/removing column
Approach:
SQL way - standardize s3 dump/target table/report SQL i.e. standardize the number of possible columns in each flow(2,3,4) output,also in the target table, so that if any fields are not available, just load them as NULL/blank during s3 dump and COPY as blank. Standardize target table structure aligned with s3 template. Also standardize reporting SQL
SCALA/SPARK : Currently exploring this option. to perform a PoC, created two s3 dumps,created two dataframes in scala, tried dataframe joins as well as spark SQL join. Im still not sure if there is anyway we can dynamically pick new columns ie making the spark code generic.
with creating dataframes directly pointing to s3, we can solve the COPY data (dynamic fields) to target table problem.
however, the reporting SQL issue still persists (or atleast I dont know, need to find a way how can it be handled)
Question:
Is there anyway,can we handle the issue (dynamic column selection in SQL) in Scala/SparkSQL ?
Note: s3 file always contains only the required fields from each run.Issue is not reading the dynamic fields from s3 (which will be auto taken care by dataframe, instead issue is how can we make the SQL(saprkSQL/API) code to handle this)
Example/Scenario-1:
Run-1
df1= country,1st_buy (here dataframe directly points to s3, which has only required attributes)
df2= country,percent (here dataframe directly points to s3, which has only required attributes)
--Sample SQL code (need to convert to sparkSQL/dataframe API)
SELECT df1.1st_buy,
df1.percent/df2.percent -- DERIVED FIELD
FROM df1,df2
WHERE df1.country=df2.country
Run-2 (here one additional column was added to df1)
df1= country,1st_buy,2nd_buy (here dataframe directly points to s3, which has only required attributes)
df2= country,percent (here dataframe directly points to s3, which has only required attributes)
****how can we handle this part of new field 2nd_buy dynamically****
--Sample SQL code (need to convert to sparkSQL/dataframe API)
SELECT df1.1st_buy,
df1.2nd_buy,
df1.1st_buy/df2.percent -- DERIVED FIELD
df1.2nd_buy/df2.percent -- DERIVED FIELD
FROM df1,df2
WHERE df1.country=df2.country
Example/Scenario-2:
Run-1
df1= country,1st_buy (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT country,sum(df1.1st_buy)
FROM df1
GROUP BY country
--Dataframe API/SparkSQL
df1.groupBy("country").sum("1st_buy").show()
Run-2 (here one additional column was added to df1)
df1= country,1st_buy,2nd_buy (here dataframe directly points to s3, which has only required attributes)
****how can we handle this part of new field 2nd_buy dynamically****
--Sample SQL
SELECT country,sum(df1.1st_buy),sum(2nd_buy)
FROM df1
GROUP BY country
--Dataframe API/SparkSQL
df1.groupBy("country").sum("1st_buy","2nd_buy").show()
{
1.
val sqlScript = "select col1, col2, .... from ... "
// string we can create dynamic
val df = spark.sql(sqlScript)
2. try use schema = = StructType(Seq(
StructField("id",LongType,true),
....
))
// and then use schema.fieldsName... or
val cols: List[Columns] = ...
// in df.select(cols:_*)
3. get schema (list fields with json file)
package spark
import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession}
import org.apache.spark.sql.types.{DataType, StructType}
import scala.io.Source
object DFFieldsWithJson extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
case class TestData (
id: Int,
firstName: String,
lastName: String,
descr: String
)
val dataTestDF = Seq(
TestData(1, "First Name 1", "Last Name 1", "Description 1"),
TestData(2, "First Name 2", "Last Name 2", "Description 2"),
TestData(3, "First Name 3", "Last Name 3", "Description 3")
).toDF()
dataTestDF.show(false)
// +---+------------+-----------+-------------+
// |id |firstName |lastName |descr |
// +---+------------+-----------+-------------+
// |1 |First Name 1|Last Name 1|Description 1|
// |2 |First Name 2|Last Name 2|Description 2|
// |3 |First Name 3|Last Name 3|Description 3|
// +---+------------+-----------+-------------+
val schemaJson =
"""{ "type" : "struct",
|"fields" : [
|{
| "name" : "id",
| "type" : "integer",
| "nullable" : true,
| "metadata" : { }
| },
| {
| "name" : "firstName",
| "type" : "string",
| "nullable" : true,
| "metadata" : {}
| },
| {
| "name" : "lastName",
| "type" : "string",
| "nullable" : true,
| "metadata" : {}
| }
| ]}""".stripMargin
val schemaSource = schemaJson.mkString
val schemaFromJson = DataType.fromJson(schemaSource).asInstanceOf[StructType]
println(schemaFromJson)
// StructType(StructField(id,IntegerType,true), StructField(firstName,StringType,true), StructField(lastName,StringType,true))
val cols: List[String] = schemaFromJson.fieldNames.toList
val col: List[Column] = cols.map(dataTestDF(_))
val df = dataTestDF.select(col: _*)
df.printSchema()
// root
// |-- id: integer (nullable = false)
// |-- firstName: string (nullable = true)
// |-- lastName: string (nullable = true)
df.show(false)
// +---+------------+-----------+
// |id |firstName |lastName |
// +---+------------+-----------+
// |1 |First Name 1|Last Name 1|
// |2 |First Name 2|Last Name 2|
// |3 |First Name 3|Last Name 3|
// +---+------------+-----------+
}
}
{ Examples:
package spark
import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.functions.{col, column, sum}
object DynamicColumnSelection extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
case class c1(
country: String,
st_buy: Double,
nd_buy: Double
)
case class c2(
country: String,
percent: Double
)
val df1 = Seq(
c1("UA", 2, 4),
c1("PL", 3, 6),
c1("GB", 4, 8)
)
.toDF()
df1.show(false)
// +-------+------+------+
// |country|st_buy|nd_buy|
// +-------+------+------+
// |UA |2.0 |4.0 |
// |PL |3.0 |6.0 |
// |GB |4.0 |8.0 |
// +-------+------+------+
val df2 = Seq(
c2("UA", 2.21),
c2("PL", 3.26)
)
.toDF()
df2.show(false)
// +-------+-------+
// |country|percent|
// +-------+-------+
// |UA |2.21 |
// |PL |3.26 |
// +-------+-------+
// Inner Join
val df = df1.join(df2, df1.col("country") === df2.col("country"), "inner")
.select(
df1.col("country"),
df1.col("st_buy"),
df1.col("nd_buy"),
df2.col("percent")
)
df.show(false)
// +-------+------+------+-------+
// |country|st_buy|nd_buy|percent|
// +-------+------+------+-------+
// |UA |2.0 |4.0 |2.21 |
// |PL |3.0 |6.0 |3.26 |
// +-------+------+------+-------+
val res1DF = df.withColumn("st_buy_percent", 'st_buy/'percent)
.withColumn("nd_buy_percent", 'nd_buy/'percent)
res1DF.show(false)
// +-------+------+------+-------+------------------+------------------+
// |country|st_buy|nd_buy|percent|st_buy_percent |nd_buy_percent |
// +-------+------+------+-------+------------------+------------------+
// |UA |2.0 |4.0 |2.21 |0.9049773755656109|1.8099547511312217|
// |PL |3.0 |6.0 |3.26 |0.9202453987730062|1.8404907975460123|
// +-------+------+------+-------+------------------+------------------+
// GroupBy + sum
val data = Seq(
c1("UA", 2, 4),
c1("PL", 3, 6),
c1("UA", 5, 10),
c1("PL", 6, 12),
c1("GB", 4, 8)
)
.toDF()
val resGroupByDF = data
.groupBy("country")
.agg(sum("st_buy").alias("sum_st_buy")
,sum("nd_buy").alias("sum_nd_buy"))
resGroupByDF.show(false)
// +-------+----------+----------+
// |country|sum_st_buy|sum_nd_buy|
// +-------+----------+----------+
// |UA |7.0 |14.0 |
// |PL |9.0 |18.0 |
// |GB |4.0 |8.0 |
// +-------+----------+----------+
val resGroupByDF1 = data.groupBy($"country").sum()
resGroupByDF1.show(false)
// +-------+-----------+-----------+
// |country|sum(st_buy)|sum(nd_buy)|
// +-------+-----------+-----------+
// |UA |7.0 |14.0 |
// |PL |9.0 |18.0 |
// |GB |4.0 |8.0 |
// +-------+-----------+-----------+
val exprs = data.columns.map(sum(_))
val resGroupByDF2 = data.groupBy($"country").agg(exprs.head, exprs.tail: _*)
resGroupByDF2.show(false)
// +-------+------------+-----------+-----------+
// |country|sum(country)|sum(st_buy)|sum(nd_buy)|
// +-------+------------+-----------+-----------+
// |UA |null |7.0 |14.0 |
// |PL |null |9.0 |18.0 |
// |GB |null |4.0 |8.0 |
// +-------+------------+-----------+-----------+
val exprs3 = List("st_buy", "nd_buy").map(sum(_))
val resGroupByDF3 = data.groupBy($"country").agg(exprs3.head, exprs3.tail: _*)
resGroupByDF3.show(false)
// +-------+-----------+-----------+
// |country|sum(st_buy)|sum(nd_buy)|
// +-------+-----------+-----------+
// |UA |7.0 |14.0 |
// |PL |9.0 |18.0 |
// |GB |4.0 |8.0 |
// +-------+-----------+-----------+
val exprs4 = data.columns.toList.filter(_ != "country").map(sum(_))
val resGroupByDF4 = data.groupBy($"country").agg(exprs4.head, exprs4.tail: _*)
resGroupByDF4.show(false)
// +-------+-----------+-----------+
// |country|sum(st_buy)|sum(nd_buy)|
// +-------+-----------+-----------+
// |UA |7.0 |14.0 |
// |PL |9.0 |18.0 |
// |GB |4.0 |8.0 |
// +-------+-----------+-----------+
// Select
val cols = data.columns.toSeq
val selectDF1 = data.select(cols.head, cols.tail:_*)
selectDF1.show(false)
// +-------+------+------+
// |country|st_buy|nd_buy|
// +-------+------+------+
// |UA |2.0 |4.0 |
// |PL |3.0 |6.0 |
// |UA |5.0 |10.0 |
// |PL |6.0 |12.0 |
// |GB |4.0 |8.0 |
// +-------+------+------+
}
}

Is it possible to apply when.otherwise functions within agg after a groupBy?

Been recently trying to apply a default function to aggregated values that were being calculated so that I didn't have to reprocess them afterwards. As far as I see I'm getting the following error.
Caused by: java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
From the following function.
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
And applying it the following way.
val initialDF = Seq(
("a", "b", 1),
("a", "b", null),
("a", null, 0)
).toDF("field1", "field2", "field3")
initialDF
.groupBy("field1", "field2")
.agg(
defaultUDF(functions.count("field3"), lit(0)).as("counter") // exception thrown here
)
Am I trying to do black magic in here or is it something that I may be missing?
The issue is in the implementation of your UserDefinedFunction:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
// java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789)
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:724)
// at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
// at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:906)
// at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:46)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:723)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:720)
// at org.apache.spark.sql.functions$.udf(functions.scala:3914)
// ... 65 elided
The error you're getting is basically because Spark cannot map the return type (i.e. Column) of your UserDefinedFunction defaultFunction to a Spark DataType.
Your defaultFunction has to accept and return Scala types that correspond with a Spark DataType. You can find the list of supported Scala types here: https://spark.apache.org/docs/latest/sql-reference.html#data-types
In any case, you don't need a UserDefinedFunction if your function takes Columns and returns a Column. For your use-case, the following code will work:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class Record(field1: String, field2: String, field3: java.lang.Integer)
val df = Seq(
Record("a", "b", 1),
Record("a", "b", null),
Record("a", null, 0)
).toDS
df.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| b| 1|
// | a| b| null|
// | a| null| 0|
// +------+------+------+
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
df
.groupBy("field1", "field2")
.agg(defaultFunction(count("field3"), lit(0)).as("counter"))
.show
// +------+------+-------+
// |field1|field2|counter|
// +------+------+-------+
// | a| b| 1|
// | a| null| 1|
// +------+------+-------+

How to union two spark Dataframe with a field of type struct that can differ?

I'm quite new with Apache Spark and still struggling with it sometimes. I'm trying to import a quite complex json file and flatten it before saving it in a parquet file.
My json file is a tree of stores.
{
"id": "store02",
"name": "store name",
"domain": "domain",
"currency": "EUR",
"address1": "Somewhere",
"country": "GER",
"city": "Berlin",
"zipCode": "12345",
"timeZone": "CET",
"accounts" : [
{
"field1": "",
"filed2": "",
"field3": "",
"optionnalArray1": [
{
"field1": "",
"field2": ""
}
],
"optionnalArray2": ["aa", "bb"]
}
],
"stores": [ .... ]
}
Each store can have a field wich is an array of accounts. An account has 3 mandatory fields and two optionnals. So I have a dataframe with a field that can have 3 differents types.
Importing the file in a dataframe is no big deal but during the flatten process I may want to do an union on two dataframes with accounts that may have a different schema and of course I have the following error : "Union can only be performed on tables with the compatible column types"
Is there a way to do this easely ? How can spark import such json file without problem ?
#Ramesh
Let's say I have two dataframes. The first one is a dataframe of stores without accounts. The second one is a dataframe of stores with accounts. An account is a struct defined like that :
val acquirerStruct = StructType(
StructField("merchantId", StringType, nullable = true) ::
StructField("name", StringType, nullable = true) ::
Nil)
val accountStruct = StructType(
StructField("acquirers", ArrayType(acquirerStruct), nullable = true) ::
StructField("applicationCode", StringType, nullable = true) ::
StructField("channelType", StringType, nullable = true) ::
StructField("id", StringType, nullable = true) ::
StructField("terminals", ArrayType(StringType), nullable = true) ::
Nil)
When I want to union the two dataframes I create a column account to my first dataframe before :
df1.withColumn("account", array(lit(null).cast(accountStruct))).union(df2)
If, in df2, all rows have an account that has the same structure that accountStruct it works fine. But that is not always true. An account may have no terminals or acquirers. That's perfectly valid in json. In that case I have the error mentionned before.
"Union can only be performed on tables with the compatible column types"
In Spark 3 you can use:
df = df1.unionByName(df2, allowMissingColumns=True)
In Spark 2 you can use:
diff1 = [c for c in df2.columns if c not in df1.columns]
diff2 = [c for c in df1.columns if c not in df2.columns]
df = df1.select('*', *[F.lit(None).alias(c) for c in diff1]) \
.unionByName(df2.select('*', *[F.lit(None).alias(c) for c in diff2]))
But I want to add that to make your life easy you should just but both JSON files in the same directory and then read them in and Spark will do the work for you.
val peopleDF = spark.read.json(path)
This will union the data and fill nulls for you at the same time. And in my opinion is the easiest way to do this.
I had the same issue in PySpark I solved it by providing schema when reading the incompatible dataframe
import copy
...
schema_to_read = copy.deepcopy(df1.schema)
df2 = sql_context.read.format("json").schema(schema_to_read).load(path)

How to properly iterate/print a parquet using scala/spark?

How do I println the individual elements of a parquet containing nested array of objects in spark/scala?
{"id" : "1201", "name" : "satish", "age" : "25", "path":[{"x":1,"y":1},{"x":2,"y":2}]}
{"id" : "1202", "name" : "krishna", "age" : "28", "path":[{"x":1.23,"y":2.12},{"x":1.23,"y":2.12}]}
Specifically I want to be able to iterate over the object and print out the id, name, and age... then each item in the path. Then move on to printing the next record and soforth. Assuming I have read in the parquet file and have the dataframe, I want to do something like the following (pseudocode):
val records = dataframe.map {
row => {
val id = row.getString("id")
val name = row.getString("id")
val age = row.getString("age")
println("${id} ${name} ${age}")
row.getArray("path").map {
item => {
val x = item.getValue("x")
val y = item.getValue("y")
println("${x} ${y}")
}
}
}
}
Not sure if the above is the right way to go about it, but it should give you an idea of what I am trying to do.
val spark = SparkSession
.builder()
.master("local")
.appName("ParquetAppendMode")
.getOrCreate()
import spark.implicits._
val data1 = spark.read.json("/home/sakoirala/IdeaProjects/SparkSolutions/src/main/resources/explode.json")
val result = data1.withColumn("path", explode($"path"))
result.withColumn("x", result("path.x"))
.withColumn("y", result("path.y")).show()
Output:
+---+----+-------+-----------+----+----+
|age| id| name| path| x| y|
+---+----+-------+-----------+----+----+
| 25|1201| satish| [1.0,1.0]| 1.0| 1.0|
| 25|1201| satish| [2.0,2.0]| 2.0| 2.0|
| 28|1202|krishna|[1.23,2.12]|1.23|2.12|
| 28|1202|krishna|[1.23,2.12]|1.23|2.12|
+---+----+-------+-----------+----+----+
You can do this entirely using the Dataframe API; no need to use map.
Here is how you can easily flatten your schema by projecting the fields you want to use:
val records = dataframe.select("id", "age", "path.x", "path.y")
You can then print your data using show:
records.show()