How to properly iterate/print a parquet using scala/spark? - scala

How do I println the individual elements of a parquet containing nested array of objects in spark/scala?
{"id" : "1201", "name" : "satish", "age" : "25", "path":[{"x":1,"y":1},{"x":2,"y":2}]}
{"id" : "1202", "name" : "krishna", "age" : "28", "path":[{"x":1.23,"y":2.12},{"x":1.23,"y":2.12}]}
Specifically I want to be able to iterate over the object and print out the id, name, and age... then each item in the path. Then move on to printing the next record and soforth. Assuming I have read in the parquet file and have the dataframe, I want to do something like the following (pseudocode):
val records = dataframe.map {
row => {
val id = row.getString("id")
val name = row.getString("id")
val age = row.getString("age")
println("${id} ${name} ${age}")
row.getArray("path").map {
item => {
val x = item.getValue("x")
val y = item.getValue("y")
println("${x} ${y}")
}
}
}
}
Not sure if the above is the right way to go about it, but it should give you an idea of what I am trying to do.

val spark = SparkSession
.builder()
.master("local")
.appName("ParquetAppendMode")
.getOrCreate()
import spark.implicits._
val data1 = spark.read.json("/home/sakoirala/IdeaProjects/SparkSolutions/src/main/resources/explode.json")
val result = data1.withColumn("path", explode($"path"))
result.withColumn("x", result("path.x"))
.withColumn("y", result("path.y")).show()
Output:
+---+----+-------+-----------+----+----+
|age| id| name| path| x| y|
+---+----+-------+-----------+----+----+
| 25|1201| satish| [1.0,1.0]| 1.0| 1.0|
| 25|1201| satish| [2.0,2.0]| 2.0| 2.0|
| 28|1202|krishna|[1.23,2.12]|1.23|2.12|
| 28|1202|krishna|[1.23,2.12]|1.23|2.12|
+---+----+-------+-----------+----+----+

You can do this entirely using the Dataframe API; no need to use map.
Here is how you can easily flatten your schema by projecting the fields you want to use:
val records = dataframe.select("id", "age", "path.x", "path.y")
You can then print your data using show:
records.show()

Related

How do I move a spark dataframe's columns to a nested column in the same dataframe?

My approach to move a spark dataframe's columns to a nested column within the same dataframe is something like this:
.appName("SparkByExamples.com")
.master("local")
.getOrCreate()
import spark.sqlContext.implicits._
val data = Seq(("Adam", "111", "50000"),
("Abe", "222", "60000"),
("Sam", "333", "40000"))
var df = data.toDF("Name", "EmpId__c", "Salary__c")
df.show(false)
val cstColsSeq = df.columns.filter(c => c.endsWith("__c")).map(f => { col(f) }).toSeq
var cstMapCol: Column = org.apache.spark.sql.functions.struct(cstColsSeq)
df = df.withColumn("cstMap", cstMapCol)
The issue is that I can't provide a Seq[Column] to org.apache.spark.sql.functions.struct(...) call. It only accepts a Column* param.
A follow through was to do something like this:
for (i <- cstColsList) {
cstMapCol = org.apache.spark.sql.functions.struct(i)
df = df.withColumn("cstMap", cstMapCol)
}
however, this overrides the cstMap
Any thoughts how do I supply cstColsSeq to the struct ? Also open to other solutions which might take a different approach of adding nesting columns in an existing populated dataframe.
Thanks!
You can expand the Seq using the : _* notation:
var cstMapCol: Column = org.apache.spark.sql.functions.struct(cstColsSeq: _*)
which will give the output
df.withColumn("cstMap", cstMapCol).show
+----+--------+---------+------------+
|Name|EmpId__c|Salary__c| cstMap|
+----+--------+---------+------------+
|Adam| 111| 50000|[111, 50000]|
| Abe| 222| 60000|[222, 60000]|
| Sam| 333| 40000|[333, 40000]|
+----+--------+---------+------------+

scala spark dataframe modify column with udf return value

I have a spark dataframe which has a timestamp field and i want to convert this to long datatype. I used a UDF and the standalone code works fine but when i plug to to a generic logic where any timestamp will need to be converted i m not ble to get it working.Issue is how can i assing the return value from UDF back to the dataframe column
Below is the code snippet
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test3").getOrCreate();
import org.apache.spark.sql.functions._
val sqlContext = spark.sqlContext
val df2 = sqlContext.jsonRDD(spark.sparkContext.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
)))
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
manTs.getTime
}
df2.withColumn("manufacture_ts",getTime(df2("manufacture_ts"))).show
+-----+----------+-----+--------------+-----+----+
| |No Comment|Tesla| 1508126400000| S|2012|
| | Get one| Ford| 1508126400000| E350|1997|
| | |Chevy| 1508126400000| Volt|2015|
+-----+----------+-----+--------------+-----+----+
Now i want to invoke this from a dataframe to be clled on all columns which are of type long
object Test4 extends App{
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test").getOrCreate();
import spark.implicits._
import scala.collection.JavaConversions._
val long : Long = "1508299200000".toLong
val data = Seq(Row("10000020_LUX_OTC",long,"2020-02-14"))
val schema = List( StructField("rowkey",StringType,true)
,StructField("order_receipt_dt",LongType,true)
,StructField("maturity_dt",StringType,true))
val dataDF = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schema))
val modifedDf2= schema.foldLeft(dataDF) { case (newDF,StructField(name,dataType,flag,metadata)) =>
newDF.withColumn(name,DataTypeUtil.transformLong(newDF,name,dataType.typeName))
modifedDf2,show
}
}
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
manTs.getTime
}
def transformLong(dataFrame: DataFrame,name:String, fieldType:String):Column = {
import org.apache.spark.sql.functions._
fieldType.toLowerCase match {
case "timestamp" => convertTimeStamp(dataFrame(name))
case _ => dataFrame.col(name)
}
}
Maybe your udf crashed if the timestamp is nullYou can do :
use unix_timestamp instead of UDF.. or make your UDF null-safe
only apply on fields which need to be converted.
Given the data:
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.TimestampType
val df = Seq(
(1L,Timestamp.valueOf(LocalDateTime.now()),Timestamp.valueOf(LocalDateTime.now()))
).toDF("id","ts1","ts2")
you can do:
val newDF = df.schema.fields.filter(_.dataType == TimestampType).map(_.name)
.foldLeft(df)((df,field) => df.withColumn(field,unix_timestamp(col(field))))
newDF.show()
which gives:
+---+----------+----------+
| id| ts1| ts2|
+---+----------+----------+
| 1|1589109282|1589109282|
+---+----------+----------+

Scala Dataframe/SQL: Dynamic Column Selection for Reporting

Context:
We have a model which has 5 flows.
Flow-1 : Input Data from different sources , these are input to Flow-2
Flow-2, Flow-3 and Flow-4 : are ML models each dumping few fields to s3
Flow-5 : Reporting Layer with data from output of Flow-2,Flow-3,Flow-4
Overall data size is very small
Problem:
Flow-5 which is reporting layer based on few SQL with input data coming from Flow-2,Flow-3 and Flow-4.
Flow-2,Flow-3 and Flow-4 has one common joining field and remaining fields are different.
We can create a SQL joining Flow-2,3,4 data stored in three different tables with few calculations/Aggregation.However the number of output fields from Flow-2,3,4 may vary in each run.
Problem-1: every time s3 file (Flow-2/3/4) structure changes, which often results in issue during COPY, as the target table schema differs from s3 file structire (to fix, need to manually add/delete fields in target table,to align with s3 data)
Problem-2: for any additions/deletion in s3 files, need to make changes in reports by adding/removing column
Approach:
SQL way - standardize s3 dump/target table/report SQL i.e. standardize the number of possible columns in each flow(2,3,4) output,also in the target table, so that if any fields are not available, just load them as NULL/blank during s3 dump and COPY as blank. Standardize target table structure aligned with s3 template. Also standardize reporting SQL
SCALA/SPARK : Currently exploring this option. to perform a PoC, created two s3 dumps,created two dataframes in scala, tried dataframe joins as well as spark SQL join. Im still not sure if there is anyway we can dynamically pick new columns ie making the spark code generic.
with creating dataframes directly pointing to s3, we can solve the COPY data (dynamic fields) to target table problem.
however, the reporting SQL issue still persists (or atleast I dont know, need to find a way how can it be handled)
Question:
Is there anyway,can we handle the issue (dynamic column selection in SQL) in Scala/SparkSQL ?
Note: s3 file always contains only the required fields from each run.Issue is not reading the dynamic fields from s3 (which will be auto taken care by dataframe, instead issue is how can we make the SQL(saprkSQL/API) code to handle this)
Example/Scenario-1:
Run-1
df1= country,1st_buy (here dataframe directly points to s3, which has only required attributes)
df2= country,percent (here dataframe directly points to s3, which has only required attributes)
--Sample SQL code (need to convert to sparkSQL/dataframe API)
SELECT df1.1st_buy,
df1.percent/df2.percent -- DERIVED FIELD
FROM df1,df2
WHERE df1.country=df2.country
Run-2 (here one additional column was added to df1)
df1= country,1st_buy,2nd_buy (here dataframe directly points to s3, which has only required attributes)
df2= country,percent (here dataframe directly points to s3, which has only required attributes)
****how can we handle this part of new field 2nd_buy dynamically****
--Sample SQL code (need to convert to sparkSQL/dataframe API)
SELECT df1.1st_buy,
df1.2nd_buy,
df1.1st_buy/df2.percent -- DERIVED FIELD
df1.2nd_buy/df2.percent -- DERIVED FIELD
FROM df1,df2
WHERE df1.country=df2.country
Example/Scenario-2:
Run-1
df1= country,1st_buy (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT country,sum(df1.1st_buy)
FROM df1
GROUP BY country
--Dataframe API/SparkSQL
df1.groupBy("country").sum("1st_buy").show()
Run-2 (here one additional column was added to df1)
df1= country,1st_buy,2nd_buy (here dataframe directly points to s3, which has only required attributes)
****how can we handle this part of new field 2nd_buy dynamically****
--Sample SQL
SELECT country,sum(df1.1st_buy),sum(2nd_buy)
FROM df1
GROUP BY country
--Dataframe API/SparkSQL
df1.groupBy("country").sum("1st_buy","2nd_buy").show()
{
1.
val sqlScript = "select col1, col2, .... from ... "
// string we can create dynamic
val df = spark.sql(sqlScript)
2. try use schema = = StructType(Seq(
StructField("id",LongType,true),
....
))
// and then use schema.fieldsName... or
val cols: List[Columns] = ...
// in df.select(cols:_*)
3. get schema (list fields with json file)
package spark
import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession}
import org.apache.spark.sql.types.{DataType, StructType}
import scala.io.Source
object DFFieldsWithJson extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
case class TestData (
id: Int,
firstName: String,
lastName: String,
descr: String
)
val dataTestDF = Seq(
TestData(1, "First Name 1", "Last Name 1", "Description 1"),
TestData(2, "First Name 2", "Last Name 2", "Description 2"),
TestData(3, "First Name 3", "Last Name 3", "Description 3")
).toDF()
dataTestDF.show(false)
// +---+------------+-----------+-------------+
// |id |firstName |lastName |descr |
// +---+------------+-----------+-------------+
// |1 |First Name 1|Last Name 1|Description 1|
// |2 |First Name 2|Last Name 2|Description 2|
// |3 |First Name 3|Last Name 3|Description 3|
// +---+------------+-----------+-------------+
val schemaJson =
"""{ "type" : "struct",
|"fields" : [
|{
| "name" : "id",
| "type" : "integer",
| "nullable" : true,
| "metadata" : { }
| },
| {
| "name" : "firstName",
| "type" : "string",
| "nullable" : true,
| "metadata" : {}
| },
| {
| "name" : "lastName",
| "type" : "string",
| "nullable" : true,
| "metadata" : {}
| }
| ]}""".stripMargin
val schemaSource = schemaJson.mkString
val schemaFromJson = DataType.fromJson(schemaSource).asInstanceOf[StructType]
println(schemaFromJson)
// StructType(StructField(id,IntegerType,true), StructField(firstName,StringType,true), StructField(lastName,StringType,true))
val cols: List[String] = schemaFromJson.fieldNames.toList
val col: List[Column] = cols.map(dataTestDF(_))
val df = dataTestDF.select(col: _*)
df.printSchema()
// root
// |-- id: integer (nullable = false)
// |-- firstName: string (nullable = true)
// |-- lastName: string (nullable = true)
df.show(false)
// +---+------------+-----------+
// |id |firstName |lastName |
// +---+------------+-----------+
// |1 |First Name 1|Last Name 1|
// |2 |First Name 2|Last Name 2|
// |3 |First Name 3|Last Name 3|
// +---+------------+-----------+
}
}
{ Examples:
package spark
import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.functions.{col, column, sum}
object DynamicColumnSelection extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
case class c1(
country: String,
st_buy: Double,
nd_buy: Double
)
case class c2(
country: String,
percent: Double
)
val df1 = Seq(
c1("UA", 2, 4),
c1("PL", 3, 6),
c1("GB", 4, 8)
)
.toDF()
df1.show(false)
// +-------+------+------+
// |country|st_buy|nd_buy|
// +-------+------+------+
// |UA |2.0 |4.0 |
// |PL |3.0 |6.0 |
// |GB |4.0 |8.0 |
// +-------+------+------+
val df2 = Seq(
c2("UA", 2.21),
c2("PL", 3.26)
)
.toDF()
df2.show(false)
// +-------+-------+
// |country|percent|
// +-------+-------+
// |UA |2.21 |
// |PL |3.26 |
// +-------+-------+
// Inner Join
val df = df1.join(df2, df1.col("country") === df2.col("country"), "inner")
.select(
df1.col("country"),
df1.col("st_buy"),
df1.col("nd_buy"),
df2.col("percent")
)
df.show(false)
// +-------+------+------+-------+
// |country|st_buy|nd_buy|percent|
// +-------+------+------+-------+
// |UA |2.0 |4.0 |2.21 |
// |PL |3.0 |6.0 |3.26 |
// +-------+------+------+-------+
val res1DF = df.withColumn("st_buy_percent", 'st_buy/'percent)
.withColumn("nd_buy_percent", 'nd_buy/'percent)
res1DF.show(false)
// +-------+------+------+-------+------------------+------------------+
// |country|st_buy|nd_buy|percent|st_buy_percent |nd_buy_percent |
// +-------+------+------+-------+------------------+------------------+
// |UA |2.0 |4.0 |2.21 |0.9049773755656109|1.8099547511312217|
// |PL |3.0 |6.0 |3.26 |0.9202453987730062|1.8404907975460123|
// +-------+------+------+-------+------------------+------------------+
// GroupBy + sum
val data = Seq(
c1("UA", 2, 4),
c1("PL", 3, 6),
c1("UA", 5, 10),
c1("PL", 6, 12),
c1("GB", 4, 8)
)
.toDF()
val resGroupByDF = data
.groupBy("country")
.agg(sum("st_buy").alias("sum_st_buy")
,sum("nd_buy").alias("sum_nd_buy"))
resGroupByDF.show(false)
// +-------+----------+----------+
// |country|sum_st_buy|sum_nd_buy|
// +-------+----------+----------+
// |UA |7.0 |14.0 |
// |PL |9.0 |18.0 |
// |GB |4.0 |8.0 |
// +-------+----------+----------+
val resGroupByDF1 = data.groupBy($"country").sum()
resGroupByDF1.show(false)
// +-------+-----------+-----------+
// |country|sum(st_buy)|sum(nd_buy)|
// +-------+-----------+-----------+
// |UA |7.0 |14.0 |
// |PL |9.0 |18.0 |
// |GB |4.0 |8.0 |
// +-------+-----------+-----------+
val exprs = data.columns.map(sum(_))
val resGroupByDF2 = data.groupBy($"country").agg(exprs.head, exprs.tail: _*)
resGroupByDF2.show(false)
// +-------+------------+-----------+-----------+
// |country|sum(country)|sum(st_buy)|sum(nd_buy)|
// +-------+------------+-----------+-----------+
// |UA |null |7.0 |14.0 |
// |PL |null |9.0 |18.0 |
// |GB |null |4.0 |8.0 |
// +-------+------------+-----------+-----------+
val exprs3 = List("st_buy", "nd_buy").map(sum(_))
val resGroupByDF3 = data.groupBy($"country").agg(exprs3.head, exprs3.tail: _*)
resGroupByDF3.show(false)
// +-------+-----------+-----------+
// |country|sum(st_buy)|sum(nd_buy)|
// +-------+-----------+-----------+
// |UA |7.0 |14.0 |
// |PL |9.0 |18.0 |
// |GB |4.0 |8.0 |
// +-------+-----------+-----------+
val exprs4 = data.columns.toList.filter(_ != "country").map(sum(_))
val resGroupByDF4 = data.groupBy($"country").agg(exprs4.head, exprs4.tail: _*)
resGroupByDF4.show(false)
// +-------+-----------+-----------+
// |country|sum(st_buy)|sum(nd_buy)|
// +-------+-----------+-----------+
// |UA |7.0 |14.0 |
// |PL |9.0 |18.0 |
// |GB |4.0 |8.0 |
// +-------+-----------+-----------+
// Select
val cols = data.columns.toSeq
val selectDF1 = data.select(cols.head, cols.tail:_*)
selectDF1.show(false)
// +-------+------+------+
// |country|st_buy|nd_buy|
// +-------+------+------+
// |UA |2.0 |4.0 |
// |PL |3.0 |6.0 |
// |UA |5.0 |10.0 |
// |PL |6.0 |12.0 |
// |GB |4.0 |8.0 |
// +-------+------+------+
}
}

Pass Spark SQL function name as parameter in Scala

I am trying to pass a Spark SQL function name to my defined function in Scala.
I am trying to get same functionality as:
myDf.agg(max($"myColumn"))
my attempt doesn't work:
def myFunc(myDf: DataFrame, myParameter: String): Dataframe = {
myDf.agg(myParameter($"myColumn"))
}
Obviously it shouldn't work as I'm providing a string type I am unable to find a way to make it work.
Is it even possible?
Edit:
I have to provide sql function name (and it can be other aggregate function) as parameter when calling my function.
myFunc(anyDf, max) or myFunc(anyDf, "max")
agg also takes a Map[String,String] which allows to do what you want:
def myFunc(myDf: DataFrame, myParameter: String): DataFrame = {
myDf.agg(Map("myColumn"->myParameter))
}
example:
val df = Seq(1.0,2.0,3.0).toDF("myColumn")
myFunc(df,"avg")
.show()
gives:
+-------------+
|avg(myColumn)|
+-------------+
| 2.0|
+-------------+
Try this:
import org.apache.spark.sql.{Column, DataFrame}
val df = Seq((1, 2, 12),(2, 1, 21),(1, 5, 10),(5, 3, 9),(2, 5, 4)).toDF("a","b","c")
def myFunc(df: DataFrame, f: Column): DataFrame = {
df.agg(f)
}
myFunc(df, max(col("a"))).show
+------+
|max(a)|
+------+
| 5|
+------+
Hope it helps!

scala - Spark : How to union all dataframe in loop

Is there a way to get the dataframe that union dataframe in loop?
This is a sample code:
var fruits = List(
"apple"
,"orange"
,"melon"
)
for (x <- fruits){
var df = Seq(("aaa","bbb",x)).toDF("aCol","bCol","name")
}
I would want to obtain some like this:
aCol | bCol | fruitsName
aaa,bbb,apple
aaa,bbb,orange
aaa,bbb,melon
Thanks again
You could created a sequence of DataFrames and then use reduce:
val results = fruits.
map(fruit => Seq(("aaa", "bbb", fruit)).toDF("aCol","bCol","name")).
reduce(_.union(_))
results.show()
Steffen Schmitz's answer is the most concise one I believe.
Below is a more detailed answer if you are looking for more customization (of field types, etc):
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
//initialize DF
val schema = StructType(
StructField("aCol", StringType, true) ::
StructField("bCol", StringType, true) ::
StructField("name", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
//list to iterate through
var fruits = List(
"apple"
,"orange"
,"melon"
)
for (x <- fruits) {
//union returns a new dataset
initialDF = initialDF.union(Seq(("aaa", "bbb", x)).toDF)
}
//initialDF.show()
references:
How to create an empty DataFrame with a specified schema?
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/Dataset.html
https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html
If you have different/multiple dataframes you can use below code, which is efficient.
val newDFs = Seq(DF1,DF2,DF3)
newDFs.reduce(_ union _)
In a for loop:
val fruits = List("apple", "orange", "melon")
( for(f <- fruits) yield ("aaa", "bbb", f) ).toDF("aCol", "bCol", "name")
you can first create a sequence and then use toDF to create Dataframe.
scala> var dseq : Seq[(String,String,String)] = Seq[(String,String,String)]()
dseq: Seq[(String, String, String)] = List()
scala> for ( x <- fruits){
| dseq = dseq :+ ("aaa","bbb",x)
| }
scala> dseq
res2: Seq[(String, String, String)] = List((aaa,bbb,apple), (aaa,bbb,orange), (aaa,bbb,melon))
scala> val df = dseq.toDF("aCol","bCol","name")
df: org.apache.spark.sql.DataFrame = [aCol: string, bCol: string, name: string]
scala> df.show
+----+----+------+
|aCol|bCol| name|
+----+----+------+
| aaa| bbb| apple|
| aaa| bbb|orange|
| aaa| bbb| melon|
+----+----+------+
Well... I think your question is a bit mis-guided.
As per my limited understanding of whatever you are trying to do, you should be doing following,
val fruits = List(
"apple",
"orange",
"melon"
)
val df = fruits
.map(x => ("aaa", "bbb", x))
.toDF("aCol", "bCol", "name")
And this should be sufficient.