how to manipulate my dataframe in spark?

how to manipulate my dataframe in spark? - scala

I have an nested json rdd stream comming in from a kafka topic.
the data looks like this:
{
"time":"sometext1","host":"somehost1","event":
{"category":"sometext2","computerName":"somecomputer1"}
}
I turned this into a dataframe and the schema looks like
root
|-- event: struct (nullable = true)
| |-- category: string (nullable = true)
| |-- computerName: string (nullable = true)
|-- time: string (nullable = true)
|-- host: string (nullable = true)
Im trying to save it to a hive table on hdfs with a schema like this
category:string
computerName:string
time:string
host:string
This is my first time working with spark and scala. I would appretiate if someone could help me.
Thanks

// Creating Rdd
val vals = sc.parallelize(
"""{"time":"sometext1","host":"somehost1","event": {"category":"sometext2","computerName":"somecomputer1"}}""" ::
Nil)
// Creating Schema
val schema = (new StructType)
.add("time", StringType)
.add("host", StringType)
.add("event", (new StructType)
.add("category", StringType)
.add("computerName", StringType))
import sqlContext.implicits._
val jsonDF = sqlContext.read.schema(schema).json(vals)
jsonDF.printSchema
root
|-- time: string (nullable = true)
|-- host: string (nullable = true)
|-- event: struct (nullable = true)
| |-- category: string (nullable = true)
| |-- computerName: string (nullable = true)
// selecting columns
val df = jsonDF.select($"event.*",$"time",
$"host")
df.printSchema
root
|-- category: string (nullable = true)
|-- computerName: string (nullable = true)
|-- time: string (nullable = true)
|-- host: string (nullable = true)
df.show
+---------+-------------+---------+---------+
| category| computerName| time| host|
+---------+-------------+---------+---------+
|sometext2|somecomputer1|sometext1|somehost1|
+---------+-------------+---------+---------+

Related

spark scala convert a nested dataframe to nested dataset

I have a nested dataframe "inputFlowRecordsAgg" which have follwoing schema
root
|-- FlowI.key: string (nullable = true)
|-- FlowS.minFlowTime: long (nullable = true)
|-- FlowS.maxFlowTime: long (nullable = true)
|-- FlowS.flowStartedCount: long (nullable = true)
|-- FlowI.DestPort: integer (nullable = true)
|-- FlowI.SrcIP: struct (nullable = true)
| |-- bytes: binary (nullable = true)
|-- FlowI.DestIP: struct (nullable = true)
| |-- bytes: binary (nullable = true)
|-- FlowI.L4Protocol: byte (nullable = true)
|-- FlowI.Direction: byte (nullable = true)
|-- FlowI.Status: byte (nullable = true)
|-- FlowI.Mac: string (nullable = true)
Wanted to convert into nested dataset of following case classes
case class InputFlowV1(val FlowI: FlowI,
val FlowS: FlowS)
case class FlowI(val Mac: String,
val SrcIP: IPAddress,
val DestIP: IPAddress,
val DestPort: Int,
val L4Protocol: Byte,
val Direction: Byte,
val Status: Byte,
var key: String = "")
case class FlowS(var minFlowTime: Long,
var maxFlowTime: Long,
var flowStartedCount: Long)
but when I try converting it using
inputFlowRecordsAgg.as[InputFlowV1]
cannot resolve '`FlowI`' given input columns: [FlowI.DestIP,FlowI.Direction, FlowI.key, FlowS.maxFlowTime, FlowI.SrcIP, FlowS.flowStartedCount, FlowI.L4Protocol, FlowI.Mac, FlowI.DestPort, FlowS.minFlowTime, FlowI.Status];
org.apache.spark.sql.AnalysisException: cannot resolve '`FlowI`' given input columns: [FlowI.DestIP,FlowI.Direction, FlowI.key, FlowS.maxFlowTime, FlowI.SrcIP, FlowS.flowStartedCount, FlowI.L4Protocol, FlowI.Mac, FlowI.DestPort, FlowS.minFlowTime, FlowI.Status];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
One comment asked me for a full code, here it is
def getReducedFlowR(inputFlowRecords: Dataset[InputFlowV1],
#transient spark: SparkSession): Dataset[InputFlowV1]={
val inputFlowRecordsAgg = inputFlowRecords.groupBy(column("FlowI.key") as "FlowI.key")
.agg(min("FlowS.minFlowTime") as "FlowS.minFlowTime" , max("FlowS.maxFlowTime") as "FlowS.maxFlowTime",
sum("FlowS.flowStartedCount") as "FlowS.flowStartedCount"
, first("FlowI.Mac") as "FlowI.Mac"
, first("FlowI.SrcIP") as "FlowI.SrcIP" , first("FlowI.DestIP") as "FlowI.DestIP"
,first("FlowI.DestPort") as "FlowI.DestPort"
, first("FlowI.L4Protocol") as "FlowI.L4Protocol"
, first("FlowI.Direction") as "FlowI.Direction" , first("FlowI.Status") as "FlowI.Status")
inputFlowRecordsAgg.printSchema()
return inputFlowRecordsAgg.as[InputFlowV1]
}

Reason is your case class schema has not matched with actual data schema, Please check the case class schema below. try to match your case class schema to data schema it will work.
Your case class schema is :
scala> df.printSchema
root
|-- FlowI: struct (nullable = true)
| |-- Mac: string (nullable = true)
| |-- SrcIP: string (nullable = true)
| |-- DestIP: string (nullable = true)
| |-- DestPort: integer (nullable = false)
| |-- L4Protocol: byte (nullable = false)
| |-- Direction: byte (nullable = false)
| |-- Status: byte (nullable = false)
| |-- key: string (nullable = true)
|-- FlowS: struct (nullable = true)
| |-- minFlowTime: long (nullable = false)
| |-- maxFlowTime: long (nullable = false)
| |-- flowStartedCount: long (nullable = false)
Try to change your code like below it should work now.
val inputFlowRecordsAgg = inputFlowRecords.groupBy(column("FlowI.key") as "key")
.agg(min("FlowS.minFlowTime") as "minFlowTime" , max("FlowS.maxFlowTime") as "maxFlowTime",
sum("FlowS.flowStartedCount") as "flowStartedCount"
, first("FlowI.Mac") as "Mac"
, first("FlowI.SrcIP") as "SrcIP" , first("DestIP") as "DestIP"
,first("FlowI.DestPort") as "DestPort"
, first("FlowI.L4Protocol") as "L4Protocol"
, first("FlowI.Direction") as "Direction" , first("FlowI.Status") as "Status")
.select(struct($"key",$"Mac",$"SrcIP",$"DestIP",$"DestPort",$"L4Protocol",$"Direction",$"Status").as("FlowI"),struct($"flowStartedCount",$"minFlowTime",$"maxFlowTime").as("FlowS")) // add this line & change based on your columns .. i have added roughly..:)

Reordering fields in nested dataframe

How do I reorder fields in a nested dataframe in scala?
for e.g below is the expected and desired schemas
currently->
root
|-- domain: struct (nullable = false)
| |-- assigned: string (nullable = true)
| |-- core: string (nullable = true)
| |-- createdBy: long (nullable = true)
|-- Event: struct (nullable = false)
| |-- action: string (nullable = true)
| |-- eventid: string (nullable = true)
| |-- dqid: string (nullable = true)
expected->
root
|-- domain: struct (nullable = false)
| |-- core: string (nullable = true)
| |-- assigned: string (nullable = true)
| |-- createdBy: long (nullable = true)
|-- Event: struct (nullable = false)
| |-- dqid: string (nullable = true)
| |-- eventid: string (nullable = true)
| |-- action: string (nullable = true)
```

You need to define schema before you read the dataframe.
val schema = val schema = StructType(Array(StructField("root",StructType(Array(StructField("domain",StructType(Array(StructField("core",StringType,true), StructField("assigned",StringType,true), StructField("createdBy",StringType,true))),true), StructField("Event",StructType(Array(StructField("dqid",StringType,true), StructField("eventid",StringType,true), StructField("action",StringType,true))),true))),true)))
Now, you can apply this schema while reading your file.
val df = spark.read.schema(schema).json("path/to/json")
Should work with any nested data.
Hope this helps!

Most efficient approach might be to just select the nested elements and wrap in a couple of structs, as shown below:
case class Domain(assigned: String, core: String, createdBy: Long)
case class Event(action: String, eventid: String, dqid: String)
val df = Seq(
(Domain("a", "b", 1L), Event("c", "d", "e")),
(Domain("f", "g", 2L), Event("h", "i", "j"))
).toDF("domain", "event")
val df2 = df.select(
struct($"domain.core", $"domain.assigned", $"domain.createdBy").as("domain"),
struct($"event.dqid", $"event.action", $"event.eventid").as("event")
)
df2.printSchema
// root
// |-- domain: struct (nullable = false)
// | |-- core: string (nullable = true)
// | |-- assigned: string (nullable = true)
// | |-- createdBy: long (nullable = true)
// |-- event: struct (nullable = false)
// | |-- dqid: string (nullable = true)
// | |-- action: string (nullable = true)
// | |-- eventid: string (nullable = true)
An alternative would be to apply row-wise map:
import org.apache.spark.sql.Row
val df2 = df.map{ case Row(Row(as: String, co: String, cr: Long), Row(ac: String, ev: String, dq: String)) =>
((co, as, cr), (dq, ac, ev))
}.toDF("domain", "event")

How to update the schema of a Spark DataFrame (methods like Dataset.withColumn and Datset.select don't work in my case)

My question is if there are any approaches to update the schema of a DataFrame without explicitly calling SparkSession.createDataFrame(dataframe.rdd, newSchema).
Details are as follows.
I have an original Spark DataFrame with schema below:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
I applied Dataset.mapPartitions on the original DataFrame and got a new DataFrame (returned by Dataset.mapPartitions).
The reason for using Dataset.mapPartitions but not Dataset.map is better transformation speed.
In this new DataFrame, every row should have a schema like below:
root
|-- column21: string (nullable = true)
|-- column22: long (nullable = true)
|-- column23: string (nullable = true)
|-- column24: long (nullable = true)
|-- column25: struct (nullable = true)
| |-- column251: string (nullable = true)
| |-- column252: string (nullable = true)
| |-- column253: string (nullable = true)
| |-- column254: string (nullable = true)
| |-- column255: string (nullable = true)
| |-- column256: string (nullable = true)
So the schema of the new DataFrame should be the same as the above.
However, the schema of the new DataFrame won't be updated automatically. The output of applying Dataset.printSchema method on the new DataFrame is still original:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
So, in order to get the correct (updated) schema, what I'm doing is using SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
My concern here is that falling back to RDD (newDataFrame.rdd) will hurt the transformation speed because Spark Catalyst doesn't handle RDD as well as Dataset/DataFrame.
My question is if there are any approaches to update the schema of the new DataFrame without explicitly calling SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
Thanks a lot.

You can use RowEncoder to define schema for newDataFrame.
See following example.
val originalDF = spark.sparkContext.parallelize(List(("Tonny", "city1"), ("Rogger", "city2"), ("Michal", "city3"))).toDF("name", "city")
val r = scala.util.Random
val encoderForNewDF = RowEncoder(StructType(Array(
StructField("name", StringType),
StructField("num", IntegerType),
StructField("city", StringType)
)))
val newDF = originalDF.mapPartitions { partition =>
partition.map{ row =>
val name = row.getAs[String]("name")
val city = row.getAs[String]("city")
val num = r.nextInt
Row.fromSeq(Array[Any](name, num, city))
}
} (encoderForNewDF)
newDF.printSchema()
|-- name: string (nullable = true)
|-- num: integer (nullable = true)
|-- city: string (nullable = true)
Row Encoder for spark: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-RowEncoder.html

How to make matching schema for two data frame in join without hard coding for every columns

I have two data frame on which I perform join and some time i get below error
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`IsAnnualReported_1` IS NOT NULL) THEN `IsAnnualReported_1` ELSE CAST(`IsAnnualReported` AS BOOLEAN) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;
Now to overcome this i have to manually cast into to matching data types like below for all machinating data type columns .
when($"IsAnnualReported_1".isNotNull, $"IsAnnualReported_1").otherwise($"IsAnnualReported".cast(DataTypes.BooleanType)).as("IsAnnualReported"),
This is how i perform join on two data frames .
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
val get_cus_val = spark.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))
val get_cus_YearPartition = spark.udf.register("get_cus_YearPartition", (filePath: String) => filePath.split("\\.")(4))
val df = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsmallfffile/FinancialPeriod/MAIN")
val df1With_ = df.toDF(df.columns.map(_.replace(".", "_")): _*)
val column_to_keep = df1With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df1result = df1With_.select(column_to_keep.head, column_to_keep.tail: _*)
val df1resultFinal=df1result.withColumn("DataPartition", get_cus_val(input_file_name))
val df1resultFinalWithYear=df1resultFinal.withColumn("PartitionYear", get_cus_YearPartition(input_file_name))
val df2 = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsmallfffile/FinancialPeriod/INCR")
val df2With_ = df2.toDF(df2.columns.map(_.replace(".", "_")): _*)
val df2column_to_keep = df2With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df2result = df2With_.select(df2column_to_keep.head, df2column_to_keep.tail: _*)
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("FinancialPeriod_organizationId", "FinancialPeriod_periodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
df1resultFinalWithYear.printSchema()
latestForEachKey.printSchema()
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey, Seq("FinancialPeriod_organizationId", "FinancialPeriod_periodId"), "outer")
.select($"FinancialPeriod_organizationId", $"FinancialPeriod_periodId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition".cast(DataTypes.StringType)).as("DataPartition"),
when($"PartitionYear_1".isNotNull, $"PartitionYear_1").otherwise($"PartitionYear".cast(DataTypes.StringType)).as("PartitionYear"),
when($"FinancialPeriod_periodEndDate_1".isNotNull, $"FinancialPeriod_periodEndDate_1").otherwise($"FinancialPeriod_periodEndDate").as("FinancialPeriod_periodEndDate"),
when($"FinancialPeriod_periodStartDate_1".isNotNull, $"FinancialPeriod_periodStartDate_1").otherwise($"FinancialPeriod_periodStartDate").as("FinancialPeriod_periodStartDate"),
when($"FinancialPeriod_periodDuration_1".isNotNull, $"FinancialPeriod_periodDuration_1").otherwise($"FinancialPeriod_periodDuration").as("FinancialPeriod_periodDuration"),
when($"FinancialPeriod_nonStandardPeriod_1".isNotNull, $"FinancialPeriod_nonStandardPeriod_1").otherwise($"FinancialPeriod_nonStandardPeriod").as("FinancialPeriod_nonStandardPeriod"),
when($"FinancialPeriod_periodType_1".isNotNull, $"FinancialPeriod_periodType_1").otherwise($"FinancialPeriod_periodType").as("FinancialPeriod_periodType"),
when($"PeriodFiscalYear_1".isNotNull, $"PeriodFiscalYear_1").otherwise($"PeriodFiscalYear").as("PeriodFiscalYear"),
when($"PeriodFiscalEndMonth_1".isNotNull, $"PeriodFiscalEndMonth_1").otherwise($"PeriodFiscalEndMonth").as("PeriodFiscalEndMonth"),
when($"IsAnnualReported_1".isNotNull, $"IsAnnualReported_1").otherwise($"IsAnnualReported".cast(DataTypes.BooleanType)).as("IsAnnualReported"),
when($"IsTransitional_1".isNotNull, $"IsTransitional_1").otherwise($"IsTransitional".cast(DataTypes.StringType)).as("IsTransitional"),
when($"CumulativeType_1".isNotNull, $"CumulativeType_1").otherwise($"CumulativeType").as("CumulativeType"),
when($"CalendarizedPeriodEndDate_1".isNotNull, $"CalendarizedPeriodEndDate_1").otherwise($"CalendarizedPeriodEndDate").as("CalendarizedPeriodEndDate"),
when($"EarliestAnnouncementDateTime_1".isNotNull, $"EarliestAnnouncementDateTime_1").otherwise($"EarliestAnnouncementDateTime").as("EarliestAnnouncementDateTime"),
when($"EADUTCOffset_1".isNotNull, $"EADUTCOffset_1").otherwise($"EADUTCOffset").as("EADUTCOffset"),
when($"PeriodPermId_1".isNotNull, $"PeriodPermId_1").otherwise($"PeriodPermId").as("PeriodPermId"),
when($"PeriodPermId_objectTypeId_1".isNotNull, $"PeriodPermId_objectTypeId_1").otherwise($"PeriodPermId_objectTypeId").as("PeriodPermId_objectTypeId"),
when($"PeriodPermId_objectType_1".isNotNull, $"PeriodPermId_objectType_1").otherwise($"PeriodPermId_objectType").as("PeriodPermId_objectType"),
when($"CumulativeTypeId_1".isNotNull, $"CumulativeTypeId_1").otherwise($"CumulativeTypeId").as("CumulativeTypeId"),
when($"PeriodTypeId_1".isNotNull, $"PeriodTypeId_1").otherwise($"PeriodTypeId").as("PeriodTypeId"),
when($"PeriodFiscalEndMonthId_1".isNotNull, $"PeriodFiscalEndMonthId_1").otherwise($"PeriodFiscalEndMonthId").as("PeriodFiscalEndMonthId"),
when($"PeriodLengthUnitId_1".isNotNull, $"PeriodLengthUnitId_1").otherwise($"PeriodLengthUnitId").as("PeriodLengthUnitId"),
when($"FFAction_1".isNotNull, concat(col("FFAction_1"), lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
.filter(!$"FFAction".contains("D"))
Now what I need is that, how can I create second data frame with the schema of first data frame so i will never get any error like data type mismatch .
Here is the schema of first and second data frame
root
|-- FinancialPeriod_organizationId: long (nullable = true)
|-- FinancialPeriod_periodId: integer (nullable = true)
|-- FinancialPeriod_periodEndDate: timestamp (nullable = true)
|-- FinancialPeriod_periodStartDate: timestamp (nullable = true)
|-- FinancialPeriod_periodDuration: string (nullable = true)
|-- FinancialPeriod_nonStandardPeriod: string (nullable = true)
|-- FinancialPeriod_periodType: string (nullable = true)
|-- PeriodFiscalYear: integer (nullable = true)
|-- PeriodFiscalEndMonth: integer (nullable = true)
|-- IsAnnualReported: boolean (nullable = true)
|-- IsTransitional: boolean (nullable = true)
|-- CumulativeType: string (nullable = true)
|-- CalendarizedPeriodEndDate: string (nullable = true)
|-- EarliestAnnouncementDateTime: timestamp (nullable = true)
|-- EADUTCOffset: string (nullable = true)
|-- PeriodPermId: string (nullable = true)
|-- PeriodPermId_objectTypeId: string (nullable = true)
|-- PeriodPermId_objectType: string (nullable = true)
|-- CumulativeTypeId: integer (nullable = true)
|-- PeriodTypeId: integer (nullable = true)
|-- PeriodFiscalEndMonthId: integer (nullable = true)
|-- PeriodLengthUnitId: integer (nullable = true)
|-- FFAction: string (nullable = true)
|-- DataPartition: string (nullable = true)
|-- PartitionYear: string (nullable = true)
root
|-- DataPartition_1: string (nullable = true)
|-- PartitionYear_1: integer (nullable = true)
|-- FinancialPeriod_organizationId: long (nullable = true)
|-- FinancialPeriod_periodId: integer (nullable = true)
|-- FinancialPeriod_periodEndDate_1: timestamp (nullable = true)
|-- FinancialPeriod_periodStartDate_1: timestamp (nullable = true)
|-- FinancialPeriod_periodDuration_1: string (nullable = true)
|-- FinancialPeriod_nonStandardPeriod_1: string (nullable = true)
|-- FinancialPeriod_periodType_1: string (nullable = true)
|-- PeriodFiscalYear_1: string (nullable = true)
|-- PeriodFiscalEndMonth_1: string (nullable = true)
|-- IsAnnualReported_1: string (nullable = true)
|-- IsTransitional_1: string (nullable = true)
|-- CumulativeType_1: string (nullable = true)
|-- CalendarizedPeriodEndDate_1: string (nullable = true)
|-- EarliestAnnouncementDateTime_1: string (nullable = true)
|-- EADUTCOffset_1: string (nullable = true)
|-- PeriodPermId_1: string (nullable = true)
|-- PeriodPermId_objectTypeId_1: string (nullable = true)
|-- PeriodPermId_objectType_1: string (nullable = true)
|-- CumulativeTypeId_1: string (nullable = true)
|-- PeriodTypeId_1: string (nullable = true)
|-- PeriodFiscalEndMonthId_1: string (nullable = true)
|-- PeriodLengthUnitId_1: string (nullable = true)
|-- FFAction_1: string (nullable = true)

You already have a good solution .
Here I am going to show you how you can avoid writing manually each columns for typecasting.
Lets say you have two dataframes (as you already have them) as
df1
root
|-- col1: integer (nullable = false)
|-- col2: string (nullable = true)
df2
root
|-- cl2: integer (nullable = false)
|-- cl1: integer (nullable = false)
suppose you want to change the dataTypes of df2 as that of df1. and as you said you know the mapping of each columns of both dataframes. You have to create Map of the relationship of the columns
val columnMaps = Map("col1" -> "cl1", "col2"->"cl2")
When you have the map as above, you can set the dataTypes to be casted to each columns of df2 as below
val schema1 = df1.schema
val toBeChangedDataTypes =df1.schema.map(x => if(columnMaps.keySet.contains(x.name)) (columnMaps(x.name), x.dataType) else (x.name, x.dataType)).toList
Then you can change the dataTypes of the columns of df2 to match with df1 by calling a recursive function
val finalDF = castingFunction(toBeChangedDataTypes, df2)
where castingFunction is a recursive function defined as
import org.apache.spark.sql.functions.col
def castingFunction(typeList: List[Tuple2[String, DataType]], df: DataFrame) : DataFrame = typeList match {
case x :: y => castingFunction(y, df.withColumn(x._1, col(x._1).cast(x._2)))
case Nil => df
}
You will see that finalDF will have schema as
root
|-- cl2: string (nullable = false)
|-- cl1: integer (nullable = false)
You can do the same for your dataframes.
I hope the answer is helpful

scala.MatchError during Spark 2.0.2 DataFrame union

I'm attempting to merge 2 DataFrames, one with old data and one with new data, using the union function. This used to work until I tried to dynamically add a new field to the old DataFrame because my schema is evolving.
This means that my old data will be missing a field and the new data will have it. In order for the union to work, I'm adding the field using the evolveSchema function below.
This resulted in the output/exception I pasted below the code, including my debug prints.
The column ordering and making fields nullable are attempts to fix this issue by making the DataFrames as identical as possible, but it persists. The schema prints show that they are both seemingly identical after these manipulations.
Any help to further debug this would be appreciated.
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.types.{StructField, StructType}
import org.apache.spark.sql.{DataFrame, SQLContext}
object Merger {
def apply(sqlContext: SQLContext, oldDataSet: Option[DataFrame], newEnrichments: Option[DataFrame]): Option[DataFrame] = {
(oldDataSet, newEnrichments) match {
case (None, None) => None
case (None, _) => newEnrichments
case (Some(existing), None) => Some(existing)
case (Some(existing), Some(news)) => Some {
val evolvedOldDataSet = evolveSchema(existing)
println("EVOLVED OLD SCHEMA FIELD NAMES:" + evolvedOldDataSet.schema.fieldNames.mkString(","))
println("NEW SCHEMA FIELD NAMES:" + news.schema.fieldNames.mkString(","))
println("EVOLVED OLD SCHEMA FIELD TYPES:" + evolvedOldDataSet.schema.fields.map(_.dataType).mkString(","))
println("NEW SCHEMA FIELD TYPES:" + news.schema.fields.map(_.dataType).mkString(","))
println("OLD SCHEMA")
existing.printSchema();
println("PRINT EVOLVED OLD SCHEMA")
evolvedOldDataSet.printSchema()
println("PRINT NEW SCHEMA")
news.printSchema()
val nullableEvolvedOldDataSet = setNullableTrue(evolvedOldDataSet)
val nullableNews = setNullableTrue(news)
println("NULLABLE EVOLVED OLD")
nullableEvolvedOldDataSet.printSchema()
println("NULLABLE NEW")
nullableNews.printSchema()
val unionData =nullableEvolvedOldDataSet.union(nullableNews)
val result = unionData.sort(
unionData("timestamp").desc
).dropDuplicates(
Seq("id")
)
result.cache()
}
}
}
def GENRE_FIELD : String = "station_genre"
// Handle missing fields in old data
def evolveSchema(oldDataSet: DataFrame): DataFrame = {
if (!oldDataSet.schema.fieldNames.contains(GENRE_FIELD)) {
val columnAdded = oldDataSet.withColumn(GENRE_FIELD, lit("N/A"))
// Columns should be in the same order for union
val columnNamesInOrder = Seq("id", "station_id", "station_name", "station_timezone", "station_genre", "publisher_id", "publisher_name", "group_id", "group_name", "timestamp")
val reorderedColumns = columnAdded.select(columnNamesInOrder.head, columnNamesInOrder.tail: _*)
reorderedColumns
}
else
oldDataSet
}
def setNullableTrue(df: DataFrame) : DataFrame = {
// get schema
val schema = df.schema
// create new schema with all fields nullable
val newSchema = StructType(schema.map {
case StructField(columnName, dataType, _, metaData) => StructField( columnName, dataType, nullable = true, metaData)
})
// apply new schema
df.sqlContext.createDataFrame( df.rdd, newSchema )
}
}
EVOLVED OLD SCHEMA FIELD NAMES:
id,station_id,station_name,station_timezone,station_genre,publisher_id,publisher_name,group_id,group_name,timestamp
NEW SCHEMA FIELD NAMES:
id,station_id,station_name,station_timezone,station_genre,publisher_id,publisher_name,group_id,group_name,timestamp
EVOLVED OLD SCHEMA FIELD TYPES:
StringType,LongType,StringType,StringType,StringType,LongType,StringType,LongType,StringType,LongType
NEW SCHEMA FIELD TYPES:
StringType,LongType,StringType,StringType,StringType,LongType,StringType,LongType,StringType,LongType
OLD SCHEMA
root |-- id: string (nullable = true) |-- station_id:
long (nullable = true) |-- station_name: string (nullable = true)
|-- station_timezone: string (nullable = true) |-- publisher_id: long
(nullable = true) |-- publisher_name: string (nullable = true) |--
group_id: long (nullable = true) |-- group_name: string (nullable =
true) |-- timestamp: long (nullable = true)
PRINT EVOLVED OLD SCHEMA root |-- id: string (nullable = true) |--
station_id: long (nullable = true) |-- station_name: string (nullable
= true) |-- station_timezone: string (nullable = true) |-- station_genre: string (nullable = false) |-- publisher_id: long
(nullable = true) |-- publisher_name: string (nullable = true) |--
group_id: long (nullable = true) |-- group_name: string (nullable =
true) |-- timestamp: long (nullable = true)
PRINT NEW SCHEMA root |-- id: string (nullable = true) |--
station_id: long (nullable = true) |-- station_name: string (nullable
= true) |-- station_timezone: string (nullable = true) |-- station_genre: string (nullable = true) |-- publisher_id: long
(nullable = true) |-- publisher_name: string (nullable = true) |--
group_id: long (nullable = true) |-- group_name: string (nullable =
true) |-- timestamp: long (nullable = true)
NULLABLE EVOLVED OLD root |-- id: string (nullable = true) |--
station_id: long (nullable = true) |-- station_name: string (nullable
= true) |-- station_timezone: string (nullable = true) |-- station_genre: string (nullable = true) |-- publisher_id: long
(nullable = true) |-- publisher_name: string (nullable = true) |--
group_id: long (nullable = true) |-- group_name: string (nullable =
true) |-- timestamp: long (nullable = true)
NULLABLE NEW root |-- id: string (nullable = true) |-- station_id:
long (nullable = true) |-- station_name: string (nullable = true)
|-- station_timezone: string (nullable = true) |-- station_genre:
string (nullable = true) |-- publisher_id: long (nullable = true)
|-- publisher_name: string (nullable = true) |-- group_id: long
(nullable = true) |-- group_name: string (nullable = true) |--
timestamp: long (nullable = true)
2017-01-18 15:59:32 ERROR org.apache.spark.internal.Logging$class
Executor:91 - Exception in task 1.0 in stage 2.0 (TID 4)
scala.MatchError: false (of class java.lang.Boolean) at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:296)
at
...
com.companystuff.meta.uploader.Merger$.apply(Merger.scala:49)
...
Caused by: scala.MatchError: false (of class java.lang.Boolean) at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:296)
...

It's because of ordering in the actual data even though its schema is the same.
So simply select all required columns then do a union query.
Something like this:
val columns:Seq[String]= ....
val df = oldDf.select(columns:_*).union(newDf.select(columns:_*)
Hope it helps you

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how to manipulate my dataframe in spark? - scala

Related

spark scala convert a nested dataframe to nested dataset

Reordering fields in nested dataframe

How to update the schema of a Spark DataFrame (methods like Dataset.withColumn and Datset.select don't work in my case)

How to make matching schema for two data frame in join without hard coding for every columns

scala.MatchError during Spark 2.0.2 DataFrame union

Categories

Resources