Spark failing to deserialize a record when creating Dataset - scala

I'm reading a large number of CSVs from S3 (everything under a key prefix) and creating a strongly-typed Dataset.
val events: DataFrame = cdcFs.getStream()
events
.withColumn("event", lit("I"))
.withColumn("source", lit(sourceName))
.as[TradeRecord]
where TradeRecord is a case class that can normally be deserialized into by SparkSession implicits. However, for a certain batch, a record is failing to deserialize. Here's the error (stack trace omitted)
Caused by: java.lang.NullPointerException: Null value appeared in non-nullable field:
- field (class: "scala.Long", name: "deal")
- root class: "com.company.trades.TradeRecord"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
deal being a field of TradeRecord that should never be null in source data (S3 objects), so it's not an Option.
Unfortunately the error message doesn't give me any clue as to what the CSV data looks like, or even which CSV file it's coming from. The batch consists of hundreds of files, so I need a way to narrow this down to at most a few files to investigate the issue.

As suggested by user10465355 you can load the data:
val events: DataFrame = ???
Filter
val mismatched = events.where($"deal".isNull)
Add file name
import org.apache.spark.sql.functions.input_file_name
val tagged = mismatched.withColumn("_file_name", input_file_name)
Optionally add the chunk and chunk and offset:
import org.apache.spark.sql.functions.{spark_partition_id, monotonically_increasing_id, shiftLeft, shiftRight
df
.withColumn("chunk", spark_partition_id())
.withColumn(
"offset",
monotonically_increasing_id - shiftLeft(shiftRight(monotonically_increasing_id, 33), 33))

Here's the solution I came up with (I'm using Spark Structured Streaming):
val stream = spark.readStream
.format("csv")
.schema(schema) // a StructType defined elsewhere
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "corruptRecord")
.load(path)
// If debugging, check for any corrupted CSVs
if (log.isDebugEnabled) { // org.apache.spark.internal.Logging trait
import spark.implicits._
stream
.filter($"corruptRecord".isNotNull)
.withColumn("input_file", input_file_name)
.select($"input_file", $"corruptRecord")
.writeStream
.format("console")
.option("truncate", false)
.start()
}
val events = stream
.withColumn("event", lit("I"))
.withColumn("source", lit(sourceName))
.as[TradeRecord]
Basically if Spark log level is set to Debug or lower, the DataFrame is checked for corrupted records and any such records are printed out together with their file names. Eventually the program tries to cast this DataFrame to a strongly-typed Dataset[TradeRecord] and fails.

Related

How to remove all records in a RDD including null?

I loaded an RDD from a csv file. However, this file includes invalid data. So, when I tried to output the contact of this RDD with first. The exception is
Caused by: java.lang.NumberFormatException: empty String
I hope to find solution to remove all records in the RDD when one record includes empty string. In addition, this RDD includes so many fields, so it is difficult to handle every field one by one. I remembers that DataFrame has such function, such as na.drop(). I need that this kind of function will work for RDD.
The code I used is like:
//using case class
case class Flight(dest_id:Long, dest:String, crsdeptime:Double, deptime:Double, depdelaymins:Double, crsarrtime:Double)
//defining function
def parseFlight(str: String): Flight = {
val line = str.split(",")
Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5).toLong)
}
//loading data
val textRDD = sc.textFile("/root/data/data.csv")
val flightsRDD = textRDD.map(parseFlight)
update
When I used RDD converted by DateFrame. I found every line of RDD is Row object. How to extract some fields of one Row to build Edge object?
If the header in the csv file matches the variable names in the case class, then it's easier to read the data as a dataframe and then use na.drop().
val flightsDf = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("/root/data/data.csv")
.na.drop()
.as[Flight]
If you want a rdd, it is always possible to convert it afterwards with flightsDf.rdd.

Spark SQL's Scala API - TimestampType - No Encoder found for org.apache.spark.sql.types.TimestampType

I am using Spark 2.1 with Scala 2.11 on a Databricks notebook
What is exactly TimestampType ?
​We know from ​​SparkSQL's documentation​ that's the official timestamp type is TimestampType, which is apparently an alias for java.sql.Timestamp :
TimestampType can be found here in the​ SparkSQL's Scala ​API
We have a difference when using a schema and the Dataset API
When parsing {"time":1469501297,"action":"Open"} from the Databricks' Scala Structured Streaming example
Using a Json schema --> OK (I do prefer using the elegant Dataset API) :
val jsonSchema = new StructType().add("time", TimestampType).add("action", StringType)
val staticInputDF =
spark
.read
.schema(jsonSchema)
.json(inputPath)
Using the Dataset API --> KO: No Encoder found for TimestampType
Creating the Event class
import org.apache.spark.sql.types._
case class Event(action: String, time: TimestampType)
--> defined class Event
Errors when reading the events from DBFS on databricks.
Note: we don't get the error when using java.sql.Timestamp as a type for "time"
val path = "/databricks-datasets/structured-streaming/events/"
val events = spark.read.json(path).as[Event]
Error message
java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.types.TimestampType
- field (class: "org.apache.spark.sql.types.TimestampType", name: "time")
- root class:
Combining the schema read method .schema(jsonSchema) and the as[Type] method containing the type java.sql.Timestamp will solve this issue. The idea came to be after reading from the Structured Streaming documentation Creating streaming DataFrames and streaming Datasets
These examples generate streaming DataFrames that are untyped, meaning
that the schema of the DataFrame is not checked at compile time, only
checked at runtime when the query is submitted. Some operations like
map, flatMap, etc. need the type to be known at compile time. To do
those, you can convert these untyped streaming DataFrames to typed
streaming Datasets using the same methods as static DataFrame.
val path = "/databricks-datasets/structured-streaming/events/"
val jsonSchema = new StructType().add("time", TimestampType).add("action", StringType)
case class Event(action: String, time: java.sql.Timestamp)
val staticInputDS =
spark
.read
.schema(jsonSchema)
.json(path)
.as[Event]
staticInputDF.printSchema
Will output :
root
|-- time: timestamp (nullable = true)
|-- action: string (nullable = true)
TimestampType is not an alias for java.sql.Timestamp, but rather a representation of a timestamp type for Spark internal usage. In general you don't want to use TimestampType in your code. The idea is that java.sql.Timestamp is supported by Spark SQL natively, so you can define you event class as follows:
case class Event(action: String, time: java.sql.Timestamp)
Internally, Spark will then use TimestampType to model the type of a value at runtime, when compiling and optimizing your query, but this is not something you're interested in most of the time.

How to read records in JSON format from Kafka using Structured Streaming?

I am trying to use structured streaming approach using Spark-Streaming based on DataFrame/Dataset API to load a stream of data from Kafka.
I use:
Spark 2.10
Kafka 0.10
spark-sql-kafka-0-10
Spark Kafka DataSource has defined underlying schema:
|key|value|topic|partition|offset|timestamp|timestampType|
My data come in json format and they are stored in the value column. I am looking for a way how to extract underlying schema from value column and update received dataframe to columns stored in value? I tried the approach below but it does not work:
val columns = Array("column1", "column2") // column names
val rawKafkaDF = sparkSession.sqlContext.readStream
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
.option("subscribe",topic)
.load()
val columnsToSelect = columns.map( x => new Column("value." + x))
val kafkaDF = rawKafkaDF.select(columnsToSelect:_*)
// some analytics using stream dataframe kafkaDF
val query = kafkaDF.writeStream.format("console").start()
query.awaitTermination()
Here I am getting Exception org.apache.spark.sql.AnalysisException: Can't extract value from value#337; because in time of creation of the stream, values inside are not known...
Do you have any suggestions?
From the Spark perspective value is just a byte sequence. It has no knowledge about the serialization format or content. To be able to extract the filed you have to parse it first.
If data is serialized as a JSON string you have two options. You can cast value to StringType and use from_json and provide a schema:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.from_json
val schema: StructType = StructType(Seq(
StructField("column1", ???),
StructField("column2", ???)
))
rawKafkaDF.select(from_json($"value".cast(StringType), schema))
or cast to StringType, extract fields by path using get_json_object:
import org.apache.spark.sql.functions.get_json_object
val columns: Seq[String] = ???
val exprs = columns.map(c => get_json_object($"value", s"$$.$c"))
rawKafkaDF.select(exprs: _*)
and cast later to the desired types.

How to load a csv directly into a Spark Dataset?

I have a csv file [1] which I want to load directly into a Dataset. The problem is that I always get errors like
org.apache.spark.sql.AnalysisException: Cannot up cast `probability` from string to float as it may truncate
The type path of the target object is:
- field (class: "scala.Float", name: "probability")
- root class: "TFPredictionFormat"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
Moreover, and specifically for the phrases field (check case class [2]) it get
org.apache.spark.sql.AnalysisException: cannot resolve '`phrases`' due to data type mismatch: cannot cast StringType to ArrayType(StringType,true);
If I define all the fields in my case class [2] as type String then everything works fine but this is not what I want. Is there a simple way to do it [3]?
References
[1] An example row
B017NX63A2,Merrell,"['merrell_for_men', 'merrell_mens_shoes', 'merrel']",merrell_shoes,0.0806054356579781
[2] My code snippet is as follows
import spark.implicits._
val INPUT_TF = "<SOME_URI>/my_file.csv"
final case class TFFormat (
doc_id: String,
brand: String,
phrases: Seq[String],
prediction: String,
probability: Float
)
val ds = sqlContext.read
.option("header", "true")
.option("charset", "UTF8")
.csv(INPUT_TF)
.as[TFFormat]
ds.take(1).map(println)
[3] I have found ways to do it by first defining columns on a DataFrame level and the convert things to Dataset (like here or here or here) but I am almost sure this is not the way things supposed to be done. I am also pretty sure that Encoders are probably the answer but I don't have a clue how
TL;DR With csv input transforming with standard DataFrame operations is the way to go. If you want to avoid you should use input format which has is expressive (Parquet or even JSON).
In general data to be converted to statically typed dataset must be already of the correct type. The most efficient way to do it is to provide schema argument for csv reader:
val schema: StructType = ???
val ds = spark.read
.option("header", "true")
.schema(schema)
.csv(path)
.as[T]
where schema could be inferred by reflection:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
val schema = ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType]
Unfortunately it won't work with your data and class because csv reader doesn't support ArrayType (but it would work for atomic types like FloatType) so you have to use the hard way. A naive solution could be expressed as below:
import org.apache.spark.sql.functions._
val df: DataFrame = ??? // Raw data
df
.withColumn("probability", $"probability".cast("float"))
.withColumn("phrases",
split(regexp_replace($"phrases", "[\\['\\]]", ""), ","))
.as[TFFormat]
but you may need something more sophisticated depending on the content of phrases.

Spark: Write each record in RDD to individual files in HDFS directory

I have a requirement where I want to write each individual records in an RDD to an individual file in HDFS.
I did it for the normal filesystem but obviously, it doesn't work for HDFS.
stream.foreachRDD{ rdd =>
if(!rdd.isEmpty()) {
rdd.foreach{
msg =>
val value = msg._2
println(value)
val fname = java.util.UUID.randomUUID.toString
val path = dir + fname
write(path, value)
}
}
}
where write is a function which writes to the filesystem.
Is there a way to do it within spark so that for each record I can natively write to the HDFS, without using any other tool like Kafka Connect or Flume??
EDIT: More Explanation
For eg:
If my DstreamRDD has the following records,
abcd
efgh
ijkl
mnop
I need different files for each record, so different file for "abcd", different for "efgh" and so on.
I tried creating an RDD within the streamRDD but I learnt it's not allowed as the RDD's are not serializable.
You can forcefully repartition the rdd to no. of partitions as many no. of records and then save
val rddCount = rdd.count()
rdd.repartition(rddCount).saveAsTextFile("your/hdfs/loc")
You can do in couple of ways..
From rdd, you can get the sparkCOntext, once you got the sparkCOntext, you can use parallelize method and pass the String as List of String.
For example:
val sc = rdd.sparkContext
sc.parallelize(Seq("some string")).saveAsTextFile(path)
Also, you can use sqlContext to convert the string to DF then write in the file.
for Example:
import sqlContext.implicits._
Seq(("some string")).toDF