How to get count of invalid data during parse - scala

We are using spark to parse a big csv file, which may contain invalid data.
We want to save valid data into the data store, and also return how many valid data we imported and how many invalid data.
I am wondering how we can do this in spark, what's the standard approach when reading data?
My current approach uses Accumulator, but it's not accurate due to how Accumulator works in spark.
// we define case class CSVInputData: all fields are defined as string
val csvInput = spark.read.option("header", "true").csv(csvFile).as[CSVInputData]
val newDS = csvInput
.flatMap { row =>
Try {
val data = new DomainData()
data.setScore(row.score.trim.toDouble)
data.setId(UUID.randomUUID().toString())
data.setDate(Util.parseDate(row.eventTime.trim))
data.setUpdateDate(new Date())
data
} match {
case Success(map) => Seq(map)
case _ => {
errorAcc.add(1)
Seq()
}
}
}
I tried to use Either, but it failed with the exception:
java.lang.NoClassDefFoundError: no Java class corresponding to Product with Serializable with scala.util.Either[xx.CSVInputData,xx.DomainData] found
Update
I think Either doesn't work with spark 2.0 dataset api:
spark.read.option("header", "true").csv("any.csv").map { row =>
try {
Right("")
} catch { case e: Throwable => Left(""); }
}
If we change to use sc(rdd api), it works:
sc.parallelize('a' to 'z').map { row =>
try {
Right("")
} catch { case e: Throwable => Left(""); }
}.collect()
In current latest scala http://www.scala-lang.org/api/2.11.x/index.html#scala.util.Either: Either doesn't implements Serializable trait
sealed abstract class Either[+A, +B] extends AnyRef
In future 2.12 http://www.scala-lang.org/api/2.12.x/scala/util/Either.html, it does:
sealed abstract class Either[+A, +B] extends Product with Serializable
Updated 2 with workaround
More info at Spark ETL: Using Either to handle invalid data
As spark dataset doesn't work with Either, so the workaround is to call ds.rdd, then use try-left-right to capture both valid and invalid data.
spark.read.option("header", "true").csv("/Users/yyuan/jyuan/1.csv").rdd.map ( { row =>
try {
Right("")
} catch { case e: Throwable => Left(""); }
}).collect()

Have you considered using an Either
val counts = csvInput
.map { row =>
try {
val data = new DomainData()
data.setScore(row.score.trim.toDouble)
data.setId(UUID.randomUUID().toString())
data.setDate(Util.parseDate(row.eventTime.trim))
data.setUpdateDate(new Date())
Right(data)
} catch {
case e: Throwable => Left(row)
}
}
val failedCount = counts.map(_.left).filter(_.e.isLeft).count()
val successCount = counts.map(_.right).filter(_.e.isRight).count()

Did you try Spark DDQ - this has most of the data quality rules that you will need. You can even extend and customize it.
Link: https://github.com/FRosner/drunken-data-quality

Related

Comparing the json data types at runtime using Jackson and Scala

I have an incoming JSON data that looks like below:
{"id":"1000","premium":29999,"eventTime":"2021-12-22 00:00:00"}
Now, I have created a class that will accept this record and will check whether the data type of the incoming record is according to the data types defined in the case class. However, when I am calling the method it is always calling the Failure part of the match case.
case class Premium(id: String, premium: Long, eventTime: String)
class Splitter extends ProcessFunction[String, Premium] {
val outputTag = new OutputTag[String]("failed")
def fromJson[T](json: String)(implicit m: Manifest[T]): Either[String, T] = {
Try {
println("inside")
lazy val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.readValue[T](json)
} match {
case Success(x) => {
Right(x)
}
case Failure(err) => {
Left(json)
}
}
}
override def processElement(i: String, context: ProcessFunction[String, Premium]#Context, collector: Collector[Premium]): Unit = {
fromJson(i) match {
case Right(data) => {
collector.collect(data)
println("Good Records: " + data)
}
case Left(json) => {
context.output(outputTag, json)
println("Bad Records: " + json)
}
}
}
}
Based on the sample record above, it should pass the Success value but no matter what I pass, it always enters the Failure part. What else is missing?
I am using Scala 2.11.12 and I tried examples from this link and this link but no luck.

TypeInformation in Flink

I have a pipeline in a place where data is being sent from Flink to Kafka topic in a JSON format. I was also able to get it from the Kafka topic and was able to get the JSON attributes as well. Now, like scala reflect classes where I can also compare the data type at runtime, I was trying to do the same thing in Fink using TypeInformation where I can set some predefined format and whatever data is being read from topic should go under this Validation and should be passed or failed accordingly.
I have a data like below:.
{"policyName":"String", "premium":2400, "eventTime":"2021-12-22 00:00:00" }
For my problem, I came across a couple of examples in Flink's book where it is mentioned how to create a TypeInformation variable but there was nothing mentioned on how to use it so I tried my way:
val objectMapper = new ObjectMapper()
val tupleType: TypeInformation[(String, String, String)] =
Types.TUPLE[(String, Int, String)]
println(tupleType.getTypeClass)
src.map(v => v)
.map { x =>
val policyName: String = objectMapper.readTree(x).get("policyName").toString()
val premium: Int = objectMapper.readTree(x).get("premium").toString().toInt
val eventTime: String = objectMapper.readTree(x).get("eventTime").toString()
if ((policyName, premium, eventTime)== tupleType.getTypeClass) {
println("Good Record: " + (policyName, premium, eventTime))
}
else {
println("Bad Record: " + (id, category, eventTime))
}
}
Now if I pass the input as below to the flink kafka producer:
{"policyName":"whatever you feel like","premium":"4000","eventTime":"2021-12-20 00:00:00"}
It should give me the expected output as a "Bad record" and the tuple since the datatype of premium is String and not Long/Int.
If a pass the input as below:
{"policyName":"whatever you feel like","premium":4000,"eventTime":"2021-12-20 00:00:00"}
It should give me the output as "Good Record" and the tuple
But according to my code, it is always giving me the else part.
If I create a datastream variable and store the results of the above map and then compare like below then it gives me the correct result:
if (tupleType == datas.getType()) { //where 'datas' is a datastream
print("Good Records")
} else {
println("Bad Records")
}
But I want to send the good/bad records to a different stream or maybe can directly be inserted in the Cassandra table. So, that is why I am using loops for identifying the records one by one. Is my way correct? What would be the best practice considering what I am trying to achieve?
Based on Dominik's inputs, I tried creating my ow CustomDeserializer class:
import com.fasterxml.jackson.databind.ObjectMapper
import org.apache.flink.api.common.serialization.DeserializationSchema
import org.apache.flink.api.common.typeinfo.TypeInformation
import java.nio.charset.StandardCharsets
class sample extends DeserializationSchema[String] {
override def deserialize(message: Array[Byte]): Tuple3[Int, String, String] = {
val data = new String(message,
StandardCharsets.UTF_8)
val objectMapper = new ObjectMapper()
val id: Int = objectMapper.readTree(data).get("id").toString().toInt
val category: String = objectMapper.readTree(data).get("Category").toString()
val eventTime: String = objectMapper.readTree(data).get("eventTime").toString()
return (id, category, eventTime)
}
override def isEndOfStream(t: String): Boolean = ???
override def getProducedType: TypeInformation[String] = return TypeInformation.of(classOf[String])
}
I wanna try to implement something like below:
src.map(v => v)
.map { x =>
if (new sample().deserialize(x)==true) {
println("Good Record: " + (id, category, eventTime))
}
else {
println("Bad Record: " + (id, category, eventTime))
}
}
But the input is in Array[Bytes] form. So how can I implement it? Where exactly I am going wrong? What needs to be modified? This is my first ever attempt in Flink Scala custom classes.
Inputs Passed: Inputs
I don't really think that using TypeInformation to do what You want is best idea. You can simply use something like ProcessFunction that will accept a JSON String and then use the ObjectMapper to deserialize JSON to class with the expected structure. You can output the correctly deserialized objects from the ProcessFunction and the Strings that failed deserialization can be apassed as side output since they will be Your Bad Records.
This could look like below, note that this uses Jackson scala to perform deserialization to case class. You can find more info here
case class Premium(policyName: String, premium: Long, eventTime: String)
class Splitter extends ProcessFunction[String, Premium] {
val outputTag = new OutputTag[String]("failed")
def fromJson[T](json: String)(implicit m: Manifest[T]): Either[String, T] = {
Try {
lazy val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.readValue[T](json)
} match {
case Success(x) => Right(x)
case Failure(err) => {
Left(json)
}
}
}
override def processElement(i: String, context: ProcessFunction[String, Premium]#Context, collector: Collector[Premium]): Unit = {
fromJson(i) match {
case Right(data) => collector.collect(data)
case Left(json) => context.output(outputTag, json)
}
}
}
Then You can use the outputTag to get the side output data from the stream to get incorrect records.

Unable to JSON Marshal B defined as List[A] using AKKA-Http and Spray-Json

as stated in the title, I'm not able to marshal List[A] into the proper Json (Array of Objects).
I'm using AKKA-Http and Spray-Json.
I defined two case classes:
final case class Item(id: String, pid: String, title: String)
final case class Items(items: List[Item])
And on that call GET http://localhost:8080/item received:
class ItemEndpoint extends Directives with ItemJsonSupport {
val route: Route = {
path("item") {
get {
parameters("page".?, "size".?) {
(page, size) => (page, size) match {
case (_, _) =>
onSuccess(Server.requestHandler ? GetItemsRequest){
case response: Items =>
complete(response)
case _ =>
complete(StatusCodes.InternalServerError)
}
}
}
}
}
}
}
GetItemsRequest is called. The latter is defined as
case class GetItemsRequest
And the RequestHandler as
class RequestHandler extends Actor with ActorLogging {
var items : Items = _
def receive: Receive = {
case GetItemsRequest =>
items = itemFactory.getItems
sender() ! items
}
}
Where getItems performs a query on Cassandra via Spark
def getItems() : Items = {
val query = sc.sql("SELECT * FROM mydb")
Items(query.map(row => Item(row.getAs[String]("item"),
row.getAs[String]("pid"), row.getAs[String]("title")
)).collect().toList)
}
returning Items containing List[Item]. All the objects are printed correctly (some of them have null fields).
Using ItemJsonFormatter
trait ItemJsonSupport extends SprayJsonSupport with DefaultJsonProtocol {
implicit val itemFormat: RootJsonFormat[Item] = jsonFormat3(Item)
implicit val itemsFormat: RootJsonFormat[Items] = jsonFormat1(Items)
}
Leads to the following error:
[simple-rest-system-akka.actor.default-dispatcher-9] [akka.actor.ActorSystemImpl(simple-rest-system)] Error during processing of request: 'requirement failed'. Completing with 500 Internal Server Error response. To change default exception handling behavior, provide a custom ExceptionHandler.
java.lang.IllegalArgumentException: requirement failed
I tried catching the exception and work on it but I haven't obtained more intel on the problem.
I followed AKKA DOCS on marshalling.
When doing the same thing, but getting only 1 item, it works just fine, I obtain a json containing all Item's parameters well formatted.
{
"id": "78289232389",
"pid": "B007ILCQ8I",
"title": ""
}
Even looking at other related Q/A I was not able to find an answer, so
What's Causing this? How can I fix it?
All the objects are printed correctly (some of them have null fields).
The exception could be thrown because getItems is returning Item objects that have one or more null field values.
One way to handle this is to replace nulls with empty strings:
def rowToItem(row: Row): Item = {
Item(Option(row.getAs[String]("item")).getOrElse(""),
Option(row.getAs[String]("pid")).getOrElse(""),
Option(row.getAs[String]("title")).getOrElse(""))
}
def getItems(): Items = {
val query = sc.sql("SELECT * FROM mydb")
Items(query.map(row => rowToItem(row)).collect().toList)
}

Multiple constructors with the same number of parameters exception while transforming data in spark using scala

Below is the code
def findUniqueGroupInMetadata(sc: SparkContext): Unit = {
val merchantGroup = sc.cassandraTable("local_pb", "merchant_metadata").select("group_name")
try {
val filterByWithGroup = merchantGroup.filter {
row =>
row.getStringOption("group_name") match {
case Some(s: String) if (s != null) => true
case None => false
}
}.map(row => row.getStringOption("group_name").get.capitalize)
//filterByWithGroup.take(15).foreach(data => println("merchantGroup => " + data))
filterByWithGroup.saveToCassandra("local_pb", "merchant_group", SomeColumns("group_name"))
} catch {
case e: Exception => println(e.printStackTrace())
}
}
Exception =>
java.lang.IllegalArgumentException: Multiple constructors with the same number of parameters not allowed.
at com.datastax.spark.connector.util.Reflect$.methodSymbol(Reflect.scala:16)
at com.datastax.spark.connector.util.ReflectionUtil$.constructorParams(ReflectionUtil.scala:63)
at com.datastax.spark.connector.mapper.DefaultColumnMapper.<init>(DefaultColumnMapper.scala:45)
at com.datastax.spark.connector.mapper.LowPriorityColumnMapper$class.defaultColumnMapper(ColumnMapper.scala:47)
at com.datastax.spark.connector.mapper.ColumnMapper$.defaultColumnMapper(ColumnMapper.scala:51)
I found the answer after looking into some blogs.
When I converted the RDD[String] to RDD[Tuple1[String]] everything went smooth. So basically in order to save data to Cassandra, data needs to be of type RDD[TupleX[String]] here x can be 1,2,3... or data can be RDD[SomeCaseClass]

How do you write a json4s CustomSerializer that handles collections

I have a class that I am trying to deserialize using the json4s CustomSerializer functionality. I need to do this due to the inability of json4s to deserialize mutable collections.
This is the basic structure of the class I want to deserialize (don't worry about why the class is structured like this):
case class FeatureValue(timestamp:Double)
object FeatureValue{
implicit def ordering[F <: FeatureValue] = new Ordering[F] {
override def compare(a: F, b: F): Int = {
a.timestamp.compareTo(b.timestamp)
}
}
}
class Point {
val features = new HashMap[String, SortedSet[FeatureValue]]
def add(name:String, value:FeatureValue):Unit = {
val oldValue:SortedSet[FeatureValue] = features.getOrElseUpdate(name, SortedSet[FeatureValue]())
oldValue += value
}
}
Json4s serializes this just fine. A serialized instance might look like the following:
{"features":
{
"CODE0":[{"timestamp":4.8828914447482E8}],
"CODE1":[{"timestamp":4.8828914541333E8}],
"CODE2":[{"timestamp":4.8828915127325E8},{"timestamp":4.8828910097466E8}]
}
}
I've tried writing a custom deserializer, but I don't know how to deal with the list tails. In a normal matcher you can just call your own function recursively, but in this case the function is anonymous and being called through the json4s API. I cannot find any examples that deal with this and I can't figure it out.
Currently I can match only a single hash key, and a single FeatureValue instance in its value. Here is the CustomSerializer as it stands:
import org.json4s.{FieldSerializer, DefaultFormats, Extraction, CustomSerializer}
import org.json4s.JsonAST._
class PointSerializer extends CustomSerializer[Point](format => (
{
case JObject(JField("features", JObject(Nil)) :: Nil) => new Point
case JObject(List(("features", JObject(List(
(feature:String, JArray(List(JObject(List(("timestamp",JDouble(ts)))))))))
))) => {
val point = new Point
point.add(feature, FeatureValue(ts))
point
}
},
{
// don't need to customize this, it works fine
case x: Point => Extraction.decompose(x)(DefaultFormats + FieldSerializer[Point]())
}
))
If I try to change to using the :: separated list format, so far I have gotten compiler errors. Even if I didn't get compiler errors, I am not sure what I would do with them.
You can get the list of json features in your pattern match and then map over this list to get the Features and their codes.
class PointSerializer extends CustomSerializer[Point](format => (
{
case JObject(List(("features", JObject(featuresJson)))) =>
val features = featuresJson.flatMap {
case (code:String, JArray(timestamps)) =>
timestamps.map { case JObject(List(("timestamp",JDouble(ts)))) =>
code -> FeatureValue(ts)
}
}
val point = new Point
features.foreach((point.add _).tupled)
point
}, {
case x: Point => Extraction.decompose(x)(DefaultFormats + FieldSerializer[Point]())
}
))
Which deserializes your json as follows :
import org.json4s.native.Serialization.{read, write}
implicit val formats = Serialization.formats(NoTypeHints) + new PointSerializer
val json = """
{"features":
{
"CODE0":[{"timestamp":4.8828914447482E8}],
"CODE1":[{"timestamp":4.8828914541333E8}],
"CODE2":[{"timestamp":4.8828915127325E8},{"timestamp":4.8828910097466E8}]
}
}
"""
val point0 = read[Point]("""{"features": {}}""")
val point1 = read[Point](json)
point0.features // Map()
point1.features
// Map(
// CODE0 -> TreeSet(FeatureValue(4.8828914447482E8)),
// CODE2 -> TreeSet(FeatureValue(4.8828910097466E8), FeatureValue(4.8828915127325E8)),
// CODE1 -> TreeSet(FeatureValue(4.8828914541333E8))
// )