TypeInformation in Flink - scala

I have a pipeline in a place where data is being sent from Flink to Kafka topic in a JSON format. I was also able to get it from the Kafka topic and was able to get the JSON attributes as well. Now, like scala reflect classes where I can also compare the data type at runtime, I was trying to do the same thing in Fink using TypeInformation where I can set some predefined format and whatever data is being read from topic should go under this Validation and should be passed or failed accordingly.
I have a data like below:.
{"policyName":"String", "premium":2400, "eventTime":"2021-12-22 00:00:00" }
For my problem, I came across a couple of examples in Flink's book where it is mentioned how to create a TypeInformation variable but there was nothing mentioned on how to use it so I tried my way:
val objectMapper = new ObjectMapper()
val tupleType: TypeInformation[(String, String, String)] =
Types.TUPLE[(String, Int, String)]
println(tupleType.getTypeClass)
src.map(v => v)
.map { x =>
val policyName: String = objectMapper.readTree(x).get("policyName").toString()
val premium: Int = objectMapper.readTree(x).get("premium").toString().toInt
val eventTime: String = objectMapper.readTree(x).get("eventTime").toString()
if ((policyName, premium, eventTime)== tupleType.getTypeClass) {
println("Good Record: " + (policyName, premium, eventTime))
}
else {
println("Bad Record: " + (id, category, eventTime))
}
}
Now if I pass the input as below to the flink kafka producer:
{"policyName":"whatever you feel like","premium":"4000","eventTime":"2021-12-20 00:00:00"}
It should give me the expected output as a "Bad record" and the tuple since the datatype of premium is String and not Long/Int.
If a pass the input as below:
{"policyName":"whatever you feel like","premium":4000,"eventTime":"2021-12-20 00:00:00"}
It should give me the output as "Good Record" and the tuple
But according to my code, it is always giving me the else part.
If I create a datastream variable and store the results of the above map and then compare like below then it gives me the correct result:
if (tupleType == datas.getType()) { //where 'datas' is a datastream
print("Good Records")
} else {
println("Bad Records")
}
But I want to send the good/bad records to a different stream or maybe can directly be inserted in the Cassandra table. So, that is why I am using loops for identifying the records one by one. Is my way correct? What would be the best practice considering what I am trying to achieve?
Based on Dominik's inputs, I tried creating my ow CustomDeserializer class:
import com.fasterxml.jackson.databind.ObjectMapper
import org.apache.flink.api.common.serialization.DeserializationSchema
import org.apache.flink.api.common.typeinfo.TypeInformation
import java.nio.charset.StandardCharsets
class sample extends DeserializationSchema[String] {
override def deserialize(message: Array[Byte]): Tuple3[Int, String, String] = {
val data = new String(message,
StandardCharsets.UTF_8)
val objectMapper = new ObjectMapper()
val id: Int = objectMapper.readTree(data).get("id").toString().toInt
val category: String = objectMapper.readTree(data).get("Category").toString()
val eventTime: String = objectMapper.readTree(data).get("eventTime").toString()
return (id, category, eventTime)
}
override def isEndOfStream(t: String): Boolean = ???
override def getProducedType: TypeInformation[String] = return TypeInformation.of(classOf[String])
}
I wanna try to implement something like below:
src.map(v => v)
.map { x =>
if (new sample().deserialize(x)==true) {
println("Good Record: " + (id, category, eventTime))
}
else {
println("Bad Record: " + (id, category, eventTime))
}
}
But the input is in Array[Bytes] form. So how can I implement it? Where exactly I am going wrong? What needs to be modified? This is my first ever attempt in Flink Scala custom classes.
Inputs Passed: Inputs

I don't really think that using TypeInformation to do what You want is best idea. You can simply use something like ProcessFunction that will accept a JSON String and then use the ObjectMapper to deserialize JSON to class with the expected structure. You can output the correctly deserialized objects from the ProcessFunction and the Strings that failed deserialization can be apassed as side output since they will be Your Bad Records.
This could look like below, note that this uses Jackson scala to perform deserialization to case class. You can find more info here
case class Premium(policyName: String, premium: Long, eventTime: String)
class Splitter extends ProcessFunction[String, Premium] {
val outputTag = new OutputTag[String]("failed")
def fromJson[T](json: String)(implicit m: Manifest[T]): Either[String, T] = {
Try {
lazy val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.readValue[T](json)
} match {
case Success(x) => Right(x)
case Failure(err) => {
Left(json)
}
}
}
override def processElement(i: String, context: ProcessFunction[String, Premium]#Context, collector: Collector[Premium]): Unit = {
fromJson(i) match {
case Right(data) => collector.collect(data)
case Left(json) => context.output(outputTag, json)
}
}
}
Then You can use the outputTag to get the side output data from the stream to get incorrect records.

Related

Fields of type Record are inserted as null in BigQuery

I have a Scala job where I need to insert nested JSON file to BigQuery. The solution for that is to create a BQ table with field type as Record for the nested fields.
I wrote a case class that looks like this:
case class AvailabilityRecord(
nestedField: NestedRecord,
timezone: String,
) {
def toMap(): java.util.Map[String, Any] = {
val map = new java.util.HashMap[String, Any]
map.put("nestedField", nestedField)
map.put("timezone", timezone)
map
}
}
case class NestedRecord(
from: String,
to: String
)
I'm using the Java dependency "com.google.cloud" % "google-cloud-bigquery" % "2.11.0", in my program.
When I try to insert the JSON value that I parsed to the case class, into BQ, the value of field timezone of tpye String is inserted, however the nested field of type Record is inserted as null.
For insertion, I'm using the following code:
def insertData(records: Seq[AvailabilityRecord], gcpService: GcpServiceImpl): Task[Unit] = Task.defer {
val recordsToInsert = records.map(record => InsertBigQueryRecord("XY", record.toMap()))
gcpService.insertIntoBq(recordsToInsert, TableId.of("dataset", "table"))
}
override def insertIntoBq(records: Iterable[InsertBigQueryRecord],
tableId: TableId): Task[Unit] = Task {
val builder = InsertAllRequest.newBuilder(tableId)
records.foreach(record => builder.addRow(record.key, record.record))
bqContext.insertAll(builder.build)
}
What might be the issue of fields of Record type are inserted as null?
The issue was that I needed to map the sub case class too, because to the java API, the case class object is not known.
For that, this helped me to solve the issue:
case class NestedRecord(
from: String,
to: String
) {
def toMap(): java.util.Map[String, String] = {
val map = new java.util.HashMap[String, Any]
map.put("from", from)
map.put("to", to)
map
}
}
And in the parent case class, the edit would take place in the toMap method:
map.put("nestedField", nestedField.toMap)
map.put("timezone", timezone)

Comparing the json data types at runtime using Jackson and Scala

I have an incoming JSON data that looks like below:
{"id":"1000","premium":29999,"eventTime":"2021-12-22 00:00:00"}
Now, I have created a class that will accept this record and will check whether the data type of the incoming record is according to the data types defined in the case class. However, when I am calling the method it is always calling the Failure part of the match case.
case class Premium(id: String, premium: Long, eventTime: String)
class Splitter extends ProcessFunction[String, Premium] {
val outputTag = new OutputTag[String]("failed")
def fromJson[T](json: String)(implicit m: Manifest[T]): Either[String, T] = {
Try {
println("inside")
lazy val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.readValue[T](json)
} match {
case Success(x) => {
Right(x)
}
case Failure(err) => {
Left(json)
}
}
}
override def processElement(i: String, context: ProcessFunction[String, Premium]#Context, collector: Collector[Premium]): Unit = {
fromJson(i) match {
case Right(data) => {
collector.collect(data)
println("Good Records: " + data)
}
case Left(json) => {
context.output(outputTag, json)
println("Bad Records: " + json)
}
}
}
}
Based on the sample record above, it should pass the Success value but no matter what I pass, it always enters the Failure part. What else is missing?
I am using Scala 2.11.12 and I tried examples from this link and this link but no luck.

Map key not found error despite using option classes

I'm new to the concept of using the Option type but I've tried to use it multiple places in this class to avoid these errors.
The following class is used to store data.
class InTags(val tag35: Option[String], val tag11: Option[String], val tag_109: Option[String], val tag_58: Option[String])
This following code takes a string and converts it into a Int -> String map by seperating on an equals sign.
val message= FIXMessage("8=FIX.4.29=25435=D49=REDACTED56=REDACTED115=REDACTED::::::::::CENTRAL34=296952=20151112-17:11:1111=Order7203109=CENTRAL1=TestAccount63=021=155=CSCO48=CSCO.O22=5207=OQ54=160=20151112-17:11:1338=5000040=244=2815=USD59=047=A13201=CSCO.O13202=510=127
")
val tag58 = message.fields(Some(58)).getOrElse("???")
val in_messages= new InTags(message.fields(Some(35)), message.fields(Some(11)), message.fields(Some(109)), Some(tag58))
println(in_messages.tag_109.getOrElse("???"))
where the FIXMessage object is defined as follows:
class FIXMessage (flds: Map[Option[Int], Option[String]]) {
val fields = flds
def this(fixString: String) = this(FIXMessage.parseFixString(Some(fixString)))
override def toString: String = {
fields.toString
}
}
object FIXMessage{
def apply(flds: Map[Option[Int], Option[String]]) = {
new FIXMessage(flds)
}
def apply(flds: String) = {
new FIXMessage(flds)
}
def parseFixString(fixString: Option[String]): Map[Option[Int], Option[String]] = {
val str = fixString.getOrElse("str=???")
val parts = str.split(1.toChar)
(for {
part <- parts
p = part.split('=')
} yield Some(p(0).toInt) -> Some(p(1))).toMap
}
}
The error I'm getting is ERROR key not found: Some(58) but doesnt the option class handle this? Which basically means that the string passed into the FIXMessage object doesnt contain a substring of the format 58=something(which is true) What is the best way to proceed?
You are using the apply method in Map, which returns the value or throw NoSuchElementException if key is not present.
Instead you could use getOrElse like
message.fields.getOrElse(Some(58), Some("str"))

How do you write a json4s CustomSerializer that handles collections

I have a class that I am trying to deserialize using the json4s CustomSerializer functionality. I need to do this due to the inability of json4s to deserialize mutable collections.
This is the basic structure of the class I want to deserialize (don't worry about why the class is structured like this):
case class FeatureValue(timestamp:Double)
object FeatureValue{
implicit def ordering[F <: FeatureValue] = new Ordering[F] {
override def compare(a: F, b: F): Int = {
a.timestamp.compareTo(b.timestamp)
}
}
}
class Point {
val features = new HashMap[String, SortedSet[FeatureValue]]
def add(name:String, value:FeatureValue):Unit = {
val oldValue:SortedSet[FeatureValue] = features.getOrElseUpdate(name, SortedSet[FeatureValue]())
oldValue += value
}
}
Json4s serializes this just fine. A serialized instance might look like the following:
{"features":
{
"CODE0":[{"timestamp":4.8828914447482E8}],
"CODE1":[{"timestamp":4.8828914541333E8}],
"CODE2":[{"timestamp":4.8828915127325E8},{"timestamp":4.8828910097466E8}]
}
}
I've tried writing a custom deserializer, but I don't know how to deal with the list tails. In a normal matcher you can just call your own function recursively, but in this case the function is anonymous and being called through the json4s API. I cannot find any examples that deal with this and I can't figure it out.
Currently I can match only a single hash key, and a single FeatureValue instance in its value. Here is the CustomSerializer as it stands:
import org.json4s.{FieldSerializer, DefaultFormats, Extraction, CustomSerializer}
import org.json4s.JsonAST._
class PointSerializer extends CustomSerializer[Point](format => (
{
case JObject(JField("features", JObject(Nil)) :: Nil) => new Point
case JObject(List(("features", JObject(List(
(feature:String, JArray(List(JObject(List(("timestamp",JDouble(ts)))))))))
))) => {
val point = new Point
point.add(feature, FeatureValue(ts))
point
}
},
{
// don't need to customize this, it works fine
case x: Point => Extraction.decompose(x)(DefaultFormats + FieldSerializer[Point]())
}
))
If I try to change to using the :: separated list format, so far I have gotten compiler errors. Even if I didn't get compiler errors, I am not sure what I would do with them.
You can get the list of json features in your pattern match and then map over this list to get the Features and their codes.
class PointSerializer extends CustomSerializer[Point](format => (
{
case JObject(List(("features", JObject(featuresJson)))) =>
val features = featuresJson.flatMap {
case (code:String, JArray(timestamps)) =>
timestamps.map { case JObject(List(("timestamp",JDouble(ts)))) =>
code -> FeatureValue(ts)
}
}
val point = new Point
features.foreach((point.add _).tupled)
point
}, {
case x: Point => Extraction.decompose(x)(DefaultFormats + FieldSerializer[Point]())
}
))
Which deserializes your json as follows :
import org.json4s.native.Serialization.{read, write}
implicit val formats = Serialization.formats(NoTypeHints) + new PointSerializer
val json = """
{"features":
{
"CODE0":[{"timestamp":4.8828914447482E8}],
"CODE1":[{"timestamp":4.8828914541333E8}],
"CODE2":[{"timestamp":4.8828915127325E8},{"timestamp":4.8828910097466E8}]
}
}
"""
val point0 = read[Point]("""{"features": {}}""")
val point1 = read[Point](json)
point0.features // Map()
point1.features
// Map(
// CODE0 -> TreeSet(FeatureValue(4.8828914447482E8)),
// CODE2 -> TreeSet(FeatureValue(4.8828910097466E8), FeatureValue(4.8828915127325E8)),
// CODE1 -> TreeSet(FeatureValue(4.8828914541333E8))
// )

Scala Reflection to update a case class val

I'm using scala and slick here, and I have a baserepository which is responsible for doing the basic crud of my classes.
For a design decision, we do have updatedTime and createdTime columns all handled by the application, and not by triggers in database. Both of this fields are joda DataTime instances.
Those fields are defined in two traits called HasUpdatedAt, and HasCreatedAt, for the tables
trait HasCreatedAt {
val createdAt: Option[DateTime]
}
case class User(name:String,createdAt:Option[DateTime] = None) extends HasCreatedAt
I would like to know how can I use reflection to call the user copy method, to update the createdAt value during the database insertion method.
Edit after #vptron and #kevin-wright comments
I have a repo like this
trait BaseRepo[ID, R] {
def insert(r: R)(implicit session: Session): ID
}
I want to implement the insert just once, and there I want to createdAt to be updated, that's why I'm not using the copy method, otherwise I need to implement it everywhere I use the createdAt column.
This question was answered here to help other with this kind of problem.
I end up using this code to execute the copy method of my case classes using scala reflection.
import reflect._
import scala.reflect.runtime.universe._
import scala.reflect.runtime._
class Empty
val mirror = universe.runtimeMirror(getClass.getClassLoader)
// paramName is the parameter that I want to replacte the value
// paramValue is the new parameter value
def updateParam[R : ClassTag](r: R, paramName: String, paramValue: Any): R = {
val instanceMirror = mirror.reflect(r)
val decl = instanceMirror.symbol.asType.toType
val members = decl.members.map(method => transformMethod(method, paramName, paramValue, instanceMirror)).filter {
case _: Empty => false
case _ => true
}.toArray.reverse
val copyMethod = decl.declaration(newTermName("copy")).asMethod
val copyMethodInstance = instanceMirror.reflectMethod(copyMethod)
copyMethodInstance(members: _*).asInstanceOf[R]
}
def transformMethod(method: Symbol, paramName: String, paramValue: Any, instanceMirror: InstanceMirror) = {
val term = method.asTerm
if (term.isAccessor) {
if (term.name.toString == paramName) {
paramValue
} else instanceMirror.reflectField(term).get
} else new Empty
}
With this I can execute the copy method of my case classes, replacing a determined field value.
As comments have said, don't change a val using reflection. Would you that with a java final variable? It makes your code do really unexpected things. If you need to change the value of a val, don't use a val, use a var.
trait HasCreatedAt {
var createdAt: Option[DateTime] = None
}
case class User(name:String) extends HasCreatedAt
Although having a var in a case class may bring some unexpected behavior e.g. copy would not work as expected. This may lead to preferring not using a case class for this.
Another approach would be to make the insert method return an updated copy of the case class, e.g.:
trait HasCreatedAt {
val createdAt: Option[DateTime]
def withCreatedAt(dt:DateTime):this.type
}
case class User(name:String,createdAt:Option[DateTime] = None) extends HasCreatedAt {
def withCreatedAt(dt:DateTime) = this.copy(createdAt = Some(dt))
}
trait BaseRepo[ID, R <: HasCreatedAt] {
def insert(r: R)(implicit session: Session): (ID, R) = {
val id = ???//insert into db
(id, r.withCreatedAt(??? /*now*/))
}
}
EDIT:
Since I didn't answer your original question and you may know what you are doing I am adding a way to do this.
import scala.reflect.runtime.universe._
val user = User("aaa", None)
val m = runtimeMirror(getClass.getClassLoader)
val im = m.reflect(user)
val decl = im.symbol.asType.toType.declaration("createdAt":TermName).asTerm
val fm = im.reflectField(decl)
fm.set(??? /*now*/)
But again, please don't do this. Read this stackoveflow answer to get some insight into what it can cause (vals map to final fields).