Fields of type Record are inserted as null in BigQuery - scala

I have a Scala job where I need to insert nested JSON file to BigQuery. The solution for that is to create a BQ table with field type as Record for the nested fields.
I wrote a case class that looks like this:
case class AvailabilityRecord(
nestedField: NestedRecord,
timezone: String,
) {
def toMap(): java.util.Map[String, Any] = {
val map = new java.util.HashMap[String, Any]
map.put("nestedField", nestedField)
map.put("timezone", timezone)
map
}
}
case class NestedRecord(
from: String,
to: String
)
I'm using the Java dependency "com.google.cloud" % "google-cloud-bigquery" % "2.11.0", in my program.
When I try to insert the JSON value that I parsed to the case class, into BQ, the value of field timezone of tpye String is inserted, however the nested field of type Record is inserted as null.
For insertion, I'm using the following code:
def insertData(records: Seq[AvailabilityRecord], gcpService: GcpServiceImpl): Task[Unit] = Task.defer {
val recordsToInsert = records.map(record => InsertBigQueryRecord("XY", record.toMap()))
gcpService.insertIntoBq(recordsToInsert, TableId.of("dataset", "table"))
}
override def insertIntoBq(records: Iterable[InsertBigQueryRecord],
tableId: TableId): Task[Unit] = Task {
val builder = InsertAllRequest.newBuilder(tableId)
records.foreach(record => builder.addRow(record.key, record.record))
bqContext.insertAll(builder.build)
}
What might be the issue of fields of Record type are inserted as null?

The issue was that I needed to map the sub case class too, because to the java API, the case class object is not known.
For that, this helped me to solve the issue:
case class NestedRecord(
from: String,
to: String
) {
def toMap(): java.util.Map[String, String] = {
val map = new java.util.HashMap[String, Any]
map.put("from", from)
map.put("to", to)
map
}
}
And in the parent case class, the edit would take place in the toMap method:
map.put("nestedField", nestedField.toMap)
map.put("timezone", timezone)

Related

TypeInformation in Flink

I have a pipeline in a place where data is being sent from Flink to Kafka topic in a JSON format. I was also able to get it from the Kafka topic and was able to get the JSON attributes as well. Now, like scala reflect classes where I can also compare the data type at runtime, I was trying to do the same thing in Fink using TypeInformation where I can set some predefined format and whatever data is being read from topic should go under this Validation and should be passed or failed accordingly.
I have a data like below:.
{"policyName":"String", "premium":2400, "eventTime":"2021-12-22 00:00:00" }
For my problem, I came across a couple of examples in Flink's book where it is mentioned how to create a TypeInformation variable but there was nothing mentioned on how to use it so I tried my way:
val objectMapper = new ObjectMapper()
val tupleType: TypeInformation[(String, String, String)] =
Types.TUPLE[(String, Int, String)]
println(tupleType.getTypeClass)
src.map(v => v)
.map { x =>
val policyName: String = objectMapper.readTree(x).get("policyName").toString()
val premium: Int = objectMapper.readTree(x).get("premium").toString().toInt
val eventTime: String = objectMapper.readTree(x).get("eventTime").toString()
if ((policyName, premium, eventTime)== tupleType.getTypeClass) {
println("Good Record: " + (policyName, premium, eventTime))
}
else {
println("Bad Record: " + (id, category, eventTime))
}
}
Now if I pass the input as below to the flink kafka producer:
{"policyName":"whatever you feel like","premium":"4000","eventTime":"2021-12-20 00:00:00"}
It should give me the expected output as a "Bad record" and the tuple since the datatype of premium is String and not Long/Int.
If a pass the input as below:
{"policyName":"whatever you feel like","premium":4000,"eventTime":"2021-12-20 00:00:00"}
It should give me the output as "Good Record" and the tuple
But according to my code, it is always giving me the else part.
If I create a datastream variable and store the results of the above map and then compare like below then it gives me the correct result:
if (tupleType == datas.getType()) { //where 'datas' is a datastream
print("Good Records")
} else {
println("Bad Records")
}
But I want to send the good/bad records to a different stream or maybe can directly be inserted in the Cassandra table. So, that is why I am using loops for identifying the records one by one. Is my way correct? What would be the best practice considering what I am trying to achieve?
Based on Dominik's inputs, I tried creating my ow CustomDeserializer class:
import com.fasterxml.jackson.databind.ObjectMapper
import org.apache.flink.api.common.serialization.DeserializationSchema
import org.apache.flink.api.common.typeinfo.TypeInformation
import java.nio.charset.StandardCharsets
class sample extends DeserializationSchema[String] {
override def deserialize(message: Array[Byte]): Tuple3[Int, String, String] = {
val data = new String(message,
StandardCharsets.UTF_8)
val objectMapper = new ObjectMapper()
val id: Int = objectMapper.readTree(data).get("id").toString().toInt
val category: String = objectMapper.readTree(data).get("Category").toString()
val eventTime: String = objectMapper.readTree(data).get("eventTime").toString()
return (id, category, eventTime)
}
override def isEndOfStream(t: String): Boolean = ???
override def getProducedType: TypeInformation[String] = return TypeInformation.of(classOf[String])
}
I wanna try to implement something like below:
src.map(v => v)
.map { x =>
if (new sample().deserialize(x)==true) {
println("Good Record: " + (id, category, eventTime))
}
else {
println("Bad Record: " + (id, category, eventTime))
}
}
But the input is in Array[Bytes] form. So how can I implement it? Where exactly I am going wrong? What needs to be modified? This is my first ever attempt in Flink Scala custom classes.
Inputs Passed: Inputs
I don't really think that using TypeInformation to do what You want is best idea. You can simply use something like ProcessFunction that will accept a JSON String and then use the ObjectMapper to deserialize JSON to class with the expected structure. You can output the correctly deserialized objects from the ProcessFunction and the Strings that failed deserialization can be apassed as side output since they will be Your Bad Records.
This could look like below, note that this uses Jackson scala to perform deserialization to case class. You can find more info here
case class Premium(policyName: String, premium: Long, eventTime: String)
class Splitter extends ProcessFunction[String, Premium] {
val outputTag = new OutputTag[String]("failed")
def fromJson[T](json: String)(implicit m: Manifest[T]): Either[String, T] = {
Try {
lazy val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.readValue[T](json)
} match {
case Success(x) => Right(x)
case Failure(err) => {
Left(json)
}
}
}
override def processElement(i: String, context: ProcessFunction[String, Premium]#Context, collector: Collector[Premium]): Unit = {
fromJson(i) match {
case Right(data) => collector.collect(data)
case Left(json) => context.output(outputTag, json)
}
}
}
Then You can use the outputTag to get the side output data from the stream to get incorrect records.

Multiple slick `column`s for the same DB column break projection

I'm new to Slick thus I'm not sure whether the problem caused by incorrect usage of implicits or Slick doesn't allow doing what I'm trying to do.
In short I use Slick-pg extension for JSONB support in Postgres. I also use spray-json to deserialize JSONB fields into case classes.
In order to automagically convert columns into objects I wrote generic implicit JsonColumnType that you can see below. It allows me to have any case class for which I defined json formatter to be converted to jsonb field.
On the other hand I want to have alias of JsValue type for the same column so that I can use JSONB-operators.
import com.github.tminglei.slickpg._
import com.github.tminglei.slickpg.json.PgJsonExtensions
import org.bson.types.ObjectId
import slick.ast.BaseTypedType
import slick.jdbc.JdbcType
import spray.json.{JsValue, RootJsonWriter, RootJsonReader}
import scala.reflect.ClassTag
trait MyPostgresDriver extends ExPostgresDriver with PgArraySupport with PgDate2Support with PgRangeSupport with PgHStoreSupport with PgSprayJsonSupport with PgJsonExtensions with PgSearchSupport with PgNetSupport with PgLTreeSupport {
override def pgjson = "jsonb" // jsonb support is in postgres 9.4.0 onward; for 9.3.x use "json"
override val api = MyAPI
private val plainAPI = new API with SprayJsonPlainImplicits
object MyAPI extends API with DateTimeImplicits with JsonImplicits with NetImplicits with LTreeImplicits with RangeImplicits with HStoreImplicits with SearchImplicits with SearchAssistants { //with ArrayImplicits
implicit val ObjectIdColumnType = MappedColumnType.base[ObjectId, Array[Byte]](
{ obj => obj.toByteArray }, { arr => new ObjectId(arr) }
)
implicit def JsonColumnType[T: ClassTag](implicit reader: RootJsonReader[T], writer: RootJsonWriter[T]) = {
val columnType: JdbcType[T] with BaseTypedType[T] = MappedColumnType.base[T, JsValue]({ obj => writer.write(obj) }, { json => reader.read(json) })
columnType
}
}
}
object MyPostgresDriver extends MyPostgresDriver
Here is how my table is defined (minimized version)
case class Article(id : ObjectId, ids : Ids)
case class Ids(doi: Option[String], pmid: Option[Long])
class ArticleRow(tag: Tag) extends Table[Article](tag, "articles") {
def id = column[ObjectId]("id", O.PrimaryKey)
def idsJson = column[JsValue]("ext_ids")
def ids = column[Ids]("ext_ids")
private val fromTuple: ((ObjectId, Ids)) => Article = {
case (id, ids) => Article(id, ids)
}
private val toTuple = (v: Article) => Option((v.id, v.ids))
def * = ProvenShape.proveShapeOf((id, ids) <> (fromTuple, toTuple))(MappedProjection.mappedProjectionShape)
}
private val articles = TableQuery[ArticleRow]
Finally I have function that looks up articles by value of json field
def getArticleByDoi(doi : String): Future[Article] = {
val query = (for (a <- articles if (a.idsJson +>> "doi").asColumnOf[String] === doi) yield a).take(1).result
slickDb.run(query).map { items =>
items.headOption.getOrElse(throw new RuntimeException(s"Article with doi $doi is not found"))
}
}
Sadly I get following exception in runtime
java.lang.ClassCastException: spray.json.JsObject cannot be cast to server.models.db.Ids
The problem is in SpecializedJdbcResultConverter.base where ti.getValue is being called with wrong ti. It should be slick.driver.JdbcTypesComponent$MappedJdbcType but instead it's com.github.tminglei.slickpg.utils.PgCommonJdbcTypes$GenericJdbcType. As result wrong type is passed into my tuple converter.
What makes Slick choose different type for column even though there is explicit definition of projection in table row class ?
Sample project that demonstrates the issue is here.

Map key not found error despite using option classes

I'm new to the concept of using the Option type but I've tried to use it multiple places in this class to avoid these errors.
The following class is used to store data.
class InTags(val tag35: Option[String], val tag11: Option[String], val tag_109: Option[String], val tag_58: Option[String])
This following code takes a string and converts it into a Int -> String map by seperating on an equals sign.
val message= FIXMessage("8=FIX.4.29=25435=D49=REDACTED56=REDACTED115=REDACTED::::::::::CENTRAL34=296952=20151112-17:11:1111=Order7203109=CENTRAL1=TestAccount63=021=155=CSCO48=CSCO.O22=5207=OQ54=160=20151112-17:11:1338=5000040=244=2815=USD59=047=A13201=CSCO.O13202=510=127
")
val tag58 = message.fields(Some(58)).getOrElse("???")
val in_messages= new InTags(message.fields(Some(35)), message.fields(Some(11)), message.fields(Some(109)), Some(tag58))
println(in_messages.tag_109.getOrElse("???"))
where the FIXMessage object is defined as follows:
class FIXMessage (flds: Map[Option[Int], Option[String]]) {
val fields = flds
def this(fixString: String) = this(FIXMessage.parseFixString(Some(fixString)))
override def toString: String = {
fields.toString
}
}
object FIXMessage{
def apply(flds: Map[Option[Int], Option[String]]) = {
new FIXMessage(flds)
}
def apply(flds: String) = {
new FIXMessage(flds)
}
def parseFixString(fixString: Option[String]): Map[Option[Int], Option[String]] = {
val str = fixString.getOrElse("str=???")
val parts = str.split(1.toChar)
(for {
part <- parts
p = part.split('=')
} yield Some(p(0).toInt) -> Some(p(1))).toMap
}
}
The error I'm getting is ERROR key not found: Some(58) but doesnt the option class handle this? Which basically means that the string passed into the FIXMessage object doesnt contain a substring of the format 58=something(which is true) What is the best way to proceed?
You are using the apply method in Map, which returns the value or throw NoSuchElementException if key is not present.
Instead you could use getOrElse like
message.fields.getOrElse(Some(58), Some("str"))

Scala Reflection to update a case class val

I'm using scala and slick here, and I have a baserepository which is responsible for doing the basic crud of my classes.
For a design decision, we do have updatedTime and createdTime columns all handled by the application, and not by triggers in database. Both of this fields are joda DataTime instances.
Those fields are defined in two traits called HasUpdatedAt, and HasCreatedAt, for the tables
trait HasCreatedAt {
val createdAt: Option[DateTime]
}
case class User(name:String,createdAt:Option[DateTime] = None) extends HasCreatedAt
I would like to know how can I use reflection to call the user copy method, to update the createdAt value during the database insertion method.
Edit after #vptron and #kevin-wright comments
I have a repo like this
trait BaseRepo[ID, R] {
def insert(r: R)(implicit session: Session): ID
}
I want to implement the insert just once, and there I want to createdAt to be updated, that's why I'm not using the copy method, otherwise I need to implement it everywhere I use the createdAt column.
This question was answered here to help other with this kind of problem.
I end up using this code to execute the copy method of my case classes using scala reflection.
import reflect._
import scala.reflect.runtime.universe._
import scala.reflect.runtime._
class Empty
val mirror = universe.runtimeMirror(getClass.getClassLoader)
// paramName is the parameter that I want to replacte the value
// paramValue is the new parameter value
def updateParam[R : ClassTag](r: R, paramName: String, paramValue: Any): R = {
val instanceMirror = mirror.reflect(r)
val decl = instanceMirror.symbol.asType.toType
val members = decl.members.map(method => transformMethod(method, paramName, paramValue, instanceMirror)).filter {
case _: Empty => false
case _ => true
}.toArray.reverse
val copyMethod = decl.declaration(newTermName("copy")).asMethod
val copyMethodInstance = instanceMirror.reflectMethod(copyMethod)
copyMethodInstance(members: _*).asInstanceOf[R]
}
def transformMethod(method: Symbol, paramName: String, paramValue: Any, instanceMirror: InstanceMirror) = {
val term = method.asTerm
if (term.isAccessor) {
if (term.name.toString == paramName) {
paramValue
} else instanceMirror.reflectField(term).get
} else new Empty
}
With this I can execute the copy method of my case classes, replacing a determined field value.
As comments have said, don't change a val using reflection. Would you that with a java final variable? It makes your code do really unexpected things. If you need to change the value of a val, don't use a val, use a var.
trait HasCreatedAt {
var createdAt: Option[DateTime] = None
}
case class User(name:String) extends HasCreatedAt
Although having a var in a case class may bring some unexpected behavior e.g. copy would not work as expected. This may lead to preferring not using a case class for this.
Another approach would be to make the insert method return an updated copy of the case class, e.g.:
trait HasCreatedAt {
val createdAt: Option[DateTime]
def withCreatedAt(dt:DateTime):this.type
}
case class User(name:String,createdAt:Option[DateTime] = None) extends HasCreatedAt {
def withCreatedAt(dt:DateTime) = this.copy(createdAt = Some(dt))
}
trait BaseRepo[ID, R <: HasCreatedAt] {
def insert(r: R)(implicit session: Session): (ID, R) = {
val id = ???//insert into db
(id, r.withCreatedAt(??? /*now*/))
}
}
EDIT:
Since I didn't answer your original question and you may know what you are doing I am adding a way to do this.
import scala.reflect.runtime.universe._
val user = User("aaa", None)
val m = runtimeMirror(getClass.getClassLoader)
val im = m.reflect(user)
val decl = im.symbol.asType.toType.declaration("createdAt":TermName).asTerm
val fm = im.reflectField(decl)
fm.set(??? /*now*/)
But again, please don't do this. Read this stackoveflow answer to get some insight into what it can cause (vals map to final fields).

MongoDB ObjectID as JSON using lift-json

I'm using Bowler framework for some REST API's (internally uses lift-json module for heavy lifting) and have the following case class:
case class Item(_id : ObjectId, name : String, value : String)
When I return this case object back to client I need to include value for _id field. However, the _id column is being returned as an empty list in Json output instead of its actual value.
{"_id":{},"name":"Id Test","value":"id test"}
Any pointers on how this can be fixed will be greatly appreciated.
Update: I tried using custom serializer for it but for some reason it doesn't get called!
class ObjectIdSerializer extends Serializer[ObjectId] {
private val Class = classOf[ObjectId]
def deserialize(implicit format: Formats) = {
case (TypeInfo(Class, _), json) => json match {
case JObject(JField("_id", JString(s)) :: Nil) => new ObjectId(s)
case x => throw new MappingException("Can't convert " + x + " to ObjectId")
}
}
def serialize(implicit format: Formats) = {
case x: ObjectId => { println("\t ########Custom Serializer was called!"); JObject(JField("_id", JString(x.toString)) :: Nil)}
}
}
implicit val formats = DefaultFormats + new ObjectIdSerializer
This is fixed. Needed to define my own RenderStrategy class in order to override the formats declaration. This post has more details on it http://blog.recursivity.com/post/5433171352/how-bowler-does-rendering-maps-requests-to-objects