How to deserialize Kafka Avro messages using POJOs with Spark Structured Streaming - scala

I'm using Spark Structured Streaming with Kafka integration in order to read Avro messages in order to deserialized them. The goal is to read these messages using a generated POJO as a schema. The code I'm using is the following :
val kafkaConsumerDf: DataFrame = sparkSession
.readStream
.format("kafka")
.option("subscribe", inputTopic)
.option("group.id", queryName)
.option("startingOffsets", "earliest")
.option("kafka.bootstrap.servers", bootstrapServers)
.load()
kafkaConsumerDf
.writeStream
.queryName(queryName)
.option("checkpointLocation", checkpointPath)
.foreachBatch((batchDF: DataFrame, batchId: Long) => {
val deserializedDf: DataFrame = batchDF.select(
from_avro(col("value"), schemaRegistryConfig) as "value"
).select("value.*")
}).start()
The schema of the data is as follows:
{
"fields": [
{
"name": "idA",
"type": "string"
},
{
"name": "idB",
"type": "string"
},
{
"name": "idC",
"type": "string"
},
{
"name": "name",
"type": [
"string",
"null"
]
}
],
"name": "Avro",
"namespace": "com.test.avro",
"type": "record"
}
As I stated above, I will want to use a POJO (Avro) as a schema to read the consumed data into Kafka. The POJO that represents the data structure is:
/** MACHINE-GENERATED FROM AVRO SCHEMA. DO NOT EDIT DIRECTLY */
import scala.annotation.switch
final case class Avro(var idA: String, var idB: String, var idC: String, var name: Option[String]) extends org.apache.avro.specific.SpecificRecordBase {
def this() = this("", "", "", "", None, None, None, None)
def get(field$: Int): AnyRef = {
(field$: #switch) match {
case 0 => {
idA
}.asInstanceOf[AnyRef]
case 1 => {
idB
}.asInstanceOf[AnyRef]
case 2 => {
idC
}.asInstanceOf[AnyRef]
case 3 => {
name match {
case Some(x) => x
case None => null
}
}.asInstanceOf[AnyRef]
case _ => new org.apache.avro.AvroRuntimeException("Bad index")
}
}
def put(field$: Int, value: Any): Unit = {
(field$: #switch) match {
case 0 => this.idA = {
value.toString
}.asInstanceOf[String]
case 1 => this.idB = {
value.toString
}.asInstanceOf[String]
case 2 => this.idC = {
value.toString
}.asInstanceOf[String]
case 4 => this.name = {
value match {
case null => None
case _ => Some(value.toString)
}
}.asInstanceOf[Option[String]]
case _ => new org.apache.avro.AvroRuntimeException("Bad index")
}
()
}
def getSchema: org.apache.avro.Schema = Avro.SCHEMA$
}
object Equipement {
val SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"Avro\",\"namespace\":\"com.test.avro\",\"fields\":[{\"name\":\"idA\",\"type\":\"string\"},{\"name\":\"idB\",\"type\":\"string\"},{\"name\":\"idC\",\"type\":\"string\"},{\"name\":\"name\",\"type\":[\"string\",\"null\"]}]}")
}
So, instead of using the from_avro() function and the schema registry, is it possible to use its POJO in order to deserialize the data? For example:
val deserializedDf = batchDF.
...
...
.as[Avro]
Do you have any ideas?

Related

How to get the return value from For loop and pass it to .body(StringBody(session => in Gatling using Scala

How to get the return value from For loop and pass it to .body(StringBody(session => in Gatling using Scala
I have created a method with for loop to generate String Array in gatling with scala
def main(args: Array[String]): Unit = {
var AddTest: Array[String] = Array[String]()
for (i <- 0 to 3) {
val TestBulk: String =
s"""{ "name": "Perftest ${Random.alphanumeric.take(6).mkString}",
"testID": "00000000-0000-0000-0000-000000006017",
"typeId": "00000000-0000-0000-0000-000000011001",
"statusId": "00000000-0000-0000-0000-000000005058"};"""
AddTest = TestBulk.split(",")
// val TestBulk2: Nothing = AddTest.mkString(",").replace(';', ',')
// println(TestBulk)
}
}
now I want to pass the return value to .body(StringBody(session =>
.exec(http("PerfTest Bulk Json")
.post("/PerfTest/bulk")
.body(StringBody(session =>
s"""[(Value from the for loop).mkString(",").replace(';', ',')
]""".stripMargin)).asJson
Please help me with the possibilities
Please let me know if
You don't need for loop (or var or split) for this. You also do not have ; anywhere, so last replace is pointless.
val ids = """
"testId": "foo",
"typeId": "bar",
"statusId": "baz"
"""
val data = (1 to 3)
.map { _ => Random.alphanumeric.take(6).mkString }
.map { r => s""""name": "Perftest $r"""" }
.map { s => s"{ $s, $ids }" }
.mkString("[", ",", "]")
exec("foo").post("/bar").body(_ => StringBody(data)).asJson
(I added [ and ] around your generated string to make it look like valid json).
Alternatively, you probably have some library that converts maps and lists to json out-of-the box (I don't know gatling, but there must be something), a bit cleaner way to do this would be with something like this:
val ids = Map(
"testId" -> "foo",
"typeId" -> "bar",
"statusId" -> "baz"
)
val data = (1 to 3)
.map { _ => Random.alphanumeric.take(6).mkString }
.map { r => ids + ("name" -> s"Perftest $r") }
exec("foo").post("/bar").body(_ => StringBody(toJson(data))).asJson
This Worked for me
Thanks to
#dima .
I build this with your suggested method.
import scala.util.Random
import math.Ordered.orderingToOrdered
import math.Ordering.Implicits.infixOrderingOps
import play.api.libs.json._
import play.api.libs.json.Writes
import play.api.libs.json.Json.JsValueWrapper
val data1 = (1 to 2)
.map {r => Json.toJson(Map(
"name" -> Json.toJson(s"Perftest${Random.alphanumeric.take(6).mkString}"),
"domainId"->Json.toJson("343RDFDGF4RGGFG"),
"typeId"->Json.toJson("343RDFDGF4RGGFG"),
"statusId"->Json.toJson("343RDFDGF4RGGFG"),
"excludedFromAutoHyperlinking" ->Json.toJson(true)))}
println(Json.toJson(data1))```

json parsing using circe in scala

I'm trying to make use of circe for json parsing in scala. Can you please help me parse 'pocs' from the data in the case class as well? here is the code:
import io.circe.Decoder
import io.circe.generic.semiauto.deriveDecoder
import io.circe.parser
val json: String =
"""
{
"segmements": [
{
"tableName": "X",
"segmentName": "XX",
"pocs": [
"aa#aa.com",
"bb#bb.com"
]
},
{
"tableName": "Y",
"segmentName": "YY",
"pocs": [
"aa#aa.com",
"bb#bb.com"
]
}
]
}
"""
final case class TableInfo(tableName: String, segmentName: String)
object TableInfo {
implicit final val TableInfoDecoder: Decoder[TableInfo] = deriveDecoder
}
val result = for {
data <- parser.parse(json)
obj <- data.asObject.toRight(left = new Exception("Data was not an object"))
segmements <- obj("segmements").toRight(left = new Exception("Json didn't had the
segments key"))
r <- segmements.as[List[TableInfo]]
} yield r
println(result)
scastie link: https://scastie.scala-lang.org/BalmungSan/eVEvBulOQwGzg5hIJroAoQ/3
Just add parameter typed as collection of String:
final case class TableInfo(tableName: String, segmentName: String, pocs: Seq[String])
scastie

How to convert nested object from rdd row to some custom object

I'm trying to learn some scala/spark and trying to practice using some basic spark integration example. So my problem is that I have a Mongo db running locally. I'm pulling some data and making an rdd from it. The data in db has a structure like that:
{
"_id": 0,
"name": "aimee Zank",
"scores": [
{
"score": 1.463179736705023,
"type": "exam"
},
{
"score": 11.78273309957772,
"type": "quiz"
},
{
"score": 35.8740349954354,
"type": "homework"
}
]
}
Here is some code:
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("simple-app")
val sparkSession = SparkSession.builder()
.appName("example-spark-scala-read-and-write-from-mongo")
.config(conf)
.config("spark.mongodb.output.uri", "mongodb://sproot:12345#172.18.0.3:27017/spdb.students")
.config("spark.mongodb.input.uri", "mongodb://sproot:12345#172.18.0.3:27017/spdb.students")
.getOrCreate()
// Reading Mongodb collection into a dataframe
val df = MongoSpark.load(sparkSession)
val dataRdd: RDD[Row] = df.rdd
dataRdd.foreach(row => println(row.getValuesMap[Any](row.schema.fieldNames)))
The code above provides me this:
Map(_id -> 0, name -> aimee Zank, scores -> WrappedArray([1.463179736705023,exam], [11.78273309957772,quiz], [35.8740349954354,homework]))
Map(_id -> 1, name -> Aurelia Menendez, scores -> WrappedArray([60.06045071030959,exam], [52.79790691903873,quiz], [71.76133439165544,homework]))
At the end I have a problem converting this data to:
case class Student(id: Long, name: String, scores: Scores)
case class Scores(#JsonProperty("scores") scores: List[Score])
case class Score (
#JsonProperty("score") score: Double,
#JsonProperty("type") scoreType: String
)
To conclude - the problem is that I cannot convert some data from RDD to the Student object. The most problematic place for me is that 'scores' nested object.
Please help me to understand how this should be done.
Played a bit more with it and ended up with the following solution:
object MainClass {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("simple-app")
val sparkSession = SparkSession.builder()
.appName("example-spark-scala-read-and-write-from-mongo")
.config(conf)
.config("spark.mongodb.output.uri", "mongodb://sproot:12345#172.18.0.3:27017/spdb.students")
.config("spark.mongodb.input.uri", "mongodb://sproot:12345#172.18.0.3:27017/spdb.students")
.getOrCreate()
val objectMapper = new ObjectMapper()
objectMapper.registerModule(DefaultScalaModule)
// Reading Mongodb collection into a dataframe
val df = MongoSpark.load(sparkSession)
val dataRdd: RDD[Row] = df.rdd
val students: List[Student] =
dataRdd
.collect()
.map(row => Student(row.getInt(0), row.getString(1), createScoresObject(row))).toList
println()
}
def createScoresObject(row: Row): Scores = {
Scores(getAllScoresFromWrappedArray(row).map(x => Score(x.getDouble(0), x.getString(1))).toList)
}
def getAllScoresFromWrappedArray(row: Row): mutable.WrappedArray[GenericRowWithSchema] = {
getScoresWrappedArray(row).map(x => x.asInstanceOf[GenericRowWithSchema])
}
def getScoresWrappedArray(row: Row): mutable.WrappedArray[AnyVal] = {
row.getAs[mutable.WrappedArray[AnyVal]](2)
}
}
case class Student(id: Long, name: String, scores: Scores)
case class Scores(scores: List[Score])
case class Score (score: Double, scoreType: String)
But I would be glad to know if there is some elegant solution.

Parsing scala Json into dataframe

Sample Json
"alternateId": [
{
"type": "POPID",
"value": "1-7842-0759-001"
},
{
"type": "CAMID",
"value": "CAMID 0000-0002-7EC1-02FF-O-0000-0000-2"
},
{
"type": "ProgrammeUuid",
"value": "1ddb01e2-6146-4e10-bba9-dde40d0ad886"
}
]
I want to update a existing dataframe with two columns, those two columns are POPID and CAMID . These two values needs to be parsed from json structure
I dont know how to parse this structure , Can you help me on what do i need to change on fetchField method. As per above json POPID is placed first and CAMID is placed second, but in real jsons, it can be placed at one of those 3 places inside alternateId.
val fetchCAMID_udf = udf(fetchCAMID _)
val fetchPOPID_udf = udf(fetchPOPID _)
var updatedDf = //Data frame initialize
updatedDf = updatedDf.withColumn("CAMID", fetchCAMID_udf(col("alternate_id")))
updatedDf = updatedDf.withColumn("POPID", fetchPOPID_udf(col("alternate_id")))
updatedDf .show(10,false)
def fetchCAMID(jsonStr: String): String = {
var CAMID: String = fetchField(jsonStr, "CAMID")
CAMID
}
def fetchPOPID(jsonStr: String): String = {
fetchField(jsonStr, "POPID")
}
def fetchField(jsonStr: String, fieldName: String): String = {
try {
implicit val formats = DefaultFormats
val extractedField = jsonStr match {
case "(unknown)" => jsonStr
case _ => {
val json = JsonMethods.parse(jsonStr)
val resultExtracted = (json \\ fieldName)
val result = resultExtracted match {
case _: JString => resultExtracted.extract[String]
case _: JInt => resultExtracted.extract[Int].toString
case _: JObject => "(unknown)"
}
result
}
}
extractedField
}
catch{
case e: Exception =>{
log.error(s"Fetch field failed. Field name: $fieldName . Json: $jsonStr")
"(unknown)"
}
}
}
Change your fetchField function as the following
def fetchField(jsonStr: String, fieldName: String): String = {
try {
val typeAndValue = (JsonMethods.parse("{"+jsonStr+"}") \ "alternateId" \ "type" \\ classOf[JString]).zip(JsonMethods.parse("{"+jsonStr+"}") \ "alternateId" \ "value" \\ classOf[JString])
typeAndValue.filter(_._1 == fieldName).map(_._2).toList(0)
}catch{
case e: Exception =>{
"(unknown)"
}
}
}
and you get the CAMID and POPID populated
you can read the JSON using Spark and get it using regular spark operations
val df=spark.read.option("multiLine",true).json("test.json")
df.select($"alternateId".getItem(0).as("pop"),$"alternateId".getItem(1).as("cam")).select($"pop.value".as("POPID"),$"cam.value".as("CAMID")).show()
+---------------+--------------------+
| POPID| CAMID|
+---------------+--------------------+
|1-7842-0759-001|CAMID 0000-0002-7...|
+---------------+--------------------+

BulkLoading to Phoenix using Spark

I was trying to code some utilities to bulk load data through HFiles from Spark RDDs.
I was taking the pattern of CSVBulkLoadTool from phoenix. I managed to generate some HFiles and load them into HBase, but i can't see the rows using sqlline(e.g using hbase shell it is possible). I would be more than grateful for any suggestions.
BulkPhoenixLoader.scala:
class BulkPhoenixLoader[A <: ImmutableBytesWritable : ClassTag, T <: KeyValue : ClassTag](rdd: RDD[(A, T)]) {
def createConf(tableName: String, inConf: Option[Configuration] = None): Configuration = {
val conf = inConf.map(HBaseConfiguration.create).getOrElse(HBaseConfiguration.create())
val job: Job = Job.getInstance(conf, "Phoenix bulk load")
job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
job.setMapOutputValueClass(classOf[KeyValue])
// initialize credentials to possibily run in a secure env
TableMapReduceUtil.initCredentials(job)
val htable: HTable = new HTable(conf, tableName)
// Auto configure partitioner and reducer according to the Main Data table
HFileOutputFormat2.configureIncrementalLoad(job, htable)
conf
}
def bulkSave(tableName: String, outputPath: String, conf:
Option[Configuration]) = {
val configuration: Configuration = createConf(tableName, conf)
rdd.saveAsNewAPIHadoopFile(
outputPath,
classOf[ImmutableBytesWritable],
classOf[Put],
classOf[HFileOutputFormat2],
configuration)
}
}
ExtendedProductRDDFunctions.scala:
class ExtendedProductRDDFunctions[A <: scala.Product](data: org.apache.spark.rdd.RDD[A]) extends
ProductRDDFunctions[A](data) with Serializable {
def toHFile(tableName: String,
columns: Seq[String],
conf: Configuration = new Configuration,
zkUrl: Option[String] =
None): RDD[(ImmutableBytesWritable, KeyValue)] = {
val config = ConfigurationUtil.getOutputConfiguration(tableName, columns, zkUrl, Some(conf))
val tableBytes = Bytes.toBytes(tableName)
val encodedColumns = ConfigurationUtil.encodeColumns(config)
val jdbcUrl = zkUrl.map(getJdbcUrl).getOrElse(getJdbcUrl(config))
val conn = DriverManager.getConnection(jdbcUrl)
val query = QueryUtil.constructUpsertStatement(tableName,
columns.toList.asJava,
null)
data.flatMap(x => mapRow(x, jdbcUrl, encodedColumns, tableBytes, query))
}
def mapRow(product: Product,
jdbcUrl: String,
encodedColumns: String,
tableBytes: Array[Byte],
query: String): List[(ImmutableBytesWritable, KeyValue)] = {
val conn = DriverManager.getConnection(jdbcUrl)
val preparedStatement = conn.prepareStatement(query)
val columnsInfo = ConfigurationUtil.decodeColumns(encodedColumns)
columnsInfo.zip(product.productIterator.toList).zipWithIndex.foreach(setInStatement(preparedStatement))
preparedStatement.execute()
val uncommittedDataIterator = PhoenixRuntime.getUncommittedDataIterator(conn, true)
val hRows = uncommittedDataIterator.asScala.filter(kvPair =>
Bytes.compareTo(tableBytes, kvPair.getFirst) == 0
).flatMap(kvPair => kvPair.getSecond.asScala.map(
kv => {
val byteArray = kv.getRowArray.slice(kv.getRowOffset, kv.getRowOffset + kv.getRowLength - 1) :+ 1.toByte
(new ImmutableBytesWritable(byteArray, 0, kv.getRowLength), kv)
}))
conn.rollback()
conn.close()
hRows.toList
}
def setInStatement(statement: PreparedStatement): (((ColumnInfo, Any), Int)) => Unit = {
case ((c, v), i) =>
if (v != null) {
// Both Java and Joda dates used to work in 4.2.3, but now they must be java.sql.Date
val (finalObj, finalType) = v match {
case dt: DateTime => (new Date(dt.getMillis), PDate.INSTANCE.getSqlType)
case d: util.Date => (new Date(d.getTime), PDate.INSTANCE.getSqlType)
case _ => (v, c.getSqlType)
}
statement.setObject(i + 1, finalObj, finalType)
} else {
statement.setNull(i + 1, c.getSqlType)
}
}
private def getIndexTables(conn: Connection, qualifiedTableName: String) : List[(String, String)]
= {
val table: PTable = PhoenixRuntime.getTable(conn, qualifiedTableName)
val tables = table.getIndexes.asScala.map(x => x.getIndexType match {
case IndexType.LOCAL => (x.getTableName.getString, MetaDataUtil.getLocalIndexTableName(qualifiedTableName))
case _ => (x.getTableName.getString, x.getTableName.getString)
}).toList
tables
}
}
The generated HFiles I load with the utility tool from hbase as follows:
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles path/to/hfile tableName
You could just convert your csv file to an RDD of Product and use the .saveToPhoenix method. This is generally how I load csv data into phoenix.
Please see: https://phoenix.apache.org/phoenix_spark.html