Date is converted to default after saving to Kafka - date

I have saved the data in Avro (Generic Record) format into Kafka and then retrieved it. The data consists of 2 String type fields and Date type. When I save to Kafka I have to convert Date to time. However, when I get the date back it is set to some default value. How it can be fixed?
The date before writing to Kafka: 2019-03-19
Datetime saved to Kafka: 1552953600000
Datetime retrieved from Kafka: 1824561152
The date converted from datetime: 1969-12-10
Here is the code for writing into Kafka:
val avroRecord = new GenericData.Record(parseAvroSchemaFromFile("/avro-offset-schema.json"))
avroRecord.put("stringValue", tableNameValue)
avroRecord.put("stringValue2", codeValue)
avroRecord.put("date", state)
val producer = new KafkaProducer[String, GenericRecord](kafkaParams)
val data = new ProducerRecord[String, GenericRecord](kafkaTopic, avroRecord)
producer.send(data)
The code for reading the data from Kafka:
val dataRDD = KafkaUtils.createRDD(
sparkSession.sparkContext,
sparkAppConfig.kafkaParams.asJava,
offsetRanges,
LocationStrategies.PreferConsistent
)
val genericRecordsValues = dataRDD.map(record => record.value().asInstanceOf[GenericRecord])
val genericRecordsFields = genericRecordsValues.map(
record =>
(
record.get("TableName").toString,
record.get("Code").toString,
new Date(record.get(dayColumnName).asInstanceOf[Long])))
genericRecordsFields.first()

Related

How to add Header to Avro Kafka Message

We are using Avro Datum Reader and Datum Writer to build Kafka messages in Scala.
Code :
def AvroKafkaMessage(schemaPath : String, dataPath: String): Array[Byte] =
{
val schema = Source.fromFile(schemaPath).mkString
val schemaObj = new Schema.Parser().parse(schema)
val reader= new GenericDatumReader[GenericRecord](schemaObj)
val dataFile = new File(dataPath)
val dataFileReader = new DataFileReader[GenericRecord](dataFile, reader)
val datum = dataFileReader.next()
val writer = new SpecificDatumWriter[GenericRecord](schemaObj)
val out = new ByteArrayOutputStream()
val encoder : BinaryEncoder= EncoderFactory.get().binaryEncoder(out, null)
writer.write(datum,encoder)
encoder.flush()
out.close()
out.toByteArray()
}
Since there would we multiple events per kafka topic, we need to add header to avro messages for unit testing.
How to add headers in avro file and produce kakfa messages ?
Spark dataframes need their own column for Kafka headers. They must exist in a specific format of Array[(String, Array[Byte])]. Avro doesn't particularly matter;your shown function returns a byte array, so add that to a row/column of the dataframe you wish to write to Kafka.
If you have an existing Avro file you want to produce to Kafka, use Spark's existing from_avro function

Spark Dataframe write to kafka topic in avro format?

I have a Dataframe in Spark that looks like
eventDF
Sno|UserID|TypeExp
1|JAS123|MOVIE
2|ASP123|GAMES
3|JAS123|CLOTHING
4|DPS123|MOVIE
5|DPS123|CLOTHING
6|ASP123|MEDICAL
7|JAS123|OTH
8|POQ133|MEDICAL
.......
10000|DPS123|OTH
I need to write it to Kafka topic in Avro format
currently i am able to write in Kafka as JSON using following code
val kafkaUserDF: DataFrame = eventDF.select(to_json(struct(eventDF.columns.map(column):_*)).alias("value"))
kafkaUserDF.selectExpr("CAST(value AS STRING)").write.format("kafka")
.option("kafka.bootstrap.servers", "Host:port")
.option("topic", "eventdf")
.save()
Now I want to write this in Avro format to Kafka topic
Spark >= 2.4:
You can use to_avro function from spark-avro library.
import org.apache.spark.sql.avro._
eventDF.select(
to_avro(struct(eventDF.columns.map(column):_*)).alias("value")
)
Spark < 2.4
You have to do it the same way:
Create a function which writes serialized Avro record to ByteArrayOutputStream and return the result. A naive implementation (this supports only flat objects) could be similar to (adopted from Kafka Avro Scala Example by Sushil Kumar Singh)
import org.apache.spark.sql.Row
def encode(schema: org.apache.avro.Schema)(row: Row): Array[Byte] = {
val gr: GenericRecord = new GenericData.Record(schema)
row.schema.fieldNames.foreach(name => gr.put(name, row.getAs(name)))
val writer = new SpecificDatumWriter[GenericRecord](schema)
val out = new ByteArrayOutputStream()
val encoder: BinaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
writer.write(gr, encoder)
encoder.flush()
out.close()
out.toByteArray()
}
Convert it to udf:
import org.apache.spark.sql.functions.udf
val schema: org.apache.avro.Schema
val encodeUDF = udf(encode(schema) _)
Use it as drop in replacement for to_json
eventDF.select(
encodeUDF(struct(eventDF.columns.map(column):_*)).alias("value")
)

How to publish JSON data into Kafka? Here the my JSON data in a dataframe

I have a data in a CSV file that I converted into JSON as dataframe like
val jsondata = sampleDF.toJSON.
Trying to publish data into kafka as follows:
val message = new ProducerRecord[String, String]("topicsample", null, jsondata)
Getting string manupation error. How can I handle this?

Hbase insert are very slow when kafka avro records are converted to Json

I am using Kafka 10 and receiving records in it from DB2 CDC. Kafka 10 uses Confluent Schema Registry to store the DB2 table schema and sends the records as Avro Array[Byte]. I want to store these records into Hbase (lets say Raw Hbase) and then run some transformation over those new records(like dropping columns, aggregation etc) using Hive and store the transformed records again into Hbase (lets say conformed Hbase). I tried 2 approaches and both are giving me some kind of issues. The records are big in length with ~500 columns(although only 10% of columns are req.) and each record is of size ~10kb.
1) I tried deserializing the records into Array[Byte] and then use the streamBulkPut method to insert it into Hbase.
Deserializer code:
def toRecord(buffer: Array[Byte]): Array[Byte] = {
var schemaRegistry: SchemaRegistryClient = null
schemaRegistry= new CachedSchemaRegistryClient(url, 10)
val bb = ByteBuffer.wrap(buffer)
bb.get() // consume MAGIC_BYTE
val schemaId = bb.getInt // consume schemaId //println(schemaId.toString)
val schema = schemaRegistry.getByID(schemaId) // consult the Schema Registry //println(schema)
val reader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get().binaryDecoder(buffer, bb.position(), bb.remaining(), null)
val writer = new GenericDatumWriter[GenericRecord](schema)
val baos = new ByteArrayOutputStream
val jsonEncoder = EncoderFactory.get.jsonEncoder(schema, baos)
writer.write( reader.read(null, decoder), jsonEncoder) //reader.read(null, decoder): returns Generic record
jsonEncoder.flush
baos.toByteArray
}
HBase bulkPut code:
val messages = KafkaUtils.createDirectStream[Object,Array[Byte],KafkaAvroDecoder,DefaultDecoder](ssc, kafkaParams, topicSet)
val hconf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(ssc.sparkContext, hconf)
val tableName = "your_table"
var rowKeyArray: Array[String] = null
hbaseContext.streamBulkPut(messages,TableName.valueOf(tableName),putFunction)
def putFunction(avroRecord:Tuple2[Object,Array[Byte]]):Put = {
implicit val formats = DefaultFormats
val recordKey = getKeyString(parse(avroRecord._1.toString.mkString).extract[Map[String,String]].values.mkString)
var put = new Put(Bytes.toBytes(recordKey))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("row"), AvroDeserializer.toRecord(avroRecord._2))
put
}
def getKeyString(keystr:String):String = {
(Math.abs(keystr map (_.hashCode) reduceLeft( 31 * _ + _) ) % 10 + 48).toChar + "_" + keystr.trim
}
Now this method works but the inserts are painfully slow. I am getting a throughput of ~5k records per minute. The plan was once the records are in Raw Hbase I will use Hive to read and explode the json to run the transformation.
2) Instead of re-serializing the records while storing into Raw Hbase I thought of doing it while loading from Raw->Conformed Hbase (I can manage the slowness here as the data will be already with me i.e. out of kafka). So I tried storing Avro records as it is into Hbase and it ran very fast, I was able to insert 1.5 Million records in 2 mins. Below is code:
hbaseContext.streamBulkPut(messages,TableName.valueOf(tableName),putFunction)
def putFunction(avroRecord:Tuple2[Object,Array[Byte]]):Put = {
implicit val formats = DefaultFormats
val recordKey = parse(avroRecord._1.toString.mkString).extract[Map[String,String]]
var put = new Put(Bytes.toBytes(getKeyString(recordKey.values.mkString)))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("row"), avroRecord._2)
put
}
The problem with this approach is Hive is not able to read Avro records from Hbase and I cannot filter the records/run any logic on it.
I would appreciate any kind of help or resource that I can follow to improve the performance. Any approach would work for me if its corresponding issue is solved. Thanks

How to read records in JSON format from Kafka using Structured Streaming?

I am trying to use structured streaming approach using Spark-Streaming based on DataFrame/Dataset API to load a stream of data from Kafka.
I use:
Spark 2.10
Kafka 0.10
spark-sql-kafka-0-10
Spark Kafka DataSource has defined underlying schema:
|key|value|topic|partition|offset|timestamp|timestampType|
My data come in json format and they are stored in the value column. I am looking for a way how to extract underlying schema from value column and update received dataframe to columns stored in value? I tried the approach below but it does not work:
val columns = Array("column1", "column2") // column names
val rawKafkaDF = sparkSession.sqlContext.readStream
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
.option("subscribe",topic)
.load()
val columnsToSelect = columns.map( x => new Column("value." + x))
val kafkaDF = rawKafkaDF.select(columnsToSelect:_*)
// some analytics using stream dataframe kafkaDF
val query = kafkaDF.writeStream.format("console").start()
query.awaitTermination()
Here I am getting Exception org.apache.spark.sql.AnalysisException: Can't extract value from value#337; because in time of creation of the stream, values inside are not known...
Do you have any suggestions?
From the Spark perspective value is just a byte sequence. It has no knowledge about the serialization format or content. To be able to extract the filed you have to parse it first.
If data is serialized as a JSON string you have two options. You can cast value to StringType and use from_json and provide a schema:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.from_json
val schema: StructType = StructType(Seq(
StructField("column1", ???),
StructField("column2", ???)
))
rawKafkaDF.select(from_json($"value".cast(StringType), schema))
or cast to StringType, extract fields by path using get_json_object:
import org.apache.spark.sql.functions.get_json_object
val columns: Seq[String] = ???
val exprs = columns.map(c => get_json_object($"value", s"$$.$c"))
rawKafkaDF.select(exprs: _*)
and cast later to the desired types.