Hbase insert are very slow when kafka avro records are converted to Json - scala

I am using Kafka 10 and receiving records in it from DB2 CDC. Kafka 10 uses Confluent Schema Registry to store the DB2 table schema and sends the records as Avro Array[Byte]. I want to store these records into Hbase (lets say Raw Hbase) and then run some transformation over those new records(like dropping columns, aggregation etc) using Hive and store the transformed records again into Hbase (lets say conformed Hbase). I tried 2 approaches and both are giving me some kind of issues. The records are big in length with ~500 columns(although only 10% of columns are req.) and each record is of size ~10kb.
1) I tried deserializing the records into Array[Byte] and then use the streamBulkPut method to insert it into Hbase.
Deserializer code:
def toRecord(buffer: Array[Byte]): Array[Byte] = {
var schemaRegistry: SchemaRegistryClient = null
schemaRegistry= new CachedSchemaRegistryClient(url, 10)
val bb = ByteBuffer.wrap(buffer)
bb.get() // consume MAGIC_BYTE
val schemaId = bb.getInt // consume schemaId //println(schemaId.toString)
val schema = schemaRegistry.getByID(schemaId) // consult the Schema Registry //println(schema)
val reader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get().binaryDecoder(buffer, bb.position(), bb.remaining(), null)
val writer = new GenericDatumWriter[GenericRecord](schema)
val baos = new ByteArrayOutputStream
val jsonEncoder = EncoderFactory.get.jsonEncoder(schema, baos)
writer.write( reader.read(null, decoder), jsonEncoder) //reader.read(null, decoder): returns Generic record
jsonEncoder.flush
baos.toByteArray
}
HBase bulkPut code:
val messages = KafkaUtils.createDirectStream[Object,Array[Byte],KafkaAvroDecoder,DefaultDecoder](ssc, kafkaParams, topicSet)
val hconf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(ssc.sparkContext, hconf)
val tableName = "your_table"
var rowKeyArray: Array[String] = null
hbaseContext.streamBulkPut(messages,TableName.valueOf(tableName),putFunction)
def putFunction(avroRecord:Tuple2[Object,Array[Byte]]):Put = {
implicit val formats = DefaultFormats
val recordKey = getKeyString(parse(avroRecord._1.toString.mkString).extract[Map[String,String]].values.mkString)
var put = new Put(Bytes.toBytes(recordKey))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("row"), AvroDeserializer.toRecord(avroRecord._2))
put
}
def getKeyString(keystr:String):String = {
(Math.abs(keystr map (_.hashCode) reduceLeft( 31 * _ + _) ) % 10 + 48).toChar + "_" + keystr.trim
}
Now this method works but the inserts are painfully slow. I am getting a throughput of ~5k records per minute. The plan was once the records are in Raw Hbase I will use Hive to read and explode the json to run the transformation.
2) Instead of re-serializing the records while storing into Raw Hbase I thought of doing it while loading from Raw->Conformed Hbase (I can manage the slowness here as the data will be already with me i.e. out of kafka). So I tried storing Avro records as it is into Hbase and it ran very fast, I was able to insert 1.5 Million records in 2 mins. Below is code:
hbaseContext.streamBulkPut(messages,TableName.valueOf(tableName),putFunction)
def putFunction(avroRecord:Tuple2[Object,Array[Byte]]):Put = {
implicit val formats = DefaultFormats
val recordKey = parse(avroRecord._1.toString.mkString).extract[Map[String,String]]
var put = new Put(Bytes.toBytes(getKeyString(recordKey.values.mkString)))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("row"), avroRecord._2)
put
}
The problem with this approach is Hive is not able to read Avro records from Hbase and I cannot filter the records/run any logic on it.
I would appreciate any kind of help or resource that I can follow to improve the performance. Any approach would work for me if its corresponding issue is solved. Thanks

Related

How to add Header to Avro Kafka Message

We are using Avro Datum Reader and Datum Writer to build Kafka messages in Scala.
Code :
def AvroKafkaMessage(schemaPath : String, dataPath: String): Array[Byte] =
{
val schema = Source.fromFile(schemaPath).mkString
val schemaObj = new Schema.Parser().parse(schema)
val reader= new GenericDatumReader[GenericRecord](schemaObj)
val dataFile = new File(dataPath)
val dataFileReader = new DataFileReader[GenericRecord](dataFile, reader)
val datum = dataFileReader.next()
val writer = new SpecificDatumWriter[GenericRecord](schemaObj)
val out = new ByteArrayOutputStream()
val encoder : BinaryEncoder= EncoderFactory.get().binaryEncoder(out, null)
writer.write(datum,encoder)
encoder.flush()
out.close()
out.toByteArray()
}
Since there would we multiple events per kafka topic, we need to add header to avro messages for unit testing.
How to add headers in avro file and produce kakfa messages ?
Spark dataframes need their own column for Kafka headers. They must exist in a specific format of Array[(String, Array[Byte])]. Avro doesn't particularly matter;your shown function returns a byte array, so add that to a row/column of the dataframe you wish to write to Kafka.
If you have an existing Avro file you want to produce to Kafka, use Spark's existing from_avro function

Deserializing bytes to GenericRecord / Row

In the aim of using DataSourceV2 for reading some stored Parquet binary files that I have already their Spark schema, I am struggling to find a way to deserialize Parquet stream into GenericRecord / Row(s).
For Avro I found that we can do that using something like:
import org.apache.avro.generic.{GenericDatumReader, GenericRecord}
import org.apache.spark.sql.avro.AvroDeserializer
...
val datumReader = new GenericDatumReader[GenericRecord]()
val reader = DataFileReader.openReader(avroStream, datumReader)
val iterator = reader.iterator()
val record = iterator.next()
val row = AvroDeserializer(reader.getSchema, schema).deserialize(record)
where avroStream is a stream of bytes.
Is there some utility classes that can help?
Thanks for your help!

Update/Upsert Data in Redshift Table using Spark (or Spark Streaming)

I am Reading data from S3 using Spark Streaming, and I want to update stream data into Amazon Redshift. Data with same primary key exist then that row should be updated and new rows should be inserted. Can someone please suggest the right approach to do it considering the performance?
val ssc = new StreamingContext(sc, Duration(30000))
val lines = ssc.textFileStream("s3://<path-to-data>/YYYY/MM/DD/HH")
lines.foreachRDD(
x => {
val normalizedRDD = processRDD(x)
val df = spark.createDataset(normalizedRDD)
//TODO: How to Update/Upsert data in Redshift?
}
)

Issue in inserting data to Hive Table using Spark and Scala

I am new to Spark. Here is something I wanna do.
I have created two data streams; first one reads data from text file and register it as a temptable using hivecontext. The other one continuously gets RDDs from Kafka and for each RDD, it it creates data streams and register the contents as temptable. Finally I join these two temp tables on a key to get final result set. I want to insert that result set in a hive table. But I am out of ideas. Tried to follow some exmples but that only create a table with one column in hive and that too not readable. Could you please show me how to insert results in a particular database and table of hive. Please note that I can see the results of join using show function so the real challenge lies with insertion in hive table.
Below is the code I am using.
imports.....
object MSCCDRFilter {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Flume, Kafka and Spark MSC CDRs Manipulation")
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val cgiDF = sc.textFile("file:///tmp/omer-learning/spark/dim_cells.txt").map(_.split(",")).map(p => CGIList(p(0).trim, p(1).trim, p(2).trim,p(3).trim)).toDF()
cgiDF.registerTempTable("my_cgi_list")
val CGITable=sqlContext.sql("select *"+
" from my_cgi_list")
CGITable.show() // this CGITable is a structure I defined in the project
val streamingContext = new StreamingContext(sc, Seconds(10)
val zkQuorum="hadoopserver:2181"
val topics=Map[String, Int]("FlumeToKafka"->1)
val messages: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(streamingContext,zkQuorum,"myGroup",topics)
val logLinesDStream = messages.map(_._2) //获取数据
logLinesDStream.print()
val MSCCDRDStream = logLinesDStream.map(MSC_KPI.parseLogLine) // change MSC_KPI to MCSCDR_GO if you wanna change the class
// MSCCDR_GO and MSC_KPI are structures defined in the project
MSCCDRDStream.foreachRDD(MSCCDR => {
println("+++++++++++++++++++++NEW RDD ="+ MSCCDR.count())
if (MSCCDR.count() == 0) {
println("==================No logs received in this time interval=================")
} else {
val dataf=sqlContext.createDataFrame(MSCCDR)
dataf.registerTempTable("hive_msc")
cgiDF.registerTempTable("my_cgi_list")
val sqlquery=sqlContext.sql("select a.cdr_type,a.CGI,a.cdr_time, a.mins_int, b.Lat, b.Long,b.SiteID from hive_msc a left join my_cgi_list b"
+" on a.CGI=b.CGI")
sqlquery.show()
sqlContext.sql("SET hive.exec.dynamic.partition = true;")
sqlContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict;")
sqlquery.write.mode("append").partitionBy("CGI").saveAsTable("omeralvi.msc_data")
val FilteredCDR = sqlContext.sql("select p.*, q.* " +
" from MSCCDRFiltered p left join my_cgi_list q " +
"on p.CGI=q.CGI ")
println("======================print result =================")
FilteredCDR.show()
streamingContext.start()
streamingContext.awaitTermination()
}
}
I have had some success writing to Hive, using the following:
dataFrame
.coalesce(n)
.write
.format("orc")
.options(Map("path" -> savePath))
.mode(SaveMode.Append)
.saveAsTable(fullTableName)
Our attempts to use partitions weren't followed through with, because I think there was some issue with our desired partitioning column.
The only limitation is with concurrent writes, where the table does not exist yet, then any task tries to create the table (because it didn't exist when it first attempted to write to the table) will Exception out.
Be aware, that writing to Hive in streaming applications is usually bad design, as you will often write many small files, which is very inefficient to read and store. So if you write more often than every hour or so to Hive, you should make sure you include logic for compaction, or add an intermediate storage layer more suited to transactional data.

Spark Streaming Job time increasing gradually

I have a streaming job ,
1. Load the data from hdfs file and register it as a temp table.
2. Then join the temp table with tables present in the hive database.
3. Then send the record to Kafka.
initially it was taking 12s to do complete first cycle , then it increases to the 50s after 10 hrs. I don't understand the issue here. Also i noticed that shuffle write is also increasing in every node after 10 hrs it is 200GB+
Sample code is ,
val rowRDD = hContext.read.format("com.databricks.spark.csv").option("header", "false").
option("delimiter", delimiter).load(path).map(col => dosomething)
//Add filter to rdd to convert the data in a time range.
val filteredRDD = rowRDD.filter { col =>{dosomething}}
//Create the data from changed data RDD and schema to new DataFrame
val tblDF = hContext.createDataFrame(filteredRDD, tblSchema).where("crud_status IN ('U','D','I')")
//Register all the records into the temporary tables.
tblDF.registerTempTable("name_changed")
val userDF = hContext.sql("SELECT id,name,account FROM name_changed JOIN account ON(name_changed.id=account.id) JOIN question
on (account.question=question.question)")
userDF.foreachPartition { records =>
val producer = getKafkaProducer(kafkaBootstrap)
records.foreach { rowData =>
producer.send(new ProducerRecord[String, Array[Byte]](topicName, rowData) )
}
}
producer.close()
}