I am Reading data from S3 using Spark Streaming, and I want to update stream data into Amazon Redshift. Data with same primary key exist then that row should be updated and new rows should be inserted. Can someone please suggest the right approach to do it considering the performance?
val ssc = new StreamingContext(sc, Duration(30000))
val lines = ssc.textFileStream("s3://<path-to-data>/YYYY/MM/DD/HH")
lines.foreachRDD(
x => {
val normalizedRDD = processRDD(x)
val df = spark.createDataset(normalizedRDD)
//TODO: How to Update/Upsert data in Redshift?
}
)
Related
If you use a spark session to create an Iceberg table with Spark scala in batch mode, and after that you do a writestream process with a merge into operation it's not possible to see new changes with spark session used in batch process.
You need to create a new one to see the actual state of the table. Why this happen?
write batch is done like this
df.writeTo(tableName).createOrReplace()
write streaming is done using these methods
override def write(df: DataFrame): StreamingQuery = {
implicit val spark: SparkSession = df.sparkSession
df.writeStream
.format("iceberg")
.trigger(Trigger.Once())
.option("fanout-enabled", "true")
.option("checkpointLocation", checkpointLocation)
.foreachBatch(mergeDFIntoTable _)
.outputMode("update")
.start()
}
private def mergeDFIntoTable(df: DataFrame, batchId: Long): Unit = {
df.createOrReplaceTempView("source_table")
val mergeSQL: String =
s"""
|MERGE INTO $TableIdentifier t
|USING source_table s
|ON s.identifier = t.identifier
|WHEN MATCHED THEN UPDATE SET *
|WHEN NOT MATCHED THEN INSERT *
|""".stripMargin
logger.info(s"BATCH ID: $batchId | SQL: $mergeSQL")
// The streaming query uses a cloned Spark Session from the original session that created it.
df.sparkSession.sql(mergeSQL)
df.unpersist()
}
If I use first spark session created and I read the table after batch process and again after streaming, show is the same (without adding or updating new rows)
spark.table(tableName).show()
However if after streaming process I create a new Spark Session the show process after streaming reflects the changes
val newSparkSession = spark.newSession()
newSparkSession.table(tableName).show()
I'm quite new to Structured Streaming and would like to understand a bit more in detail the main metrics of Spark.
I have a Structured Streaming process in Databricks that reads events from one Eventhub, read values from those events, creates a new df and writes this new df into a second Eventhub.
The event that comes from the first Eventhub, is an eventgrid event from which I read a url (when a blob is added to a storage account) and inside a foreachBatch, I create a new DF and write it to the second Eventhub.
The code has the following structure:
val streamingInputDF =
spark.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
.select(($"body").cast("string"))
def get_func( batchDF:DataFrame, batchID:Long ) : Unit = {
batchDF.persist()
for (row <- batchDF.rdd.collect) { //necessary to read the file with spark.read....
val file_url = "/mnt/" + path
// create df from readed url
val df = spark
.read
.option("rowTag", "Transaction")
.xml(file_url)
if (!(df.rdd.isEmpty)){
// some filtering
val eh_df = df.select(col(...).as(...),
val eh_jsoned = eh_df.toJSON.withColumnRenamed("value", "body")
// write to Eventhub
eh_jsoned.select("body")
.write
.format("eventhubs")
.options(eventHubsConfWrite.toMap)
.save()
}
}
batchDF.unpersist()
}
val query_test= streamingSelectDF
.writeStream
.queryName("query_test")
.foreachBatch(get_func _)
.start()
I have tried adding the maxEventsPerTrigger(100) parameter but this increases a lot the time from when the data arrives to the Storage Account until it is consumed in Databricks.
The value for maxEventsPerTrigger is set randomly in order to test behaviour.
Having seen the metrics, what sense does it make that the batch time is increasing so much and the processing rate and input rate are similar?
What approach should I consider to improve the process?
I'm running it from a Databricks 7.5 Notebook, Spark 3.0.1 and Scala 2.12.
Thank you all very much in advance.
NOTE:
XML files have the same size
First Eventhub has 20 partitions
Rate data input to first Eventhub is 2 events/sec
I am using Kafka 10 and receiving records in it from DB2 CDC. Kafka 10 uses Confluent Schema Registry to store the DB2 table schema and sends the records as Avro Array[Byte]. I want to store these records into Hbase (lets say Raw Hbase) and then run some transformation over those new records(like dropping columns, aggregation etc) using Hive and store the transformed records again into Hbase (lets say conformed Hbase). I tried 2 approaches and both are giving me some kind of issues. The records are big in length with ~500 columns(although only 10% of columns are req.) and each record is of size ~10kb.
1) I tried deserializing the records into Array[Byte] and then use the streamBulkPut method to insert it into Hbase.
Deserializer code:
def toRecord(buffer: Array[Byte]): Array[Byte] = {
var schemaRegistry: SchemaRegistryClient = null
schemaRegistry= new CachedSchemaRegistryClient(url, 10)
val bb = ByteBuffer.wrap(buffer)
bb.get() // consume MAGIC_BYTE
val schemaId = bb.getInt // consume schemaId //println(schemaId.toString)
val schema = schemaRegistry.getByID(schemaId) // consult the Schema Registry //println(schema)
val reader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get().binaryDecoder(buffer, bb.position(), bb.remaining(), null)
val writer = new GenericDatumWriter[GenericRecord](schema)
val baos = new ByteArrayOutputStream
val jsonEncoder = EncoderFactory.get.jsonEncoder(schema, baos)
writer.write( reader.read(null, decoder), jsonEncoder) //reader.read(null, decoder): returns Generic record
jsonEncoder.flush
baos.toByteArray
}
HBase bulkPut code:
val messages = KafkaUtils.createDirectStream[Object,Array[Byte],KafkaAvroDecoder,DefaultDecoder](ssc, kafkaParams, topicSet)
val hconf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(ssc.sparkContext, hconf)
val tableName = "your_table"
var rowKeyArray: Array[String] = null
hbaseContext.streamBulkPut(messages,TableName.valueOf(tableName),putFunction)
def putFunction(avroRecord:Tuple2[Object,Array[Byte]]):Put = {
implicit val formats = DefaultFormats
val recordKey = getKeyString(parse(avroRecord._1.toString.mkString).extract[Map[String,String]].values.mkString)
var put = new Put(Bytes.toBytes(recordKey))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("row"), AvroDeserializer.toRecord(avroRecord._2))
put
}
def getKeyString(keystr:String):String = {
(Math.abs(keystr map (_.hashCode) reduceLeft( 31 * _ + _) ) % 10 + 48).toChar + "_" + keystr.trim
}
Now this method works but the inserts are painfully slow. I am getting a throughput of ~5k records per minute. The plan was once the records are in Raw Hbase I will use Hive to read and explode the json to run the transformation.
2) Instead of re-serializing the records while storing into Raw Hbase I thought of doing it while loading from Raw->Conformed Hbase (I can manage the slowness here as the data will be already with me i.e. out of kafka). So I tried storing Avro records as it is into Hbase and it ran very fast, I was able to insert 1.5 Million records in 2 mins. Below is code:
hbaseContext.streamBulkPut(messages,TableName.valueOf(tableName),putFunction)
def putFunction(avroRecord:Tuple2[Object,Array[Byte]]):Put = {
implicit val formats = DefaultFormats
val recordKey = parse(avroRecord._1.toString.mkString).extract[Map[String,String]]
var put = new Put(Bytes.toBytes(getKeyString(recordKey.values.mkString)))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("row"), avroRecord._2)
put
}
The problem with this approach is Hive is not able to read Avro records from Hbase and I cannot filter the records/run any logic on it.
I would appreciate any kind of help or resource that I can follow to improve the performance. Any approach would work for me if its corresponding issue is solved. Thanks
I am new to Spark. Here is something I wanna do.
I have created two data streams; first one reads data from text file and register it as a temptable using hivecontext. The other one continuously gets RDDs from Kafka and for each RDD, it it creates data streams and register the contents as temptable. Finally I join these two temp tables on a key to get final result set. I want to insert that result set in a hive table. But I am out of ideas. Tried to follow some exmples but that only create a table with one column in hive and that too not readable. Could you please show me how to insert results in a particular database and table of hive. Please note that I can see the results of join using show function so the real challenge lies with insertion in hive table.
Below is the code I am using.
imports.....
object MSCCDRFilter {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Flume, Kafka and Spark MSC CDRs Manipulation")
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val cgiDF = sc.textFile("file:///tmp/omer-learning/spark/dim_cells.txt").map(_.split(",")).map(p => CGIList(p(0).trim, p(1).trim, p(2).trim,p(3).trim)).toDF()
cgiDF.registerTempTable("my_cgi_list")
val CGITable=sqlContext.sql("select *"+
" from my_cgi_list")
CGITable.show() // this CGITable is a structure I defined in the project
val streamingContext = new StreamingContext(sc, Seconds(10)
val zkQuorum="hadoopserver:2181"
val topics=Map[String, Int]("FlumeToKafka"->1)
val messages: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(streamingContext,zkQuorum,"myGroup",topics)
val logLinesDStream = messages.map(_._2) //获取数据
logLinesDStream.print()
val MSCCDRDStream = logLinesDStream.map(MSC_KPI.parseLogLine) // change MSC_KPI to MCSCDR_GO if you wanna change the class
// MSCCDR_GO and MSC_KPI are structures defined in the project
MSCCDRDStream.foreachRDD(MSCCDR => {
println("+++++++++++++++++++++NEW RDD ="+ MSCCDR.count())
if (MSCCDR.count() == 0) {
println("==================No logs received in this time interval=================")
} else {
val dataf=sqlContext.createDataFrame(MSCCDR)
dataf.registerTempTable("hive_msc")
cgiDF.registerTempTable("my_cgi_list")
val sqlquery=sqlContext.sql("select a.cdr_type,a.CGI,a.cdr_time, a.mins_int, b.Lat, b.Long,b.SiteID from hive_msc a left join my_cgi_list b"
+" on a.CGI=b.CGI")
sqlquery.show()
sqlContext.sql("SET hive.exec.dynamic.partition = true;")
sqlContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict;")
sqlquery.write.mode("append").partitionBy("CGI").saveAsTable("omeralvi.msc_data")
val FilteredCDR = sqlContext.sql("select p.*, q.* " +
" from MSCCDRFiltered p left join my_cgi_list q " +
"on p.CGI=q.CGI ")
println("======================print result =================")
FilteredCDR.show()
streamingContext.start()
streamingContext.awaitTermination()
}
}
I have had some success writing to Hive, using the following:
dataFrame
.coalesce(n)
.write
.format("orc")
.options(Map("path" -> savePath))
.mode(SaveMode.Append)
.saveAsTable(fullTableName)
Our attempts to use partitions weren't followed through with, because I think there was some issue with our desired partitioning column.
The only limitation is with concurrent writes, where the table does not exist yet, then any task tries to create the table (because it didn't exist when it first attempted to write to the table) will Exception out.
Be aware, that writing to Hive in streaming applications is usually bad design, as you will often write many small files, which is very inefficient to read and store. So if you write more often than every hour or so to Hive, you should make sure you include logic for compaction, or add an intermediate storage layer more suited to transactional data.
I have a streaming job ,
1. Load the data from hdfs file and register it as a temp table.
2. Then join the temp table with tables present in the hive database.
3. Then send the record to Kafka.
initially it was taking 12s to do complete first cycle , then it increases to the 50s after 10 hrs. I don't understand the issue here. Also i noticed that shuffle write is also increasing in every node after 10 hrs it is 200GB+
Sample code is ,
val rowRDD = hContext.read.format("com.databricks.spark.csv").option("header", "false").
option("delimiter", delimiter).load(path).map(col => dosomething)
//Add filter to rdd to convert the data in a time range.
val filteredRDD = rowRDD.filter { col =>{dosomething}}
//Create the data from changed data RDD and schema to new DataFrame
val tblDF = hContext.createDataFrame(filteredRDD, tblSchema).where("crud_status IN ('U','D','I')")
//Register all the records into the temporary tables.
tblDF.registerTempTable("name_changed")
val userDF = hContext.sql("SELECT id,name,account FROM name_changed JOIN account ON(name_changed.id=account.id) JOIN question
on (account.question=question.question)")
userDF.foreachPartition { records =>
val producer = getKafkaProducer(kafkaBootstrap)
records.foreach { rowData =>
producer.send(new ProducerRecord[String, Array[Byte]](topicName, rowData) )
}
}
producer.close()
}