write into kafka topic using spark and scala - scala

I am reading data from Kafka topic and write back the data received into another Kafka topic.
Below is my code ,
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.kafka.clients.producer.{Kafka Producer, ProducerRecord}
import org.apache.spark.sql.ForeachWriter
//loading data from kafka
val data = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "*******:9092")
.option("subscribe", "PARAMTABLE")
.option("startingOffsets", "latest")
.load()
//Extracting value from Json
val schema = new StructType().add("PARAM_INSTANCE_ID",IntegerType).add("ENTITY_ID",IntegerType).add("PARAM_NAME",StringType).add("VALUE",StringType)
val df1 = data.selectExpr("CAST(value AS STRING)")
val dataDF = df1.select(from_json(col("value"), schema).as("data")).select("data.*")
//Insert into another Kafka topic
val topic = "SparkParamValues"
val brokers = "********:9092"
val writer = new KafkaSink(topic, brokers)
val query = dataDF.writeStream
.foreach(writer)
.outputMode("update")
.start().awaitTermination()
I am getting the below error,
<Console>:47:error :not found: type KafkaSink
val writer = new KafkaSink(topic, brokers)
I am very new to spark, Someone suggest how to resolve this or verify the above code whether it is correct. Thanks in advance .

In spark structured streaming, You can write to Kafka topic after reading from another topic using existing DataStreamWriter for Kafka or you can create your own sink by extending ForeachWriter class.
Without using custom sink:
You can use below code to write a dataframe to kafka. Assuming df as the dataframe generated by reading from kafka topic.
Here dataframe should have atleast one column with name as value. If you have multiple columns you should merge them into one column and name it as value. If key column is not specified then key will be marked as null in destination topic.
df.select("key", "value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "<topicName>")
.start()
.awaitTermination()
Using custom sink:
If you want to implement your own Kafka sink you need create a class by extending ForeachWriter. You need override some methods and pass the object of this class to foreach() method.
// By using Anonymous class to extend ForeachWriter
df.writeStream.foreach(new ForeachWriter[Row] {
// If you are writing Dataset[String] then new ForeachWriter[String]
def open(partitionId: Long, version: Long): Boolean = {
// open connection
}
def process(record: String) = {
// write rows to connection
}
def close(errorOrNull: Throwable): Unit = {
// close the connection
}
}).start()
You can check this databricks notebook for the implemented code (Scroll down and check the code under Kafka Sink heading). I think you are referring to this page only. To solve the issue you need to make sure that KafkaSink class is available to your spark code. You can bring both spark code file and class file in same package. If you are running on spark-shell paste the KafkaSink class before pasting spark code.
Read structured streaming kafka integration guide to explore more.

Related

Spark Stuctured Streaming: StreamingQuery.awaitTermination() does not exit after writing all data to Kafka

I'm writing a small Scala program that should get some small dataframe, write it to Scala and terminate.
I do it with Spark Structured Streaming. The data is written to Kafka (console consumer shows that everything is okay), but StreamingQuery, that successfully writes to Kafka, is freezing on awaitTermination() method.
How can I cope with it?
Here's my Scala core to reproduce the problem:
package part4integrations
import org.apache.spark.sql.SparkSession
import common._
object IntegratingKafkaDemo {
val spark = SparkSession.builder()
.appName("Integrating Kafka")
.master("local[2]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
def writeToKafka() = {
val carsDF = spark.readStream
.schema(carsSchema)
.json("src/main/resources/data/cars")
val carsKafkaDF = carsDF.selectExpr("upper(Name) as key", "Name as value")
// kafka: writing but NOT exiting, WTF?!
carsKafkaDF.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "rockthejvm")
.option("checkpointLocation", "checkpoints_demo")
.start().awaitTermination()
}
def main(args: Array[String]): Unit = {
writeToKafka()
}
}
And here's the json cars/cars.json
{"Name":"chevrolet chevelle malibu", "Miles_per_Gallon":18, "Cylinders":8, "Displacement":307, "Horsepower":130, "Weight_in_lbs":3504, "Acceleration":12, "Year":"1970-01-01", "Origin":"USA"}
{"Name":"buick skylark 320", "Miles_per_Gallon":15, "Cylinders":8, "Displacement":350, "Horsepower":165, "Weight_in_lbs":3693, "Acceleration":11.5, "Year":"1970-01-01", "Origin":"USA"}
{"Name":"plymouth satellite", "Miles_per_Gallon":18, "Cylinders":8, "Displacement":318, "Horsepower":150, "Weight_in_lbs":3436, "Acceleration":11, "Year":"1970-01-01", "Origin":"USA"}
{"Name":"amc rebel sst", "Miles_per_Gallon":16, "Cylinders":8, "Displacement":304, "Horsepower":150, "Weight_in_lbs":3433, "Acceleration":12, "Year":"1970-01-01", "Origin":"USA"}
{"Name":"ford torino", "Miles_per_Gallon":17, "Cylinders":8, "Displacement":302, "Horsepower":140, "Weight_in_lbs":3449, "Acceleration":10.5, "Year":"1970-01-01", "Origin":"USA"}
Stream is by definition infinite, that's why it didn't finish. You have two possible solutions:
use spark.read and then df.write.format("kafka") as mentioned by #OneCricketeer, but it could be more complex if you need to handle only new files, as you will need to track somewhere which files were already processed, and which not
use explicit trigger with Trigger.Once or Trigger.AvailableNow (in Spark 3.3) to process only new data since previous run, and finish, instead of waiting indifinitely. See documentation for more examples. In your case it could be something like this:
carsKafkaDF.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "rockthejvm")
.option("checkpointLocation", "checkpoints_demo")
.trigger(Trigger.Once())
.start().awaitTermination()
It's not "freezing", it is waiting for a new batch files in that directory.
If you want to start the job, and have it end, you want to batch with spark.read.json and spark.write.format("kafka") rather than use the read/write Stream methods

Spark Structured Streaming dynamic lookup with Redis

i am new to spark.
We are currently building a pipeline :
Read the events from Kafka topic
Enrich this data with the help of Redis-Lookup
Write events to the new Kafka topic
So, my problem is when i want to use spark-redis library it performs very well, but data stays static in my streaming job.
Although data is refreshed at Redis, it does not reflect to my dataframe.
Spark reads data at first then never updates it.
Also i am reading from REDIS data at first,total data about 1mio key-val string.
What kind of approaches/methods i can do, i want to use Redis as in-memory dynamic lookup.
And lookup table is changing almost 1 hour.
Thanks.
used libraries:
spark-redis-2.4.1.jar
commons-pool2-2.0.jar
jedis-3.2.0.jar
Here is the code part:
import com.intertech.hortonworks.spark.registry.functions._
val config = Map[String, Object]("schema.registry.url" -> "http://aa.bbb.ccc.yyy:xxxx/api/v1")
implicit val srConfig:SchemaRegistryConfig = SchemaRegistryConfig(config)
var rawEventSchema = sparkSchema("my_raw_json_events")
val my_raw_events_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "aa.bbb.ccc.yyy:9092")
.option("subscribe", "my-raw-event")
.option("failOnDataLoss","false")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger",1000)
.load()
.select(from_json($"value".cast("string"),rawEventSchema, Map.empty[String, String])
.alias("C"))
import com.redislabs.provider.redis._
val sc = spark.sparkContext
val stringRdd = sc.fromRedisKV("PARAMETERS:*")
val lookup_map = stringRdd.collect().toMap
val lookup = udf((key: String) => lookup_map.getOrElse(key,"") )
val curated_df = my_raw_events_df
.select(
...
$"C.SystemEntryDate".alias("RecordCreateDate")
,$"C.Profile".alias("ProfileCode")
,**lookup(expr("'PARAMETERS:PROFILE||'||NVL(C.Profile,'')")).alias("ProfileName")**
,$"C.IdentityType"
,lookup(expr("'PARAMETERS:IdentityType||'||NVL(C.IdentityType,'')")).alias("IdentityTypeName")
...
).as("C")
import org.apache.spark.sql.streaming.Trigger
val query = curated_df
.select(to_sr(struct($"*"), "curated_event_sch").alias("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "aa.bbb.ccc.yyy:9092")
.option("topic", "curated-event")
.option("checkpointLocation","/user/spark/checkPointLocation/xyz")
.trigger(Trigger.ProcessingTime("30 seconds"))
.start()
query.awaitTermination()
One option is to not use spark-redis, but rather lookup in Redis directly. This can be achieved with df.mapPartitions function. You can find some examples for Spark DStreams here https://blog.codecentric.de/en/2017/07/lookup-additional-data-in-spark-streaming/. The idea for Structural Streaming is similar. Be careful to handle the Redis connection properly.
Another solution is to do a stream-static join (spark docs):
Instead of collecting the redis rdd to the driver, use the redis dataframe (spark-redis docs) as a static dataframe to be joined with your stream, so it will be like:
val redisStaticDf = spark.read. ...
val streamingDf = spark.readStream. ...
streamingDf.join(redisStaticDf, ...)
Since spark micro-batch execution engine evaluates the query-execution on each trigger, the redis dataframe will fetch the data on each trigger, providing you an up-to-date data (if you will cache the dataframe it won't)

Reading kafka topic using spark dataframe

I want to create dataframe on top of kafka topic and after that i want to register that dataframe as temp table to perform minus operation on data. I have written below code. But while querying registered table I'm getting error
"org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;"
org.apache.spark.sql.types.DataType
org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types._
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "SERVER ******").option("subscribe", "TOPIC_NAME").option("startingOffsets", "earliest").load()
df.printSchema()
val personStringDF = df.selectExpr("CAST(value AS STRING)")
val user_schema =StructType(Array(StructField("OEM",StringType,true),StructField("IMEI",StringType,true),StructField("CUSTOMER_ID",StringType,true),StructField("REQUEST_SOURCE",StringType,true),StructField("REQUESTER",StringType,true),StructField("REQUEST_TIMESTAMP",StringType,true),StructField("REASON_CODE",StringType,true)))
val personDF = personStringDF.select(from_json(col("value"),user_schema).as("data")).select("data.*")
personDF.registerTempTable("final_df1")
spark.sql("select * from final_df1").show
ERROR:---------- "org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;"
Also i have used start() method and I'm getting below error.
20/08/11 00:59:30 ERROR streaming.MicroBatchExecution: Query final_df1 [id = 1a3e2ea4-2ec1-42f8-a5eb-8a12ce0fb3f5, runId = 7059f3d2-21ec-43c4-b55a-8c735272bf0f] terminated with error
java.lang.AbstractMethodError
NOTE: My main objective behind writing this script is i want to write minus query on this data and want to compare it with one of the register table i have on cluster. So , to summarise If I'm sending 1000 records in kafka topic from oracle database, I'm creating dataframe on top of oracle table , registering it as temp table and same I'm doing with kafka topic. Than i want to run minus query between source(oracle) and target(kafka topic). to perform 100% data validation between source and target. (Registering kafka topic as temporary table is possible?)
Use memory sink instead of registerTempTable. Check below code.
org.apache.spark.sql.types.DataType
org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "SERVER ******")
.option("subscribe", "TOPIC_NAME")
.option("startingOffsets", "earliest")
.load()
df.printSchema()
val personStringDF = df.selectExpr("CAST(value AS STRING)")
val user_schema =StructType(Array(StructField("OEM",StringType,true),StructField("IMEI",StringType,true),StructField("CUSTOMER_ID",StringType,true),StructField("REQUEST_SOURCE",StringType,true),StructField("REQUESTER",StringType,true),StructField("REQUEST_TIMESTAMP",StringType,true),StructField("REASON_CODE",StringType,true)))
val personDF = personStringDF.select(from_json(col("value"),user_schema).as("data")).select("data.*")
personDF
.writeStream
.outputMode("append")
.format("memory")
.queryName("final_df1").start()
spark.sql("select * from final_df1").show(10,false)
Streaming DataFrame doesn't support the show() method. When you call start() method, it will start a background thread to stream the input data to the sink, and since you are using ConsoleSink, it will output the data to the console. You don't need to call show().
remove the below lines,
personDF.registerTempTable("final_df1")
spark.sql("select * from final_df1").show
and add the below or equivalent lines instead,
val query1 = personDF.writeStream.queryName("final_df1").format("memory").outputMode("append").start()
query1.awaitTermination()

Spark structured streaming - UNION two or more streaming sources

I'm using spark 2.3.2 and running into an issue doing a union on 2 or more streaming sources from Kafka. Each of these are streaming sources from Kafka that I've already transformed and stored in Dataframes.
I'd ideally want to store the results of this UNIONed dataframe in parquet format in HDFS or potentially even back into Kafka. The ultimate goal is to store these merged events with as low a latency as possible.
val finalDF = flatDF1
.union(flatDF2)
.union(flatDF3)
val query = finalDF.writeStream
.format("parquet")
.outputMode("append")
.option("path", hdfsLocation)
.option("checkpointLocation", checkpointLocation)
.option("failOnDataLoss", false)
.start()
query.awaitTermination()
when doing a writeStream to console instead of parquet I'm getting the expected results, but the example above causes an assertion failure.
Caused by: java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.sql.execution.streaming.OffsetSeq.toStreamProgress(OffsetSeq.scala:42)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:185)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:124)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
here is the class and assertion that is failing:
case class OffsetSeq(offsets: Seq[Option[Offset]], metadata: Option[OffsetSeqMetadata] = None) {
assert(sources.size == offsets.size)
Is this because the checkpoint is only storing the offsets for one of the dataframes? Looking through the Spark Structured Streaming documentation it looked like it was possible to do joins/union of streaming sources in Spark 2.2 or >
First, please define how your case class OffsetSeq is related to the code with the unions of the dataframes.
Next, checkpointing is a real issue when performing this union and then writing to Kafka with writestream. Separating into multiple writestreams - each with it's own checkpointing - confuses batch id's because of the union operating. Using the same writestream with union of dataframes fails with checkpointing since the checkpoint appears to seek all the models that generated the dataframes before the union and cannot distinguish what row/record came from what dataframe/model.
For writing to Kafka, from structured sql streaming unioned dataframes - best to use writestream with foreach and ForEachWriter including the Kafka Producer in the process method. No checkpointing is needed; the application the just uses temp checkpoint files which are set to be deleted when appropriate - set "forceDeleteTempCheckpointLocation" to true - in the session builder.
Anyway, I have just set up scala code to union an arbitrary number of streaming dataframes and then write to Kafka Producer. Appears to work well once all Kafka Producer code is placed in the ForEachWriter process method so that it can be serialized by Spark.
val output = dataFrameModelArray.reduce(_ union _)
val stream: StreamingQuery = output
.writeStream.foreach(new ForeachWriter[Row] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(row: Row): Unit = {
val producer: KafkaProducer[String, String] = new KafkaProducer[String, String](props)
val record = new ProducerRecord[String, String](producerTopic, row.getString(0), row.getString(1))
producer.send(record)
}
def close(errorOrNull: Throwable): Unit = {
}
}
).start()
Can add more logic in process method if needed.
Note prior to union, all dataframes to be unioned have been converted into key, value string columns. Value is a json string of the message data to be sent over the Kafka Producer. This is also very important to get write before the union is attempted.
svcModel.transform(query)
.select($"key", $"uuid", $"currentTime", $"label", $"rawPrediction", $"prediction")
.selectExpr("key", "to_json(struct(*)) AS value")
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
where svcModel is a dataframe in the dataFrameModelArray.

Spark - Is not stopping Spark Stream that consumes a Kafka topic

I'm trying to write a test for spark streaming example that consumes data from kafka. I'm using EmbeddedKafka for this.
implicit val config = EmbeddedKafkaConfig(kafkaPort = 12345)
EmbeddedKafka.start()
EmbeddedKafka.createCustomTopic(topic)
println(s"Kafka Running ${EmbeddedKafka.isRunning}")
val spark = SparkSession.builder.appName("StructuredStreaming").master("local[2]").getOrCreate
import spark.implicits._
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:12345")
.option("subscribe", topic)
.load()
// pushing data to kafka
vfes.foreach(e => {
val json = ...
EmbeddedKafka.publishStringMessageToKafka(topic, json)
})
val query = df.selectExpr("CAST(value AS STRING)")
.as[String]
.writeStream.format("console")
query.start().awaitTermination()
spark.stop()
EmbeddedKafka.stop()
When I run this, it keeps running and doesn't stop or print anything to the console.
I cannot figure out why is that.
I also tried terminating kafka by calling EmbeddedKafka.stop() before calling stop on the stream.
Try setting timeout with
query.start().awaitTermination( 3000)
wherein the 3000 is in milli seconds