Kafka + Spark ERROR MicroBatchExecution: Query - scala

I'm trying to run the program specified in this IBM Developer code pattern. For now, I am only doing the local deployment https://github.com/IBM/kafka-streaming-click-analysis?cm_sp=Developer-_-determine-trending-topics-with-clickstream-analysis-_-Get-the-Code
Since it's a little old, my versions of Kafka and Scala aren't exactly what the code pattern calls for. The versions I am using are:
Scala: 2.4.6
Kafka 0.10.2.1
At the last step, I get the following error:
ERROR MicroBatchExecution: Query [id = f4dfe12f-1c99-427e-9f75-91a77f6e51a7,
runId = c9744709-2484-4ea1-9bab-28e7d0f6b511] terminated with error
org.apache.spark.sql.catalyst.errors.package$TreeNodeException
Along with the execution tree
The steps I am following are as follows:
1. Start Zookeeper
2. Start Kafka
3. cd kafka_2.10-0.10.2.1
4. tail -200 data/2017_01_en_clickstream.tsv | bin/kafka-console-producer.sh --broker-list ip:port --topic clicks --producer.config=config/producer.properties
I have downloaded the dataset and stored it in a directory called data inside of the kafka_2.10-0.10.2.1 directory
cd $SPARK_DIR
bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.6
Since SPARK_DIR wasn't set during the Spark installation, I am navigating the the directory containing spark to run this command
scala> import scala.util.Try
scala> case class Click(prev: String, curr: String, link: String, n: Long)
scala> def parseVal(x: Array[Byte]): Option[Click] = {
val split: Array[String] = new Predef.String(x).split("\\t")
if (split.length == 4) {
Try(Click(split(0), split(1), split(2), split(3).toLong)).toOption
} else
None
}
scala> val records = spark.readStream.format("kafka")
.option("subscribe", "clicks")
.option("failOnDataLoss", "false")
.option("kafka.bootstrap.servers", "localhost:9092").load()
scala>
val messages = records.select("value").as[Array[Byte]]
.flatMap(x => parseVal(x))
.groupBy("curr")
.agg(Map("n" -> "sum"))
.sort($"sum(n)".desc)
val query = messages.writeStream
.outputMode("complete")
.option("truncate", "false")
.format("console")
.start()
The last statement, query=... is giving the error mentioned above. Any help would be greatly appreciated. Thanks in advance!

A required library or dependency for interacting with Apache Kafka is missing, so you may need to install the missing library or update to a compatible version

Related

Reading names of the indexes from kafka topic doesn't work

I have a problem writing a spark job. The problem goes as follows. I need to process all records from elasticsearch index. Spark seems like a good match for this and I wrote following code
val dataset: Dataset<Row> = session.read()
.format("org.elasticsearch.spark.sql")
.option("es.read.field.include", "orgUUID,serializedEventKey,involvedContactURNs,crmAssociationSmartURNs")
.option("es.read.field.as.array.include", "involvedContactURNs,crmAssociationSmartURNs")
.load(index)
dataset.foreach(transform)
This code works without problem and does everything as expected - the problem is that index - (name of the elasticsearch index) is not known a-priory. I have to read names of the indexes from kafka topic. So I have added following loop
while (true) {
val records = kafkaClient.poll(Duration.ofMillis(1000))
if (!records.isEmpty) {
records.forEach { record ->
val index = record.value()
// Here comes code from above to process index
}
}
This somehow doesn't work - the same records got read multiple times from the same kafka topic. I understand that spark spawns multiple executors behind the back still all kafka clients share the same group ID and according to kafka documentation only one of them should be able to read - that is the first mystery to which I want to get some explanation.
That is not the end of my adventure though. I decided to use spark streaming to read from kafka and went with the following
val df = session.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test_consumers")
.option("startingOffsets", "earliest")
.load()
and after that
val temp = df.selectExpr("CAST(key AS STRING)","CAST(value AS STRING)")
val temp1 = temp.map(retrieve, Encoders.STRING())
val retrieve: MapFunction<Row, String> = MapFunction { row ->
val index = row.getAs("value")
// Here initial block of code to process elasticsearch index
index
}
This failed with
Caused by: org.apache.spark.SparkException: Writing job aborted.
At the line in the initial code block dataset.foreach(transform)

Spark Stuctured Streaming: StreamingQuery.awaitTermination() does not exit after writing all data to Kafka

I'm writing a small Scala program that should get some small dataframe, write it to Scala and terminate.
I do it with Spark Structured Streaming. The data is written to Kafka (console consumer shows that everything is okay), but StreamingQuery, that successfully writes to Kafka, is freezing on awaitTermination() method.
How can I cope with it?
Here's my Scala core to reproduce the problem:
package part4integrations
import org.apache.spark.sql.SparkSession
import common._
object IntegratingKafkaDemo {
val spark = SparkSession.builder()
.appName("Integrating Kafka")
.master("local[2]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
def writeToKafka() = {
val carsDF = spark.readStream
.schema(carsSchema)
.json("src/main/resources/data/cars")
val carsKafkaDF = carsDF.selectExpr("upper(Name) as key", "Name as value")
// kafka: writing but NOT exiting, WTF?!
carsKafkaDF.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "rockthejvm")
.option("checkpointLocation", "checkpoints_demo")
.start().awaitTermination()
}
def main(args: Array[String]): Unit = {
writeToKafka()
}
}
And here's the json cars/cars.json
{"Name":"chevrolet chevelle malibu", "Miles_per_Gallon":18, "Cylinders":8, "Displacement":307, "Horsepower":130, "Weight_in_lbs":3504, "Acceleration":12, "Year":"1970-01-01", "Origin":"USA"}
{"Name":"buick skylark 320", "Miles_per_Gallon":15, "Cylinders":8, "Displacement":350, "Horsepower":165, "Weight_in_lbs":3693, "Acceleration":11.5, "Year":"1970-01-01", "Origin":"USA"}
{"Name":"plymouth satellite", "Miles_per_Gallon":18, "Cylinders":8, "Displacement":318, "Horsepower":150, "Weight_in_lbs":3436, "Acceleration":11, "Year":"1970-01-01", "Origin":"USA"}
{"Name":"amc rebel sst", "Miles_per_Gallon":16, "Cylinders":8, "Displacement":304, "Horsepower":150, "Weight_in_lbs":3433, "Acceleration":12, "Year":"1970-01-01", "Origin":"USA"}
{"Name":"ford torino", "Miles_per_Gallon":17, "Cylinders":8, "Displacement":302, "Horsepower":140, "Weight_in_lbs":3449, "Acceleration":10.5, "Year":"1970-01-01", "Origin":"USA"}
Stream is by definition infinite, that's why it didn't finish. You have two possible solutions:
use spark.read and then df.write.format("kafka") as mentioned by #OneCricketeer, but it could be more complex if you need to handle only new files, as you will need to track somewhere which files were already processed, and which not
use explicit trigger with Trigger.Once or Trigger.AvailableNow (in Spark 3.3) to process only new data since previous run, and finish, instead of waiting indifinitely. See documentation for more examples. In your case it could be something like this:
carsKafkaDF.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "rockthejvm")
.option("checkpointLocation", "checkpoints_demo")
.trigger(Trigger.Once())
.start().awaitTermination()
It's not "freezing", it is waiting for a new batch files in that directory.
If you want to start the job, and have it end, you want to batch with spark.read.json and spark.write.format("kafka") rather than use the read/write Stream methods

Spark 3.2.1 fetch HBase data not working with NewAPIHadoopRDD

Below is the sample code snippet that is used for data fetch from HBase. This worked fine with Spark 3.1.2. However after upgrading to Spark 3.2.1, it is not working i.e. returned RDD doesn't contain any value. Also, it is not throwing any exception.
def getInfo(sc: SparkContext, startDate:String, cachingValue: Int, sparkLoggerParams: SparkLoggerParams, zkIP: String, zkPort: String): RDD[(String)] = {{
val scan = new Scan
scan.addFamily("family")
scan.addColumn("family","time")
val rdd = getHbaseConfiguredRDDFromScan(sc, zkIP, zkPort, "myTable", scan, cachingValue, sparkLoggerParams)
val output: RDD[(String)] = rdd.map { row =>
(Bytes.toString(row._2.getRow))
}
output
}
def getHbaseConfiguredRDDFromScan(sc: SparkContext, zkIP: String, zkPort: String, tableName: String,
scan: Scan, cachingValue: Int, sparkLoggerParams: SparkLoggerParams): NewHadoopRDD[ImmutableBytesWritable, Result] = {
scan.setCaching(cachingValue)
val scanString = Base64.getEncoder.encodeToString(org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(scan).toByteArray)
val hbaseContext = new SparkHBaseContext(zkIP, zkPort)
val hbaseConfig = hbaseContext.getConfiguration()
hbaseConfig.set(TableInputFormat.INPUT_TABLE, tableName)
hbaseConfig.set(TableInputFormat.SCAN, scanString)
sc.newAPIHadoopRDD(
hbaseConfig,
classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]
).asInstanceOf[NewHadoopRDD[ImmutableBytesWritable, Result]]
}
Also, If we fetch using Scan directly without using NewAPIHadoopRDD, it works.
Software versions:
Spark: 3.2.1 prebuilt with user provided Apache Hadoop
Scala: 2.12.10
HBase: 2.4.9
Hadoop: 2.10.1
I found out the solution to this one.
See this upgrade guide from Spark 3.1.x to Spark 3.2.x:
https://spark.apache.org/docs/latest/core-migration-guide.html
Since Spark 3.2, spark.hadoopRDD.ignoreEmptySplits is set to true by default which means Spark will not create empty partitions for empty input splits. To restore the behavior before Spark 3.2, you can set spark.hadoopRDD.ignoreEmptySplits to false.
It can be set like this on spark-submit:
./spark-submit \
--class org.apache.hadoop.hbase.spark.example.hbasecontext.HBaseDistributedScanExample \
--master spark://localhost:7077 \
--conf "spark.hadoopRDD.ignoreEmptySplits=false" \
--jars ... \
/tmp/hbase-spark-1.0.1-SNAPSHOT.jar YourHBaseTable
Alternatively, you can also set these globally at $SPARK_HOME/conf/spark-defaults.conf to apply for every Spark application.
spark.hadoopRDD.ignoreEmptySplits false

write into kafka topic using spark and scala

I am reading data from Kafka topic and write back the data received into another Kafka topic.
Below is my code ,
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.kafka.clients.producer.{Kafka Producer, ProducerRecord}
import org.apache.spark.sql.ForeachWriter
//loading data from kafka
val data = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "*******:9092")
.option("subscribe", "PARAMTABLE")
.option("startingOffsets", "latest")
.load()
//Extracting value from Json
val schema = new StructType().add("PARAM_INSTANCE_ID",IntegerType).add("ENTITY_ID",IntegerType).add("PARAM_NAME",StringType).add("VALUE",StringType)
val df1 = data.selectExpr("CAST(value AS STRING)")
val dataDF = df1.select(from_json(col("value"), schema).as("data")).select("data.*")
//Insert into another Kafka topic
val topic = "SparkParamValues"
val brokers = "********:9092"
val writer = new KafkaSink(topic, brokers)
val query = dataDF.writeStream
.foreach(writer)
.outputMode("update")
.start().awaitTermination()
I am getting the below error,
<Console>:47:error :not found: type KafkaSink
val writer = new KafkaSink(topic, brokers)
I am very new to spark, Someone suggest how to resolve this or verify the above code whether it is correct. Thanks in advance .
In spark structured streaming, You can write to Kafka topic after reading from another topic using existing DataStreamWriter for Kafka or you can create your own sink by extending ForeachWriter class.
Without using custom sink:
You can use below code to write a dataframe to kafka. Assuming df as the dataframe generated by reading from kafka topic.
Here dataframe should have atleast one column with name as value. If you have multiple columns you should merge them into one column and name it as value. If key column is not specified then key will be marked as null in destination topic.
df.select("key", "value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "<topicName>")
.start()
.awaitTermination()
Using custom sink:
If you want to implement your own Kafka sink you need create a class by extending ForeachWriter. You need override some methods and pass the object of this class to foreach() method.
// By using Anonymous class to extend ForeachWriter
df.writeStream.foreach(new ForeachWriter[Row] {
// If you are writing Dataset[String] then new ForeachWriter[String]
def open(partitionId: Long, version: Long): Boolean = {
// open connection
}
def process(record: String) = {
// write rows to connection
}
def close(errorOrNull: Throwable): Unit = {
// close the connection
}
}).start()
You can check this databricks notebook for the implemented code (Scroll down and check the code under Kafka Sink heading). I think you are referring to this page only. To solve the issue you need to make sure that KafkaSink class is available to your spark code. You can bring both spark code file and class file in same package. If you are running on spark-shell paste the KafkaSink class before pasting spark code.
Read structured streaming kafka integration guide to explore more.

Stopping Spark Streaming: exception in the cleaner thread but it will continue to run

I'm working on a Spark-Streaming application, I'm just trying to get a simple example of a Kafka Direct Stream working:
package com.username
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object MyApp extends App {
val topic = args(0) // 1 topic
val brokers = args(1) //localhost:9092
val spark = SparkSession.builder().master("local[2]").getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
val topicSet = topic.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
// Just print out the data within the topic
val parsers = directKafkaStream.map(v => v)
parsers.print()
ssc.start()
val endTime = System.currentTimeMillis() + (5 * 1000) // 5 second loop
while(System.currentTimeMillis() < endTime){
//write something to the topic
Thread.sleep(1000) // 1 second pause between iterations
}
ssc.stop()
}
This mostly works, whatever I write into the kafka topic, it gets included into the streaming batch and gets printed out. My only concern is what happens at ssc.stop():
dd/mm/yy hh:mm:ss WARN FileSystem: exception in the cleaner thread but it will continue to run
java.lang.InterruptException
at java.lang.Object.wait(Native Method)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:143)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:164)
at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:2989)
at java.lang.Thread.run(Thread.java:748)
This exception doesn't cause my app to fail nor exit though. I know I could wrap ssc.stop() into a try/catch block to suppress it, but looking into the API docs has me believe that this is not its intended behavior. I've been looking around online for a solution but nothing involving Spark has mentioned this exception, is there anyway for me to properly fix this?
I encountered the same problem with starting the process directly with sbt run. But if I packaged the project and start with YOUR_SPARK_PATH/bin/spark-submit --class [classname] --master local[4] [package_path], it works correctly. Hope this would help.