Read Kafka messages in spark batch job - scala

What is the best option to read each day , the latest messages from kafka topic,
in spark-batch job (running on EMR) ?
I don't want to use spark-streaming , cause don't have a cluster 24/7.
I saw the option of kafka-utils:
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka_2.11
But I see that the last version was in 2016.
Is It still the best option?
Thanks!
----------------------edit-------------
Thanks for response , I tried this JAR:
'org.apache.spark', name: 'spark-sql-kafka-0-10_2.12', version: '2.4.4'
Running it on EMR with: scalaVersion = '2.12.11' sparkVersion = '2.4.4'
With the following code:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "kafka-utl")
.option("subscribe", "mytopic")
.option("startingOffsets", "earliest")
.option("kafka.partition.assignment.strategy","range") //added it due to error on missing default value for this param
.load()
df.show()
I want to read every batch , all the available messages in the kafka. The program failed on the following error:
21/08/18 16:29:50 WARN ConsumerConfig: The configuration auto.offset.reset = earliest was supplied but isn't a known config.
Exception in thread "Kafka Offset Reader" java.lang.NoSuchMethodError: org.apache.kafka.clients.consumer.KafkaConsumer.subscribe(Ljava/util/Collection;)V
at org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:63)
at org.apache.spark.sql.kafka010.KafkaOffsetReader.consumer(KafkaOffsetReader.scala:86)
at org.apache.spark.sql.kafka010.KafkaOffsetReader.$anonfun$fetchTopicPartitions$1(KafkaOffsetReader.scala:119)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.spark.sql.kafka010.KafkaOffsetReader$$anon$1$$anon$2.run(KafkaOffsetReader.scala:59)
What I did wrong? Thanks.

You're looking at the old spark-kafka package.
Try this one https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10
Alternatively, spark-sql-kafka-0-10

Related

Spark Structured Streaming terminates immediately with spark-submit

I am trying to set up an ingestion pipeline using Spark structured streaming to read from Kafka and write to a Delta Lake table. I currently have a basic POC that I am trying to get running, no transformations yet. When working in the spark-shell, everything seems to run fine:
spark-shell --master spark://HOST:7077 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1,io.delta:delta-core_2.12:1.1.0
Starting and writing the stream:
val source = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "http://HOST:9092").option("subscribe", "spark-kafka-test").option("startingOffsets", "earliest").load().writeStream.format("delta").option("checkpointLocation", "/tmp/delta/checkpoint").start("/tmp/delta/delta-test")
However, once I pack this in to a Scala application and spark-submit the class with the required packages in a sbt assembly jar to the standalone spark instance, the stream seems to stop immediately and does not process any messages in the topic. I simply get the following logs:
INFO SparkContext: Invoking stop() from shutdown hook
...
INFO SparkContext: Successfully stopped SparkContext
INFO MicroBatchExecution: Resuming at batch 0 with committed offsets {} and available offsets {KafkaV2[Subscribe[spark-kafka-test]]: {"spark-kafka-test":{"0":6}}}
INFO MicroBatchExecution: Stream started from {}
Process finished with exit code 0
Here is my Scala class:
import org.apache.spark.sql.SparkSession
object Consumer extends App {
val spark = SparkSession
.builder()
.appName("Spark Kafka Consumer")
.master("spark://HOST:7077")
//.master("local")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.executor.memory", "1g")
.config("spark.executor.cores", "2")
.config("spark.cores.max", "2")
.getOrCreate()
val source = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "http://HOST:9092")
.option("subscribe", "spark-kafka-test")
.option("startingOffsets", "earliest")
.load()
.writeStream
.format("delta")
.option("checkpointLocation", "/tmp/delta/checkpoint")
.start("/tmp/delta/delta-test")
}
Here is my spark-submitcommand:
spark-submit --master spark://HOST:7077 --deploy-mode client --class Consumer --name Kafka-Delta-Consumer --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1,io.delta:delta-core_2.12:1.1.0 <PATH-TO-JAR>/assembly.jar
Does anybody have an idea why the stream is closed and the program terminates? I am assuming memory is not a problem, as the whole Kafka topic is only a few bytes.
EDIT:
From some further investigations, I found the following behavior: On my confluent hub interface, I see that starting the stream via the spark-shell registers a consumer and active consumption is visible in monitoring.
On contrast, the spark-submit job is seemingly not able to register the consumer. On the driver logs, I found the following error:
WARN org.apache.spark.sql.kafka010.KafkaOffsetReaderConsumer - Error in attempt 1 getting Kafka offsets:
java.lang.NullPointerException
at org.apache.spark.kafka010.KafkaConfigUpdater.setAuthenticationConfigIfNeeded(KafkaConfigUpdater.scala:60)
In my case, I am working with one master and one worker on the same machine. There shouldn't be any networking differences between spark-shell and spark-submit executions, am I right?

Spark - Error: Failed to load class - spark-submit

I create sbt project with Intellij and build Artifacts to jar file.
I put jar file to server and submit, but I got this error:
spark-submit --master spark://master:7077 --class streaming_process spark-jar/spark-streaming.jar
Error: Failed to load class streaming_process.
21/01/23 04:41:32 INFO ShutdownHookManager: Shutdown hook called
21/01/23 04:41:32 INFO ShutdownHookManager: Deleting directory /tmp/spark-982e8fe3-9421-45bd-aced-e46c4d756054
My code
// Code Block 2 Starts Here
val spark = SparkSession.builder
.master("spark://master:7077")
.appName("Stream Processing Application")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
// Code Block 2 Ends Here
// Code Block 3 Starts Here
// Stream meetup.com RSVP Message Data from Kafka
val meetup_rsvp_df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_bootstrap_servers)
.option("subscribe", kafka_topic_name)
.option("startingOffsets", "latest")
.load()
You can see my project image:
JVM can't find the jar which contains streaming_process class. Please use --jars spark-jar/spark-streaming.jar option.

Problem reading AVRO messages from a Kafka Topic using Structured Spark Streaming (Spark Version 2.3.1.3.0.1.0-187/ Scala version 2.11.8)

I am invoking spark-shell like this
spark-shell --jars kafka-clients-0.10.2.1.jar,spark-sql-kafka-0-10_2.11-2.3.0.cloudera1.jar,spark-streaming-kafka-0-10_2.11-2.3.0.jar,spark-avro_2.11-2.4.0.jar,avro-1.9.1.jar
After That I read from a Kafka Topic using readStream()
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"kafka-1.x:9093,kafka-2.x:9093,kafka-0.x:9093").option("kafka.security.protocol","
SASL_SSL").option("kafka.ssl.protocol","TLSv1.2").option("kafka.sasl.mechanism","PLAIN").option("kafka.sasl.jaas.config","org.apache.kafka.common.security.plain.PlainLoginModule
required username=\"token\" password=\"XXXXXXXXXX\";").option("subscribe", "test-topic").option("startingOffsets",
"latest").load()
Then I read the AVRO Schema File
val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("/root/avro_schema.json")))
Then I make the DataFrame which matches the AVRO schema
val DataLineageDF = df.select(from_avro(col("value"),jsonFormatSchema).as("DataLineage")).select("DataLineage.*")
This Throws an Error on me :
java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
I could fix this Problem by replacing the jar spark-avro_2.11-2.4.0.jar with spark-avro_2.11-2.4.0-palantir.31.jar
Issue:
DataLineageDF.writeStream.format("console").outputMode("append").trigger(Trigger.ProcessingTime("10 seconds")).start
Fails, with this Error
Exception in thread "stream execution thread for [id = ad836d19-0f29-499a-adea-57c6d9c630b2, runId = 489b1123-a2b2-48ea-9d24-e6744e0959b0]" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.boxedType(Lorg/apache/spark/sql/types/DataType;)Ljava/lang/String;
which seems to be related to In-compatible jars. If anyone has any idea what's going wrong please comment

spark strucuted streaming write errors

I'm running into some odd errors when I consume and sink kafka messages. I'm running 2.3.0, and I know this was working prior in some other version.
val event = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", <server list>)
.option("subscribe", <topic>)
.load()
val filesink_query = outputdf.writeStream
.partitionBy(<some column>)
.format("parquet")
.option("path", <some path in EMRFS>)
.option("checkpointLocation", "/tmp/ingestcheckpoint")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start
java.lang.IllegalStateException: /tmp/outputagent/_spark_metadata/0 doesn't exist when compacting batch 9 (compactInterval: 10)
I'm rather confused, is this an error in the newest version of spark?
issue seemed to be related to using S3n over s3a and only having checkpoints on hdfs not s3. this is highly annoying sin e I would like to avoid hard coding dns or ips in my code.

Offset Management For Apache Kafka With Apache Spark Batch

I'm writing a Spark (v2.2) batch job which reads from a Kafka topic. Spark jobs are scheduling with cron.
I can't use Spark Structured Streaming because non based-time windows are not supported.
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("subscribe", s"kafka_topic")
I need to set the offset for the kafka topic to know from where to start the next batch job. How can I do that?
I guess you are using KafkaUtils to create stream, you can pass this as parameter.
val inputDStream = KafkaUtils.createDirectStream[String,String](ssc,PreferConsistent,
Assign[String, String](fromOffsets.keys,kafkaParams,fromOffsets))
Hoping this helps !