Control micro batch of Structured Spark Streaming - scala

I'm reading data from a Kafka topic and I'm putting it to Azure ADLS (HDFS Like) in partitioned mode.
My code is like below :
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("failOnDataLoss", false)
.load()
.selectExpr(/*"CAST(key AS STRING)",*/ "CAST(value AS STRING)").as(Encoders.STRING)
df.writeStream
.partitionBy("year", "month", "day", "hour", "minute")
.format("parquet")
.option("path", outputDirectory)
.option("checkpointLocation", checkpointDirectory)
.outputMode("append")
.start()
.awaitTermination()
I have about 2000 records/sec, and my problem is that Spark is inserting the data every 45sec, and I want the data to be inserted immediately.
Anyone know how to control the size of micro batch ?

From the Spark 2.3 version it is available the Continuous processing mode. In the official doc. you can read that only three sinks are supported for this mode and only the Kafka sink is ready for production, and "the end-to-end low-latency processing can be best observed with Kafka as the source and sink"
df
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/0")
.option("topic", "output0")
.trigger(Trigger.Continuous("0 seconds"))
.start()
So, it seems that, at the moment, you canĀ“t use HDFS as sink using the Continuous mode. In your case maybe you can test Akka Streams and the Alpakka connector

Related

Spark Streaming Job starts reading from beginning instead of where it stopped consuming

I am trying to use Dataproc on Google Cloud Platform for my Spark Streaming jobs.
I use Kafka as my source and try to write it to MongoDB. Its working fine, but after the job fails it starts to read the messages from my Kafka topic from the beginning instead of from where it stopped.
Here is my config for reading from Kafka:
clickstreamTestDf = (
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", confluentBootstrapServers)
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))
.option("kafka.ssl.endpoint.identification.algorithm", "https")
.option("kafka.sasl.mechanism", "PLAIN")
.option("subscribe", "customer_experience")
.option("failOnDataLoss", "false")
.option("startingOffsets", "earliest")
.load()
)
And here is my write stream code:
finished_df.writeStream \
.format("mongodb")\
.option("spark.mongodb.connection.uri", connectionString) \
.option("spark.mongodb.database", "Company-Environment") \
.option("spark.mongodb.collection", "customer_experience") \
.option("checkpointLocation", "gs://firstsparktest_1/checkpointCustExp") \
.option("forceDeleteTempCheckpointLocation", "true") \
.outputMode("append") \
.start() \
.awaitTermination()
Do I need to set startingOffsets to latest? I tried but it still didn't read from where it stopped.
Can I use checkpointLocation like this? Is it okay to use a directory in google storage?
I want to run the streaming job, stop it, delete the Dataproc cluster and then create a new one the next day and continue reading from where it left off. Is that possible and how?
Really need some help here!

Authenticate spark structured streaming against kafka with delegation token, in scala

I'm trying to stream messages out of kafka with spark structured streaming in scala as per spark documentation like this:
val sparkConfig = new SparkConf()
.setAppName("Some.app.name")
.setMaster("local")
val spark = SparkSession
.builder
.config(sparkConfig)
.getOrCreate()
val dataframe = spark
.readStream
.format("kafka")
.option("subscribe", kafkaTopic)
.option("kafka.bootstrap.servers", kafkaEndpoint)
.option("kafka.security.protocol", "SASL_PLAINTEXT")
.option("kafka.sasl.username", "$ConnectionString")
.option("kafka.sasl.password", kafkaConnectionString)
.option("kafka.sasl.mechanism", "PLAIN")
.option("spark.kafka.clusters.cluster.sasl.token.mechanism", "SASL_PLAINTEXT")
.option("includeHeaders", "true")
.load()
val outputAllToConsoleQuery = dataframe
.writeStream
.format("console")
.start()
outputAllToConsoleQuery.awaitTermination()
Which of course fails with Could not find a 'KafkaClient' entry in the JAAS configuration. System property 'java.security.auth.login.config' is not set
As per spark documentation here "..the application can be configured via Spark parameters and may not need JAAS login configuration".
I have also read kafka documentation.
I think I can get the idea, but I haven't found a way to actually code it, nor have I found any example.
Could someone provide the code in scala that configures spark structured streaming to authenticate against kafka and use delegation token, without JAAS configuration file?

Consuming data from Azure event hub using structured streaming kafka

I am using structured streaming kafka integration to stream data from event hub and print in console as below example, but I get nothing on the console even if I am able to show the data in console using org.apache.spark.eventhubs structured streaming API.
import org.apache.spark.sql.kafka010._
val spark = SparkSession.builder()
.master("local[*]")
.appName("kafkaeventhubconsumer")
.getOrCreate()
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<EVENT_HUB_FQDN>:9093")
.option("subscribe", "<EVENT_HUB_NAME>")
.option("security.protocol", "SASL_SSL")
.option("sasl.mechanism" , "PLAIN")
.option("sasl.jaas.config", """org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="<CONNECTION_STRING>";""")
.load()
df.writeStream.outputMode("append").format("console").option("truncate", false).start().awaitTermination()

spark strucuted streaming write errors

I'm running into some odd errors when I consume and sink kafka messages. I'm running 2.3.0, and I know this was working prior in some other version.
val event = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", <server list>)
.option("subscribe", <topic>)
.load()
val filesink_query = outputdf.writeStream
.partitionBy(<some column>)
.format("parquet")
.option("path", <some path in EMRFS>)
.option("checkpointLocation", "/tmp/ingestcheckpoint")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start
java.lang.IllegalStateException: /tmp/outputagent/_spark_metadata/0 doesn't exist when compacting batch 9 (compactInterval: 10)
I'm rather confused, is this an error in the newest version of spark?
issue seemed to be related to using S3n over s3a and only having checkpoints on hdfs not s3. this is highly annoying sin e I would like to avoid hard coding dns or ips in my code.

How to use Spark Structured Streaming with Kafka Direct Stream?

I came across Structured Streaming with Spark, it has an example of continuously consuming from an S3 bucket and writing processed results to a MySQL DB.
// Read data continuously from an S3 location
val inputDF = spark.readStream.json("s3://logs")
// Do operations using the standard DataFrame API and write to MySQL
inputDF.groupBy($"action", window($"time", "1 hour")).count()
.writeStream.format("jdbc")
.start("jdbc:mysql//...")
How can this be used with Spark Kafka Streaming?
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
Is there a way to combine these two examples without using stream.foreachRDD(rdd => {})?
Is there a way to combine these two examples without using
stream.foreachRDD(rdd => {})?
Not yet. Spark 2.0.0 doesn't have Kafka sink support for Structured Streaming. This is a feature that should come out in Spark 2.1.0 according to Tathagata Das, one of the creators of Spark Streaming.
Here is the relevant JIRA issue.
Edit: (29/11/2018)
Yes, It's possible with Spark version 2.2 onwards.
stream
.writeStream // use `write` for batch, like DataFrame
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("topic", "target-topic1")
.start()
Check this SO post(read and write on Kafka topic with Spark streaming) for more.
Edit: (06/12/2016)
Kafka 0.10 integration for Structured Streaming is now expiramentaly supported in Spark 2.0.2:
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
ds1
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
I was having a similar issue w.r.t reading from Kafka source and writing to a Cassandra sink. Created a simple project here kafka2spark2cassandra, sharing in case it could be helpful for anyone.