Spark Streaming Job starts reading from beginning instead of where it stopped consuming - mongodb

I am trying to use Dataproc on Google Cloud Platform for my Spark Streaming jobs.
I use Kafka as my source and try to write it to MongoDB. Its working fine, but after the job fails it starts to read the messages from my Kafka topic from the beginning instead of from where it stopped.
Here is my config for reading from Kafka:
clickstreamTestDf = (
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", confluentBootstrapServers)
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))
.option("kafka.ssl.endpoint.identification.algorithm", "https")
.option("kafka.sasl.mechanism", "PLAIN")
.option("subscribe", "customer_experience")
.option("failOnDataLoss", "false")
.option("startingOffsets", "earliest")
.load()
)
And here is my write stream code:
finished_df.writeStream \
.format("mongodb")\
.option("spark.mongodb.connection.uri", connectionString) \
.option("spark.mongodb.database", "Company-Environment") \
.option("spark.mongodb.collection", "customer_experience") \
.option("checkpointLocation", "gs://firstsparktest_1/checkpointCustExp") \
.option("forceDeleteTempCheckpointLocation", "true") \
.outputMode("append") \
.start() \
.awaitTermination()
Do I need to set startingOffsets to latest? I tried but it still didn't read from where it stopped.
Can I use checkpointLocation like this? Is it okay to use a directory in google storage?
I want to run the streaming job, stop it, delete the Dataproc cluster and then create a new one the next day and continue reading from where it left off. Is that possible and how?
Really need some help here!

Related

pyspark: how to perform structured streaming using KafkaUtils

I am doing a structured streaming using SparkSession.readStream and writing it to hive table, but seems it does not allow me to time-based micro-batches, i.e. I need a batch of 5 secs. All the messages should forms a batch of 5 secs, and the batch data should get written to hive table.
Right now its reading the messages as and when they are being posted to Kafka topic, and each message is one record for the table.
Working Code
def hive_write_batch_data(data, batchId):
data.write.format("parquet").mode("append").saveAsTable("test.my_table")
kafka_config = {
"checkpointLocation":"/user/aiman/temp/checkpoint",
"kafka.bootstrap.servers":"kafka.bootstrap.server.com:9093",
"subscribe":"TEST_TOPIC",
"startingOffsets": offsetValue,
"kafka.security.protocol":"SSL",
"kafka.ssl.keystore.location": "kafka.keystore.uat.jks",
"kafka.ssl.keystore.password": "abcd123",
"kafka.ssl.key.password":"abcd123",
"kafka.ssl.truststore.type":"JKS",
"kafka.ssl.truststore.location": "kafka.truststore.uat.jks",
"kafka.ssl.truststore.password":"abdc123",
"kafka.ssl.enabled.protocols":"TLSv1",
"kafka.ssl.endpoint.identification.algorithm":""
}
df = spark.readStream \
.format("kafka") \
.options(**kafka_config) \
.load()
data = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)","offset","timestamp","partition")
data_new = data.select(col("offset"),col("partition"),col("key"),json_tuple(col("value"),"product_code","rec_time")) \
.toDF("offset","partition","key","product_code","rec_time")
data_new.writeStream. \
.foreachBatch(hive_write_batch_data) \
.start() \
.awaitTermination()
Problem Statement
Since each message is being treated as one record entry in hive table, a single parquet file is being created for each record, which can trigger hive's small-file issue.
I need to create a time-based batch so that multiple records gets inserted into hive table in one batch. For that I only found KafkaUtils to be having support for time-based using ssc = StreamingContext(sc, 5) but it does not create Dataframes.
How should I use KafkaUtils to create batches read into dataframes ?
Adding a trigger worked. Added a trigger in the stream writer:
data_new.writeStream \
.trigger(processingTime="5 seconds") \ #Trigger
.foreachBatch(hive_write_batch_data) \
.start() \
.awaitTermination()
Found the article here

Authenticate spark structured streaming against kafka with delegation token, in scala

I'm trying to stream messages out of kafka with spark structured streaming in scala as per spark documentation like this:
val sparkConfig = new SparkConf()
.setAppName("Some.app.name")
.setMaster("local")
val spark = SparkSession
.builder
.config(sparkConfig)
.getOrCreate()
val dataframe = spark
.readStream
.format("kafka")
.option("subscribe", kafkaTopic)
.option("kafka.bootstrap.servers", kafkaEndpoint)
.option("kafka.security.protocol", "SASL_PLAINTEXT")
.option("kafka.sasl.username", "$ConnectionString")
.option("kafka.sasl.password", kafkaConnectionString)
.option("kafka.sasl.mechanism", "PLAIN")
.option("spark.kafka.clusters.cluster.sasl.token.mechanism", "SASL_PLAINTEXT")
.option("includeHeaders", "true")
.load()
val outputAllToConsoleQuery = dataframe
.writeStream
.format("console")
.start()
outputAllToConsoleQuery.awaitTermination()
Which of course fails with Could not find a 'KafkaClient' entry in the JAAS configuration. System property 'java.security.auth.login.config' is not set
As per spark documentation here "..the application can be configured via Spark parameters and may not need JAAS login configuration".
I have also read kafka documentation.
I think I can get the idea, but I haven't found a way to actually code it, nor have I found any example.
Could someone provide the code in scala that configures spark structured streaming to authenticate against kafka and use delegation token, without JAAS configuration file?

How to use foreach in PySpark to write to a kafka topic?

I am trying to insert a refined log created on each row through foreach & want to store into a Kafka topic as follows -
def refine(df):
log = df.value
event_logs = json.dumps(get_event_logs(log)) #A function to refine the row/log
pdf = pd.DataFrame({"value": event_logs}, index=[0])
spark = SparkSession.builder.appName("myAPP").getOrCreate()
df = spark.createDataFrame(pdf)
query = df.selectExpr("CAST(value AS STRING)") \
.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "intest") \
.save()
And I am calling it using the following code.
query = streaming_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.outputMode("append") \
.format("console") \
.foreach(refine)\
.start()
query.awaitTermination()
But the refine function is somehow unable to get Kafka packages I sent while submitting the code. I believe the executioner has no access to the Kafka package send via the following command -
./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 ...
because when I submit my code, I get the following error message,
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
So, my question is how to sink data into Kafka inside foreach? And as a side question I want to know if it's a good idea to create another session inside foreach; I had to redeclare session inside foreach because the exiting session of the main driver couldn't be used inside foreach for some issues regarding serializability.
P.S: If I try to sink it to console (...format("console")) inside foreach, then it works without any error.

Control micro batch of Structured Spark Streaming

I'm reading data from a Kafka topic and I'm putting it to Azure ADLS (HDFS Like) in partitioned mode.
My code is like below :
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("failOnDataLoss", false)
.load()
.selectExpr(/*"CAST(key AS STRING)",*/ "CAST(value AS STRING)").as(Encoders.STRING)
df.writeStream
.partitionBy("year", "month", "day", "hour", "minute")
.format("parquet")
.option("path", outputDirectory)
.option("checkpointLocation", checkpointDirectory)
.outputMode("append")
.start()
.awaitTermination()
I have about 2000 records/sec, and my problem is that Spark is inserting the data every 45sec, and I want the data to be inserted immediately.
Anyone know how to control the size of micro batch ?
From the Spark 2.3 version it is available the Continuous processing mode. In the official doc. you can read that only three sinks are supported for this mode and only the Kafka sink is ready for production, and "the end-to-end low-latency processing can be best observed with Kafka as the source and sink"
df
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/0")
.option("topic", "output0")
.trigger(Trigger.Continuous("0 seconds"))
.start()
So, it seems that, at the moment, you canĀ“t use HDFS as sink using the Continuous mode. In your case maybe you can test Akka Streams and the Alpakka connector

spark strucuted streaming write errors

I'm running into some odd errors when I consume and sink kafka messages. I'm running 2.3.0, and I know this was working prior in some other version.
val event = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", <server list>)
.option("subscribe", <topic>)
.load()
val filesink_query = outputdf.writeStream
.partitionBy(<some column>)
.format("parquet")
.option("path", <some path in EMRFS>)
.option("checkpointLocation", "/tmp/ingestcheckpoint")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start
java.lang.IllegalStateException: /tmp/outputagent/_spark_metadata/0 doesn't exist when compacting batch 9 (compactInterval: 10)
I'm rather confused, is this an error in the newest version of spark?
issue seemed to be related to using S3n over s3a and only having checkpoints on hdfs not s3. this is highly annoying sin e I would like to avoid hard coding dns or ips in my code.