pyspark: how to perform structured streaming using KafkaUtils - pyspark

I am doing a structured streaming using SparkSession.readStream and writing it to hive table, but seems it does not allow me to time-based micro-batches, i.e. I need a batch of 5 secs. All the messages should forms a batch of 5 secs, and the batch data should get written to hive table.
Right now its reading the messages as and when they are being posted to Kafka topic, and each message is one record for the table.
Working Code
def hive_write_batch_data(data, batchId):
data.write.format("parquet").mode("append").saveAsTable("test.my_table")
kafka_config = {
"checkpointLocation":"/user/aiman/temp/checkpoint",
"kafka.bootstrap.servers":"kafka.bootstrap.server.com:9093",
"subscribe":"TEST_TOPIC",
"startingOffsets": offsetValue,
"kafka.security.protocol":"SSL",
"kafka.ssl.keystore.location": "kafka.keystore.uat.jks",
"kafka.ssl.keystore.password": "abcd123",
"kafka.ssl.key.password":"abcd123",
"kafka.ssl.truststore.type":"JKS",
"kafka.ssl.truststore.location": "kafka.truststore.uat.jks",
"kafka.ssl.truststore.password":"abdc123",
"kafka.ssl.enabled.protocols":"TLSv1",
"kafka.ssl.endpoint.identification.algorithm":""
}
df = spark.readStream \
.format("kafka") \
.options(**kafka_config) \
.load()
data = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)","offset","timestamp","partition")
data_new = data.select(col("offset"),col("partition"),col("key"),json_tuple(col("value"),"product_code","rec_time")) \
.toDF("offset","partition","key","product_code","rec_time")
data_new.writeStream. \
.foreachBatch(hive_write_batch_data) \
.start() \
.awaitTermination()
Problem Statement
Since each message is being treated as one record entry in hive table, a single parquet file is being created for each record, which can trigger hive's small-file issue.
I need to create a time-based batch so that multiple records gets inserted into hive table in one batch. For that I only found KafkaUtils to be having support for time-based using ssc = StreamingContext(sc, 5) but it does not create Dataframes.
How should I use KafkaUtils to create batches read into dataframes ?

Adding a trigger worked. Added a trigger in the stream writer:
data_new.writeStream \
.trigger(processingTime="5 seconds") \ #Trigger
.foreachBatch(hive_write_batch_data) \
.start() \
.awaitTermination()
Found the article here

Related

Spark Streaming Job starts reading from beginning instead of where it stopped consuming

I am trying to use Dataproc on Google Cloud Platform for my Spark Streaming jobs.
I use Kafka as my source and try to write it to MongoDB. Its working fine, but after the job fails it starts to read the messages from my Kafka topic from the beginning instead of from where it stopped.
Here is my config for reading from Kafka:
clickstreamTestDf = (
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", confluentBootstrapServers)
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))
.option("kafka.ssl.endpoint.identification.algorithm", "https")
.option("kafka.sasl.mechanism", "PLAIN")
.option("subscribe", "customer_experience")
.option("failOnDataLoss", "false")
.option("startingOffsets", "earliest")
.load()
)
And here is my write stream code:
finished_df.writeStream \
.format("mongodb")\
.option("spark.mongodb.connection.uri", connectionString) \
.option("spark.mongodb.database", "Company-Environment") \
.option("spark.mongodb.collection", "customer_experience") \
.option("checkpointLocation", "gs://firstsparktest_1/checkpointCustExp") \
.option("forceDeleteTempCheckpointLocation", "true") \
.outputMode("append") \
.start() \
.awaitTermination()
Do I need to set startingOffsets to latest? I tried but it still didn't read from where it stopped.
Can I use checkpointLocation like this? Is it okay to use a directory in google storage?
I want to run the streaming job, stop it, delete the Dataproc cluster and then create a new one the next day and continue reading from where it left off. Is that possible and how?
Really need some help here!

Databricks: Azure Queue Storage structured streaming key not found error

I am trying to write ETL pipeline for AQS streaming data. Here is my code
CONN_STR = dbutils.secrets.get(scope="kvscope", key = "AZURE-STORAGE-CONN-STR")
schema = StructType([
StructField("id", IntegerType()),
StructField("parkingId", IntegerType()),
StructField("capacity", IntegerType()),
StructField("freePlaces", IntegerType()),
StructField("insertTime", TimestampType())
])
stream = spark.readStream \
.format("abs-aqs") \
.option("fileFormat", "json") \
.option("queueName", "freeparkingplaces") \
.option("connectionString", CONN_STR) \
.schema(schema) \
.load()
display(stream)
When I run this I am getting java.util.NoSuchElementException: key not found: eventType
Here is how my queue looks like
Can you spot and explain me what is the problem?
The abs-aqs connector isn’t for consumption of data from AQS, but it’s for getting data about new files in the blob storage using events reported to AQS. That’s why you’re specifying the the file format option, and schema - but these parameters will be applied to the files, not messages in AQS.
As far as I know (I could be wrong), there is no Spark connector for AQS, and it’s usually recommended to use EventHubs or Kafka as messaging solution.

How to use foreach in PySpark to write to a kafka topic?

I am trying to insert a refined log created on each row through foreach & want to store into a Kafka topic as follows -
def refine(df):
log = df.value
event_logs = json.dumps(get_event_logs(log)) #A function to refine the row/log
pdf = pd.DataFrame({"value": event_logs}, index=[0])
spark = SparkSession.builder.appName("myAPP").getOrCreate()
df = spark.createDataFrame(pdf)
query = df.selectExpr("CAST(value AS STRING)") \
.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "intest") \
.save()
And I am calling it using the following code.
query = streaming_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.outputMode("append") \
.format("console") \
.foreach(refine)\
.start()
query.awaitTermination()
But the refine function is somehow unable to get Kafka packages I sent while submitting the code. I believe the executioner has no access to the Kafka package send via the following command -
./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 ...
because when I submit my code, I get the following error message,
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
So, my question is how to sink data into Kafka inside foreach? And as a side question I want to know if it's a good idea to create another session inside foreach; I had to redeclare session inside foreach because the exiting session of the main driver couldn't be used inside foreach for some issues regarding serializability.
P.S: If I try to sink it to console (...format("console")) inside foreach, then it works without any error.

Control micro batch of Structured Spark Streaming

I'm reading data from a Kafka topic and I'm putting it to Azure ADLS (HDFS Like) in partitioned mode.
My code is like below :
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("failOnDataLoss", false)
.load()
.selectExpr(/*"CAST(key AS STRING)",*/ "CAST(value AS STRING)").as(Encoders.STRING)
df.writeStream
.partitionBy("year", "month", "day", "hour", "minute")
.format("parquet")
.option("path", outputDirectory)
.option("checkpointLocation", checkpointDirectory)
.outputMode("append")
.start()
.awaitTermination()
I have about 2000 records/sec, and my problem is that Spark is inserting the data every 45sec, and I want the data to be inserted immediately.
Anyone know how to control the size of micro batch ?
From the Spark 2.3 version it is available the Continuous processing mode. In the official doc. you can read that only three sinks are supported for this mode and only the Kafka sink is ready for production, and "the end-to-end low-latency processing can be best observed with Kafka as the source and sink"
df
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/0")
.option("topic", "output0")
.trigger(Trigger.Continuous("0 seconds"))
.start()
So, it seems that, at the moment, you can´t use HDFS as sink using the Continuous mode. In your case maybe you can test Akka Streams and the Alpakka connector

spark strucuted streaming write errors

I'm running into some odd errors when I consume and sink kafka messages. I'm running 2.3.0, and I know this was working prior in some other version.
val event = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", <server list>)
.option("subscribe", <topic>)
.load()
val filesink_query = outputdf.writeStream
.partitionBy(<some column>)
.format("parquet")
.option("path", <some path in EMRFS>)
.option("checkpointLocation", "/tmp/ingestcheckpoint")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start
java.lang.IllegalStateException: /tmp/outputagent/_spark_metadata/0 doesn't exist when compacting batch 9 (compactInterval: 10)
I'm rather confused, is this an error in the newest version of spark?
issue seemed to be related to using S3n over s3a and only having checkpoints on hdfs not s3. this is highly annoying sin e I would like to avoid hard coding dns or ips in my code.