Spark Structured Stream - Kinesis as Data Source - pyspark

I am trying to consume kinesis data stream records using psypark structured stream.
I am trying to run this code in aws glue batch job. My goal is to use checkpoint and save checkpoints and data to s3. I am able to consume the data but it is giving only few records for every trigger whereas kinesis data stream has lot of records. I am using TRIM_HORIZON which is alias to earliest and trigger spark.writestream once so that it executes once and shuts down the cluster. When i run the job again, it picks latest offset from checkpoint and runs.
kinesis = spark.readStream.format('kinesis') \
.option('streamName', kinesis_stream_name) \
.option('endpointUrl', 'blaablaa')\
.option('region', region) \
.option('startingPosition', 'TRIM_HORIZON')\
.option('maxOffsetsPerTrigger',100000)\
.load()
// do some transformation here
TargetKinesisData = stream_data.writeStream.format("parquet").outputMode('append').option(
"path", s3_target).option("checkpointLocation", checkpoint_location).trigger(once=True).start().awaitTermination()

Related

StructuredStreaming - read from Strimzi Kafka on GKE, writing data into Mongo every 10 minutes

I have data in Kafka topic(data published every 10 mins) and i'm planning to read this data using Apache Spark Structured Stream(batch mode) and push it in MongoDB.
Pls note :
This will be scheduled using Composer/Airflow on GCP - which will create a Dataproc cluster, run the spark code, and then delete the cluster
Here is my current code :
# read from Kafka, extract json - and write to mongoDB
df_reader = spark.readStream.format('kafka')\
.option("kafka.bootstrap.servers",kafkaBrokers)\
.option("kafka.security.protocol","SSL") \
.option("kafka.ssl.truststore.location",ssl_truststore_location) \
.option("kafka.ssl.truststore.password",ssl_truststore_password) \
.option("kafka.ssl.keystore.location", ssl_keystore_location)\
.option("kafka.ssl.keystore.password", ssl_keystore_password)\
.option("subscribe", topic) \
.option("kafka.group.id", consumerGroupId)\
.option("failOnDataLoss", "false") \
.option("startingOffsets", "earliest") \
.load()
df = df_reader.selectExpr("CAST(value AS STRING)")
df_data = df.select(from_json(col('value'),schema).alias('data')).select("data.*").filter(col('customer')==database)
# write to Mongo
df_data.write\
.format("mongo") \
.option("uri", mongoConnUri) \
.option("database", database) \
.option("collection", collection) \
.mode("append") \
.save()
Since this is run as a batch query every 10 minutes, how do i ensure that duplicate records are not read, and pushed into MongoDB ?
When i use readStream - does read all the data in Kafka topic OR from the point it last read the data ?
How does df.read differ from df.readStream in this case ?
Pls note :
mongo datasource does not support streaming query, else i could have used the checkpoint to enable this ?
Pls advise what is the best way to achieve this ?
tia!
If you want to schedule the job to run every X minutes, you should use spark.read.format("kafka"), otherwise, it will start a long-running Spark Structured Streaming job, not a batch job.
Spark will track offsets either in Kafka, or with a checkpointLocation, which you'll want to configure.
Also - stuctured streaming writes do work with Mongo.
As commented, Kafka Connect might be more useful than scheduling anything, though, and you can use GKE or maybe Cloud-Run to start Kafka Connect worker containers, or create a cluster in GCE. This will run continuously, and you won't have to wait 10+ minutes (depending on Kafka consumer lag).

Azure Databricks only gets Event Hub Data sent while its runnng

I'm trying to read Azure Event Hub data using databricks.
I have a producer running in nodejs, as well as a consumer for testing (on a different consumer group) and all seems to be running fine.
I am using the following pyspark code in databricks to get the data:
# Initialize event hub config dictionary with connectionString
connectionString = "Endpoint=sb://XXX.servicebus.windows.net/;SharedAccessKeyName=test;SharedAccessKey=XXX;EntityPath=XXX"
ehConf = {}
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
# Add consumer group to the ehConf dictionary
ehConf['eventhubs.consumerGroup'] = "databricks"
# Read events from the Event Hub
messages = spark.readStream.format("eventhubs").options(**ehConf).load()
# Visualize the Dataframe in realtime
display(messages)
The issue is that it only reads data from the stream if it is sent while the notebook is running. If i produce data and then run the notebook, it does not appear.
What am I missing? I want to use this to collect data from the stream every hour or so and save it.
Config:
Databricks Runtime: 7.3LTS (Spark 3.0.1, Scala 2.12)
Azure eventhub library: com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.21
You have two problems:
display by default is using a temporary checkpoint, so when you run it next time, it don't know where to continue, so it starts again & again. If you continue to use display, add the `checkpointLocation="some_path" to the display call (see docs)
by default, EventHubs connector reads only new data (that's why it consumes data only when running) - if you want to consume data from beginning (only on the initial call) - you need to add option eventhubs.startingPositions that encodes starting positions (see doc) - to start reading from beginning of the topic, assign following to this option:
import json
startingEventPosition = {
"offset": -1,
"seqNo": -1, #not in use
"enqueuedTime": None, #not in use
"isInclusive": True
}
ehConf["eventhubs.startingPosition"] = json.dumps(startingEventPosition)

pyspark - how to run and schedule streaming jobs in dataproc hosted on GCP

I am trying to have a pyspark code to stream the data from delta table and perform the merge against final delta target continuously at an interval of 10 - 15 mins between each cycle.
I have written a simple pyspark code and submitting the job in spark shell using the command "spark-submit gs://<pyspark_script>>.py". However, the script runs once and does not take the next cycle.
Code sample :
SourceDF.writeStream
.format("delta")
.outputMode("append") -- I have also tried "update"
.foreachBatch(mergeToDelta)
.option("checkpointLocation","gs:<<path_for_the_checkpint_location>>")
.trigger(processingTime="10 minutes") -- I have tried continuous='10 minutes"
.start()
How to submit the spark jobs in dataproc in google cloud for continuous streaming?
Both source and target for streaming job are delta tables.

Can Spark jobs be scheduled through Airflow

I am new to spark and need to clarify some doubts i have.
Can I schedule Spark Jobs through Airflow
My Airflow (Spark) jobs process raw csv files present in S3 bucket and then transforms into parquet format , stores it into S3 bucket and then finally stores it into Presto Hive after completely processed. End user connects to Presto and queries the data to create visualisation.
Can this processed data be stored in Hive only or Presto only so that user can connect to Presto or Hive and accordingly to perform query on the database.
Well you can always spark_submit_operator
to schedule and submit your spark jobs or you can use bash operator
where you can use the spark-submit bash command to schedule and submit spark jobs.
to your second question, After spark created parquet files you can use spark (same spark instance) to write it to hive or presto.

How Apache Beam manage kinesis checkpointing?

I have a streaming pipeline developed in Apache Beam (using Spark Runner) which reads from kinesis stream.
I am looking out for options in Apache Beam to manage kinesis checkpointing (i.e. stores periodically the current position of kinesis stream) so as it allows the system to recover from failures and continue processing where the stream left off.
Is there a provision available for Apache Beam to support kinesis checkpointing as similar to Spark Streaming (Reference link - https://spark.apache.org/docs/2.2.0/streaming-kinesis-integration.html)?
Since KinesisIO is based on UnboundedSource.CheckpointMark, it uses the standard checkpoint mechanism, provided by Beam UnboundedSource.UnboundedReader.
Once a KinesisRecord has been read (actually, pulled from a records queue that is feed separately by actually fetching the records from Kinesis shard), then the shard checkpoint will be updated by using the record SequenceNumber and then, depending on runner implementation of UnboundedSource and checkpoints processing, will be saved.
Afaik, Beam Spark Runner uses Spark States mechanism for this purposes.