Can I "branch" stream into many and write them in parallel in pyspark? - pyspark

I am receiving Kafka stream in pyspark. Currently I am grouping it by one set of fields and writing updates to database:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
.option("subscribe", topic)
...
df = df \
.groupBy("myfield1") \
.agg(
expr("count(*) as cnt"),
min(struct(col("mycol.myfield").alias("mmm"), col("*"))).alias("minData")
) \
.select("cnt", "minData.*") \
.select(
col("...").alias("..."),
...
col("userId").alias("user_id")
query = df \
.writeStream \
.outputMode("update") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df, epoch)) \
.start()
query.awaitTermination()
Can I take the same chain in the middle and create another grouping like
df2 = df \
.groupBy("myfield2") \
.agg(
expr("count(*) as cnt"),
min(struct(col("mycol.myfield").alias("mmm"), col("*"))).alias("minData")
) \
.select("cnt", "minData.*") \
.select(
col("...").alias("..."),
...
col("userId").alias("user_id")
and write it's ooutput into different place in parallel?
Where to call writeStream and awaitTermination?

Yes, you can branch a Kafka input stream into as many streaming queries as you like.
You need to consider the following:
query.awaitTermination is a blocking method, which means whatever code you are writing after this method will not be executed until this query gets terminated.
Each "branched" streaming query will run in parallel and is it important that you define a checkpoint location in each of their writeStream calls.
Overall, your code needs to have the following structure:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
.option("subscribe", topic) \
.[...]
# note that I changed the variable name to "df1"
df1 = df \
.groupBy("myfield1") \
.[...]
df2 = df \
.groupBy("myfield2") \
.[...]
query1 = df1 \
.writeStream \
.outputMode("update") \
.option("checkpointLocation", "/tmp/checkpointLoc1") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df1, epoch)) \
.start()
query2 = df2 \
.writeStream \
.outputMode("update") \
.option("checkpointLocation", "/tmp/checkpointLoc2") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df2, epoch)) \
.start()
spark.streams.awaitAnyTermination
Just an additional remark: In the code you are showing, you are overwriting df, so the derivation of df2 might not get you the results as you were intended.

Related

Spark Shuffle Read and Shuffle Write Increasing in Structured Screaming

I have been running spark-structured streaming with Kafka for the last 23 hours. And I could see Shuffle Read and Shuffle Write Increasing drastically and finally, the driver stopped due to"out of memory".
Data Pushing to Kafak is 3 json per second and Spark streaming processingTime='30 seconds'
spark = SparkSession \
.builder \
.master("spark://spark-master:7077") \
.appName("demo") \
.config("spark.executor.cores", 1) \
.config("spark.cores.max", "4") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.sql.warehouse.dir", "hdfs://172.30.7.36:9000/user/hive/warehouse") \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.executor.memory", '1g') \
.config("spark.scheduler.mode", "FAIR") \
.config("spark.driver.memory", '2g') \
.config("spark.sql.caseSensitive", "true") \
.config("spark.sql.shuffle.partitions", 8) \
.enableHiveSupport() \
.getOrCreate()
CustDf \
.writeStream \
.queryName("customerdatatest") \
.format("delta") \
.outputMode("append") \
.trigger(processingTime='30 seconds') \
.option("mergeSchema", "true") \
.option("checkpointLocation", "/checkpoint/bronze_customer/") \
.toTable("bronze.customer")
I am expecting this straming should be run alteast for 1 month continuously.
Spark is transforming json (Flattening the json) and insert into the delta table.
Please help me on this. weather i misssed any configuration ?

Trying to write and read delta table in the same pyspark structured streaming job. Can't see data

Is it possible for a PySpark job to write in a delta table and also read from the same in the same code? Here is what I'm trying to do.
Problem statement: I'm having trouble printing the data on the console to see what is flowing.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta import *
spark = SparkSession \
.builder \
.appName("test") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "demo.topic") \
.option("startingOffsets", "earliest") \
.load() \
.withColumn("ingested_timestamp", unix_timestamp()) \
.withColumn("value_str", col("value").cast(StringType())) \
.select("ingested_timestamp", "value_str")
# code to write in the delta table called events
stream = kafka_df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "./data/tmp/delta/events/_checkpoints/") \
.toTable("events")
# code to read the same delta table
read_df = spark.read.format("delta").table("events");
read_df.show(5)
stream.awaitTermination()
The code runs without an error using the following command.
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,io.delta:delta-core_2.12:2.1.0 kafka_and_create_delta_table.py
I'm trying to visualize the data that I'm flushing to Kafka into the Delta table to make sure the data is flowing fine and the underlying component works well too.
I can see an empty table even after sending traffic to my topic.
Found no committed offset for the partition demo.topic-0
+------------------+---------+
|ingested_timestamp|value_str|
+------------------+---------+
+------------------+---------+
Any kind of assistance would be helpful.
Also, I tried running write logic in one job and kept the read job in another.
Read Job:
spark = SparkSession \
.builder \
.appName("test") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
read_df = spark.read.table("events");
read_df.show(5)
read_df.awaitTermination()
Then this read job was complaining,
pyspark.sql.utils.AnalysisException: Table or view not found: events; 'UnresolvedRelation [events], [], false

Pivot a streaming dataframe pyspark

I have a streaming dataframe from kafka and I need to pivot two columns. This is the code I'm currently using:
streaming_df = streaming_df.groupBy('Id','Date')\
.pivot('Var')\
.agg(first('Val'))
query = streaming_df.limit(5) \
.writeStream \
.outputMode("append") \
.format("memory") \
.queryName("stream") \
.start()
time.sleep(50)
spark.sql("select * from stream").show(20, False)
query.stop()
`
I recieve the following error:
pyspark.sql.utils.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
pyspark version: 3.1.1
any ideas how to implement pivot with a streaming dataframe ?
The pivot transformation is not supported by Spark when applying to streaming data.
What you can do is to use the foreachBatch with a user defined function like this:
def apply_pivot(stream_df, batch_id):
# Here your pivot transformation
stream_df \
.groupBy('Id','Date') \
.pivot('Var') \
.agg(first('Val')) \
.write \
.format('memory') \
.outputMode('append') \
.queryName("stream")
query = streaming_df.limit(5) \
.writeStream \
.foreachBatch(apply_pivot) \
.start()
time.sleep(50)
spark.sql("select * from stream").show(20, False)
query.stop()
Let me know if it helped you!

pysprak - microbatch streaming delta table as a source to perform merge against another delta table - foreachbatch is not getting invoked

I have created a delta table and now I'm trying to perform merge data to that table using foreachBatch(). I've followed this example. I am running this code in dataproc image 1.5x in google cloud.
Spark version 2.4.7
Delta version 0.6.0
My code looks as follows:
from delta.tables import *
spark = SparkSession.builder \
.appName("streaming_merge") \
.master("local[*]") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Function to upsert `microBatchOutputDF` into Delta table using MERGE
def mergeToDelta(microBatchOutputDF, batchId):
(deltaTable.alias("accnt").merge(
microBatchOutputDF.alias("updates"), \
"accnt.acct_nbr = updates.acct_nbr") \
.whenMatchedDelete(condition = "updates.cdc_ind='D'") \
.whenMatchedUpdateAll(condition = "updates.cdc_ind='U'") \
.whenNotMatchedInsertAll(condition = "updates.cdc_ind!='D'") \
.execute()
)
deltaTable = DeltaTable.forPath(spark, "gs:<<path_for_the_target_delta_table>>")
# Define the source extract
SourceDF = (
spark.readStream \
.format("delta") \
.load("gs://<<path_for_the_source_delta_location>>")
# Start the query to continuously upsert into target tables in update mode
SourceDF.writeStream \
.format("delta") \
.outputMode("update") \
.foreachBatch(mergeToDelta) \
.option("checkpointLocation","gs:<<path_for_the_checkpint_location>>") \
.trigger(once=True) \
.start() \
This code runs without any problems, but there is no data written to the delta table, I doubt foreachBatch is not getting invoked. Anyone know what I'm doing wrong?
After adding awaitTermination, streaming started working and picked up the latest data from the source and performed the merge on delta target table.

pyspark 2.4.x structured streaming foreachBatch not running

I am working with spark 2.4.0 and python 3.6. I am developing a python program with pyspark structured streaming actions. The program runs two readstream reading from two sockets, and after made a union of these two streaming dataframe. I tried spark 2.4.0 and 2.4.3 but nothing changed.
Then I perform a unique writestream in order to write just one output streaming dataframe. THAT WORKS WELL.
However, since I need to write also a non streaming dataset for all the micro-batches, I coded a foreachBatch call inside the writestream. THAT DOESN'T WORK.
I put spark.scheduler.mode=FAIR in spark.defaults.conf. I am running through spark-submit, but even though I tried with python3 directly, it doesn't work at all. It looks like as it didn't execute the splitStream function referred in the foreachBatch. I tried adding some print in the splitStream function, without any effects.
I made many attempting, but nothing changed, I submitted via spark-submit and by python. I am working on a spark standalone cluster.
inDF_1 = spark \
.readStream \
.format('socket') \
.option('host', host_1) \
.option('port', port_1) \
.option("maxFilesPerTrigger", 1) \
.load()
inDF_2 = spark \
.readStream \
.format('socket') \
.option('host', host_2) \
.option('port', port_2) \
.option("maxFilesPerTrigger", 1) \
.load() \
.coalesce(1)
inDF = inDF_1.union(inDF_2)
#--------------------------------------------------#
# write streaming raw dataser R-01 plateMeasures #
#--------------------------------------------------#
def splitStream(df, epoch_id):
df \
.write \
.format('text') \
.outputMode('append') \
.start(path = outDir0)
listDF = df.collect()
print(listDF)
pass
stageDir = dLocation.getLocationDir('R-00')
outDir0 = dLocation.getLocationDir(outList[0])
chkDir = dLocation.getLocationDir('CK-00')
query0 = programName + '_q0'
q0 = inDF_1 \
.writeStream \
.foreachBatch(splitStream) \
.format('text') \
.outputMode('append') \
.queryName(query0) \
.start(path = stageDir
, checkpointLocation = chkDir)
I am using foreachBatch because I need to write several sinks for each input microbatch.
Thanks a lot to everyone could try to help me.
I have tried this in my local machine and works for Spark > 2.4.
df.writeStream
.foreachBatch((microBatchDF, microBatchId) => {
microBatchDF
.withColumnRenamed("value", "body")
.write
.format("console")
.option("checkpointLocation","checkPoint")
.save()
})
.start()
.awaitTermination()