Spark Shuffle Read and Shuffle Write Increasing in Structured Screaming - spark-structured-streaming

I have been running spark-structured streaming with Kafka for the last 23 hours. And I could see Shuffle Read and Shuffle Write Increasing drastically and finally, the driver stopped due to"out of memory".
Data Pushing to Kafak is 3 json per second and Spark streaming processingTime='30 seconds'
spark = SparkSession \
.builder \
.master("spark://spark-master:7077") \
.appName("demo") \
.config("spark.executor.cores", 1) \
.config("spark.cores.max", "4") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.sql.warehouse.dir", "hdfs://172.30.7.36:9000/user/hive/warehouse") \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.executor.memory", '1g') \
.config("spark.scheduler.mode", "FAIR") \
.config("spark.driver.memory", '2g') \
.config("spark.sql.caseSensitive", "true") \
.config("spark.sql.shuffle.partitions", 8) \
.enableHiveSupport() \
.getOrCreate()
CustDf \
.writeStream \
.queryName("customerdatatest") \
.format("delta") \
.outputMode("append") \
.trigger(processingTime='30 seconds') \
.option("mergeSchema", "true") \
.option("checkpointLocation", "/checkpoint/bronze_customer/") \
.toTable("bronze.customer")
I am expecting this straming should be run alteast for 1 month continuously.
Spark is transforming json (Flattening the json) and insert into the delta table.
Please help me on this. weather i misssed any configuration ?

Related

Trying to write and read delta table in the same pyspark structured streaming job. Can't see data

Is it possible for a PySpark job to write in a delta table and also read from the same in the same code? Here is what I'm trying to do.
Problem statement: I'm having trouble printing the data on the console to see what is flowing.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta import *
spark = SparkSession \
.builder \
.appName("test") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "demo.topic") \
.option("startingOffsets", "earliest") \
.load() \
.withColumn("ingested_timestamp", unix_timestamp()) \
.withColumn("value_str", col("value").cast(StringType())) \
.select("ingested_timestamp", "value_str")
# code to write in the delta table called events
stream = kafka_df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "./data/tmp/delta/events/_checkpoints/") \
.toTable("events")
# code to read the same delta table
read_df = spark.read.format("delta").table("events");
read_df.show(5)
stream.awaitTermination()
The code runs without an error using the following command.
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,io.delta:delta-core_2.12:2.1.0 kafka_and_create_delta_table.py
I'm trying to visualize the data that I'm flushing to Kafka into the Delta table to make sure the data is flowing fine and the underlying component works well too.
I can see an empty table even after sending traffic to my topic.
Found no committed offset for the partition demo.topic-0
+------------------+---------+
|ingested_timestamp|value_str|
+------------------+---------+
+------------------+---------+
Any kind of assistance would be helpful.
Also, I tried running write logic in one job and kept the read job in another.
Read Job:
spark = SparkSession \
.builder \
.appName("test") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
read_df = spark.read.table("events");
read_df.show(5)
read_df.awaitTermination()
Then this read job was complaining,
pyspark.sql.utils.AnalysisException: Table or view not found: events; 'UnresolvedRelation [events], [], false

Pivot a streaming dataframe pyspark

I have a streaming dataframe from kafka and I need to pivot two columns. This is the code I'm currently using:
streaming_df = streaming_df.groupBy('Id','Date')\
.pivot('Var')\
.agg(first('Val'))
query = streaming_df.limit(5) \
.writeStream \
.outputMode("append") \
.format("memory") \
.queryName("stream") \
.start()
time.sleep(50)
spark.sql("select * from stream").show(20, False)
query.stop()
`
I recieve the following error:
pyspark.sql.utils.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
pyspark version: 3.1.1
any ideas how to implement pivot with a streaming dataframe ?
The pivot transformation is not supported by Spark when applying to streaming data.
What you can do is to use the foreachBatch with a user defined function like this:
def apply_pivot(stream_df, batch_id):
# Here your pivot transformation
stream_df \
.groupBy('Id','Date') \
.pivot('Var') \
.agg(first('Val')) \
.write \
.format('memory') \
.outputMode('append') \
.queryName("stream")
query = streaming_df.limit(5) \
.writeStream \
.foreachBatch(apply_pivot) \
.start()
time.sleep(50)
spark.sql("select * from stream").show(20, False)
query.stop()
Let me know if it helped you!

Can I "branch" stream into many and write them in parallel in pyspark?

I am receiving Kafka stream in pyspark. Currently I am grouping it by one set of fields and writing updates to database:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
.option("subscribe", topic)
...
df = df \
.groupBy("myfield1") \
.agg(
expr("count(*) as cnt"),
min(struct(col("mycol.myfield").alias("mmm"), col("*"))).alias("minData")
) \
.select("cnt", "minData.*") \
.select(
col("...").alias("..."),
...
col("userId").alias("user_id")
query = df \
.writeStream \
.outputMode("update") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df, epoch)) \
.start()
query.awaitTermination()
Can I take the same chain in the middle and create another grouping like
df2 = df \
.groupBy("myfield2") \
.agg(
expr("count(*) as cnt"),
min(struct(col("mycol.myfield").alias("mmm"), col("*"))).alias("minData")
) \
.select("cnt", "minData.*") \
.select(
col("...").alias("..."),
...
col("userId").alias("user_id")
and write it's ooutput into different place in parallel?
Where to call writeStream and awaitTermination?
Yes, you can branch a Kafka input stream into as many streaming queries as you like.
You need to consider the following:
query.awaitTermination is a blocking method, which means whatever code you are writing after this method will not be executed until this query gets terminated.
Each "branched" streaming query will run in parallel and is it important that you define a checkpoint location in each of their writeStream calls.
Overall, your code needs to have the following structure:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
.option("subscribe", topic) \
.[...]
# note that I changed the variable name to "df1"
df1 = df \
.groupBy("myfield1") \
.[...]
df2 = df \
.groupBy("myfield2") \
.[...]
query1 = df1 \
.writeStream \
.outputMode("update") \
.option("checkpointLocation", "/tmp/checkpointLoc1") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df1, epoch)) \
.start()
query2 = df2 \
.writeStream \
.outputMode("update") \
.option("checkpointLocation", "/tmp/checkpointLoc2") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df2, epoch)) \
.start()
spark.streams.awaitAnyTermination
Just an additional remark: In the code you are showing, you are overwriting df, so the derivation of df2 might not get you the results as you were intended.

pysprak - microbatch streaming delta table as a source to perform merge against another delta table - foreachbatch is not getting invoked

I have created a delta table and now I'm trying to perform merge data to that table using foreachBatch(). I've followed this example. I am running this code in dataproc image 1.5x in google cloud.
Spark version 2.4.7
Delta version 0.6.0
My code looks as follows:
from delta.tables import *
spark = SparkSession.builder \
.appName("streaming_merge") \
.master("local[*]") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Function to upsert `microBatchOutputDF` into Delta table using MERGE
def mergeToDelta(microBatchOutputDF, batchId):
(deltaTable.alias("accnt").merge(
microBatchOutputDF.alias("updates"), \
"accnt.acct_nbr = updates.acct_nbr") \
.whenMatchedDelete(condition = "updates.cdc_ind='D'") \
.whenMatchedUpdateAll(condition = "updates.cdc_ind='U'") \
.whenNotMatchedInsertAll(condition = "updates.cdc_ind!='D'") \
.execute()
)
deltaTable = DeltaTable.forPath(spark, "gs:<<path_for_the_target_delta_table>>")
# Define the source extract
SourceDF = (
spark.readStream \
.format("delta") \
.load("gs://<<path_for_the_source_delta_location>>")
# Start the query to continuously upsert into target tables in update mode
SourceDF.writeStream \
.format("delta") \
.outputMode("update") \
.foreachBatch(mergeToDelta) \
.option("checkpointLocation","gs:<<path_for_the_checkpint_location>>") \
.trigger(once=True) \
.start() \
This code runs without any problems, but there is no data written to the delta table, I doubt foreachBatch is not getting invoked. Anyone know what I'm doing wrong?
After adding awaitTermination, streaming started working and picked up the latest data from the source and performed the merge on delta target table.

Apache Arrow OutOfMemoryException when PySpark reads Hive table to pandas

I searched for this kind of error, and I couldn't find any information on how to solve it. This is what I get when I execute the below two scripts:
org.apache.arrow.memory.OutOfMemoryException: Failure while allocating memory.
write.py
import pandas as pd
from pyspark.sql import SparkSession
from os.path import abspath
warehouse_location = abspath('spark-warehouse')
booksPD = pd.read_csv('books.csv')
spark = SparkSession.builder \
.appName("MyApp") \
.master("local[*]") \
.config("spark.sql.execution.arrow.enabled", "true") \
.config("spark.driver.maxResultSize", "16g") \
.config("spark.python.worker.memory", "16g") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
spark.createDataFrame(booksPD).write.saveAsTable("books")
spark.catalog.clearCache()
read.py
from pyspark.sql import SparkSession
from os.path import abspath
warehouse_location = abspath('spark-warehouse')
spark = SparkSession.builder \
.appName("MyApp") \
.master("local[*]") \
.config("spark.sql.execution.arrow.enabled", "true") \
.config("spark.driver.maxResultSize", "16g") \
.config("spark.python.worker.memory", "16g") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
books = spark.sql("SELECT * FROM books").toPandas()
Most probably, one has to increase the memory limits. Appending the below configurations to increase the driver and executor memory solves the problem in my case.
.config("spark.driver.memory", "16g") \
.config("spark.executor.memory", "16g") \
Since the program is configured to run in local mode (.master("local[*]")), the driver will get some of the load too and will need enough memory.