Apache Arrow OutOfMemoryException when PySpark reads Hive table to pandas - pyspark

I searched for this kind of error, and I couldn't find any information on how to solve it. This is what I get when I execute the below two scripts:
org.apache.arrow.memory.OutOfMemoryException: Failure while allocating memory.
write.py
import pandas as pd
from pyspark.sql import SparkSession
from os.path import abspath
warehouse_location = abspath('spark-warehouse')
booksPD = pd.read_csv('books.csv')
spark = SparkSession.builder \
.appName("MyApp") \
.master("local[*]") \
.config("spark.sql.execution.arrow.enabled", "true") \
.config("spark.driver.maxResultSize", "16g") \
.config("spark.python.worker.memory", "16g") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
spark.createDataFrame(booksPD).write.saveAsTable("books")
spark.catalog.clearCache()
read.py
from pyspark.sql import SparkSession
from os.path import abspath
warehouse_location = abspath('spark-warehouse')
spark = SparkSession.builder \
.appName("MyApp") \
.master("local[*]") \
.config("spark.sql.execution.arrow.enabled", "true") \
.config("spark.driver.maxResultSize", "16g") \
.config("spark.python.worker.memory", "16g") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
books = spark.sql("SELECT * FROM books").toPandas()

Most probably, one has to increase the memory limits. Appending the below configurations to increase the driver and executor memory solves the problem in my case.
.config("spark.driver.memory", "16g") \
.config("spark.executor.memory", "16g") \
Since the program is configured to run in local mode (.master("local[*]")), the driver will get some of the load too and will need enough memory.

Related

Spark Shuffle Read and Shuffle Write Increasing in Structured Screaming

I have been running spark-structured streaming with Kafka for the last 23 hours. And I could see Shuffle Read and Shuffle Write Increasing drastically and finally, the driver stopped due to"out of memory".
Data Pushing to Kafak is 3 json per second and Spark streaming processingTime='30 seconds'
spark = SparkSession \
.builder \
.master("spark://spark-master:7077") \
.appName("demo") \
.config("spark.executor.cores", 1) \
.config("spark.cores.max", "4") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.sql.warehouse.dir", "hdfs://172.30.7.36:9000/user/hive/warehouse") \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.executor.memory", '1g') \
.config("spark.scheduler.mode", "FAIR") \
.config("spark.driver.memory", '2g') \
.config("spark.sql.caseSensitive", "true") \
.config("spark.sql.shuffle.partitions", 8) \
.enableHiveSupport() \
.getOrCreate()
CustDf \
.writeStream \
.queryName("customerdatatest") \
.format("delta") \
.outputMode("append") \
.trigger(processingTime='30 seconds') \
.option("mergeSchema", "true") \
.option("checkpointLocation", "/checkpoint/bronze_customer/") \
.toTable("bronze.customer")
I am expecting this straming should be run alteast for 1 month continuously.
Spark is transforming json (Flattening the json) and insert into the delta table.
Please help me on this. weather i misssed any configuration ?

Trying to write and read delta table in the same pyspark structured streaming job. Can't see data

Is it possible for a PySpark job to write in a delta table and also read from the same in the same code? Here is what I'm trying to do.
Problem statement: I'm having trouble printing the data on the console to see what is flowing.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta import *
spark = SparkSession \
.builder \
.appName("test") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "demo.topic") \
.option("startingOffsets", "earliest") \
.load() \
.withColumn("ingested_timestamp", unix_timestamp()) \
.withColumn("value_str", col("value").cast(StringType())) \
.select("ingested_timestamp", "value_str")
# code to write in the delta table called events
stream = kafka_df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "./data/tmp/delta/events/_checkpoints/") \
.toTable("events")
# code to read the same delta table
read_df = spark.read.format("delta").table("events");
read_df.show(5)
stream.awaitTermination()
The code runs without an error using the following command.
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,io.delta:delta-core_2.12:2.1.0 kafka_and_create_delta_table.py
I'm trying to visualize the data that I'm flushing to Kafka into the Delta table to make sure the data is flowing fine and the underlying component works well too.
I can see an empty table even after sending traffic to my topic.
Found no committed offset for the partition demo.topic-0
+------------------+---------+
|ingested_timestamp|value_str|
+------------------+---------+
+------------------+---------+
Any kind of assistance would be helpful.
Also, I tried running write logic in one job and kept the read job in another.
Read Job:
spark = SparkSession \
.builder \
.appName("test") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
read_df = spark.read.table("events");
read_df.show(5)
read_df.awaitTermination()
Then this read job was complaining,
pyspark.sql.utils.AnalysisException: Table or view not found: events; 'UnresolvedRelation [events], [], false

Spark Job taking long time to append data to S3

I'm running spark job on EMR and trying to convert large zipped file (15gb) to parquet but it is taking too long to write to S3.
I'm using r5 instance for master (1 instance) and core (3 instances).
Here is my code.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date
def main():
spark = SparkSession \
.builder \
.appName("csv-to-parquer-convertor") \
.config("spark.sql.catalogimplementation", "hive") \
.config("hive.metastore.connect.retries", 3) \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport().getOrCreate()
tgt_filename = 'SOME_Prefix'
src_path = 'SOURCE_S3_PATH'
tgt_path = 'TARGET_ BUCKET' + tgt_filename
df = spark.read.csv(src_path, header=True)
partitioned_df = df.repartition(50)
partitioned_df.write.mode('append').parquet(path=tgt_path)
spark.stop()
if __name__ == "__main__":
main()
any suggestion will be much appreciated. ?

Pyspark error occurred while calling o50.load. : com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure

I need help am struggling with pyspark jdbc error on my jupyter notebook..im connecting to a database url from local to somewhere else my code is as of below
installed sql connector
2.installed mysql workbench to test out the connection
installed jupyter and did the set up,
Im honestly not sure what lib or installs im missing do i need to download a local mysql server?
import findspark
findspark.init
from pyspark import SparkContext
from pyspark.sql import SQLContext
sqlContext = sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
#connection to oc_customer_reward (find store credits customers gain), change db and port
df_storecredit = sqlContext.read \
.format("jdbc") \
.option("url", "jdbc:mysql://link") \
.option("driver", "com.mysql.jdbc.Driver") \
.option("dbtable", "(SELECT customer_id, order_id, date_added as 'timestamp', points from oc_customer_reward) as orders") \
.option("user", "") \
.option("password", "") \
.load()
df_storecredit.printSchema()
#add project_id column
from pyspark.sql.functions import lit
df_projectid = df_storecredit.withColumn('project_id', lit(''))
df_projectid.printSchema()
#Save the dataframe to the table.
db_properties = {"user": "","password": "","driver": "org.postgresql.Driver"}
df_projectid.write.jdbc(url='jdbc:postgresql://link',table='dwh_storecredits',mode='append',properties=db_properties)

pysprak - microbatch streaming delta table as a source to perform merge against another delta table - foreachbatch is not getting invoked

I have created a delta table and now I'm trying to perform merge data to that table using foreachBatch(). I've followed this example. I am running this code in dataproc image 1.5x in google cloud.
Spark version 2.4.7
Delta version 0.6.0
My code looks as follows:
from delta.tables import *
spark = SparkSession.builder \
.appName("streaming_merge") \
.master("local[*]") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Function to upsert `microBatchOutputDF` into Delta table using MERGE
def mergeToDelta(microBatchOutputDF, batchId):
(deltaTable.alias("accnt").merge(
microBatchOutputDF.alias("updates"), \
"accnt.acct_nbr = updates.acct_nbr") \
.whenMatchedDelete(condition = "updates.cdc_ind='D'") \
.whenMatchedUpdateAll(condition = "updates.cdc_ind='U'") \
.whenNotMatchedInsertAll(condition = "updates.cdc_ind!='D'") \
.execute()
)
deltaTable = DeltaTable.forPath(spark, "gs:<<path_for_the_target_delta_table>>")
# Define the source extract
SourceDF = (
spark.readStream \
.format("delta") \
.load("gs://<<path_for_the_source_delta_location>>")
# Start the query to continuously upsert into target tables in update mode
SourceDF.writeStream \
.format("delta") \
.outputMode("update") \
.foreachBatch(mergeToDelta) \
.option("checkpointLocation","gs:<<path_for_the_checkpint_location>>") \
.trigger(once=True) \
.start() \
This code runs without any problems, but there is no data written to the delta table, I doubt foreachBatch is not getting invoked. Anyone know what I'm doing wrong?
After adding awaitTermination, streaming started working and picked up the latest data from the source and performed the merge on delta target table.