pyspark - kafka stream - out of memory - pyspark

I'm trying to test kafka streaming with broker version 0.10 with this code. It's just a simple code to print the content of a topic. Not a big deal yet! But, for some reason memory is not enough(10GB of RAM in a VM)! The code:
# coding: utf-8
"""
kafka-test-003.py: test with broker 0.10(new Spark Stream API)
How to run this script?
spark-submit --jars jars/spark-sql-kafka-0-10_2.11-2.3.0.jar,jars/kafka-clients-0.11.0.0.jar kafka-test-003.py
"""
import pyspark
from pyspark import SparkContext
from pyspark.sql.session import SparkSession,Row
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
# starting spark session
spark = SparkSession.builder.appName("Kakfa-test").getOrCreate()
spark.sparkContext.setLogLevel('WARN')
# getting streaming context
sc = spark.sparkContext
ssc = StreamingContext(sc, 2) # batching duration: each 2 seconds
broker = "kafka.some.address:9092"
topic = "my.topic"
### Streaming
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", broker) \
.option("startingOffsets", "earliest") \
.option("subscribe", topic) \
.load() \
.select(col('key').cast("string"),col('value').cast("string"))
query = df \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
### End Streaming
query.awaitTermination()
Running spark submit:
spark-submit --master local[*] --driver-memory 5G --executor-memory 5G --jars jars/kafka-clients-0.11.0.0.jar,jars/spark-sql-kafka-0-10_2.11-2.3.0.jar kafka-test-003.py
Unfortunately, the result is:
java.lang.OutOfMemoryError: Java heap space
I'm assuming that Kafka should bring a little portions of data per time exactly to avoid this problem, right? So, what am I doing wrong ?

spark memory management is a complex process. The optimal solution depends not only on your data and type of operations and the system behavior
can you retry the following spark command:
spark-submit --master local[*] --driver-memory 4G --executor-memory 2G
--executor-cores 5 --num-executors 8 --jars jars/kafka-clients-0.11.0.0.jar,jars/spark-sql-kafka-0-10_2.11-2.3.0.jar
kafka-test-003.py
Can you adjust the above memory parameters as per the following link by tuning your performance? Using spark-submit, what is the behavior of the --total-executor-cores option?

Related

Pyspark not able to read from bigquery table

I am running the below code to pull a bigquery table using Pyspark. The spark session has been initiated without any issue but I am not able to connect to the table in public dataset. Here is the error that I get from running the script.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('Optimize BigQuery Storage') \
.config('spark.jars.packages', 'gs://spark-lib/bigquery/spark-3.1-bigquery-0.27.1-preview.jar') \
.getOrCreate()
df = spark.read \
.format("bigquery") \
.load("bigquery-public-data.samples.shakespeare")
https://i.stack.imgur.com/actAv.png

Spark Streaming Job starts reading from beginning instead of where it stopped consuming

I am trying to use Dataproc on Google Cloud Platform for my Spark Streaming jobs.
I use Kafka as my source and try to write it to MongoDB. Its working fine, but after the job fails it starts to read the messages from my Kafka topic from the beginning instead of from where it stopped.
Here is my config for reading from Kafka:
clickstreamTestDf = (
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", confluentBootstrapServers)
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))
.option("kafka.ssl.endpoint.identification.algorithm", "https")
.option("kafka.sasl.mechanism", "PLAIN")
.option("subscribe", "customer_experience")
.option("failOnDataLoss", "false")
.option("startingOffsets", "earliest")
.load()
)
And here is my write stream code:
finished_df.writeStream \
.format("mongodb")\
.option("spark.mongodb.connection.uri", connectionString) \
.option("spark.mongodb.database", "Company-Environment") \
.option("spark.mongodb.collection", "customer_experience") \
.option("checkpointLocation", "gs://firstsparktest_1/checkpointCustExp") \
.option("forceDeleteTempCheckpointLocation", "true") \
.outputMode("append") \
.start() \
.awaitTermination()
Do I need to set startingOffsets to latest? I tried but it still didn't read from where it stopped.
Can I use checkpointLocation like this? Is it okay to use a directory in google storage?
I want to run the streaming job, stop it, delete the Dataproc cluster and then create a new one the next day and continue reading from where it left off. Is that possible and how?
Really need some help here!

mongodb pyspark connector set up

i have followed the link here to install, build is succesful but I cannot find the connector.
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.read.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config("spark.mongodb.write.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.2') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
Py4JJavaError: An error occurred while calling o592.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource
the connector was downloaded and built here
https://github.com/mongodb/mongo-spark#please-see-the-downloading-instructions-for-information-on-getting-and-using-the-mongodb-spark-connector
I Am using ubuntu 20.04
Change to
df = spark.read.format("mongodb").load()
Then, you have to tell pyspark where to find the mongo libs, e.g.
/usr/local/bin/spark-submit --jars $HOME/java/lib/mongo-spark-connector-10.0.0.jar,$HOME/java/lib/mongodb-driver-sync-4.3.2.jar,$HOME/java/lib/mongodb-driver-core-4.3.2.jar,$HOME/java/lib/bson-4.3.2.jar mongo_spark1.py
I'm running pyspark in local mode.
Mongodb version 4
Spark version 3.2.1
I download all needed jars in one folder(path_to_jars) and add it to spark config
bson-4.7.0.jar
mongodb-driver-legacy-4.7.0.jar
mongo-spark-connector-10.0.3.jar
mongodb-driver-core-4.7.0.jar
mongodb-driver-sync-4.7.0.jar
from pyspark.sql import SparkSession
url = 'mongodb://id:port/Database.collection'
spark = (SparkSession
.builder
.master('local[*]')
.config('spark.driver.extraClassPath','path_to_jars/*')
.config("spark.mongodb.read.connection.uri",url)
.config("spark.mongodb.write.connection.uri", url)
.getOrCreate()
)
df = spark.read.format("mongodb").load()

Pyspark, MongoDB and Missing BSON Reference

I am trying to write a basic pyspark script to connect to MongoDB. I am using Spark 3.1.2 and MongoDb driver 3.2.2.
My code is:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
spark = SparkSession \
.builder \
.appName("SparkSQL") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/client.coll") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll") \
.getOrCreate()
df = spark.read.format("mongo").load()
When I execute in Pyspark with /usr/local/spark/bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 I get:
java.lang.NoClassDefFoundError: org/bson/conversions/Bson
I am very new to Spark. Could someone please help me understand how to install the missing Bson reference?

How to run a memory intensive spark streaming job using the smallest memory usage possible

I am running a job that joins two kafka topics into one topic using a value. The environment I am using only allow me to assign less than 10g ram per job, the data I am trying to join is around 500k record per topic.
I am pretty new to Spark, so I'd like to know if there is a way to minimize the memory consumption
the code:
val df_person: DataFrame = PERSONINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaPERSONINFORMATION).as("s")).select("s.*").withColumn("comsume_date", lit(LocalDateTime.now.format(DateTimeFormatter.ofPattern("HH:mm:ss.SS")))).as("dfperson")
val df_candidate: DataFrame = CANDIDATEINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaCANDIDATEINFORMATION).as("s")).select("s.*").withColumn("comsume_date", lit(LocalDateTime.now.format(DateTimeFormatter.ofPattern("HH:mm:ss.SS")))).as("dfcandidate")
Join topics:
val joined_df : DataFrame = df_candidate.join(df_person, col("dfcandidate.PERSONID") === col("dfperson.ID"),"inner").withColumn("join_date", lit(LocalDateTime.now.format(DateTimeFormatter.ofPattern("HH:mm:ss.SS"))))
Re-structure the data
val string2json: DataFrame = joined_df.select($"dfcandidate.ID".as("key"),to_json(struct($"dfcandidate.ID".as("candidateID"), $"FULLNAME", $"PERSONALID",$"join_date",$"dfcandidate.PERSONID".as("personID"),$"dfcandidate.comsume_date".as("candidate_comsume_time"),$"dfperson.comsume_date".as("person_comsume_time"))).cast("String").as("value"))
write them to a topic
string2json.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "xxx:9092")
.option("topic", "mergedinfo")
.option("checkpointLocation", "/tmp/producer/checkpoints")
.option("failOnDataLoss", false)
.start()
.awaitTermination()
the run command:
spark-submit --class taqasi_spark.App --master yarn ./spark_poc-test_memory.jar --executor-memory 10g --driver-memory 10g --executor-memory 10g --deploy-mode cluster