I am trying to create spark connection as part of spark_conn() function and use this connection throughout the other functions. For example in the below code, I am using spark connection created as part of spark_conn() function in read_data() function as below. Is my approach correct?
from pyspark.sql import SparkSession
def spark_conn():
spark = SparkSession \
.builder \
.appName("sparkConnection") \
.getOrCreate()
return spark
def read_data(spark, SNOWFLAKE_SOURCE_NAME, snowflake_options, loadtime, sftable):
df = spark.read \
.format(SNOWFLAKE_SOURCE_NAME) \
.options(**snowflake_options) \
.option("query","SELECT * FROM sftable WHERE SNAPSHOT==loadtime") \
.load()
if __name__=="__main__":
conn = spark_conn()
rd = read_data(conn, SNOWFLAKE_SOURCE_NAME, snowflake_options, loadtime, sftable)
I want to convert data from Dataframe to RDD, and save it to MongoDB, here is my code:
import pymongo
import pymongo_spark
from pyspark import SparkConf, SparkContext
from pyspark import BasicProfiler
from pyspark.sql import SparkSession
class MyCustomProfiler(BasicProfiler):
def show(self, id):
print("My custom profiles for RDD:%s" % id)
conf = SparkConf().set("spark.python.profile", "true")
spark = SparkSession.builder \
.master("local[*]") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# Important: activate pymongo_spark.
pymongo_spark.activate()
on_time_dataframe = spark.read.parquet(r'\data\on_time_performance.parquet')
on_time_dataframe.show()
# Note we have to convert the row to a dict to avoid https://jira.mongodb.org/browse/HADOOP-276
as_dict = on_time_dataframe.rdd.map(lambda row: row.asDict())
as_dict.saveToMongoDB('mongodb://localhost:27017/agile_data_science.on_time_performance')
some errors occurs:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: java.lang.ClassNotFoundException: com.mongodb.hadoop.io.BSONWritable
I have installed the Mongo-hadoop file; It seems I don't have a Bsonweitable class. I'm not good at java, So I want someone to help me.
My requirement is to call a "Spark Scala" function from an existing PySpark program.
What is the best way to pass sparkSession created in PySpark program to Scala function.
I pass my scala jar to Pyspark as follows.
spark-submit --jars ScalaExample-0.1.jar pyspark_call_scala_example.py iris.data
Scalacode
def getDf(spark: SparkSession, query:String, df: DataFrame, log: Logger): DataFrame = {
import spark.implicits._
val df = spark.sql(query)
df
}
Pysparkcode
if __name__ == '__main__':
query = sys.argv[1]
spark = SparkSession \
.builder \
.appName("PySpark using Scala example") \
.getOrCreate()
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
query_df = DataFrame(sc._jvm.com.crowdstrike.dsci.sparkjobs.PythonHelper.getDf(???, query, ???), sqlContext)
Question
How to pass sparksession and logger to getDf ?
https://www.crowdstrike.com/blog/spark-hot-potato-passing-dataframes-between-scala-spark-and-pyspark/
To pass SparkSession from Python to Scala, use spark._jsparkSession.
I'm trying to connect spark (pyspark) to mongodb as follows:
conf = SparkConf()
conf.set('spark.mongodb.input.uri', default_mongo_uri)
conf.set('spark.mongodb.output.uri', default_mongo_uri)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = SparkSession \
.builder \
.appName("my-app") \
.config("spark.mongodb.input.uri", default_mongo_uri) \
.config("spark.mongodb.output.uri", default_mongo_uri) \
.getOrCreate()
But when I do the following:
users = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
.option("uri", '{uri}.{col}'.format(uri=mongo_uri, col='users')).load()
I get this error:
java.lang.ClassNotFoundException: Failed to find data source:
com.mongodb.spark.sql.DefaultSource
I did the same thing from pyspark shell and I was able to retrieve data. This is the command I ran:
pyspark --conf "spark.mongodb.input.uri=mongodb_uri" --conf "spark.mongodb.output.uri=mongodburi" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.2
But here we have the option to specify the package we need to use. But what about standalone apps and scripts. how can I configure mongo-spark-connector there.
Any ideas?
Here how I did it in Jupyter notebook:
1. Download jars from central or any other repository and put them in directory called "jars":
mongo-spark-connector_2.11-2.4.0
mongo-java-driver-3.9.0
2. Create session and write/read any data
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
working_directory = 'jars/*'
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection") \
.config("spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection") \
.config('spark.driver.extraClassPath', working_directory) \
.getOrCreate()
people = my_spark.createDataFrame([("JULIA", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77),
("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 22)], ["name", "age"])
people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.select('*').where(col("name") == "JULIA").show()
As a result you will see this:
If you are using SparkContext & SparkSession, you have mentioned the connector jar packages in SparkConf, check the following Code:
from pyspark import SparkContext,SparkConf
conf = SparkConf().set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.3.2")
sc = SparkContext(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.config("spark.mongodb.output.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
If you are using only SparkSession then use following code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.config("spark.mongodb.output.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.3.2') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
If you're using the newest version of mongo-spark-connector, i.e. v10.0.1 at the time of writing this, you need to use SparkConf object, as stated by the mongo documentation (https://www.mongodb.com/docs/spark-connector/current/configuration/).
Besides, you don't need to manually download anything, it will do it for you.
Bellow is the solution I came up with, for :
mongo-spark-connector: 10.0.1
mongo server : 5.0.8
spark : 3.2.0
def init_spark():
password = os.environ["MONGODB_PASSWORD"]
user = os.environ["MONGODB_USER"]
host = os.environ["MONGODB_HOST"]
db_auth = os.environ["MONGODB_DB_AUTH"]
mongo_conn = f"mongodb://{user}:{password}#{host}:27017/{db_auth}"
conf = SparkConf()
# Download mongo-spark-connector and its dependencies.
# This will download all the necessary jars and put them in your $HOME/.ivy2/jars, no need to manually download them :
conf.set("spark.jars.packages",
"org.mongodb.spark:mongo-spark-connector:10.0.1")
# Set up read connection :
conf.set("spark.mongodb.read.connection.uri", mongo_conn)
conf.set("spark.mongodb.read.database", "<my-read-database>")
conf.set("spark.mongodb.read.collection", "<my-read-collection>")
# Set up write connection
conf.set("spark.mongodb.write.connection.uri", mongo_conn)
conf.set("spark.mongodb.write.database", "<my-write-database>")
conf.set("spark.mongodb.write.collection", "<my-write-collection>")
# If you need to update instead of inserting :
conf.set("spark.mongodb.write.operationType", "update")
SparkContext(conf=conf)
return SparkSession \
.builder \
.appName('<my-app-name>') \
.getOrCreate()
spark = init_spark()
df = spark.read.format("mongodb").load()
df_grouped = df.groupBy("<some-column>").agg(mean("<some-other-column>"))
df_grouped.write.format("mongodb").mode("append").save()
I was also facing same error "java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource" while trying to connect to MongoDB from Spark (2.3).
I had to download and copy mongo-spark-connector_2.11 JAR file(s) into jars directory of spark installation.
That resolved my issue and I was successfully able to call my spark code via spark-submit.
Hope it helps.
Here is how this error got resolved by downloading the jar files below. (Used the solution of this question.)
1.Downloaded the jar files below.
mongo-spark-connector_2.11-2.4.1 from here
mongo-java-driver-3.9.0 from here
copy and paste both these jar files into 'jars' location in spark directory.
Pyspark Code in jupiter notebook:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("mongo").\
config("spark.mongodb.input.uri","mongodb://127.0.0.1:27017/$database.$table_name").\
config("spark.mongodb.output.uri","mongodb://127.0.0.1:27017/$database.$table_name").\
getOrCreate()
df=spark.read.format('com.mongodb.spark.sql.DefaultSource')\
.option( "uri", "mongodb://127.0.0.1:27017/$database.$table_name") \
.load()
df.printSchema()
#create Temp view of df to view the data
table = df.createOrReplaceTempView("df")
#to read table present in mongodb
query1 = spark.sql("SELECT * FROM df ")
query1.show(10)
You are not using sc to create the SparkSession. Maybe this code can help you:
conf.set('spark.mongodb.input.uri', mongodb_input_uri)
conf.set('spark.mongodb.input.collection', 'collection_name')
conf.set('spark.mongodb.output.uri', mongodb_output_uri)
sc = SparkContext(conf=conf)
spark = SparkSession(sc) # Using the context (conf) to create the session
I want to create a decision tree model using spark submit.
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
from pyspark import SparkConf, SparkContext
from numpy import array
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/newumc.classification_data") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/newumc.classification_data") \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
dt = df.rdd.map(createLabeledPoints)
model_dt = DecisionTree.trainClassifier(dt, numClasses=467, categoricalFeaturesInfo={0:2,1:2, 2:2, 3:2, 4:2, 5:2, 6:2, 7:2, 8:2, 9:2, 10:2, 11:2, 12:2, 13:2, 14:2, 15:2, 16:2, 17:2, 18:2, 19:2, 20:2, 21:2, 22:2, 23:2, 24:2, 25:2, 26:2, 27:2, 28:2, 29:2, 30:2, 31:2, 32:2, 33:2, 34:2, 35:2, 36:2, 37:2, 38:2}, impurity='gini', maxDepth=30, maxBins=32)
where createLabeledPoints is a function that return to me a labeledpoint
I have no issue when I execute this code using pyspark in the spark-shell
but I want to use spark-submit, when I do that its gives me this error
pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects
I think the problem is because I create another sparkSession inside spark-submit (I think) or because pysparksataframe cannot be pickled!
Can anyone please help me !