How do i create a single SparkSession in one file and reuse it other file - pyspark

I have two py files
com/demo/DemoMain.py
com/demo/Sample.py
In both of the above files i am recreating the SparkSession object , In Pyspark,how do i create a Sparksession in one file and reuse it in other py files . In Scala it is easily possible by creating in one object and import that it everywhere
DemoMain.py
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql import SparkSession
from pyspark.sql import Row
def main():
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
sc = spark.sparkContext
data=["surender,34","ajay,21"]
lines = sc.parallelize(data)
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
df=spark.createDataFrame(people)
df.show()
if __name__ == '__main__':
main()
sample.py
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
rdd = spark.sparkContext.parallelize(["surender","raja"])
rdd.collect()

Related

create spark connection as part of python function

I am trying to create spark connection as part of spark_conn() function and use this connection throughout the other functions. For example in the below code, I am using spark connection created as part of spark_conn() function in read_data() function as below. Is my approach correct?
from pyspark.sql import SparkSession
def spark_conn():
spark = SparkSession \
.builder \
.appName("sparkConnection") \
.getOrCreate()
return spark
def read_data(spark, SNOWFLAKE_SOURCE_NAME, snowflake_options, loadtime, sftable):
df = spark.read \
.format(SNOWFLAKE_SOURCE_NAME) \
.options(**snowflake_options) \
.option("query","SELECT * FROM sftable WHERE SNAPSHOT==loadtime") \
.load()
if __name__=="__main__":
conn = spark_conn()
rd = read_data(conn, SNOWFLAKE_SOURCE_NAME, snowflake_options, loadtime, sftable)

Fail to savetoMongoDB :java.lang.ClassNotFoundException: com.mongodb.hadoop.io.BSONWritable

I want to convert data from Dataframe to RDD, and save it to MongoDB, here is my code:
import pymongo
import pymongo_spark
from pyspark import SparkConf, SparkContext
from pyspark import BasicProfiler
from pyspark.sql import SparkSession
class MyCustomProfiler(BasicProfiler):
def show(self, id):
print("My custom profiles for RDD:%s" % id)
conf = SparkConf().set("spark.python.profile", "true")
spark = SparkSession.builder \
.master("local[*]") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# Important: activate pymongo_spark.
pymongo_spark.activate()
on_time_dataframe = spark.read.parquet(r'\data\on_time_performance.parquet')
on_time_dataframe.show()
# Note we have to convert the row to a dict to avoid https://jira.mongodb.org/browse/HADOOP-276
as_dict = on_time_dataframe.rdd.map(lambda row: row.asDict())
as_dict.saveToMongoDB('mongodb://localhost:27017/agile_data_science.on_time_performance')
some errors occurs:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: java.lang.ClassNotFoundException: com.mongodb.hadoop.io.BSONWritable
I have installed the Mongo-hadoop file; It seems I don't have a Bsonweitable class. I'm not good at java, So I want someone to help me.

Passing sparkSession Between Scala Spark and PySpark

My requirement is to call a "Spark Scala" function from an existing PySpark program.
What is the best way to pass sparkSession created in PySpark program to Scala function.
I pass my scala jar to Pyspark as follows.
spark-submit --jars ScalaExample-0.1.jar pyspark_call_scala_example.py iris.data
Scalacode
def getDf(spark: SparkSession, query:String, df: DataFrame, log: Logger): DataFrame = {
import spark.implicits._
val df = spark.sql(query)
df
}
Pysparkcode
if __name__ == '__main__':
query = sys.argv[1]
spark = SparkSession \
.builder \
.appName("PySpark using Scala example") \
.getOrCreate()
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
query_df = DataFrame(sc._jvm.com.crowdstrike.dsci.sparkjobs.PythonHelper.getDf(???, query, ???), sqlContext)
Question
How to pass sparksession and logger to getDf ?
https://www.crowdstrike.com/blog/spark-hot-potato-passing-dataframes-between-scala-spark-and-pyspark/
To pass SparkSession from Python to Scala, use spark._jsparkSession.

cannot pickle pyspark dataframe

I want to create a decision tree model using spark submit.
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
from pyspark import SparkConf, SparkContext
from numpy import array
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/newumc.classification_data") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/newumc.classification_data") \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
dt = df.rdd.map(createLabeledPoints)
model_dt = DecisionTree.trainClassifier(dt, numClasses=467, categoricalFeaturesInfo={0:2,1:2, 2:2, 3:2, 4:2, 5:2, 6:2, 7:2, 8:2, 9:2, 10:2, 11:2, 12:2, 13:2, 14:2, 15:2, 16:2, 17:2, 18:2, 19:2, 20:2, 21:2, 22:2, 23:2, 24:2, 25:2, 26:2, 27:2, 28:2, 29:2, 30:2, 31:2, 32:2, 33:2, 34:2, 35:2, 36:2, 37:2, 38:2}, impurity='gini', maxDepth=30, maxBins=32)
where createLabeledPoints is a function that return to me a labeledpoint
I have no issue when I execute this code using pyspark in the spark-shell
but I want to use spark-submit, when I do that its gives me this error
pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects
I think the problem is because I create another sparkSession inside spark-submit (I think) or because pysparksataframe cannot be pickled!
Can anyone please help me !

pyspark error: AttributeError: 'SparkSession' object has no attribute 'serializer'

I am using spark ver 2.0.1
def f(l):
print(l.b_appid)
sqlC=SQLContext(spark)
mrdd = sqlC.read.parquet("hdfs://localhost:54310/yogi/device/processed//data.parquet")
mrdd.forearch(f) <== this gives error
In Spark 2.X - in order to use Spark Session (aka spark) you need to create it
You can create SparkSessionlike this:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
Once you have the SparkSession object (spark) you can use it like this:
mydf = spark.read.parquet("hdfs://localhost:54310/yogi/device/processed//data.parquet")
mydf.forearch(f)
More info can be found in Spark Sessions section in spark docs:
class pyspark.sql.SparkSession(sparkContext, jsparkSession=None)
The entry point to programming Spark with the Dataset and DataFrame
API. A SparkSession can be used create DataFrame, register DataFrame
as tables, execute SQL over tables, cache tables, and read parquet
files. To create a SparkSession, use the following builder pattern:
spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
Info about class builder can be found in class Builder - Builder for SparkSession.