PySpark Mongodb / java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame - mongodb

I'm trying to connect pyspark to MongoDB with this (running on Databricks) :
from pyspark import SparkConf, SparkContext
from pyspark.mllib.recommendation import ALS
from pyspark.sql import SQLContext
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
but I get this error
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
I am using Spark 2.0 and Mongo-spark-connector 2.11 and defined spark.mongodb.input.uri and spark.mongodb.output.uri

You are using spark.read.format before you defined spark
As you can see in the Spark 2.1.0 documents
A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:
spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

I managed to make it work because I was using mongo-spark-connector_2.10-1.0.0 instead of mongo-spark-connector_2.10-2.0.0

Related

Pyspark not able to read from bigquery table

I am running the below code to pull a bigquery table using Pyspark. The spark session has been initiated without any issue but I am not able to connect to the table in public dataset. Here is the error that I get from running the script.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('Optimize BigQuery Storage') \
.config('spark.jars.packages', 'gs://spark-lib/bigquery/spark-3.1-bigquery-0.27.1-preview.jar') \
.getOrCreate()
df = spark.read \
.format("bigquery") \
.load("bigquery-public-data.samples.shakespeare")
https://i.stack.imgur.com/actAv.png

mongodb pyspark connector set up

i have followed the link here to install, build is succesful but I cannot find the connector.
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.read.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config("spark.mongodb.write.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.2') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
Py4JJavaError: An error occurred while calling o592.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource
the connector was downloaded and built here
https://github.com/mongodb/mongo-spark#please-see-the-downloading-instructions-for-information-on-getting-and-using-the-mongodb-spark-connector
I Am using ubuntu 20.04
Change to
df = spark.read.format("mongodb").load()
Then, you have to tell pyspark where to find the mongo libs, e.g.
/usr/local/bin/spark-submit --jars $HOME/java/lib/mongo-spark-connector-10.0.0.jar,$HOME/java/lib/mongodb-driver-sync-4.3.2.jar,$HOME/java/lib/mongodb-driver-core-4.3.2.jar,$HOME/java/lib/bson-4.3.2.jar mongo_spark1.py
I'm running pyspark in local mode.
Mongodb version 4
Spark version 3.2.1
I download all needed jars in one folder(path_to_jars) and add it to spark config
bson-4.7.0.jar
mongodb-driver-legacy-4.7.0.jar
mongo-spark-connector-10.0.3.jar
mongodb-driver-core-4.7.0.jar
mongodb-driver-sync-4.7.0.jar
from pyspark.sql import SparkSession
url = 'mongodb://id:port/Database.collection'
spark = (SparkSession
.builder
.master('local[*]')
.config('spark.driver.extraClassPath','path_to_jars/*')
.config("spark.mongodb.read.connection.uri",url)
.config("spark.mongodb.write.connection.uri", url)
.getOrCreate()
)
df = spark.read.format("mongodb").load()

Pyspark, MongoDB and Missing BSON Reference

I am trying to write a basic pyspark script to connect to MongoDB. I am using Spark 3.1.2 and MongoDb driver 3.2.2.
My code is:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
spark = SparkSession \
.builder \
.appName("SparkSQL") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/client.coll") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll") \
.getOrCreate()
df = spark.read.format("mongo").load()
When I execute in Pyspark with /usr/local/spark/bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 I get:
java.lang.NoClassDefFoundError: org/bson/conversions/Bson
I am very new to Spark. Could someone please help me understand how to install the missing Bson reference?

Hive support is required to CREATE Hive TABLE (AS SELECT)

I am planning to save the spark dataframe into hive tables so i can query them and extract latitude and longitude from them since Spark dataframe aren't iterable.
With pyspark in jupyter i wrote this code to make a spark session:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
#readmultiple csv with pyspark
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.sql.catalogImplementation=hive").enableHiveSupport() \
.getOrCreate()
df = spark.read.csv("Desktop/train/train.csv",header=True);
Pickup_locations=df.select("pickup_datetime","Pickup_latitude",
"Pickup_longitude")
print(Pickup_locations.count())
then i run the hiveql :
df.createOrReplaceTempView("mytempTable")
spark.sql("create table hive_table as select * from mytempTable");
And i get this error:
Py4JJavaError: An error occurred while calling o24.sql.
: org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable `hive_table`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists
+- Project [id#311, vendor_id#312, pickup_datetime#313, dropoff_datetime#314, passenger_count#315, pickup_longitude#316, pickup_latitude#317, dropoff_longitude#318, dropoff_latitude#319, store_and_fwd_flag#320, trip_duration#321]​
​
I was in this situation before. You need to pass a config parameter to spark-submit command so it considers hive as the catalog implementation for your spark sql.
Here is how spark submit looks like:
spark-submit --deploy-mode cluster --master yarn --conf spark.sql.catalogImplementation=hive --class harri_sparkStreaming.com_spark_streaming.App ./target/com-spark-streaming-2.3.0-jar-with-dependencies.jar
The trick is in: --conf spark.sql.catalogImplementation=hive
Hope this helps

How spark-shell or Zepellin notebook set HiveContext to SparkSession?

Does anyone know why I can access to an existing hive table from spark-shell or zepelling notebook doing this
val df = spark.sql("select * from hive_table")
But when I submit a spark jar with a spark object created this way,
val spark = SparkSession
.builder()
.appName("Yet another spark app")
.config("spark.sql.shuffle.partitions", 18)
.config("spark.executor.memory", "2g")
.config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
I got this
Table or view not found
What I really want is to learn, understand, what the shell and the notebooks are doing for us in order to provide hive context to the SparkSession.
When working with Hive, one must instantiate SparkSession with Hive support
You need to call enableHiveSupport() on the session builder