mongodb pyspark connector set up - mongodb

i have followed the link here to install, build is succesful but I cannot find the connector.
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.read.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config("spark.mongodb.write.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.2') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
Py4JJavaError: An error occurred while calling o592.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource
the connector was downloaded and built here
https://github.com/mongodb/mongo-spark#please-see-the-downloading-instructions-for-information-on-getting-and-using-the-mongodb-spark-connector
I Am using ubuntu 20.04

Change to
df = spark.read.format("mongodb").load()
Then, you have to tell pyspark where to find the mongo libs, e.g.
/usr/local/bin/spark-submit --jars $HOME/java/lib/mongo-spark-connector-10.0.0.jar,$HOME/java/lib/mongodb-driver-sync-4.3.2.jar,$HOME/java/lib/mongodb-driver-core-4.3.2.jar,$HOME/java/lib/bson-4.3.2.jar mongo_spark1.py

I'm running pyspark in local mode.
Mongodb version 4
Spark version 3.2.1
I download all needed jars in one folder(path_to_jars) and add it to spark config
bson-4.7.0.jar
mongodb-driver-legacy-4.7.0.jar
mongo-spark-connector-10.0.3.jar
mongodb-driver-core-4.7.0.jar
mongodb-driver-sync-4.7.0.jar
from pyspark.sql import SparkSession
url = 'mongodb://id:port/Database.collection'
spark = (SparkSession
.builder
.master('local[*]')
.config('spark.driver.extraClassPath','path_to_jars/*')
.config("spark.mongodb.read.connection.uri",url)
.config("spark.mongodb.write.connection.uri", url)
.getOrCreate()
)
df = spark.read.format("mongodb").load()

Related

Pyspark not able to read from bigquery table

I am running the below code to pull a bigquery table using Pyspark. The spark session has been initiated without any issue but I am not able to connect to the table in public dataset. Here is the error that I get from running the script.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('Optimize BigQuery Storage') \
.config('spark.jars.packages', 'gs://spark-lib/bigquery/spark-3.1-bigquery-0.27.1-preview.jar') \
.getOrCreate()
df = spark.read \
.format("bigquery") \
.load("bigquery-public-data.samples.shakespeare")
https://i.stack.imgur.com/actAv.png

How to connect Snowflake with PySpark?

I am trying to connect to Snowflake with Pyspark on my local machine.
My code is as follows:
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark import SparkConf
conf = SparkConf()
conf.set('spark.jars','/path/to/driver/snowflake-jdbc-3.12.17.jar , \
/path/to/connector/spark-snowflake_2.12-2.10.0-spark_3.2.jar')
spark = SparkSession.builder \
.master("local") \
.appName("snowflake-test") \
.config(conf=conf) \
.getOrCreate()
sfOptions = {
"sfURL": "https://someurl.com",
"sfAccount": "account",
"sfUser": "user",
"sfPassword": "password",
"sfDatabase": "database",
"sfSchema": "PUBLIC",
"sfWarehouse": "warehouse"
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select * DimDate") \
.load()
df.show()
When I run this I get the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o46.load.
How to fix this?
With the Snowflake Spark JAR version "spark-snowflake_2.12:2.10.0-spark_3.2"
Snowflake JDBC 3.13.14 needs to be used. I see that you are using 3.12.17 JDBC version.
Can you add JDBC Version 3.13.14 and then test. As pointed by FKyani, this is a compatibility issue between Snowflake-Spark Jar and JDBC jar.
Please confirm the correct JDBC version is imported.
Recommended Client Versions: https://docs.snowflake.com/en/release-notes/requirements.html#recommended-client-versions
This looks similar to the error mentioned in this article:
https://community.snowflake.com/s/article/Error-py4j-protocol-Py4JJavaError-An-error-occurred-while-calling-o74-load-java-lang-NoSUchMethodError-scala-Product-init-Lscala-Product-V
If you are using Scala 2.12, you need to downgrade it to 2.11. Please note that, in this case, you will also have to use the associated version of the Spark connector for Snowflake.
Spark connectors for Snowflake can be found here. We recommend you to use the latest connector version depending on your Spark version and Scala 2.11.

Pyspark, MongoDB and Missing BSON Reference

I am trying to write a basic pyspark script to connect to MongoDB. I am using Spark 3.1.2 and MongoDb driver 3.2.2.
My code is:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
spark = SparkSession \
.builder \
.appName("SparkSQL") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/client.coll") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll") \
.getOrCreate()
df = spark.read.format("mongo").load()
When I execute in Pyspark with /usr/local/spark/bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 I get:
java.lang.NoClassDefFoundError: org/bson/conversions/Bson
I am very new to Spark. Could someone please help me understand how to install the missing Bson reference?

Unfound DefaultSource when connecting MongoDB with PySpark in PyCharm

the following error returned when I am connecting MongoDB with PySpark in PyCharm.
"java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html"
Python: 3.6.0
Spark: 2.2.0
Mongo-Spark connector: mongo-spark-connector_2.11-2.2.0.jar
code is as followers:
spark = SparkSession.builder.appName("Python Spark SQL basic example") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/local.users") \
.getOrCreate()
spark.conf.set("spark.jars", "/ExternalJar/mongo-spark-connector_2.11-2.2.0.jar")
df_users = spark.read.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", "mongodb://127.0.0.1/local.users")\
.load()

PySpark Mongodb / java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame

I'm trying to connect pyspark to MongoDB with this (running on Databricks) :
from pyspark import SparkConf, SparkContext
from pyspark.mllib.recommendation import ALS
from pyspark.sql import SQLContext
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
but I get this error
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
I am using Spark 2.0 and Mongo-spark-connector 2.11 and defined spark.mongodb.input.uri and spark.mongodb.output.uri
You are using spark.read.format before you defined spark
As you can see in the Spark 2.1.0 documents
A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:
spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
I managed to make it work because I was using mongo-spark-connector_2.10-1.0.0 instead of mongo-spark-connector_2.10-2.0.0