Get collection names with Spark connector MongoDB - mongodb

Is there any way to get the database collection names natively with the spark-connector.
Now i'm using pymongo to do it, but I wonder if is possible to do it with the spark connector.
My actual method, just for info:
from pymongo import MongoClient
db = MongoClient().database
db_names = db.collection_names(False)
for name in db_names:
spark = SparkSession\
.builder\
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/database." + name) \
.config("spark.mongodb.output.uri", "mongodb://localhost:27017/database." + name) \
.getOrCreate()
...

With python the Mongo Spark Connector only uses the Spark API, so there is no native way to list collections.
Also, please note the SparkSession is a singleton, so when changing collections the configuration should be done on the DataFrameReader using the option() method.

Related

how to connect to mongodb Atlas from databricks cluster using pyspark

how to connect to mongodb Atlas from databricks cluster using pyspark
This is my simple code in notebook
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb+srv://admin:<password>#mongocluster.fxilr.mongodb.net/TestDatabase.Events") \
.getOrCreate()
df = spark.read.format("mongo").load()
df.printSchema()
But I am getting error as
IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
What is wrong am i doing
I followed this steps and I was able to connect.
Install org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 maven library to your cluster as I was using scala2.12
Goto Cluster detail page and in Advance option under Spark tab , you add below two config parameters
spark.mongodb.output.uri connection-string
spark.mongodb.input.uri connection-string
Note connection-string should look like this - (have your appropriate user, password and database names)
mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database?retryWrites=true&w=majority
Use following python code in your notebook and it should load your sample collection to a dataframe
# Reading from MongoDB
df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/database?retryWrites=true&w=majority")\
.option("database", "my_database")\
.option("collection", "my_collection")\
.load()
You can use following to write to mongo db
events_df.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode("append")\
.option( "uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database.my_collection?retryWrites=true&w=majority") \
.save()
Hope this should work for you. Please do let others know if it worked.

engine used when creating Hive table with joins using Spark SQL

I am not sure from documentation if when creating an Hive table using HiveContext from Spark, will it use the Spark engine or the standard Hive mapreduce job to perform the task?
val sc = new SparkContext()
val hc = new HiveContext(sc)
hc.sql("""
CREATE TABLE db.new_table
STORED AS PARQUET
AS SELECT
field1,
field2,
field3
FROM db.src1
JOIN db.src2
ON (x = y)
"""
)
Spark 1.6
Spark SQL supports Apache Hive using HiveContext. It uses the Spark SQL execution engine to work with data stored in Hive.
above Spark 2.x
val spark = SparkSession .builder() .appName( "SparkSessionExample" ) .config( "spark.sql.warehouse.dir" , warehouseLocation) .enableHiveSupport() .getOrCreate()
When doing this now, SPARK will use SPARK APIs and not MR. Hivecontext need not be explicitly referenced as is deprecated, even in spark-submit / program mode.

Unable to create dataframe using SQLContext object in spark2.2

I am using spark 2.2 version on Microsoft Windows 7. I want to load csv file in one variable to perform SQL related actions later on but unable to do so. I referred accepted answer from this link but of no use. I followed below steps for creating SparkContext object and SQLContext object:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sc=SparkContext.getOrCreate() // Creating spark context object
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Creating SQL object for query related tasks
Objects are created successfully but when I execute below code it throws an error which can't be posted here.
val df = sqlContext.read.format("csv").option("header", "true").load("D://ResourceData.csv")
And when I try something like df.show(2) it says that df was not found. I tried databricks solution for loading CSV from the attached link. It downloads the packages but doesn't load csv file. So how can I rectify my problem?? Thanks in advance :)
I solved my problem for loading local file in dataframe using 1.6 version in cloudera VM with the help of below code:
1) sudo spark-shell --jars /usr/lib/spark/lib/spark-csv_2.10-1.5.0.jar,/usr/lib/spark/lib/commons-csv-1.5.jar,/usr/lib/spark/lib/univocity-parsers-1.5.1.jar
2) val df1 = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("treatEmptyValuesAsNulls", "true" ).option("parserLib", "univocity").load("file:///home/cloudera/Desktop/ResourceData.csv")
NOTE: sc and sqlContext variables are automatically created
But there are many improvements in the latest version i.e 2.2.1 which I am unable to use because metastore_db doesn't gets created in windows 7. I ll post a new question regarding the same.
In reference with your comment that you are able to access SparkSession variable, then follow below steps to process your csv file using SparkSQL.
Spark SQL is a Spark module for structured data processing.
There are mainly two abstractions - Dataset and Dataframe :
A Dataset is a distributed collection of data.
A DataFrame is a Dataset organized into named columns.
In the Scala API, DataFrame is simply a type alias of Dataset[Row].
With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.
You have a csv file and you can simply create a dataframe by doing one of the following:
From your spark-shell using the SparkSession variable spark:
val df = spark.read
.format("csv")
.option("header", "true")
.load("sample.csv")
After reading the file into dataframe, you can register it into a temporary view.
df.createOrReplaceTempView("foo")
SQL statements can be run by using the sql methods provided by Spark
val fooDF = spark.sql("SELECT name, age FROM foo WHERE age BETWEEN 13 AND 19")
You can also query that file directly with SQL:
val df = spark.sql("SELECT * FROM csv.'file:///path to the file/'")
Make sure that you run spark in local mode when you load data from local, or else you will get error. The error occurs when you have already set HADOOP_CONF_DIR environment variable,and which expects "hdfs://..." otherwise "file://".
Set your spark.sql.warehouse.dir (default: ${system:user.dir}/spark-warehouse).
.config("spark.sql.warehouse.dir", "file:///C:/path/to/my/")
It is the default location of Hive warehouse directory (using Derby)
with managed databases and tables. Once you set the warehouse directory, Spark will be able to locate your files, and you can load csv.
Reference : Spark SQL Programming Guide
Spark version 2.2.0 has built-in support for csv.
In your spark-shell run the following code
val df= spark.read
.option("header","true")
.csv("D:/abc.csv")
df: org.apache.spark.sql.DataFrame = [Team_Id: string, Team_Name: string ... 1 more field]

How spark-shell or Zepellin notebook set HiveContext to SparkSession?

Does anyone know why I can access to an existing hive table from spark-shell or zepelling notebook doing this
val df = spark.sql("select * from hive_table")
But when I submit a spark jar with a spark object created this way,
val spark = SparkSession
.builder()
.appName("Yet another spark app")
.config("spark.sql.shuffle.partitions", 18)
.config("spark.executor.memory", "2g")
.config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
I got this
Table or view not found
What I really want is to learn, understand, what the shell and the notebooks are doing for us in order to provide hive context to the SparkSession.
When working with Hive, one must instantiate SparkSession with Hive support
You need to call enableHiveSupport() on the session builder

PySpark Mongodb / java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame

I'm trying to connect pyspark to MongoDB with this (running on Databricks) :
from pyspark import SparkConf, SparkContext
from pyspark.mllib.recommendation import ALS
from pyspark.sql import SQLContext
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
but I get this error
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
I am using Spark 2.0 and Mongo-spark-connector 2.11 and defined spark.mongodb.input.uri and spark.mongodb.output.uri
You are using spark.read.format before you defined spark
As you can see in the Spark 2.1.0 documents
A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:
spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
I managed to make it work because I was using mongo-spark-connector_2.10-1.0.0 instead of mongo-spark-connector_2.10-2.0.0