Spark Mongodb Connector Scala - Missing database name - mongodb

I'm stuck with a weird issue. I'm trying to locally connect Spark to MongoDB using mongodb spark connector.
Apart from setting up spark I'm using the following code:
val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:27017/movie_db.movie_ratings", "readPreference.name" -> "secondaryPreferred"), Some(ReadConfig(sc)))
val writeConfig = WriteConfig(Map("uri" -> "mongodb://127.0.0.1/movie_db.movie_ratings"))
// Load the movie rating data from Mongo DB
val movieRatings = MongoSpark.load(sc, readConfig).toDF()
movieRatings.show(100)
However, I get a compilation error:
java.lang.IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property.
On line where I set up readConfig. I don't get why it's complaining for not set uri when I clearly have a uri property in the Map.
I might be missing something.

You can do it from SparkSession as mentioned here
val spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/movie_db.movie_ratings")
.config("spark.mongodb.input.readPreference.name", "secondaryPreferred")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/movie_db.movie_ratings")
.getOrCreate()
create dataframe using the config
val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:27017/movie_db.movie_ratings", "readPreference.name" -> "secondaryPreferred"))
val df = MongoSpark.load(spark)
Write df to mongodb
MongoSpark.save(
df.write
.option("spark.mongodb.output.uri", "mongodb://127.0.0.1/movie_db.movie_ratings")
.mode("overwrite"))
In your code: prefixes are missing in config
val readConfig = ReadConfig(Map(
"spark.mongodb.input.uri" -> "mongodb://localhost:27017/movie_db.movie_ratings",
"spark.mongodb.input.readPreference.name" -> "secondaryPreferred"),
Some(ReadConfig(sc)))
val writeConfig = WriteConfig(Map(
"spark.mongodb.output.uri" -> "mongodb://127.0.0.1/movie_db.movie_ratings"))

For Java, either you can set the configs while creating spark session or first create the session and then set it as runtime configs.
1.
SparkSession sparkSession = SparkSession.builder()
.master("local")
.appName("MongoSparkConnector")
.config("spark.mongodb.input.uri","mongodb://localhost:27017/movie_db.movie_ratings")
.config("spark.mongodb.input.readPreference.name", "secondaryPreferred")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/movie_db.movie_ratings")
.getOrCreate()
OR
2.
SparkSession sparkSession = SparkSession.builder()
.master("local")
.appName("MongoSparkConnector")
.getOrCreate()
Then,
String mongoUrl = "mongodb://localhost:27017/movie_db.movie_ratings";
sparkSession.sparkContext().conf().set("spark.mongodb.input.uri", mongoURL);
sparkSession.sparkContext().conf().set("spark.mongodb.output.uri", mongoURL);
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("collection", sourceCollection);
readOverrides.put("readPreference.name", "secondaryPreferred");
ReadConfig readConfig = ReadConfig.create(sparkSession).withOptions(readOverrides);
Dataset<Row> df = MongoSpark.loadAndInferSchema(sparkSession,readConfig);

Related

Adding Mongo config to active spark session

I am trying to add the configuraions to an active spark session. Below is my code
val spark = SparkSession.getActiveSession.get
spark.conf.set("spark.mongodb.input.uri",
"mongodb://hello_admin:hello123#localhost:27017/testdb.products?authSource=admin")
spark.conf.set("spark.mongodb.input.partitioner" ,"MongoPaginateBySizePartitioner")
import com.mongodb.spark._
val customRdd = MongoSpark.load(sc)
println(customRdd.count())
println(customRdd.first.toJson)
println(customRdd.collect().foreach(println))
But I am getting an error:
java.lang.IllegalArgumentException: Missing database name. Set via the
'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
While when I write the code
val spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.input.uri", "mongodb://hello_admin:hello123#localhost:27017/testdb.products?authSource=admin")
// .config("spark.mongodb.output.uri", "mongodb://hello_admin:hello123#localhost:27017/testdb.products?authSource=admin")
.config("spark.mongodb.input.partitioner" ,"MongoPaginateBySizePartitioner")
.getOrCreate()
val sc = spark.sparkContext
val customRdd = MongoSpark.load(sc)
println(customRdd.count())
println(customRdd.first.toJson)
println(customRdd.collect().foreach(println))
My code is excecuting fine.
Kindly let me know what changes i need in the first code
You can define sparkSession like this with SparkConf. ( i don't know if this helps you )
def sparkSession(conf: SparkConf): SparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate()
val sparkConf = new SparkConf()
sparkConf.set("prop","value")
val ss = sparkSession(sparkConf)
Or you can try to use SparkEnv ( i'm using sparkEnv for a lot of things to change props ):
SparkEnv.get.conf.set("prop", "value")

Spark Scala Write dataframe to MongoDB

I am attempting to write my transformed data frame into MongoDB using this as a guide
https://docs.mongodb.com/spark-connector/master/scala/streaming/
So far, my reading of data frame from MongoDB works perfectly. As shown below.
val mongoURI = "mongodb://000.000.000.000:27017"
val Conf = makeMongoURI(mongoURI,"blog","articles")
val readConfigintegra: ReadConfig = ReadConfig(Map("uri" -> Conf))
val sparkSess = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.output.uri", "mongodb://000.000.000.000:27017/blog.vectors")
.getOrCreate()
// Uses the ReadConfig
val df3 = sparkSess.sqlContext.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://000.000.000.000:27017/blog.articles")))
However, writing this data frame to MongoDB seems to prove more difficult.
//reads data from mongo and does some transformations
val data = read_mongo()
data.show(20,false)
data.write.mode("append").mongo()
For the last line, I receive the following error.
Exception in thread "main" java.lang.IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.output.uri' or 'spark.mongodb.output.database' property
This seems confusing to me as I set this within my spark Session in the code blocks above.
val sparkSess = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.output.uri", "mongodb://000.000.000.000:27017/blog.vectors")
.getOrCreate()
Can you spot anything I'm doing wrong?
My answer is pretty much parallels how I read it but uses writeConfig instead.
data.saveToMongoDB(WriteConfig(Map("uri" -> "mongodb://000.000.000.000:27017/blog.vectors")))

Connecting Spark to Multiple Mongo Collections

I have the following MongoDB Collections : employee and details.
Now I have a requirement where I have to get documents from both collections into spark to analyze data.
I tried below code but it seems not working
SparkConf conf = new SparkConf().setAppName("DBConnection").setMaster("local[*]")
.set("spark.app.id","MongoSparkExample")
.set("spark.mongodb.input.uri","mongodb://localhost/Emp.employee")
.set("spark.executor.memory", "6g");
SparkSession session = SparkSession.builder().appName("Member Log")
.config(conf).getOrCreate();
SparkConf dailyconf = new SparkConf().setAppName("DBConnection").setMaster("local[*]")
.set("spark.app.id","Mongo Two Example")
.set("spark.mongodb.input.uri","mongodb://localhost/Emp.details");
SparkSession mongosession = SparkSession.builder().appName("Daily Log")
.config(dailyconf).getOrCreate();
Any pointers would be highly appreciated.
I fixed this issue by adding below code
JavaSparkContext newcontext = new JavaSparkContext(session.sparkContext());
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("collection", "details");
readOverrides.put("readPreference.name", "secondaryPreferred");
ReadConfig readConfig = ReadConfig.create(newcontext).withOptions(readOverrides);
MongoSpark.load(newcontext,readConfig);
First of all, like eliasah said, you should only create one Spark Session.
Second, take a look at the official MongoDB Spark Connector . It provides integration between MongoDB and Apache Spark. It gives you the posibility to load collections in Dataframes.
Please refer to the official documentation:
MongoDB Connector for Spark
Read from MongoDB (scala)
Read
from MongoDB (java)
EDIT
The documentation says the following:
Call loadFromMongoDB() with a ReadConfig object to specify a different MongoDB server address, database and collection.
In your case:
sc.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://localhost/Emp.details")))
You can use latest Spark SQL features. By passing param as per requirement:
sparksession = SparkSession
.builder()
.master("local[*]")
.appName("TEST")
.config( "spark.mongodb.input.uri", mongodb://localhost:portNo/dbInputName.CollInputName")
.config "spark.mongodb.output.uri", "mongodb://localhost:portNo/dbOutName.CollOutName")
.getOrCreate()
val readConfigVal = ReadConfig(Map("uri" -> uriName,"database" -> dbName, "collection" -> collName, "readPreference.name" -> "secondaryPreferred"), Some(ReadConfig(sparksession)))
var mongoLoadedDF = MongoSpark.load(sparksession,readConfig)
println("mongoLoadedDF:"+mongoLoadedDF.show())
You can read and write multiple tables using readOverrides / writeOverrides.
SparkSession spark = SparkSession
.builder()
.appName("Mongo connect")
.config("spark.mongodb.input.uri", "mongodb://user:password#ip_addr:27017/database_name.employee")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
// Read employee table 1
JavaMongoRDD<Document> employeeRdd = MongoSpark.load(jsc);
Map<String, String> readOverrides = new HashMap<String, String>();
// readOverrides.put("database", "database_name");
readOverrides.put("collection", "details");
ReadConfig readConfig = ReadConfig.create(jsc).withOptions(readOverrides);
// Read another table 2 (details table )
JavaMongoRDD<Document> detailsRdd = MongoSpark.load(jsc, readConfig);
System.out.println(employeeRdd.first().toJson());
System.out.println(detailsRdd.first().toJson());
jsc.close();

Reading from Multiple MongoDBs to form Dataset

I want to make 2 datasets from 2 different Mongo Databases. I am currently using official MongoSpark Connector. sparkSession is started in the following way.
SparkConf sparkConf = new SparkConf().setMaster("yarn").setAppName("test")
.set("spark.mongodb.input.partitioner", "MongoShardedPartitioner")
.set("spark.mongodb.input.uri", "mongodb://192.168.77.62/db1.coll1")
.set("spark.sql.crossJoin.enabled", "true");
SparkSession sparkSession = sparkSession.builder().appName("test1").config(sparkConf).getOrCreate();
If I want to change the spark.mongodb.input.uri, how will I do that? I have already tried changing the runtimeConfig of sparkSession and also using ReadConfig with readOverrides but those did not work.
Method 1:
sparkSession.conf().set("spark.mongodb.input.uri", "mongodb://192.168.77.63/db1.coll2");
Method 2:
Map<String, String> readOverrides = new HashMap<String, String>();
readoverrides.put("uri","192.168.77.63/db1.coll2");
ReadConfig readConfig = ReadConfig.create(sparkSession).withOptions(readOverrides);
Dataset<Position> ds = MongoSpark.load(sparkSession, readConfig, Position.class);
Edit 1: As suggested by Karol I tried the following method
SparkConf sparkConf = new SparkConf().setMaster("yarn").setAppName("test");
SparkSession sparkSession = SparkSession.builder().appName("test1").config(sparkConf).getOrCreate();
Map<String, String> readOverrides1 = new HashMap<String, String>();
readOverrides1.put("uri", "mongodb://192.168.77.62:27017");
readOverrides1.put("database", "db1");
readOverrides1.put("collection", "coll1");
ReadConfig readConfig = ReadConfig.create(sparkSession).withOptions(readOverrides1);
This fails in runtime saying:
Exception in thread "main" java.lang.IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
Edit 2:
public static void main(String[] args) {
SparkSession sparkSession = SparkSession.builder().appName("test")
.config("spark.worker.cleanup.enabled", "true").config("spark.scheduler.mode", "FAIR").getOrCreate();
String mongoURI2 = "mongodb://192.168.77.63:27017/db1.coll1";
Map<String, String> readOverrides1 = new HashMap<String, String>();
readOverrides1.put("uri", mongoURI2);
ReadConfig readConfig1 = ReadConfig.create(sparkSession).withOptions(readOverrides1);
MongoSpark.load(sparkSession,readConfig1,Position.class).show();
}
Still this is giving the same exception as the previous edit.
built.sbt:
libraryDependencies += "org.mongodb.spark" % "mongo-spark-connector_2.11" % "2.0.0"
package com.example.app
import com.mongodb.spark.config.{ReadConfig, WriteConfig}
import com.mongodb.spark.sql._
object App {
def main(args: Array[String]): Unit = {
val MongoUri1 = args(0).toString
val MongoUri2 = args(1).toString
val SparkMasterUri= args(2).toString
def makeMongoURI(uri:String,database:String,collection:String) = (s"${uri}/${database}.${collection}")
val mongoURI1 = s"mongodb://${MongoUri1}:27017"
val mongoURI2 = s"mongodb://${MongoUri2}:27017"
val CONFdb1 = makeMongoURI(s"${mongoURI1}","MyColletion1,"df")
val CONFdb2 = makeMongoURI(s"${mongoURI2}","MyColletion2,"df")
val WRITEdb1: WriteConfig = WriteConfig(scala.collection.immutable.Map("uri"->CONFdb1))
val READdb1: ReadConfig = ReadConfig(Map("uri" -> CONFdb1))
val WRITEdb2: WriteConfig = WriteConfig(scala.collection.immutable.Map("uri"->CONFdb2))
val READdb2: ReadConfig = ReadConfig(Map("uri" -> CONFdb2))
val spark = SparkSession
.builder
.appName("AppMongo")
.config("spark.worker.cleanup.enabled", "true")
.config("spark.scheduler.mode", "FAIR")
.getOrCreate()
val df1 = spark.read.mongo(READdb1)
val df2 = spark.read.mongo(READdb2)
df1.write.mode("overwrite").mongo(WRITEdb1)
df2.write.mode("overwrite").mongo(WRITEdb2)
}
}
you can now pass uri1 and uri2 into /usr/local/spark/bin/spark-submit pathToMyjar.app.jar MongoUri1 MongoUri2 sparkMasterUrias args and then create config for each uri
spark.read.mongo(READdb)
It's not useful to set uri in ReadConfig. Spark-Mongo connector uses this information when call ReadConfig.create() method. So try to set it in SparkContext before use it.
Just like below:
SparkContext context = spark.sparkContext();
context.conf().set("spark.mongodb.input.uri","mongodb://host:ip/database.collection");
JavaSparkContext jsc = new JavaSparkContext(context);

How to load data from Cassandra table

I am working on Spark version: 2.0.1 and Cassandra 3.9. I want to read data from a table in cassandra by CassandraSQLContext. However, Spark 2.0 was changed and using sparkSession. I am trying to use sparkSession and I am lucky, the following is my code.
Could you please review and give your advice?
def main(args: Array[String], date_filter: String): Unit = {
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val sparkSession = SparkSession.builder
.master("local")
.appName("my-spark-app")
.config(conf)
.getOrCreate()
import sparkSession.implicits._
import org.apache.spark.sql._
val rdd = sparkSession
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "users", "keyspace" -> "monita"))
.load()
println("count: " +rdd.count())
}
Your code looks ok. You don't need to create SC. You can set Cassandra connection properties in config like below.
val sparkSession = SparkSession
.builder
.master("local")
.appName("my-spark-app")
.config("spark.cassandra.connection.host", "127.0.0.1")
.config("spark.cassandra.connection.port", "9042")
.getOrCreate()