Spark Scala Write dataframe to MongoDB - mongodb

I am attempting to write my transformed data frame into MongoDB using this as a guide
https://docs.mongodb.com/spark-connector/master/scala/streaming/
So far, my reading of data frame from MongoDB works perfectly. As shown below.
val mongoURI = "mongodb://000.000.000.000:27017"
val Conf = makeMongoURI(mongoURI,"blog","articles")
val readConfigintegra: ReadConfig = ReadConfig(Map("uri" -> Conf))
val sparkSess = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.output.uri", "mongodb://000.000.000.000:27017/blog.vectors")
.getOrCreate()
// Uses the ReadConfig
val df3 = sparkSess.sqlContext.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://000.000.000.000:27017/blog.articles")))
However, writing this data frame to MongoDB seems to prove more difficult.
//reads data from mongo and does some transformations
val data = read_mongo()
data.show(20,false)
data.write.mode("append").mongo()
For the last line, I receive the following error.
Exception in thread "main" java.lang.IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.output.uri' or 'spark.mongodb.output.database' property
This seems confusing to me as I set this within my spark Session in the code blocks above.
val sparkSess = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.output.uri", "mongodb://000.000.000.000:27017/blog.vectors")
.getOrCreate()
Can you spot anything I'm doing wrong?

My answer is pretty much parallels how I read it but uses writeConfig instead.
data.saveToMongoDB(WriteConfig(Map("uri" -> "mongodb://000.000.000.000:27017/blog.vectors")))

Related

Is it possible to set Job Description for all actions on some RDD?

I'm running SparkSession as follows:
val query: String = //...
val session = SparkSession.builder
.enableHiveSupport()
.getOrCreate()
val iterator = session.sql(query).rdd.toLocalIterator
I read this answer and understood that sc.setLocalProperty("callSite.short", "callSite.short"). But this is not applicable in my case.
The thing is I dont know which thread will use the iterator I got here. Is it possible to set job description to be displayed in Spark Web UI for all actions on the RDD I got with session.sql(query).rdd?

Spark Mongodb Connector Scala - Missing database name

I'm stuck with a weird issue. I'm trying to locally connect Spark to MongoDB using mongodb spark connector.
Apart from setting up spark I'm using the following code:
val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:27017/movie_db.movie_ratings", "readPreference.name" -> "secondaryPreferred"), Some(ReadConfig(sc)))
val writeConfig = WriteConfig(Map("uri" -> "mongodb://127.0.0.1/movie_db.movie_ratings"))
// Load the movie rating data from Mongo DB
val movieRatings = MongoSpark.load(sc, readConfig).toDF()
movieRatings.show(100)
However, I get a compilation error:
java.lang.IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property.
On line where I set up readConfig. I don't get why it's complaining for not set uri when I clearly have a uri property in the Map.
I might be missing something.
You can do it from SparkSession as mentioned here
val spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/movie_db.movie_ratings")
.config("spark.mongodb.input.readPreference.name", "secondaryPreferred")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/movie_db.movie_ratings")
.getOrCreate()
create dataframe using the config
val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:27017/movie_db.movie_ratings", "readPreference.name" -> "secondaryPreferred"))
val df = MongoSpark.load(spark)
Write df to mongodb
MongoSpark.save(
df.write
.option("spark.mongodb.output.uri", "mongodb://127.0.0.1/movie_db.movie_ratings")
.mode("overwrite"))
In your code: prefixes are missing in config
val readConfig = ReadConfig(Map(
"spark.mongodb.input.uri" -> "mongodb://localhost:27017/movie_db.movie_ratings",
"spark.mongodb.input.readPreference.name" -> "secondaryPreferred"),
Some(ReadConfig(sc)))
val writeConfig = WriteConfig(Map(
"spark.mongodb.output.uri" -> "mongodb://127.0.0.1/movie_db.movie_ratings"))
For Java, either you can set the configs while creating spark session or first create the session and then set it as runtime configs.
1.
SparkSession sparkSession = SparkSession.builder()
.master("local")
.appName("MongoSparkConnector")
.config("spark.mongodb.input.uri","mongodb://localhost:27017/movie_db.movie_ratings")
.config("spark.mongodb.input.readPreference.name", "secondaryPreferred")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/movie_db.movie_ratings")
.getOrCreate()
OR
2.
SparkSession sparkSession = SparkSession.builder()
.master("local")
.appName("MongoSparkConnector")
.getOrCreate()
Then,
String mongoUrl = "mongodb://localhost:27017/movie_db.movie_ratings";
sparkSession.sparkContext().conf().set("spark.mongodb.input.uri", mongoURL);
sparkSession.sparkContext().conf().set("spark.mongodb.output.uri", mongoURL);
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("collection", sourceCollection);
readOverrides.put("readPreference.name", "secondaryPreferred");
ReadConfig readConfig = ReadConfig.create(sparkSession).withOptions(readOverrides);
Dataset<Row> df = MongoSpark.loadAndInferSchema(sparkSession,readConfig);

Using Scala and SparkSql and importing CSV file with header [duplicate]

This question already has answers here:
Spark - load CSV file as DataFrame?
(14 answers)
Closed 5 years ago.
I'm very new to Spark and Scala(Like two hours new), I'm trying to play with a CSV data file but I cannot do it as I'm not sure how to deal with "Header row", I have searched internet for the way to load it or to skip it but I don't really know how to do that.
I'm pasting my code That I'm using, please help me.
object TaxiCaseOne{
case class NycTaxiData(Vendor_Id:String, PickUpdate:String, Droptime:String, PassengerCount:Int, Distance:Double, PickupLong:String, PickupLat:String, RateCode:Int, Flag:String, DropLong:String, DropLat:String, PaymentMode:String, Fare:Double, SurCharge:Double, Tax:Double, TripAmount:Double, Tolls:Double, TotalAmount:Double)
def mapper(line:String): NycTaxiData = {
val fields = line.split(',')
val data:NycTaxiData = NycTaxiData(fields(0), fields(1), fields(2), fields(3).toInt, fields(4).toDouble, fields(5), fields(6), fields(7).toInt, fields(8), fields(9),fields(10),fields(11),fields(12).toDouble,fields(13).toDouble,fields(14).toDouble,fields(15).toDouble,fields(16).toDouble,fields(17).toDouble)
return data
}def main(args: Array[String]) {
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
// Use new SparkSession interface in Spark 2.0
val spark = SparkSession
.builder
.appName("SparkSQL")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
.getOrCreate()
val lines = spark.sparkContext.textFile("../nyc.csv")
val data = lines.map(mapper)
// Infer the schema, and register the DataSet as a table.
import spark.implicits._
val schemaData = data.toDS
schemaData.printSchema()
schemaData.createOrReplaceTempView("data")
// SQL can be run over DataFrames that have been registered as a table
val vendor = spark.sql("SELECT * FROM data WHERE Vendor_Id == 'CMT'")
val results = teenagers.collect()
results.foreach(println)
spark.stop()
}
}
If you have a CSV file you should use spark-csv to read the csv files rather than using textFile
val spark = SparkSession.builder().appName("test val spark = SparkSession
.builder
.appName("SparkSQL")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
.getOrCreate()
val df = spark.read
.format("csv")
.option("header", "true") //This identifies first line as header
.csv("../nyc.csv")
You need a spark-core and spark-sql dependency to work with this
Hope this helps!

Can we able to use mulitple sparksessions to access two different Hive servers

I have a scenario to compare two different tables source and destination from two separate remote hive servers, can we able to use two SparkSessions something like I tried below:-
val spark = SparkSession.builder().master("local")
.appName("spark remote")
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.160:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.160:9083")
.enableHiveSupport()
.getOrCreate()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
val sparkdestination = SparkSession.builder()
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.42:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.42:9083")
.enableHiveSupport()
.getOrCreate()
I tried with SparkSession.clearActiveSession() and SparkSession.clearDefaultSession() but it isn't working, throwing the error below:
Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
is there any other way we can achieve accessing two hive tables using multiple SparkSessions or SparkContext.
Thanks
I use this way and working perfectly fine with Spark 2.1
val sc = SparkSession.builder()
.config("hive.metastore.uris", "thrift://dbsyz1111:10000")
.enableHiveSupport()
.getOrCreate()
// Createdataframe 1 from by reading the data from hive table of metstore 1
val dataframe_1 = sc.sql("select * from <SourcetbaleofMetaStore_1>")
// Resetting the existing Spark Contexts
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
//Initialize Spark session2 with Hive Metastore 2
val spc2 = SparkSession.builder()
.config("hive.metastore.uris", "thrift://dbsyz2222:10004")
.enableHiveSupport()
.getOrCreate()
// Load dataframe 2 of spark context 1 into a new dataframe of spark context2, By getting schema and data by converting to rdd API
val dataframe_2 = spc2.createDataFrame(dataframe_1.rdd, dataframe_1.schema)
dataframe_2.write.mode("Append").saveAsTable(<targettableNameofMetastore_2>)
Look at SparkSession getOrCreate method
which state that
gets an existing [[SparkSession]] or, if there is no existing one,
creates a new one based on the options set in this builder.
This method first checks whether there is a valid thread-local
SparkSession, and if yes, return that one. It then checks whether
there is a valid global default SparkSession, and if yes, return
that one. If no valid global default SparkSession exists, the method
creates a new SparkSession and assigns the newly created
SparkSession as the global default.
In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing
SparkSession.
That's the reason its returning first session and its configurations.
Please go through the docs to find out alternative ways to create session..
I'm working on <2 spark version. So I am not sure how to create new session with out collision of configuration exactly..
But, here is useful test case i.e SparkSessionBuilderSuite.scala to do that-
DIY..
Example method in that test case
test("use session from active thread session and propagate config options") {
val defaultSession = SparkSession.builder().getOrCreate()
val activeSession = defaultSession.newSession()
SparkSession.setActiveSession(activeSession)
val session = SparkSession.builder().config("spark-config2", "a").getOrCreate()
assert(activeSession != defaultSession)
assert(session == activeSession)
assert(session.conf.get("spark-config2") == "a")
assert(session.sessionState.conf == SQLConf.get)
assert(SQLConf.get.getConfString("spark-config2") == "a")
SparkSession.clearActiveSession()
assert(SparkSession.builder().getOrCreate() == defaultSession)
SparkSession.clearDefaultSession()
}

Connecting Spark to Multiple Mongo Collections

I have the following MongoDB Collections : employee and details.
Now I have a requirement where I have to get documents from both collections into spark to analyze data.
I tried below code but it seems not working
SparkConf conf = new SparkConf().setAppName("DBConnection").setMaster("local[*]")
.set("spark.app.id","MongoSparkExample")
.set("spark.mongodb.input.uri","mongodb://localhost/Emp.employee")
.set("spark.executor.memory", "6g");
SparkSession session = SparkSession.builder().appName("Member Log")
.config(conf).getOrCreate();
SparkConf dailyconf = new SparkConf().setAppName("DBConnection").setMaster("local[*]")
.set("spark.app.id","Mongo Two Example")
.set("spark.mongodb.input.uri","mongodb://localhost/Emp.details");
SparkSession mongosession = SparkSession.builder().appName("Daily Log")
.config(dailyconf).getOrCreate();
Any pointers would be highly appreciated.
I fixed this issue by adding below code
JavaSparkContext newcontext = new JavaSparkContext(session.sparkContext());
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("collection", "details");
readOverrides.put("readPreference.name", "secondaryPreferred");
ReadConfig readConfig = ReadConfig.create(newcontext).withOptions(readOverrides);
MongoSpark.load(newcontext,readConfig);
First of all, like eliasah said, you should only create one Spark Session.
Second, take a look at the official MongoDB Spark Connector . It provides integration between MongoDB and Apache Spark. It gives you the posibility to load collections in Dataframes.
Please refer to the official documentation:
MongoDB Connector for Spark
Read from MongoDB (scala)
Read
from MongoDB (java)
EDIT
The documentation says the following:
Call loadFromMongoDB() with a ReadConfig object to specify a different MongoDB server address, database and collection.
In your case:
sc.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://localhost/Emp.details")))
You can use latest Spark SQL features. By passing param as per requirement:
sparksession = SparkSession
.builder()
.master("local[*]")
.appName("TEST")
.config( "spark.mongodb.input.uri", mongodb://localhost:portNo/dbInputName.CollInputName")
.config "spark.mongodb.output.uri", "mongodb://localhost:portNo/dbOutName.CollOutName")
.getOrCreate()
val readConfigVal = ReadConfig(Map("uri" -> uriName,"database" -> dbName, "collection" -> collName, "readPreference.name" -> "secondaryPreferred"), Some(ReadConfig(sparksession)))
var mongoLoadedDF = MongoSpark.load(sparksession,readConfig)
println("mongoLoadedDF:"+mongoLoadedDF.show())
You can read and write multiple tables using readOverrides / writeOverrides.
SparkSession spark = SparkSession
.builder()
.appName("Mongo connect")
.config("spark.mongodb.input.uri", "mongodb://user:password#ip_addr:27017/database_name.employee")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
// Read employee table 1
JavaMongoRDD<Document> employeeRdd = MongoSpark.load(jsc);
Map<String, String> readOverrides = new HashMap<String, String>();
// readOverrides.put("database", "database_name");
readOverrides.put("collection", "details");
ReadConfig readConfig = ReadConfig.create(jsc).withOptions(readOverrides);
// Read another table 2 (details table )
JavaMongoRDD<Document> detailsRdd = MongoSpark.load(jsc, readConfig);
System.out.println(employeeRdd.first().toJson());
System.out.println(detailsRdd.first().toJson());
jsc.close();