Reading from Multiple MongoDBs to form Dataset - mongodb

I want to make 2 datasets from 2 different Mongo Databases. I am currently using official MongoSpark Connector. sparkSession is started in the following way.
SparkConf sparkConf = new SparkConf().setMaster("yarn").setAppName("test")
.set("spark.mongodb.input.partitioner", "MongoShardedPartitioner")
.set("spark.mongodb.input.uri", "mongodb://192.168.77.62/db1.coll1")
.set("spark.sql.crossJoin.enabled", "true");
SparkSession sparkSession = sparkSession.builder().appName("test1").config(sparkConf).getOrCreate();
If I want to change the spark.mongodb.input.uri, how will I do that? I have already tried changing the runtimeConfig of sparkSession and also using ReadConfig with readOverrides but those did not work.
Method 1:
sparkSession.conf().set("spark.mongodb.input.uri", "mongodb://192.168.77.63/db1.coll2");
Method 2:
Map<String, String> readOverrides = new HashMap<String, String>();
readoverrides.put("uri","192.168.77.63/db1.coll2");
ReadConfig readConfig = ReadConfig.create(sparkSession).withOptions(readOverrides);
Dataset<Position> ds = MongoSpark.load(sparkSession, readConfig, Position.class);
Edit 1: As suggested by Karol I tried the following method
SparkConf sparkConf = new SparkConf().setMaster("yarn").setAppName("test");
SparkSession sparkSession = SparkSession.builder().appName("test1").config(sparkConf).getOrCreate();
Map<String, String> readOverrides1 = new HashMap<String, String>();
readOverrides1.put("uri", "mongodb://192.168.77.62:27017");
readOverrides1.put("database", "db1");
readOverrides1.put("collection", "coll1");
ReadConfig readConfig = ReadConfig.create(sparkSession).withOptions(readOverrides1);
This fails in runtime saying:
Exception in thread "main" java.lang.IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
Edit 2:
public static void main(String[] args) {
SparkSession sparkSession = SparkSession.builder().appName("test")
.config("spark.worker.cleanup.enabled", "true").config("spark.scheduler.mode", "FAIR").getOrCreate();
String mongoURI2 = "mongodb://192.168.77.63:27017/db1.coll1";
Map<String, String> readOverrides1 = new HashMap<String, String>();
readOverrides1.put("uri", mongoURI2);
ReadConfig readConfig1 = ReadConfig.create(sparkSession).withOptions(readOverrides1);
MongoSpark.load(sparkSession,readConfig1,Position.class).show();
}
Still this is giving the same exception as the previous edit.

built.sbt:
libraryDependencies += "org.mongodb.spark" % "mongo-spark-connector_2.11" % "2.0.0"
package com.example.app
import com.mongodb.spark.config.{ReadConfig, WriteConfig}
import com.mongodb.spark.sql._
object App {
def main(args: Array[String]): Unit = {
val MongoUri1 = args(0).toString
val MongoUri2 = args(1).toString
val SparkMasterUri= args(2).toString
def makeMongoURI(uri:String,database:String,collection:String) = (s"${uri}/${database}.${collection}")
val mongoURI1 = s"mongodb://${MongoUri1}:27017"
val mongoURI2 = s"mongodb://${MongoUri2}:27017"
val CONFdb1 = makeMongoURI(s"${mongoURI1}","MyColletion1,"df")
val CONFdb2 = makeMongoURI(s"${mongoURI2}","MyColletion2,"df")
val WRITEdb1: WriteConfig = WriteConfig(scala.collection.immutable.Map("uri"->CONFdb1))
val READdb1: ReadConfig = ReadConfig(Map("uri" -> CONFdb1))
val WRITEdb2: WriteConfig = WriteConfig(scala.collection.immutable.Map("uri"->CONFdb2))
val READdb2: ReadConfig = ReadConfig(Map("uri" -> CONFdb2))
val spark = SparkSession
.builder
.appName("AppMongo")
.config("spark.worker.cleanup.enabled", "true")
.config("spark.scheduler.mode", "FAIR")
.getOrCreate()
val df1 = spark.read.mongo(READdb1)
val df2 = spark.read.mongo(READdb2)
df1.write.mode("overwrite").mongo(WRITEdb1)
df2.write.mode("overwrite").mongo(WRITEdb2)
}
}
you can now pass uri1 and uri2 into /usr/local/spark/bin/spark-submit pathToMyjar.app.jar MongoUri1 MongoUri2 sparkMasterUrias args and then create config for each uri
spark.read.mongo(READdb)

It's not useful to set uri in ReadConfig. Spark-Mongo connector uses this information when call ReadConfig.create() method. So try to set it in SparkContext before use it.
Just like below:
SparkContext context = spark.sparkContext();
context.conf().set("spark.mongodb.input.uri","mongodb://host:ip/database.collection");
JavaSparkContext jsc = new JavaSparkContext(context);

Related

How to use properties in spark scala maven project

i want to include properties file explicitly and include it in spark code , instead of hardcoding directly in spark code with all credentials.
i am trying following approach but not able to do, AppContext is not able to be resolved.
please guide me how to achieve this.
Spark_env.properties (under src/main/resourcses in maven project for spark with scala)
CASSANDRA_HOST1=127.0.0.133
CASSANDRA_PORT1=9042
CASSANDRA_USER1=usr1
CASSANDRA_PASS1=pas2
DataMigration.cassandra.keyspace1=demo2
DataMigration.cassandra.table1= data1
CASSANDRA_HOST2=
CASSANDRA_PORT2=9042
CASSANDRA_USER2=usr2
CASSANDRA_PASS2=pas2
D.cassandra.keyspace2=kesp2
D.cassandra.table2= data2
DataMigration.DifferencedRecords.output.path1=C:/spark_windows_proj/File1.csv
DataMigration.DifferencedRecords.output.path2=C:/spark_windows_proj/File1.parquet
----------------------------------------------------------------------------------
DM.scala
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.mapreduce.v2.app.AppContext
object Data_Migration {
def main(args: Array[String]) {
val host1: String = AppContext.getProperties().getProperty("CASSANDRA_HOST1")
val port1 = AppContext.getProperties().getProperty("CASSANDRA_PORT1").toInt
val keySpace1: String = AppContext.getProperties().getProperty("DataMigration.cassandra.keyspace1")
val DataMigrationTableName1: String = AppContext.getProperties().getProperty("DataMigration.cassandra.table1")
val username1: String = AppContext.getProperties().getProperty("CASSANDRA_USER1")
val pass1: String = AppContext.getProperties().getProperty("CASSANDRA_PASS1")
val host2: String = AppContext.getProperties().getProperty("CASSANDRA_HOST2")
val port2 = AppContext.getProperties().getProperty("CASSANDRA_PORT2").toInt
val keySpace2: String = AppContext.getProperties().getProperty("DataMigration.cassandra.keyspace2")
val DataMigrationTableName2: String = AppContext.getProperties().getProperty("DataMigration.cassandra.table2")
val username2: String = AppContext.getProperties().getProperty("CASSANDRA_USER2")
val pass2: String = AppContext.getProperties().getProperty("CASSANDRA_PASS2")
val Result_csv: String = AppContext.getProperties().getProperty("DataMigration.DifferencedRecords.output.path1")
val Result_parquet: String = AppContext.getProperties().getProperty("DataMigration.DifferencedRecords.output.path2")
val sc = AppContext.getSparkContext()
val spark = SparkSession
.builder() .master("local")
.appName("ABC")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df_read1 = spark.read
.format("org.apache.spark.sql.cassandra")
.option("spark.cassandra.connection.host",host1)
.option("spark.cassandra.connection.port",port1)
.option( "spark.cassandra.auth.username",username1)
.option("spark.cassandra.auth.password",pass1)
.option("keyspace",keySpace1)
.option("table",DataMigrationTableName1)
.load()
I would rather pass the properties explicitly by passing the --properties-file option to the spark-submit when submitting the job.
The AppContext won't necessary work for all submission types, while passing config file should work everywhere.
Edit: For local usage without spark-submit, you can simply use the standard Properties class, loading it from the resources and get access to properties. You only need to put property file into src/main/resources instead of src/test/resources that is included into classpath only for tests. The code is something like:
val props = new Properties
props.load(getClass.getClassLoader.getResourceAsStream("file.props"))

H2O fails on H2OContext.getOrCreate

I'm trying to write a sample program in Scala/Spark/H2O. The program compiles, but throws an exception in H2OContext.getOrCreate:
object App1 extends App{
val conf = new SparkConf()
conf.setAppName("AppTest")
conf.setMaster("local[1]")
conf.set("spark.executor.memory","1g");
val sc = new SparkContext(conf)
val spark = SparkSession.builder
.master("local")
.appName("ApplicationController")
.getOrCreate()
import spark.implicits._
val h2oContext = H2OContext.getOrCreate(sess) // <--- error here
import h2oContext.implicits._
val rawData = sc.textFile("c:\\spark\\data.csv")
val data = rawData.map(line => line.split(',').map(_.toDouble))
val response: RDD[Int] = data.map(row => row(0).toInt)
val str = "count: " + response.count()
val h2oResponse: H2OFrame = response.toDF
sc.stop
spark.stop
}
This is the exception log:
Exception in thread "main"
java.lang.RuntimeException: When using the Sparkling Water as Spark
package via --packages option, the 'no.priv.garshol.duke:duke:1.2'
dependency has to be specified explicitly due to a bug in Spark
dependency resolution. at
org.apache.spark.h2o.H2OContext.init(H2OContext.scala:117)

Spark Mongodb Connector Scala - Missing database name

I'm stuck with a weird issue. I'm trying to locally connect Spark to MongoDB using mongodb spark connector.
Apart from setting up spark I'm using the following code:
val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:27017/movie_db.movie_ratings", "readPreference.name" -> "secondaryPreferred"), Some(ReadConfig(sc)))
val writeConfig = WriteConfig(Map("uri" -> "mongodb://127.0.0.1/movie_db.movie_ratings"))
// Load the movie rating data from Mongo DB
val movieRatings = MongoSpark.load(sc, readConfig).toDF()
movieRatings.show(100)
However, I get a compilation error:
java.lang.IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property.
On line where I set up readConfig. I don't get why it's complaining for not set uri when I clearly have a uri property in the Map.
I might be missing something.
You can do it from SparkSession as mentioned here
val spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/movie_db.movie_ratings")
.config("spark.mongodb.input.readPreference.name", "secondaryPreferred")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/movie_db.movie_ratings")
.getOrCreate()
create dataframe using the config
val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:27017/movie_db.movie_ratings", "readPreference.name" -> "secondaryPreferred"))
val df = MongoSpark.load(spark)
Write df to mongodb
MongoSpark.save(
df.write
.option("spark.mongodb.output.uri", "mongodb://127.0.0.1/movie_db.movie_ratings")
.mode("overwrite"))
In your code: prefixes are missing in config
val readConfig = ReadConfig(Map(
"spark.mongodb.input.uri" -> "mongodb://localhost:27017/movie_db.movie_ratings",
"spark.mongodb.input.readPreference.name" -> "secondaryPreferred"),
Some(ReadConfig(sc)))
val writeConfig = WriteConfig(Map(
"spark.mongodb.output.uri" -> "mongodb://127.0.0.1/movie_db.movie_ratings"))
For Java, either you can set the configs while creating spark session or first create the session and then set it as runtime configs.
1.
SparkSession sparkSession = SparkSession.builder()
.master("local")
.appName("MongoSparkConnector")
.config("spark.mongodb.input.uri","mongodb://localhost:27017/movie_db.movie_ratings")
.config("spark.mongodb.input.readPreference.name", "secondaryPreferred")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/movie_db.movie_ratings")
.getOrCreate()
OR
2.
SparkSession sparkSession = SparkSession.builder()
.master("local")
.appName("MongoSparkConnector")
.getOrCreate()
Then,
String mongoUrl = "mongodb://localhost:27017/movie_db.movie_ratings";
sparkSession.sparkContext().conf().set("spark.mongodb.input.uri", mongoURL);
sparkSession.sparkContext().conf().set("spark.mongodb.output.uri", mongoURL);
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("collection", sourceCollection);
readOverrides.put("readPreference.name", "secondaryPreferred");
ReadConfig readConfig = ReadConfig.create(sparkSession).withOptions(readOverrides);
Dataset<Row> df = MongoSpark.loadAndInferSchema(sparkSession,readConfig);

Connecting Spark to Multiple Mongo Collections

I have the following MongoDB Collections : employee and details.
Now I have a requirement where I have to get documents from both collections into spark to analyze data.
I tried below code but it seems not working
SparkConf conf = new SparkConf().setAppName("DBConnection").setMaster("local[*]")
.set("spark.app.id","MongoSparkExample")
.set("spark.mongodb.input.uri","mongodb://localhost/Emp.employee")
.set("spark.executor.memory", "6g");
SparkSession session = SparkSession.builder().appName("Member Log")
.config(conf).getOrCreate();
SparkConf dailyconf = new SparkConf().setAppName("DBConnection").setMaster("local[*]")
.set("spark.app.id","Mongo Two Example")
.set("spark.mongodb.input.uri","mongodb://localhost/Emp.details");
SparkSession mongosession = SparkSession.builder().appName("Daily Log")
.config(dailyconf).getOrCreate();
Any pointers would be highly appreciated.
I fixed this issue by adding below code
JavaSparkContext newcontext = new JavaSparkContext(session.sparkContext());
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("collection", "details");
readOverrides.put("readPreference.name", "secondaryPreferred");
ReadConfig readConfig = ReadConfig.create(newcontext).withOptions(readOverrides);
MongoSpark.load(newcontext,readConfig);
First of all, like eliasah said, you should only create one Spark Session.
Second, take a look at the official MongoDB Spark Connector . It provides integration between MongoDB and Apache Spark. It gives you the posibility to load collections in Dataframes.
Please refer to the official documentation:
MongoDB Connector for Spark
Read from MongoDB (scala)
Read
from MongoDB (java)
EDIT
The documentation says the following:
Call loadFromMongoDB() with a ReadConfig object to specify a different MongoDB server address, database and collection.
In your case:
sc.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://localhost/Emp.details")))
You can use latest Spark SQL features. By passing param as per requirement:
sparksession = SparkSession
.builder()
.master("local[*]")
.appName("TEST")
.config( "spark.mongodb.input.uri", mongodb://localhost:portNo/dbInputName.CollInputName")
.config "spark.mongodb.output.uri", "mongodb://localhost:portNo/dbOutName.CollOutName")
.getOrCreate()
val readConfigVal = ReadConfig(Map("uri" -> uriName,"database" -> dbName, "collection" -> collName, "readPreference.name" -> "secondaryPreferred"), Some(ReadConfig(sparksession)))
var mongoLoadedDF = MongoSpark.load(sparksession,readConfig)
println("mongoLoadedDF:"+mongoLoadedDF.show())
You can read and write multiple tables using readOverrides / writeOverrides.
SparkSession spark = SparkSession
.builder()
.appName("Mongo connect")
.config("spark.mongodb.input.uri", "mongodb://user:password#ip_addr:27017/database_name.employee")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
// Read employee table 1
JavaMongoRDD<Document> employeeRdd = MongoSpark.load(jsc);
Map<String, String> readOverrides = new HashMap<String, String>();
// readOverrides.put("database", "database_name");
readOverrides.put("collection", "details");
ReadConfig readConfig = ReadConfig.create(jsc).withOptions(readOverrides);
// Read another table 2 (details table )
JavaMongoRDD<Document> detailsRdd = MongoSpark.load(jsc, readConfig);
System.out.println(employeeRdd.first().toJson());
System.out.println(detailsRdd.first().toJson());
jsc.close();

Multiclass Classification Evaluator field does not exist error - Apache Spark

I am new to Spark and trying a basic classifier in Scala.
I'm trying to get the accuracy, but when using MulticlassClassificationEvaluator it gives the error below:
Caused by: java.lang.IllegalArgumentException: Field "label" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:227)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)
at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:76)
at com.classifier.classifier_app.App$.<init>(App.scala:90)
at com.classifier.classifier_app.App$.<clinit>(App.scala)
The code is as below:
val conf = new SparkConf().setMaster("local[*]").setAppName("Classifier")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("Email Classifier")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
val spamInput = "TRAIN_00000_0.eml" //files to train model
val normalInput = "TRAIN_00002_1.eml"
val spamData = spark.read.textFile(spamInput)
val normalData = spark.read.textFile(normalInput)
case class Feature(index: Int, value: String)
val indexer = new StringIndexer()
.setInputCol("value")
.setOutputCol("label")
val regexTokenizer = new RegexTokenizer()
.setInputCol("value")
.setOutputCol("cleared")
.setPattern("\\w+").setGaps(false)
val remover = new StopWordsRemover()
.setInputCol("cleared")
.setOutputCol("filtered")
val hashingTF = new HashingTF()
.setInputCol("filtered").setOutputCol("features")
.setNumFeatures(100)
val nb = new NaiveBayes()
val indexedSpam = spamData.map(x=>Feature(0, x))
val indexedNormal = normalData.map(x=>Feature(1, x))
val trainingData = indexedSpam.union(indexedNormal)
val pipeline = new Pipeline().setStages(Array (indexer, regexTokenizer, remover, hashingTF, nb))
val model = pipeline.fit(trainingData)
model.write.overwrite().save("myNaiveBayesModel")
val spamTest = spark.read.textFile("TEST_00009_0.eml")
val normalTest = spark.read.textFile("TEST_00000_1.eml")
val sameModel = PipelineModel.load("myNaiveBayesModel")
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
Console.println("Spam Test")
val predictionSpam = sameModel.transform(spamTest).select("prediction")
predictionSpam.foreach(println(_))
val accuracy = evaluator.evaluate(predictionSpam)
println("Accuracy Spam: " + accuracy)
Console.println("Normal Test")
val predictionNorm = sameModel.transform(normalTest).select("prediction")
predictionNorm.foreach(println(_))
val accuracyNorm = evaluator.evaluate(predictionNorm)
println("Accuracy Normal: " + accuracyNorm)
The error occurs when initializing the MulticlassClassificationEvaluator. How should the column names be specified? Any help is appreciated.
The error is in this line:
val predictionSpam = sameModel.transform(spamTest).select("prediction")
Your dataframe contains only prediction column and no label column.