How to add a MongoDB-specific query using mongo-spark connector? - mongodb

i am using "mongo-spark" in order to read mongodb from spark 2.0 application.
(https://github.com/mongodb/mongo-spark)
Here is a code example:
val readConfig: ReadConfig = ReadConfig(Map(
"spark.mongodb.input.uri"-> "mongodb://mongodb01.blabla.com/xqwer",
"collection" -> "some_collection"),
None)
sparkSession.read.format("com.mongodb.spark.sql").options(readConfig.asOptions).load()
does anyone know how to add mongodb query (e.g. "find({ uid: 'ZesSZY3Ch0k8nQtQUIfH' })" ) ?

You can use filter() on df
val df = sparkSession.read.format("com.mongodb.spark.sql")
.options(readConfig.asOptions).load()
df.filter($"uid".equalTo(lit("ZesSZY3Ch0k8nQtQUIfH"))).show()

Related

Spark 3.2.1 fetch HBase data not working with NewAPIHadoopRDD

Below is the sample code snippet that is used for data fetch from HBase. This worked fine with Spark 3.1.2. However after upgrading to Spark 3.2.1, it is not working i.e. returned RDD doesn't contain any value. Also, it is not throwing any exception.
def getInfo(sc: SparkContext, startDate:String, cachingValue: Int, sparkLoggerParams: SparkLoggerParams, zkIP: String, zkPort: String): RDD[(String)] = {{
val scan = new Scan
scan.addFamily("family")
scan.addColumn("family","time")
val rdd = getHbaseConfiguredRDDFromScan(sc, zkIP, zkPort, "myTable", scan, cachingValue, sparkLoggerParams)
val output: RDD[(String)] = rdd.map { row =>
(Bytes.toString(row._2.getRow))
}
output
}
def getHbaseConfiguredRDDFromScan(sc: SparkContext, zkIP: String, zkPort: String, tableName: String,
scan: Scan, cachingValue: Int, sparkLoggerParams: SparkLoggerParams): NewHadoopRDD[ImmutableBytesWritable, Result] = {
scan.setCaching(cachingValue)
val scanString = Base64.getEncoder.encodeToString(org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(scan).toByteArray)
val hbaseContext = new SparkHBaseContext(zkIP, zkPort)
val hbaseConfig = hbaseContext.getConfiguration()
hbaseConfig.set(TableInputFormat.INPUT_TABLE, tableName)
hbaseConfig.set(TableInputFormat.SCAN, scanString)
sc.newAPIHadoopRDD(
hbaseConfig,
classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]
).asInstanceOf[NewHadoopRDD[ImmutableBytesWritable, Result]]
}
Also, If we fetch using Scan directly without using NewAPIHadoopRDD, it works.
Software versions:
Spark: 3.2.1 prebuilt with user provided Apache Hadoop
Scala: 2.12.10
HBase: 2.4.9
Hadoop: 2.10.1
I found out the solution to this one.
See this upgrade guide from Spark 3.1.x to Spark 3.2.x:
https://spark.apache.org/docs/latest/core-migration-guide.html
Since Spark 3.2, spark.hadoopRDD.ignoreEmptySplits is set to true by default which means Spark will not create empty partitions for empty input splits. To restore the behavior before Spark 3.2, you can set spark.hadoopRDD.ignoreEmptySplits to false.
It can be set like this on spark-submit:
./spark-submit \
--class org.apache.hadoop.hbase.spark.example.hbasecontext.HBaseDistributedScanExample \
--master spark://localhost:7077 \
--conf "spark.hadoopRDD.ignoreEmptySplits=false" \
--jars ... \
/tmp/hbase-spark-1.0.1-SNAPSHOT.jar YourHBaseTable
Alternatively, you can also set these globally at $SPARK_HOME/conf/spark-defaults.conf to apply for every Spark application.
spark.hadoopRDD.ignoreEmptySplits false

Read Data from MongoDB through Apache Spark with a query

I am able to read the data stored in MongoDB via Apache Spark via the conventional methods described in its documentation. I have a mongoDB query that I would like to be used while loading the collection. The query is simple, but I can't seem to find the correct way to specify the query the config() function in SparkSession object.
Following is my SparkSession builder
val confMap: Map[String, String] = Map(
"spark.mongodb.input.uri" -> "mongodb://xxx:xxx#mongodb1:27017,mongodb2:27017,mongodb3:27017/?ssl=true&replicaSet=MongoShard-0&authSource=xxx&retryWrites=true&authMechanism=SCRAM-SHA-1",
"spark.mongodb.input.database" -> "A",
"spark.mongodb.input.collection" -> "people",
"spark.mongodb.output.database" -> "B",
"spark.mongodb.output.collection" -> "result",
"spark.mongodb.input.readPreference.name" -> "primaryPreferred"
)
conf.setAll(confMap)
val spark: SparkSession =
SparkSession.builder().master("local[1]").config(conf).getOrCreate()
Is there a way to specify the MongoDB query in the SparkConf object so that the SparkSession reads only the specific fields present in the collection.
Use .withPipeline API
Example Code:
val readConfig = ReadConfig(Map("uri" -> MONGO_DEV_URI, "collection" -> MONGO_COLLECTION_NAME, "readPreference.name" -> "secondaryPreferred"))
MongoSpark
.load(spark.sparkContext, readConfig)
.withPipeline(Seq(Document.parse(query)))
As per comments:
sparkSession.read.format("com.mongodb.spark.sql.DefaultSource")
.option("pipeline", "[{ $match: { name: { $exists: true } } }]")
.option("uri","mongodb://127.0.0.1/mydb.mycoll")
.load()

Spark scala use spark-mongo connector to upsert

Is there any way to upsert Mongo Collection with spark-mongo connector based on a certain field in dataframe ?
To replace documents based on unique key constraint, use replaceDocument and shardKey option. Default shardKey is {_id: 1}.
https://docs.mongodb.com/spark-connector/master/configuration/
df.write.format('com.mongodb.spark.sql') \
.option('collection', 'target_collection') \
.option('replaceDocument', 'true') \
.option('shardKey', '{"date": 1, "name": 1, "resource": 1}') \
.mode('append') \
.save()
replaceDocument=false option makes your document merged based on the shardKey.
https://github.com/mongodb/mongo-spark/blob/c9e1bc58cb509021d7b7d03367776b84da6db609/src/main/scala/com/mongodb/spark/MongoSpark.scala#L120-L141
As of MongoDB Connector for Spark version 1.1+ (currently version 2.2)
when you execute save() as below:
dataFrameWriter.write.mongo()
dataFrameWriter.write.format("com.mongodb.spark.sql").save()
If a dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted. 
See also MongoDB Spark SQL for more information and snippets.
Try option replaceDocument
df.select("_id").withColumn("aaa", lit("ha"))
.write
.option("collection", collectionName)
.option("replaceDocument", "false")
.mode(SaveMode.Append)
.format("mongo")
.save()
I dont know why in mongo document can not find any document for this option
With some digging on mongo-spark's source, here's a simple hack to add the feature of upsert on certain fields, to MongoSpark.save method:
// add additional keys parameter
def save[D](dataset: Dataset[D], writeConfig: WriteConfig, keys: List[String]): Unit = {
val mongoConnector = MongoConnector(writeConfig.asOptions)
val dataSet = dataset.toDF()
val mapper = rowToDocumentMapper(dataSet.schema)
val documentRdd: RDD[BsonDocument] = dataSet.rdd.map(row => mapper(row))
val fieldNames = dataset.schema.fieldNames.toList
val queryKeyList = keys.isEmpty match {
case true => keys
case false => BsonDocument.parse(writeConfig.shardKey.getOrElse("{_id: 1}")).keySet().asScala.toList
}
// the rest remains the same
// ...
}

Connecting Spark to Multiple Mongo Collections

I have the following MongoDB Collections : employee and details.
Now I have a requirement where I have to get documents from both collections into spark to analyze data.
I tried below code but it seems not working
SparkConf conf = new SparkConf().setAppName("DBConnection").setMaster("local[*]")
.set("spark.app.id","MongoSparkExample")
.set("spark.mongodb.input.uri","mongodb://localhost/Emp.employee")
.set("spark.executor.memory", "6g");
SparkSession session = SparkSession.builder().appName("Member Log")
.config(conf).getOrCreate();
SparkConf dailyconf = new SparkConf().setAppName("DBConnection").setMaster("local[*]")
.set("spark.app.id","Mongo Two Example")
.set("spark.mongodb.input.uri","mongodb://localhost/Emp.details");
SparkSession mongosession = SparkSession.builder().appName("Daily Log")
.config(dailyconf).getOrCreate();
Any pointers would be highly appreciated.
I fixed this issue by adding below code
JavaSparkContext newcontext = new JavaSparkContext(session.sparkContext());
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("collection", "details");
readOverrides.put("readPreference.name", "secondaryPreferred");
ReadConfig readConfig = ReadConfig.create(newcontext).withOptions(readOverrides);
MongoSpark.load(newcontext,readConfig);
First of all, like eliasah said, you should only create one Spark Session.
Second, take a look at the official MongoDB Spark Connector . It provides integration between MongoDB and Apache Spark. It gives you the posibility to load collections in Dataframes.
Please refer to the official documentation:
MongoDB Connector for Spark
Read from MongoDB (scala)
Read
from MongoDB (java)
EDIT
The documentation says the following:
Call loadFromMongoDB() with a ReadConfig object to specify a different MongoDB server address, database and collection.
In your case:
sc.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://localhost/Emp.details")))
You can use latest Spark SQL features. By passing param as per requirement:
sparksession = SparkSession
.builder()
.master("local[*]")
.appName("TEST")
.config( "spark.mongodb.input.uri", mongodb://localhost:portNo/dbInputName.CollInputName")
.config "spark.mongodb.output.uri", "mongodb://localhost:portNo/dbOutName.CollOutName")
.getOrCreate()
val readConfigVal = ReadConfig(Map("uri" -> uriName,"database" -> dbName, "collection" -> collName, "readPreference.name" -> "secondaryPreferred"), Some(ReadConfig(sparksession)))
var mongoLoadedDF = MongoSpark.load(sparksession,readConfig)
println("mongoLoadedDF:"+mongoLoadedDF.show())
You can read and write multiple tables using readOverrides / writeOverrides.
SparkSession spark = SparkSession
.builder()
.appName("Mongo connect")
.config("spark.mongodb.input.uri", "mongodb://user:password#ip_addr:27017/database_name.employee")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
// Read employee table 1
JavaMongoRDD<Document> employeeRdd = MongoSpark.load(jsc);
Map<String, String> readOverrides = new HashMap<String, String>();
// readOverrides.put("database", "database_name");
readOverrides.put("collection", "details");
ReadConfig readConfig = ReadConfig.create(jsc).withOptions(readOverrides);
// Read another table 2 (details table )
JavaMongoRDD<Document> detailsRdd = MongoSpark.load(jsc, readConfig);
System.out.println(employeeRdd.first().toJson());
System.out.println(detailsRdd.first().toJson());
jsc.close();

Connect to SQLite in Apache Spark

I want to run a custom function on all tables in a SQLite database. The function is more or less the same, but depends on the schema of the individual table. Also, the tables and their schemata are only known at runtime (the program is called with an argument that specifies the path of the database).
This is what I have so far:
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// somehow bind sqlContext to DB
val allTables = sqlContext.tableNames
for( t <- allTables) {
val df = sqlContext.table(t)
val schema = df.columns
sqlContext.sql("SELECT * FROM " + t + "...").map(x => myFunc(x,schema))
}
The only hint I found so far needs to know the table in advance, which is not the case in my scenario:
val tableData =
sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db", "dbtable" -> t))
.load()
I am using the xerial sqlite jdbc driver. So how can I conntect solely to a database, not to a table?
Edit: Using Beryllium's answer as a start I updated my code to this:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val metaData = sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db",
"dbtable" -> "(SELECT * FROM sqlite_master) AS t")).load()
val myTableNames = metaData.select("tbl_name").distinct()
for (t <- myTableNames) {
println(t.toString)
val tableData = sqlContext.table(t.toString)
for (record <- tableData.select("*")) {
println(record)
}
}
At least I can read the table names at runtime which is a huge step forward for me. But I can't read the tables. I tried both
val tableData = sqlContext.table(t.toString)
and
val tableData = sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db",
"dbtable" -> t.toString)).load()
in the loop, but in both cases I get a NullPointerException. Although I can print the table names it seems I cannot connect to them.
Last but not least I always get an SQLITE_ERROR: Connection is closed error. It looks to be the same issue described in this question: SQLITE_ERROR: Connection is closed when connecting from Spark via JDBC to SQLite database
There are two options you can try
Use JDBC directly
Open a separate, plain JDBC connection in your Spark job
Get the tables names from the JDBC meta data
Feed these into your for comprehension
Use a SQL query for the "dbtable" argument
You can specify a query as the value for the dbtable argument. Syntactically this query must "look" like a table, so it must be wrapped in a sub query.
In that query, get the meta data from the database:
val df = sqlContext.read.format("jdbc").options(
Map(
"url" -> "jdbc:postgresql:xxx",
"user" -> "x",
"password" -> "x",
"dbtable" -> "(select * from pg_tables) as t")).load()
This example works with PostgreSQL, you have to adapt it for SQLite.
Update
It seems that the JDBC driver only supports to iterate over one result set.
Anyway, when you materialize the list of table names using collect(), then the following snippet should work:
val myTableNames = metaData.select("tbl_name").map(_.getString(0)).collect()
for (t <- myTableNames) {
println(t.toString)
val tableData = sqlContext.read.format("jdbc")
.options(
Map(
"url" -> "jdbc:sqlite:/x.db",
"dbtable" -> t)).load()
tableData.show()
}