How to read data from mongoDB to spark with a specific query - mongodb

I am getting data from mongodb using the query,
db.objects.find({ _key: { $in: ["user:130"] } }, { _id: 0, uid: 1, username: 1 }).pretty();
now i need to get the same data in spark.
val readConf = ReadConfig(Map("uri" -> host, "database" -> "nodebb", "collection" -> "objects"))
val data = spark.read.mongo(readConf)
This is giving complete data from mongodb.
How can i apply that query too...?
Thanks

If you want for example just to filter some records you can use .filter on your df.
If you want to use sql queries on your data loaded from Mongo you can create temp view from your df and then query with spark.sql
df.createOrReplaceTempView("temp")
some_fruit = spark.sql("SELECT type, qty FROM temp WHERE type LIKE '%e%'")
some_fruit.show()
More details in documentation: MongoDB spark connector docu

Related

Spring data MongoDb aggregation: group and lookup in the same query add where clause

I have a mongoDB query where I use the group operation and lookup operation. I am able to join the 2 collections but I am not sure how I can add a WHERE clause to the below spring data code:
val lookupOperation: LookupOperation = LookupOperation.newLookup()
.from("attachments")
.localField("id")
.foreignField("data.attachmentId").`as`("attachmentJoin")
val aggregation: Aggregation = newAggregation(
Aggregation.match(Criteria("_id").`is`(sId)),
Aggregation.unwind("data"),
Aggregation.match(Criteria("data.name").nin(null, "")),
Aggregation.match(Criteria("data.attachmentName").nin(null, "")),
Aggregation.match(Criteria("data.attachmentId").nin(null, "")),
Aggregation.group(fields()
.and("data.attachmentName")
.and("data.name")
.and("data.attachmentId"))
.push("data").`as`("data"),
// =================================================
lookupOperation
)
Before the line, I have the following aggregation result:
{
"id": {GROUPPED_IDs},
"data": {DATA_OBJECT_ARRAY}
}
After lookup, the aggregation result becomes:
{
"id": {GROUPPED_IDs},
"data": {DATA_OBJECT_ARRAY},
"attachmentJoin": {ATTACHMENT_DATA}
}
the attachmentJoin contains multiple ATTACHMENT objects. I would like to perform a where clause like below, in order to have only the correct attachment
WHERE attachmentJoin.id = data.attachmentId
How could I perform the where condition in order to affect only the attachmentJoin data?
Thanks!

Is it possible to use Mongo Shell style queries with Spark's Cosmos DB Connector?

I'm using the Cosmos DB Connector for Spark. Is it possible to use Mongo Shell "JSON-style" queries with the Cosmos DB connector instead of SQL queries?
I tried using the MongoDB Connector instead to achieve the same functionality but have run into some annoying bugs with memory limits using the Mongo Connector. So I've abandoned that approach.
This is the way I'd prefer to query:
val results = db.cars.find(
{
"car.actor.account.name": "Bill"
}
)
This is the way the cosmos connector allows:
val readConfig: Config = Config(Map(
"Endpoint" -> config.getString("endpoint"),
"Masterkey" -> config.getString("masterkey"),
"Database" -> config.getString("database"),
"Collection" -> "cars",
"preferredRegions" -> "South Central US",
"schema_samplesize" -> "100",
"query_custom" -> "SELECT * FROM root WHERE root['$v']['car']['$v']['actor']['$v']['account']['$v']['name']['$v'] = 'Bill'"
))
val results = spark.sqlContext.read.cosmosDB(readConfig)
Obviously the SQL-oriented approach doesn't lend itself well to the deeply nested data structures I'm getting from Cosmos DB. It's quite a bit more verbose, too; requiring each nested dictionary to be referenced with "['$v']" for reasons I'm unclear on. I'd much prefer to be able to use the Mongo-style syntax.
The Cosmos DB Connector for Spark mentioned in this link is for Cosmos DB SQL API, so you only can query it in SQL-oriented
// Import Necessary Libraries
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
import com.microsoft.azure.cosmosdb.spark.config.Config
// Read Configuration
val readConfig = Config(Map(
"Endpoint" -> "https://doctorwho.documents.azure.com:443/",
"Masterkey" -> "YOUR-KEY-HERE",
"Database" -> "DepartureDelays",
"Collection" -> "flights_pcoll",
"query_custom" -> "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'" // Optional
))
// Connect via azure-cosmosdb-spark to create Spark DataFrame
val flights = spark.read.cosmosDB(readConfig)
flights.count()
But if your cosmos db is mongo api,you could follow the statements in above link:
Please refer to this guide:https://docs.mongodb.com/spark-connector/master/python/read-from-mongodb/

Using MongoDB Spark Connector to filter based on timestamp

I am using Spark MongoDB connector to fetch data from mongodb..However I am not able to get how I can query on Mongo using Spark using aggregation pipeline(rdd.withPipeline).Following is my code where I want to fetch records based on timestamp & store in dataframe :
val appData=MongoSpark.load(spark.sparkContext,readConfig)
val df=appData.withPipeline(Seq(Document.parse("{ $match: { createdAt : { $gt : 2017-01-01 00:00:00 } } }"))).toDF()
Is this a correct way to query on mongodb using spark for timestamp value?
As the comment mentioned, you could utilise the Extended JSON format for the date filter.
val appDataRDD = MongoSpark.load(sc)
val filteredRDD = appDataRDD.withPipeline(Seq(Document.parse("{$match:{timestamp:{$gt:{$date:'2017-01-01T00:00:00.000'}}}}")))
filteredRDD.foreach(println)
See also MongoDB Spark Connector: Filters and Aggregation to see an alternative filter.
Try this:
val pipeline = "{'$match': {'CreationDate':{$gt: {$date:'2020-08-26T00:00:00.000Z'}}}}"
val sourceDF = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://administrator:password#10.XXXXX:27017/?authSource=admin").option("database","_poc").option("collection", "activity").option("pipeline", pipeline).load()
try this(but it has limitation like mongo date and ISODate can only take TZ format timestamp only.
option("pipeline", s"""[{ $$match: { "updatedAt" : { $$gte : new ISODate("2022-11-29T15:26:21.556Z") } } }]""").mongo[DeltaComments]

confusion regarding embedded BSONDocument in reactiveMongo

I have stored following data in MongoDB
db.users.insert({id: 1,user: {firstname:"John",lastname:"Cena",email:["jc#wwe.com","jc1#wwe.com"],password:"YouCantSeeMe",address:{street:"34 some street", country:"USA"}}})
I queried as follows expecting that the first query will not work but the second will. To my surprise, it was the other way round.
This query worked
val query1 = BSONDocument("user.firstname"->user.firstName)
This didn't
val query2 = BSONDocument("user"-> BSONDocument("firstname"->user.firstName))
I observed that query1 creates following structure (by running mongodb in verbose mode, mongodb -v)
{ user.firstname: "John" }
But query2 creates following structure
{ user: { firstname: "John" } }
Aren't these two the same (firstname is inside user)?
They are not the same.
First query works because you are comparing the fields of embedded document using dot notation.
Second query fails because you are comparing document as a whole against a embedded document.
https://docs.mongodb.com/manual/tutorial/query-embedded-documents/
https://docs.mongodb.com/manual/core/document/#dot-notation

How to remove _id from MongoDB results?

I am inserting json file into Mongodb(with Scala/Play framework) and the same getting/downloading it into my view page for some other requirement, but this time it is coming with one "_id" parameter in the json file.
But I need only my actual json file only that is not having any any "_id" parameter. I have read the Mongodb tutorial, that by default storing it with one _id for any collection document.
Please let me know that how can I get or is there any chance to get my actual json file without any _id in MongoDB.
this is the json result which is storing in database(I don't need that "_id" parameter)
{
"testjson": [{
"key01": "value1",
"key02": "value02",
"key03": "value03"
}],
"_id": 1
}
If you have a look at ReactiveMongo dev guide and to its API, you can see it support projection in a similar way as the MongoDB shell.
Then you can understand that you can do
collection.find(selector = BSONDocument(), projection = BSONDocument("_id" -> 0))
Or, as you are using JSON serialization:
collection.find(selector = Json.obj(), projection = Json.obj("_id" -> 0))
You can use this query in the shell:
db.testtable.find({},{"_id" : false})
Here we are telling mongoDB not to return _id from the collection.
You can also use 0 instead of false, like this:
db.testtable.find({},{"_id" : 0})
for scala you need to convert it in as per the driver syntax.