Upsert many records using ReactiveMongo and Scala - scala

I am writing a DAO Actor for MongoDB that uses ReactiveMongo. I want to implement some very simple CRUD operations, among which the ability to upsert many records in one shot. Since I have a reactive application (built on Akka), it's important for me to have idempotent actions, so I need the operation to be an upsert, not an insert.
So far I have the following (ugly) code to do so:
case class UpsertResult[T](nUpd: Int, nIns: Int, failed: List[T])
def upsertMany[T](l: List[T], collection: BSONCollection)
(implicit ec: ExecutionContext, w: BSONDocumentWriter[T]):
Future[UpsertResult[T]] = {
Future.sequence(l.map(o => collection.save(o).map(r => (o, r))))
.transform({
results =>
val failed: List[T] = results.filter(!_._2.ok).unzip._1
val nUpd = results.count(_._2.updatedExisting)
UpsertResult(nUpd, results.size - nUpd - failed.size, failed)
}, t => t)
}
Is there an out-of-the-box way of upserting many records at once using the reactivemongo API alone?
I am a MongoDB beginner so this might sound trivial to many. Any help is appreciated!

Mongo has no support for upserting multiple documents in one query. The update operation for example can always only insert up to one new element. So this is not a flaw in the reactivemongo driver, there simply is no DB command to achieve the result you expect. Iterating over the documents you want to upsert is the right way to do it.
The manual on mongodb about upsert contains further informations:
http://docs.mongodb.org/manual/core/update/#update-operations-with-the-upsert-flag

According to the docs, BSSONCollection.save inserts the document, or updates it if it already exists in the collection: see here. Now, I'm not sure exactly how it makes the decision about whether the document already exists or not: presumably it's based on what MongoDB tells it... so the primary key/id or a unique index.
In short: I think you're doing it the right way (including your result counts from LastError).

Related

How to implement pagination for fastapi with mongo db(Motor)

I have a simple REST api which is a books store created with FastAPI and mongo db as the backend (I have used Motor as the library instead of Pymongo). I have a GET endpoint to get all the books in the database which also supports query strings (For example : A user can search for books with a single author or with a genre type etc).
Below are the corresponding codes for this endpoint :
routers.py
#router.get("/books", response_model=List[models.AllBooksResponse])
async def get_the_list_of_all_books(
authors: Optional[str] = None,
genres: Optional[str] = None,
published_year: Optional[str] = None,
) -> List[Dict[str, Any]]:
if authors is None and genres is None and published_year is None:
all_books = [book for book in await mongo.BACKEND.get_all_books()]
else:
all_books = [
book
for book in await mongo.BACKEND.get_all_books(
authors=authors.strip('"').split(",") if authors is not None else None,
genres=genres.strip('"').split(",") if genres is not None else None,
published_year=datetime.strptime(published_year, "%Y")
if published_year is not None
else None,
)
]
return all_books
The corresponding model :
class AllBooksResponse(BaseModel):
name: str
author: str
link: Optional[str] = None
def __init__(self, name, author, **data):
super().__init__(
name=name, author=author, link=f"{base_uri()}book/{data['book_id']}"
)
And the backend function for getting the data:
class MongoBackend:
def __init__(self, uri: str) -> None:
self._client = motor.motor_asyncio.AsyncIOMotorClient(uri)
async def get_all_books(
self,
authors: Optional[List[str]] = None,
genres: Optional[List[str]] = None,
published_year: Optional[datetime] = None,
) -> List[Dict[str, Any]]:
find_condition = {}
if authors is not None:
find_condition["author"] = {"$in": authors}
if genres is not None:
find_condition["genres"] = {"$in": genres}
if published_year is not None:
find_condition["published_year"] = published_year
cursor = self._client[DB][BOOKS_COLLECTION].find(find_condition, {"_id": 0})
return [doc async for doc in cursor]
Now i want to implement pagination for this endpoint . Here i have a few questions :
Is it good to do pagination on database level or application level ?
Do we have some out of the box libraries which can help me do that in fastapi ? I checked the documentation for https://pypi.org/project/fastapi-pagination/ , but this seems to be more targeted towards SQL databases
I also checked out this link : https://www.codementor.io/#arpitbhayani/fast-and-efficient-pagination-in-mongodb-9095flbqr which talks about different ways of doing this in Mongo db but i think only the first option(using limit and skip) would work for me, because i want to also make it work when i am using other filter parameters (for example for author and genre) and there is no way i can know the ObjectId's unless i make the first query to get the data and then i want to do pagination.
But the issue is everywhere i see using limit and skip is discouraged.
Can someone please let me know what are the best practices here and can something apply to my requirement and use case?
Many thanks in advance.
There is no right or wrong answer to such a question. A lot depends on the technology stack that you use, as well as the context that you have, considering also the future directions of both the software you wrote as well as the software you use (mongo).
Answering your questions:
It depends on the load you have to manage and the dev stack you use. Usually it is done at database level, since retrieving the first 110 and dropping the first 100 is quite dumb and resource consuming (the database will do it for you).
To me is seems pretty simple on how to do it via fastapi: just add to your get function the parameters limit: int = 10 and skip: int = 0 and use them in the filtering function of your database. Fastapi will check the data types for you, while you could check that limit is not negative or above, say, 100.
It says that there is no silver bullet and that since skip function of mongo does not perform well. Thus he believes that the second option is better, just for performances. If you have billions and billions of documents (e.g. amazon), well, it may be the case to use something different, though by the time your website has grown that much, I guess you'll have the money to pay an entire team of experts to sort things out and possibly develop your own database.
TL;DR
Concluding, the limit and skip approach is the most common one. It is usually done at the database level, in order to reduce the amount of work of the application and bandwidth.
Mongo is not very efficient in skipping and limiting results. If your database has, say a million of documents, then I don't think you'll even notice. You could even use a relational database for such a workload. You can always benchmark the options you have and choose the most appropriate one.
I don't know much about mongo, but I know that generally, indexes can help limiting and skipping records (docs in this case), but I'm not sure if it's the case for mongo as well.
You can use this package to paginate:
https://pypi.org/project/fastapi-paginate
How to use it:
https://github.com/nazmulnnb/fastapi-paginate/blob/main/examples/pagination_motor.py

Updating Data in MongoDB from Apache Spark Streaming

I am using the scala Api of Apache Spark Streaming to read from Kafka Server in a window with the size of a minute and a slide intervall of a minute.
The message from Kafka contain a timestamp from the moment they were sent and an arbitrary value. Each of the values is supposed to be reducedByKeyAndWindow and saved to the Mongo.
val messages = stream.map(record => (record.key, record.value.toDouble))
val reduced = messages.reduceByKeyAndWindow((x: Double , y: Double) => (x + y),
Seconds(60), Seconds(60))
reduced.foreachRDD({ rdd =>
import spark.implicits._
val aggregatedPower = rdd.map({x => MyJsonObj(x._2, x._1)}).toDF()
aggregatedPower.write.mode("append").mongo()
})
This works so far, however it is possible, that some message come with a delay, of a minute, which results leads to having two json objects with the same timestamp in the dataBase.
{"_id":"5aa93a7e9c6e8d1b5c486fef","power":6.146849997,"timestamp":"2018-03-14 15:00"}
{"_id":"5aa941df9c6e8d11845844ae","power":5.0,"timestamp":"2018-03-14 15:00"}
The Documentation of the mongo-spark-connector didn't help me with finding a solution.
Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?
Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?
It seems that what you're looking for is a MongoDB operation called upsert. Where an update operation will insert a new document if the criteria has no match, and update the fields if there is a match.
If you are using MongoDB Connector for Spark v2.2+, whenever a Spark dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted.
Now you could try to create an RDD using MongoDB Spark Aggregation, specifying a $match filter to query where timestamp matches the current window:
val aggregatedRdd = rdd.withPipeline(Seq(
Document.parse(
"{ $match: { timestamp : '2018-03-14 15:00' } }"
)))
Modify the value of power field, and then write.mode('append').
You may also find blog: Data Streaming MongoDB/Kafka useful as well. If you would like to write a Kafka consumer and directly insert into MongoDB applying your logics using MongoDB Java Driver

How does mongoose populate work under the hood

Could somebody tell me how
I have a collection
a {
b: String
c: Date
d: ObjectId --> j
}
j {
k: String
l: String
m: String
}
when I carry out a:
a.find({ b: 'thing' }).populate('d').exec(etc..)
in the background is this actually carrying out two queries against the MongoDB in order to return all the items 'j'?
I have no issues getting populate to work, what concerns me is the performance implications of the task.
Thanks
Mongoose uses two queries to fulfill the request.
The a collection is queried to get the docs that match the main query, and then the j collection is queried to populate the d field in the docs.
You can see the queries Mongoose is using by enabling debug output:
mongoose.set('debug', true);
Basically the model 'a' is containing an attribute 'd' which is referencing(pointing) towards the model 'j'.
So whenever we use
a.find({ b: 'thing' }).populate('d').exec(etc..)
Then through populate we can individually call properties of 'j' like :
d.k
d.l
d.m
Populate() helps us to call properties of other models.
Adding to #JohnnyHK answer on the performance implications of the task you worried about, I believe no matter what, these queries have to execute sequentially whether we use the mongoose provided populate() method or the one you will implement the server-side, both will have the same time complexity.
This is because in order to populate we need to have the results from the first query, after getting the result uuid will be used to query the document in the other collection.
So I believe it's a waste to make these changes the server-side than to use the mongoose provided method. The performance will remain the same.

Mapping MongoDB documents to case class with types but without embedded documents

Subset looks like an interesting, thin MongoDB wrapper.
In one of the examples given, there are Tweets and Users. However, User is a subdocument of Tweet. In classical SQL, this would be normalized into two separate tables with a foreign key from Tweet to User. In MongoDB, this wouldn't necessitate a DBRef, storing the user's ObjectId would be sufficient.
Both in Subset and Salat this would result in these case classes:
case class Tweet(_id: ObjectId, content: String, userId: ObjectId)
case class User(_id: ObjectId, name: String)
So there's no guarantee that the ObjectId in Tweet actually resolves to a User (making it less typesafe). I also have to write the same query for each class that references User (or move it to some trait).
So what I'd like to achieve is to have case class Tweet(_id: ObjectId, content: String, userId: User), in code, and the ObjectId in the database. Is this possible, and if so, how? What are good alternatives?
Yes, it's possible. Actually it's even simpler than having a "user" sub-document in a "tweet". When "user" is a reference, it is just a scalar value, MongoDB and "Subset" has no mechanisms to query subdocument fields.
I've prepared a simple REPLable snippet of code for you (it assumes you have two collections -- "tweets" and "users").
Preparations...
import org.bson.types.ObjectId
import com.mongodb._
import com.osinka.subset._
import Document.DocumentId
val db = new Mongo("localhost") getDB "test"
val tweets = db getCollection "tweets"
val users = db getCollection "users"
Our User case class
case class User(_id: ObjectId, name: String)
A number of fields for tweets and user
val content = "content".fieldOf[String]
val user = "user".fieldOf[User]
val name = "name".fieldOf[String]
Here more complicated things start to happen. What we need is a ValueReader that's capable of getting ObjectId based on field name, but then goes to another collection and reads an object from there.
This can be written as a single piece of code, that does all things at once (you may see such a variant in the answer history), but it would be more idiomatic to express it as a combination of readers. Suppose we have a ValueReader[User] that reads from DBObject:
val userFromDBObject = ValueReader({
case DocumentId(id) ~ name(name) => User(id, name)
})
What's left is a generic ValueReader[T] that expects ObjectId and retrieves an object from a specific collection using supplied underlying reader:
class RefReader[T](val collection: DBCollection, val underlying: ValueReader[T]) extends ValueReader[T] {
override def unpack(o: Any):Option[T] =
o match {
case id: ObjectId =>
Option(collection findOne id) flatMap {underlying.unpack _}
case _ =>
None
}
}
Then, we may say our type class for reading Users from references is merely
implicit val userReader = new RefReader[User](users, userFromDBObject)
(I am grateful to you for this question, since this use case is quite
rare and I had no real motivation to develop a generic solution. I think
I need to include this kind of helper into "Subset" finally..
I shall appreciate your feedback on this approach)
And this is how you would use it:
import collection.JavaConverters._
tweets.find.iterator.asScala foreach {
case Document.DocumentId(id) ~ content(content) ~ user(u) =>
println("%s - %s by %s".format(id, content, u))
}
Alexander Azarov answer works probably fine, but I would personally not do it this way.
What you have is a Tweet that only have an ObjectId reference to the user.
And you want to load the user during tweet load because for your domain it is probably easier to manipulate. In any case, unless you use subdocuments (not always a good choice), you have to query the DB again to retrieve the user data, and this is what is done by Alexander Azarov.
You would rather do a transformation function that transforms a Tweet to a TweetWithUser or something like that.
def transform(tweet: Tweet) = TweetWithUser( tweet.id, tweet.content, findUserWithId(tweet.userId) )
I don't really see why you would expect a framework to resolve something that you could have done yourself very easily in a single line of code.
And remember in your application, in some cases you don't even need the whole User object, so it is expensive to query twice the database while it's not always needed. You should only use the case class with the full User data when you really need that user data, and not simply always load the full user data because it seems more convenient.
Or if you want to manipulate User objects anyway, you would have a User proxy, on which you could access the id attribute directly, and on any other access, a db query would be done. In Java/SQL, Hibernate is doing with lazy loading of relationships, but I'm not sure it's a good idea to use that with MongoDB and it breaks immutability

How to retrieve all objects in a Mongodb collection including the ids?

I'm using Casbah and Salat to create my own Mongodb dao and am implementing a getAll method like this:
val dao: SalatDAO[T, ObjectId]
def getAll(): List[T] = dao.find(ref = MongoDBObject()).toList
What I want to know is:
Is there a better way to retrieve all objects?
When I iterate through the objects, I can't find the object's _id. Is it excluded? How do I include it in the list?
1°/ The ModelCompanion trait provides a def findAll(): SalatMongoCursor[ObjectType] = dao.find(MongoDBObject.empty) methods. You will have to do a dedicated request for every collection your database have.
If you iterate over the objects returned, it could be better to iterate with the SalatMongoCursor[T] returned by the dao.find rather than doing two iterations (one with the toList from Iterator trait then another on your List[T]).
2°/ Salat maps the _id key with your class id field. If you define a class with an id: ObjectId field. This field is mapped with the mongo _id key.
You can change this behaviour using the #Key annotation as pointed out in Salat documentation
I implemented something like:
MyDAO.ids(MongoDBObject("_id" -> MongoDBObject("$exists" -> true)))
This fetches all the ids, but given the wide range of what you might be doing, probably not the best solution for all situations. Right now, I'm building a small system with 5 records of data, and using this to help understand how MongoDB works.
If this was a production database with 1,000,000 entries, then this (or any getAll query) would be stupid. Instead of doing that, consider trying to write a targeted query that goes after the real results you seek.