I have a simple REST api which is a books store created with FastAPI and mongo db as the backend (I have used Motor as the library instead of Pymongo). I have a GET endpoint to get all the books in the database which also supports query strings (For example : A user can search for books with a single author or with a genre type etc).
Below are the corresponding codes for this endpoint :
routers.py
#router.get("/books", response_model=List[models.AllBooksResponse])
async def get_the_list_of_all_books(
authors: Optional[str] = None,
genres: Optional[str] = None,
published_year: Optional[str] = None,
) -> List[Dict[str, Any]]:
if authors is None and genres is None and published_year is None:
all_books = [book for book in await mongo.BACKEND.get_all_books()]
else:
all_books = [
book
for book in await mongo.BACKEND.get_all_books(
authors=authors.strip('"').split(",") if authors is not None else None,
genres=genres.strip('"').split(",") if genres is not None else None,
published_year=datetime.strptime(published_year, "%Y")
if published_year is not None
else None,
)
]
return all_books
The corresponding model :
class AllBooksResponse(BaseModel):
name: str
author: str
link: Optional[str] = None
def __init__(self, name, author, **data):
super().__init__(
name=name, author=author, link=f"{base_uri()}book/{data['book_id']}"
)
And the backend function for getting the data:
class MongoBackend:
def __init__(self, uri: str) -> None:
self._client = motor.motor_asyncio.AsyncIOMotorClient(uri)
async def get_all_books(
self,
authors: Optional[List[str]] = None,
genres: Optional[List[str]] = None,
published_year: Optional[datetime] = None,
) -> List[Dict[str, Any]]:
find_condition = {}
if authors is not None:
find_condition["author"] = {"$in": authors}
if genres is not None:
find_condition["genres"] = {"$in": genres}
if published_year is not None:
find_condition["published_year"] = published_year
cursor = self._client[DB][BOOKS_COLLECTION].find(find_condition, {"_id": 0})
return [doc async for doc in cursor]
Now i want to implement pagination for this endpoint . Here i have a few questions :
Is it good to do pagination on database level or application level ?
Do we have some out of the box libraries which can help me do that in fastapi ? I checked the documentation for https://pypi.org/project/fastapi-pagination/ , but this seems to be more targeted towards SQL databases
I also checked out this link : https://www.codementor.io/#arpitbhayani/fast-and-efficient-pagination-in-mongodb-9095flbqr which talks about different ways of doing this in Mongo db but i think only the first option(using limit and skip) would work for me, because i want to also make it work when i am using other filter parameters (for example for author and genre) and there is no way i can know the ObjectId's unless i make the first query to get the data and then i want to do pagination.
But the issue is everywhere i see using limit and skip is discouraged.
Can someone please let me know what are the best practices here and can something apply to my requirement and use case?
Many thanks in advance.
There is no right or wrong answer to such a question. A lot depends on the technology stack that you use, as well as the context that you have, considering also the future directions of both the software you wrote as well as the software you use (mongo).
Answering your questions:
It depends on the load you have to manage and the dev stack you use. Usually it is done at database level, since retrieving the first 110 and dropping the first 100 is quite dumb and resource consuming (the database will do it for you).
To me is seems pretty simple on how to do it via fastapi: just add to your get function the parameters limit: int = 10 and skip: int = 0 and use them in the filtering function of your database. Fastapi will check the data types for you, while you could check that limit is not negative or above, say, 100.
It says that there is no silver bullet and that since skip function of mongo does not perform well. Thus he believes that the second option is better, just for performances. If you have billions and billions of documents (e.g. amazon), well, it may be the case to use something different, though by the time your website has grown that much, I guess you'll have the money to pay an entire team of experts to sort things out and possibly develop your own database.
TL;DR
Concluding, the limit and skip approach is the most common one. It is usually done at the database level, in order to reduce the amount of work of the application and bandwidth.
Mongo is not very efficient in skipping and limiting results. If your database has, say a million of documents, then I don't think you'll even notice. You could even use a relational database for such a workload. You can always benchmark the options you have and choose the most appropriate one.
I don't know much about mongo, but I know that generally, indexes can help limiting and skipping records (docs in this case), but I'm not sure if it's the case for mongo as well.
You can use this package to paginate:
https://pypi.org/project/fastapi-paginate
How to use it:
https://github.com/nazmulnnb/fastapi-paginate/blob/main/examples/pagination_motor.py
Related
I want to query data that is two levels down, however, would I still be able to retrieve data from its original node?
To explain better, my Firebase Database looks like:
posts
-192u3jdj0j9sj0
-message: haha this is funny (CAN I STILL GET THIS DATA)
-genre: comedy (CAN I STILL GET THIS DATA)
-author
-user: "jasonj"
-comment
-ajiwj2319j0jsf9d0jf
-comment: "lol"
-user: "David" (QUERY HERE****)
-jfaiwjfoj1ijifjojif
-comment: "so funny"
-user: "Toddy"
I essentially want to query by all of the comments David has posted. However, with how query works, can I still grab the original (message & genre) that was from "level 1"? Or would I have to restructure my data? Possibly rewriting the level 1 data under comment.
(End goal: something like Yahoo answers, where the user can see the questions he posted, as well as the questions to where he posted comments)
Below code works, but I'm not sure how to pull up level 1 data or if its even possible
ref = Database.database().reference().child("posts").child(myPost).child("comment")
var queryRef:DatabaseQuery
queryRef = ref.queryOrdered(byChild: "user").queryEqual(toValue: "David")
queryRef.observeSingleEvent(of: .value, with: { (snapshot) in
if snapshot.childrenCount > 0 {
Your current data structure makes it easy to find the comments for a specific post. It does not however make it easy to find the comments from a specific author. The reason for that is that Firebase Database queries treat your content as a flat list of nodes. The value you want to filter on, must be at a fixed path under each node.
To allow finding the comments from a specific author, you'll want to add an additional node where you keep that information. For example:
"authorComments": {
"David": {
"-192u3jdj0j9sj0_-ajiwj2319j0jsf9d0jf": true
},
"Toddy": {
"-192u3jdj0j9sj0_-jfaiwjfoj1ijifjojif": true
}
}
This structure is often known as a reverse index, and it allows you to easily find the comment paths (I used a _ as the separator of path segments above) for a specific user.
This sort of data duplication is quite common when using NoSQL databases, as you often have to modify/expand your data structure to allow the use-cases that your app needs.
Also see my answers here:
Firebase Query Double Nested
Firebase query if child of child contains a value
Given an Meteor application that has multiple collections that need to be displayed together in a paged Facebook-style timeline view, I'm trying to decide on the best way to handle the publication of this data.
The requirements are as follows:
Documents from different collections may be intermingled in the timeline view.
The items should be sorted by a common field (the date, for example)
There should be a paged-display limit with a "Load More..." button
To solve this problem I can see two possible approaches...
Approach 1 - Overpublish
Currently I have different collections for each type of data. This poses a problem for the efficient publishing of the information that I need. For example, if the current display limit is 100 then I need to publish 100 elements of each type of collection in order to be sure of displaying the latest 100 elements of the screen.
An example may make this clearer. Assume that the timeline display shows results from collections A, B, C and D. Potentially only one of those collections may have any data, so to be sure that I have enough data to display 100 items I'll need to fetch 100 items from each collection. In that case, however, I could be fetching and sending 400 items instead!
That's really not good at all.
Then, on the client side, I need to handling merging these collections such that I show the documents in order, which probably isn't a trivial task.
Approach 2 - Combine all the collections
The second approach that occurs to me it to have one enormous server side collection of generic objects. That is, instead of having collections A, B, C, and D, I'd instead have a master collection M with a type field that describes the type of data held by the document.
This would allow me to trivially retrieve the the latest documents without over publishing.
However I'm not yet sure what the full repercussions of this approach would be, especially with packages such as aldeed:autoform and aldeed:simple-schema.
My questions are:
Does anyone here have and experience with these two approaches? If
so, what other issues should I be aware of?
Can anyone here suggest
an alternative approach?
I'd use the second approach, but do not put everything in there...
What I mean is that, for your timeline you need events, so you'd create an events collection that stores the basic information for each event (date, owner_id, etc) you'd also add the type of event and id to match another collection. So you'll keep your events just small enough to publish all is needed to then grab more details if there is a need.
You could then, either just publish your events, or publish the cursors of the other collections at the same time using the _id's to not over-publish. That event collection will become very handy for matching documents like if the user wants to see what in his timeline is related to user X or city Y...
I hope it helps you out.
I finally come up with a completely different approach.
I've created a server publication that returns the list of items ids and types to be displayed. The client can then fetch these from the relevant collections.
This allows me to maintain separate collections for each type, thus avoiding issues related to trying to maintain a Master collection type. Our data-model integrity is preserved.
At the same time I don't have to over-publish the data to the client. The workload on the server to calculate the ID list is minimal, and outweighs the disadvantages of the other two approaches by quite a long way in my opinion.
The basic publication looks like this (in Coffeescript):
Meteor.publish 'timeline', (options, limit) ->
check options, Object
check limit, Match.Optional Number
sub = this
limit = Math.min limit ? 10, 200
# We use the peerlibrary:reactive-mongo to enable meteor reactivity on the server
#ids = {}
tracker = Tracker.autorun =>
# Run a find operation on the collections that can be displayed in the timeline,
# and add the ids to an array
collections = ['A', 'B']
items = []
for collectionName in collections
collection = Mongo.Collection.get collectionName
collection.find({}, { fields: { updatedOn: 1 }, limit: limit, sort: { updatedOn: -1 }}).forEach (item) ->
item.collection = collectionName
items.push item
# Sort the array and crop it to the required length
items = items.sort (a,b) -> new Date(a.date) - new Date(b.date)
items = items[0...limit]
newIds = {}
# Add/Remove the ids from the 'timeline' collection
for doc in items
id = doc._id
newIds[id] = true
# Add this id to the publication if we didn't have it before
if not #ids[id]?
#ids[id] = moment doc.updatedOn
sub.added 'timeline', id, { collection: doc.collection, docId: id, updatedOn: doc.updatedOn }
# If the update time has changed then it needs republishing
else if not moment(doc.updatedOn).isSame #ids[id]
#ids[id] = doc.updatedOn
sub.changed 'timeline', id, { collection: doc.collection, docId: id, updatedOn: doc.updatedOn }
# Check for items that are no longer in the result
for id of #ids
if not newIds[id]?
sub.removed 'timeline', id
delete #ids[id]
sub.onStop ->
tracker.stop()
sub.ready()
Note that I'm using peerlibrary:reactive-publish for the server-side autorun.
The queries fetch just the latest ids from each collection, then it places them into a single array, sorts them by date and crops the array length to the current limit.
The resulting ids are then added to the timeline collection, which provides for a reactive solution on the client.
On the client it's a simply a matter of subscripting to this collection, then subscribing the individual item subscriptions themselves. Something like this:
Template.timelinePage.onCreated ->
#autorun =>
#limit = parseInt(Router.current().params['limit']) || 10
sub = #subscribe 'timeline', {}, #limit
if sub.ready()
items = Timeline.find().fetch()
As = _.pluck _.where(items, { collection: 'a' }), 'docId'
#aSub = #subscribe 'a', { _id: { $in: As }}
Bs = _.pluck _.where(items, { collection: 'b' }), 'docId'
#bSub = #subscribe 'b', { _id: { $in: Bs }}
Finally, the template can iterate one the timeline subscription and display the appropriate item based on its type.
Could somebody tell me how
I have a collection
a {
b: String
c: Date
d: ObjectId --> j
}
j {
k: String
l: String
m: String
}
when I carry out a:
a.find({ b: 'thing' }).populate('d').exec(etc..)
in the background is this actually carrying out two queries against the MongoDB in order to return all the items 'j'?
I have no issues getting populate to work, what concerns me is the performance implications of the task.
Thanks
Mongoose uses two queries to fulfill the request.
The a collection is queried to get the docs that match the main query, and then the j collection is queried to populate the d field in the docs.
You can see the queries Mongoose is using by enabling debug output:
mongoose.set('debug', true);
Basically the model 'a' is containing an attribute 'd' which is referencing(pointing) towards the model 'j'.
So whenever we use
a.find({ b: 'thing' }).populate('d').exec(etc..)
Then through populate we can individually call properties of 'j' like :
d.k
d.l
d.m
Populate() helps us to call properties of other models.
Adding to #JohnnyHK answer on the performance implications of the task you worried about, I believe no matter what, these queries have to execute sequentially whether we use the mongoose provided populate() method or the one you will implement the server-side, both will have the same time complexity.
This is because in order to populate we need to have the results from the first query, after getting the result uuid will be used to query the document in the other collection.
So I believe it's a waste to make these changes the server-side than to use the mongoose provided method. The performance will remain the same.
My question is quite simple: I'd like to perform a GROUP BY like statement with MongoDB using the OPAlang high level database API. But I don't think that is possible (?)
If I do want to perform a mongodb $group operation, do I necessarily need to use the low-level API (using stdlib.apis.mongo ?)
Finally, can I use both low-level and high-level APIs to communicate with my MongoDB ?
Thanks.
I am afraid that, taking into account latest published Opa compiler code, no aggregation is supported :( See the thread in Opa forum. Also note the comment of Quentin about the using both low- and high-level API-s:
"You can use this [low level] library and built-in [hight level] library together, [...]"
See the auto-increment implementation advices by the guys from the MLstate in this thread. Note the high level DB field /next_id definition and initialization with low level read and increment.
I just got different idea.
All MongoDB commands (eg. the "group" command you are using) are accessible with the virtual collection named $cmd. You just ask the server to find the document {command_name: command_parameter, additional: "options", are: ["listed", "here"]}. You should be able to use every fancy feature of your MongoDB server, not supported yet with the Opa API, with single find query. This includes the aggregation framework introduced in version 2.2 and the full-text searching still in beta since version 2.4.
For example, I want to use new text command to search in full-text index for collecion coll_name the query string query. I am currently using the code (where oncuccess is the function to parse the answer and get the id-s of the documents found):
{ search: query, project: {_id:0, id:1}, }
|> Bson.opa2doc
|> MongoCommands.simple_str_command_opts(ll_db, db_name, "text", coll_name, opts)
|> MongoCommon.outcome_map(_, onsuccess, onfailure)
And if you take a look at the source code of the API, simple_str_command_opts is implemented as a findOne() to the Mongo.
But instead I could use the high level DB support:
/test/`$cmd`[{text: coll_name, search: query, project: {_id: 0, id: 1}}]
What you have to do, is to declare the high-level DB collection with the type including:
all the fields that you use to make the query,
all the fields that you can get in possible answer.
For the text command:
type commands = {
// command
string text,
// query
string search,
{
int _id,
int id,
} project,
// result of executing command "text"
string queryDebugString,
string language,
list({
float score,
{int id} obj,
}) results,
{
int nscanned,
int nscannedObjects,
int n,
int nfound,
int timeMicros,
} stats,
int ok,
// in case of failure (`ok: 0`)
string errmsg,
}
Unfortunately, it is not working :( During the application start-up Opa run-time DB support tries to create the unique index for the primary key of the set (for following example {text, search, project}):
database test {
article /article[{id}]
commands /`$cmd`[{text, search, project}]
}
Using primary key is necessary, since you have to use findOne(), not find(). Creating an index for virtual collection $cmd is not allowed and DB initialization fails.
If you find the way to stop Opa from creating index, you will be able to use all the fancy features of Mongo using no more then high-level API ;)
Looking for similar functionality to Postgres' Distinct On.
Have a collection of documents {user_id, current_status, date}, where status is just text and date is a Date. Still in the early stages of wrapping my head around mongo and getting a feel for best way to do things.
Would mapreduce be the best solution here, map emits all, and reduce keeps a record of the latest one, or is there a built in solution without pulling out mr?
There is a distinct command, however I'm not sure that's what you need. Distinct is kind of a "query" command and with lots of users, you're probably going to want to roll up data not in real-time.
Map-Reduce is probably one way to go here.
Map Phase: Your key would simply be an ID. Your value would be something like the following {current_status:'blah',date:1234}.
Reduce Phase: Given an array of values, you would grab the most recent and return only it.
To make this work optimally you'll probably want to look at a new feature from 1.8.0. The "re-reduce" feature. Will allow you to process only new data instead of re-processing the whole status collection.
The other way to do this is to build a "most-recent" collection and tie the status insert to that collection. So when you insert a new status for the user, you update their "most-recent".
Depending on the importance of this feature, you could possibly do both things.
Current solution that seems to be working well.
map = function () {emit(this.user.id, this.created_at);}
//We call new date just in case somethings not being stored as a date and instead just a string, cause my date gathering/inserting function is kind of stupid atm
reduce = function(key, values) { return new Date(Math.max.apply(Math, values.map(function(x){return new Date(x)})))}
res = db.statuses.mapReduce(map,reduce);
Another way to achieve the same result would be to use the group command, which is a kind of a mr-shortcut that lets you aggregate on a specific key or set of keys.
In your case it would read like this:
db.coll.group({ key : { user_id: true },
reduce : function(obj, prev) {
if (new Date(obj.date) < prev.date) {
prev.status = obj.status;
prev.date = obj.date;
}
},
initial : { status : "" }
})
However, unless you have a rather small fixed amount of users I strongly believe that a better solution would be, as previously suggested, to keep a separate collection containing only the latest status-message for each user.