MapReduce on child objects not embedded - mongodb

I am having a problem with creating a mapreduce algorithm that will get me the stats i need. I have a user object that can create a post and a post can have many likes by other users.
User
--Post
----Likes
The Post is not embedded in the user because we access posts separately and not just in a user context. The stat I need is the number of likes an author has gotten and i need to get this through the likes of the posts of a user. The problem is that because the posts are not embedded, I cannot access them in my map function. Here are the map and reduce functions I currently have
def reputation_map
<<-MAP
function() {
var posts = db.posts.find({user_id:this._id});
emit(this._id, {posts:posts});
}
MAP
end
def reputation_reduce
<<-REDUCE
function(key, values) {
var count = 0;
while(values.hasNext()){
values.next();
count+=1;
}
return {posts:count};
}
REDUCE
end
This should only return the posts for each user so I have not even gotten to the likes level yet but instead of a count, this only returns a dbquery for posts. What is the correct way of doing this?

Map Reduce is really designed to operate on a single collection at a time.
Technically, it is possible to query a separate collection from inside a Map function as you have done, but take caution as this is not recommended nor supported. you may run into issues, especially if the collection is sharded.
A similar question was asked a while back: How to call to mongodb inside my map/reduce functions? Is it a good practice?
If you are aggregating results from multiple collections, you may find that the safest and most straight-forward way to do it is in the application.
Alternatively, if likes per author is a value that will be searched for with some frequency, it may be preferable to include it as a value in each document, and spend a little more overhead on each update to increment this value, rather than periodically performing a potentially resource-heavy calculation of all the votes per author.
Hopefully this will give you some food for thought for retrieving the values that you need to.
If you would like some assistance writing a Map Reduce operation for a single collection, the Community is here to help. Please include a sample input document, and a description of the desired output.
For more information on Map Reduce, the documentation may be found here:
http://www.mongodb.org/display/DOCS/MapReduce
Additionally, there are some good Map Reduce examples in the MongoDB Cookbook:
http://cookbook.mongodb.org/
The "Extras" section of the cookbook article "Finding Max And Min Values with Versioned Documents" http://cookbook.mongodb.org/patterns/finding_max_and_min/ contains a good step-by-step walkthrough of a Map Reduce operation, explaining how the functions are executed.

Related

Firestore pagination by offset

I would like to create two queries, with pagination option. On the first one I would like to get the first ten records and the second one I would like to get the other all records:
.startAt(0)
.limit(10)
.startAt(9)
.limit(null)
Can anyone confirm that above code is correct for both condition?
Firestore does not support index or offset based pagination. Your query will not work with these values.
Please read the documentation on pagination carefully. Pagination requires that you provide a document reference (or field values in that document) that defines the next page to query. This means that your pagination will typically start at the beginning of the query results, then progress through them using the last document you see in the prior page.
From CollectionReference:
offset(offset) → {Query}
Specifies the offset of the returned results.
As Doug mentioned, Firestore does not support Index/offset - BUT you can get similar effects using combinations of what it does support.
Firestore has it's own internal sort order (usually the document.id), but any query can be sorted .orderBy(), and the first document will be relative to that sorting - only an orderBy() query has a real concept of a "0" position.
Firestore also allows you to limit the number of documents returned .limit(n)
.endAt(), .endBefore(), .startAt(), .startBefore() all need either an object of the same fields as the orderBy, or a DocumentSnapshot - NOT an index
what I would do is create a Query:
const MyOrderedQuery = FirebaseInstance.collection().orderBy()
Then first execute
MyOrderedQuery.limit(n).get()
or
MyOrderedQuery.limit(n).get().onSnapshot()
which will return one way or the other a QuerySnapshot, which will contain an array of the DocumentSnapshots. Let's save that array
let ArrayOfDocumentSnapshots = QuerySnapshot.docs;
Warning Will Robinson! javascript settings is usually by reference,
and even with spread operator pretty shallow - make sure your code actually
copies the full deep structure or that the reference is kept around!
Then to get the "rest" of the documents as you ask above, I would do:
MyOrderedQuery.startAfter(ArrayOfDocumentSnapshots[n-1]).get()
or
MyOrderedQuery.startAfter(ArrayOfDocumentSnapshots[n-1]).onSnapshot()
which will start AFTER the last returned document snapshot of the FIRST query. Note the re-use of the MyOrderedQuery
You can get something like a "pagination" by saving the ordered Query as above, then repeatedly use the returned Snapshot and the original query
MyOrderedQuery.startAfter(ArrayOfDocumentSnapshots[n-1]).limit(n).get() // page forward
MyOrderedQuery.endBefore(ArrayOfDocumentSnapshots[0]).limit(n).get() // page back
This does make your state management more complex - you have to hold onto the ordered Query, and the last returned QuerySnapshot - but hey, now you're paginating.
BIG NOTE
This is not terribly efficient - setting up a listener is fairly "expensive" for Firestore, so you don't want to do it often. Depending on your document size(s), you may want to "listen" to larger sections of your collections, and handle more of the paging locally (Redux or whatever) - Firestore Documentation indicates you want your listeners around at least 30 seconds for efficiency. For some applications, even pages of 10 can be efficient; for others you may need 500 or more stored locally and paged in smaller chucks.

Nosql database design - MongoDB

I am trying to build an app where I just have these 3 models:
topic (has just a title (max 100 chars.))
comment (has text (may be very long), author_id, topic_id, createdDate)
author (has just a username)
Actually a very simple db structure. A Topic may have many comments, which are created by authors. And an author may have many comments.
I am still trying to figure out the best way of designing the database structure (documents). First I though to put everything to its own schema like above. 3 Documents. But since this is a nosql db, I should actually try to eliminate the needs for a join. And now I am really thinking of putting everything to a single document, which also sounds crazy.
These are my actually queries from ui:
Homepage query: Listing all the topics, which have received the most comments today (will run very often)
Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Main page of a topic query: Listing all the comments of a topic, with their authors' username.
Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this:
Comment (text, username, topic_title, createdDate)
This way I will not need any join, but also save i.e. the title of topics multiple times.. in every comment..
I just could not decide.
I appreciate any help.
You can do the second design you suggested but it all comes down to how you want to use the data. I assume you’re going to be using it for a website.
If you want the comments to be clickable, in such that clicking on the topic name will redirect to the topic’s page or clicking the username will redirect to the user’s page where you can see all his comments, i suggest you keep them as IDs. Since you can later use .populate(“field1 field2”) and you can select the fields you would like to get from that ID.
Alternatively you can store both the topic_name and username and their IDs in the same document to reduce queries, but you would end up storing more redundant data.
Revised design:
The three queries (in the question post) are likely to be like this (pseudo-code):
select all topics from comments, where date is today, group by topic and count comments, order by count (desc)
select topics from comments, where topic matches search, group by topic.
select all from comments, where topic matches topic_param, order by comment_date (desc).
So, as you had intended (in your question post) it is likely there will be one main collection, comments.
comments:
date
author
text
topic
The user and topic collections with one field each, are optional, to maintain uniqueness.
Note the group-by queries will be aggregation queries, for example, the main query will be like this:
db.comments.aggregate( [
{ $match: { date: ISODate("2019-11-15") } },
{ $group: { _id: "$topic", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
] )
This will give you all the topics names, today and with highest counted topics first.
You could also take a bit different approach. Storing information redundant is not a bad thing in all cases.
1. Homepage query: Listing all the topics, which have received the most comments today (will run very often)
You could implement this as two extra fields in your Topic entity. One describing the last date a comment was added and the second to count the amount of comments added that day. By doing so you do not need to join but can write a query that only looks at the Topic collection.
You could also store these statistics independently of the other data and update it when required. Think of this as having a document that describes your database its current state (at least those parts relevant to you).
This might give you a time penalty on storing information but it improves reading times.
2. Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Far as I understand this one you only need the topic title. Meaning you can query the database once and retrieve all titles. If the collection grows so big this becomes slow you could trigger a refresh of the retrieval query that only returns a subset (a user is not likely to go through 100 possible topics).
3. Main page of a topic query: Listing all the comments of a topic, with their authors' username.
This is actually the tricky one. If this is really what it is you want to do then you are most likely best off storing all data in one document. However I would ask you: what is the problem making more than one query? I doubt you will be showing all comments at once when there are thousands (as you say). Instead of storing each in a separate document or throwing all in one document, you could also bucket them and retrieve only the 20 most recent ones (if you would create buckets of size 20). Read more about the bucket pattern here and update the ones shown when required.
You said:
"Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this..."
I"ll make an argument from a 'domain driven design' point of view.
Given that all your data exists within the same bounded context (business domain). Then it is acceptable to encapsulate it all within the same document!

MongoDB, sort an array that is the value of a field of a document, with a slice on that array

I have a Mongo collection for profile comments, which is structured like this:
{
"_id": "",
"comments": []
}
The id refers to the ID of the user from the profiles collection. For improved server power, I only want to show 10 comments per page. To do this, I use Mongo's $slice operator. Here is my code for the query.
mongoCols.profileComments.findOne({"_id":doc._id},{comments:{$slice:[(comPage-1)*10,10]}},function(err,doc) {
Here's the problem. I want to show the comments in reverse order, meaning that the newest comments are on the first page, and latest comments are on the last. I thought of a few solutions to this but they aren't very efficient.
1) I could retrieve the entire array, then use JavaScript (I'm using nodejs for this) to sort that array, then only take the 10 elements that I want. This seems inefficient because I'm asking Mongo to retrieve what is potentially a ton of elements from an array, when I only need 10.
2) I could make each comment a separate document, with a field saying what user the comment is for. I could then only find documents where the comment was sent to the requested user, and use the skip and limit options to only retrieve the 10 documents I want. My problem with this is that Mongo will have to go through almost every single comment every time you request for a user's comments. This seems inefficient, but it is my best solution so far.
I would prefer to keep the structure I currently have, but if I need to change for it to work, then I will comply.

MongoDB - is this efficient?

I have a collection users in mongodb. I am using the node/mongodb API.
In my node.js if I go:
var users = db.collection("users");
and then
users.findOne({ ... })
Is this idiomatic and efficient?
I want to avoid loading all users into application memory.
In Mongo, collection.find() returns a cursor, which is an abstract representation of the results that match your query. It doesn't return your results over the wire until you iterate over the cursor by calling cursor.next(). Before iterating over it, you can, for instance, call cursor.limit(x) to limit the number of results that it is allowed to return.
collection.findOne() is effectively just a shortcut version of collection.find().limit(1).next(). So the cursor is never allowed to return more than the one result.
As already explained, the collection object itself is a facade allowing access to the collection, but which doesn't hold any actual documents in memory.
findOne() is very efficient, and it is indeed idiomatic, though IME it's more used in dev/debugging than in real application code. It's very useful in the CLI, but how often does a real application need to just grab any one document from a collection? The only case I can think of is when a given query can only ever return one document (i.e. an effective primary key).
Further reading:
Cursors,
Read Operations
Yes, that should only load one user into memory.
The collection object is exactly that, it doesn't return all users when you create a new collection object, only when you either use a findOne or you iterate a cursor from the return of a find.

MongoEngine: How to collect embedded documents from referencing document

I have a class Post which has a list of embedded document called "comments"
Here all i want to do is to retrieve latest comments for all the posts user posted.
How can i achieve that? My current code, i just loop though the 'Post' class for that user and manually collect "comment".
But I also want this to be sorted by recently added, so have sort function to loop over manually collected comments and re-sort.
This seems like very inefficient, so asking for advise. Thanks!
Firstly if you $push onto the list with an update then you will keep the comments in order.
You can use the $slice operator to return the last x comments eg:
Post.objects(id=xxx).fields(slice__comments=-5)
However, the schema may not be efficient especially if you keep growing the number of comments, or comments can be unpublished. In that case you may want to split comments out into their own Document Model and link the comments to the Post by id. This would be two round trips to the database but offers more flexibility - eg. you could filter on date and published.