How to read paged records in mongo when record creator is deleted - mongodb

May be title is not very clear but I'll try to explain.
There are two collections in mongo:
groups
users
Groups are created by users.
UI sends a /groups/1/10 to read first 10 groups. We don't want to return groups whose creators(users) are deleted.
Example:
UI makes call: /groups/1/10
Let us say only 8 records are available because 2 users are deleted from the system, hence their groups are not available.
What should we do?
Should UI make another request like: /groups/1/2 ?
Should we read let's say 20 groups, read first 10 available groups and return them? This may not be very good for second or third pages.

There is not enough information here to give a specific answer, specifically we need to know more about the schema that you are using. We'll try to give some general details that might point things in the right direction. We are also assuming that your endpoints are structured as /groups/<pageNumber>/<pageSize>.
Broadly speaking, if the client calls /groups/1/10 and there are (at least) 10 valid matching results, then the system should return 10 results.
It's not clear what you mean when you say:
only 8 records are available because 2 users are deleted from the system, hence their groups are not available ... Should UI make another request like: /groups/1/2 ?
The first part of that statement implies that there are only 8 valid results, but the second part implies that there are at least 2 more valid results that can be retrieved. If there are 10 valid results then they should be all get returned.
How you accomplish this depends on how invalid groups and/or deleted users are represented in your system. If, for example, the documents in your groups collection have some sort of valid field that becomes false when the user who created it gets deleted, then you should be applying a filter to remove those results such as:
db.groups.find({ valid: true }).limit(10)
If instead the groups have a document that references the user who created it, then you may need to do something a bit more complex. That may be along the lines of doing an aggregation that does a $lookup on the users collection and then perform a subsequent $match to remove the groups from the results whose creators have been deleted.
While there are many approaches to this problem, the only one that I would consider incorrect would be to force the client to perform the group validity check and/or force the client to make multiple requests.

Related

Nosql database design - MongoDB

I am trying to build an app where I just have these 3 models:
topic (has just a title (max 100 chars.))
comment (has text (may be very long), author_id, topic_id, createdDate)
author (has just a username)
Actually a very simple db structure. A Topic may have many comments, which are created by authors. And an author may have many comments.
I am still trying to figure out the best way of designing the database structure (documents). First I though to put everything to its own schema like above. 3 Documents. But since this is a nosql db, I should actually try to eliminate the needs for a join. And now I am really thinking of putting everything to a single document, which also sounds crazy.
These are my actually queries from ui:
Homepage query: Listing all the topics, which have received the most comments today (will run very often)
Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Main page of a topic query: Listing all the comments of a topic, with their authors' username.
Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this:
Comment (text, username, topic_title, createdDate)
This way I will not need any join, but also save i.e. the title of topics multiple times.. in every comment..
I just could not decide.
I appreciate any help.
You can do the second design you suggested but it all comes down to how you want to use the data. I assume you’re going to be using it for a website.
If you want the comments to be clickable, in such that clicking on the topic name will redirect to the topic’s page or clicking the username will redirect to the user’s page where you can see all his comments, i suggest you keep them as IDs. Since you can later use .populate(“field1 field2”) and you can select the fields you would like to get from that ID.
Alternatively you can store both the topic_name and username and their IDs in the same document to reduce queries, but you would end up storing more redundant data.
Revised design:
The three queries (in the question post) are likely to be like this (pseudo-code):
select all topics from comments, where date is today, group by topic and count comments, order by count (desc)
select topics from comments, where topic matches search, group by topic.
select all from comments, where topic matches topic_param, order by comment_date (desc).
So, as you had intended (in your question post) it is likely there will be one main collection, comments.
comments:
date
author
text
topic
The user and topic collections with one field each, are optional, to maintain uniqueness.
Note the group-by queries will be aggregation queries, for example, the main query will be like this:
db.comments.aggregate( [
{ $match: { date: ISODate("2019-11-15") } },
{ $group: { _id: "$topic", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
] )
This will give you all the topics names, today and with highest counted topics first.
You could also take a bit different approach. Storing information redundant is not a bad thing in all cases.
1. Homepage query: Listing all the topics, which have received the most comments today (will run very often)
You could implement this as two extra fields in your Topic entity. One describing the last date a comment was added and the second to count the amount of comments added that day. By doing so you do not need to join but can write a query that only looks at the Topic collection.
You could also store these statistics independently of the other data and update it when required. Think of this as having a document that describes your database its current state (at least those parts relevant to you).
This might give you a time penalty on storing information but it improves reading times.
2. Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Far as I understand this one you only need the topic title. Meaning you can query the database once and retrieve all titles. If the collection grows so big this becomes slow you could trigger a refresh of the retrieval query that only returns a subset (a user is not likely to go through 100 possible topics).
3. Main page of a topic query: Listing all the comments of a topic, with their authors' username.
This is actually the tricky one. If this is really what it is you want to do then you are most likely best off storing all data in one document. However I would ask you: what is the problem making more than one query? I doubt you will be showing all comments at once when there are thousands (as you say). Instead of storing each in a separate document or throwing all in one document, you could also bucket them and retrieve only the 20 most recent ones (if you would create buckets of size 20). Read more about the bucket pattern here and update the ones shown when required.
You said:
"Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this..."
I"ll make an argument from a 'domain driven design' point of view.
Given that all your data exists within the same bounded context (business domain). Then it is acceptable to encapsulate it all within the same document!

How to optimize collection subscription in Meteor?

I'm working on a filtered live search module with Meteor.js.
Usecase & problem:
A user wants to do a search through all the users to find friends. But I cannot afford for each user to ask the complete users collection. The user filter the search using checkboxes. I'd like to subscribe to the matched users. What is the best way to do it ?
I guess it would be better to create the query client-side, then send it the the method to get back the desired set of users. But, I wonder : when the filtering criteria changes, does the new subscription erase all of the old one ? Because, if I do a first search which return me [usr1, usr3, usr5], and after that a search that return me [usr2, usr4], the best would be to keep the first set and simply add the new one to it on the client-side suscribed collection.
And, in addition, if then I do a third research wich should return me [usr1, usr3, usr2, usr4], the autorunned subscription would not send me anything as I already have the whole result set in my collection.
The goal is to spare processing and data transfer from the server.
I have some ideas, but I haven't coded enough of it yet to share it in a easily comprehensive way.
How would you advice me to do to be the more relevant possible in term of time and performance saving ?
Thanks you all.
David
It depends on your application, but you'll probably send a non-empty string to a publisher which uses that string to search the users collection for matching names. For example:
Meteor.publish('usersByName', function(search) {
check(search, String);
// make sure the user is logged in and that search is sufficiently long
if (!(this.userId && search.length > 2))
return [];
// search by case insensitive regular expression
var selector = {username: new RegExp(search, 'i')};
// only publish the necessary fields
var options = {fields: {username: 1}};
return Meteor.users.find(selector, options);
});
Also see common mistakes for why we limit the fields.
performance
Meteor is clever enough to keep track of the current document set that each client has for each publisher. When the publisher reruns, it knows to only send the difference between the sets. So the situation you described above is already taken care of for you.
If you were subscribed for users: 1,2,3
Then you restarted the subscription for users 2,3,4
The server would send a removed message for 1 and an added message for 4.
Note this will not happen if you stopped the subscription prior to rerunning it.
To my knowledge, there isn't a way to avoid removed messages when modifying the parameters for a single subscription. I can think of two possible (but tricky) alternatives:
Accumulate the intersection of all prior search queries and use that when subscribing. For example, if a user searched for {height: 5} and then searched for {eyes: 'blue'} you could subscribe with {height: 5, eyes: 'blue'}. This may be hard to implement on the client, but it should accomplish what you want with the minimum network traffic.
Accumulate active subscriptions. Rather than modifying the existing subscription each time the user modifies the search, start a new subscription for the new set of documents, and push the subscription handle to an array. When the template is destroyed, you'll need to iterate through all of the handles and call stop() on them. This should work, but it will consume more resources (both network and server memory + CPU).
Before attempting either of these solutions, I'd recommend benchmarking the worst case scenario without using them. My main concern is that without fairly tight controls, you could end up publishing the entire users collection after successive searches.
If you want to go easy on your server, you'll want to send as little data to the client as possible. That means every document you send to the client that is NOT a friend is waste. So let's eliminate all that waste.
Collect your filters (eg filters = {sex: 'Male', state: 'Oregon'}). Then call a method to search based on your filter (eg Users.find(filters). Additionally, you can run your own proprietary ranking algorithm to determine the % chance that a person is a friend. Maybe base it off of distance from ip address (or from phone GPS history), mutual friends, etc. This will pay dividends in efficiency in a bit. Index things like GPS coords or other highly unique attributes, maybe try out composite indexes. But remember more indexes means slower writes.
Now you've got a cursor with all possible friends, ranked from most likely to least likely.
Next, change your subscription to match those friends, but put a limit:20 on there. Also, only send over the fields you need. That way, if a user wants to skip this step, you only wasted sending 20 partial docs over the wire. Then, have an infinite scroll or 'load more' button the user can click. When they load more, it's an additive subscription, so it's not resending duplicate info. Discover Meteor describes this pattern in great detail, so I won't.
After a few clicks/scrolls, the user won't find any more friends (because you were smart & sorted them) so they will stop trying & move on to the next step. If you returned 200 possible friends & they stop trying after 60, you just saved 140 docs from going through the pipeline. There's your efficiency.

REST Get items not in collection

Introduction
Let's say I have a rest service which returns items:
GET /items
GET /items/7
POST /items
etc.
We also have groups of items:
GET /groups
GET /groups/16
POST /groups
etc.
And we can then get the items in a specific group:
/items?groupid=16
This is all pretty straight forward.
Question
Now that we have a way to get items in a specific group, should we also supply a way to get items that are NOT in a specific group? Why? Because if a client wants to add items to a group, it has to know which items aren't added yet.
I see two options:
We supply some way to query the data
Don't do anything, let the client handle it.
Ad 1.
We can supply a way to query/search the data like this:
/items?groupid=!16
or
/items?q=groupid<>16
I have the feeling this leads to a never ending stream of feature requests for search queries.
Ad 2.
The client can first get all items. Next the client can get all items in group 16. Doing a diff on these two collections gives the items not in group 16.
This way the client has to do a little more coding, and work with collections, keep them in memory etc. On the other hand it doesn't need to learn a specific query syntax.
Are there any best practicies on this topic?
First, I'd use a different URL to get the items in a group:
GET /groups/16/items
This would return the collection resource of all items that are in group 16. To add an item to a group,
POST /groups/16/items
could be used.
Adding an item to a group will have the same result, regardless of the item already being in the group or not. In both cases the client only cares about the result: he wants the item to be in the group. If it already was, fine, if not, it is now.
So I don't see any usecase for getting items that are not in a group.
Perhaps allowing the client to POST the new item to the desired group could be the way to go.
During the validation you would need anyway, you could either accept the not-pre-existing item or reject it because it is already there.
Building on the URLs of #Tichodroma, you could perhaps have:
GET /groups/16/non-items

REST Design: What Http verb should be used to retrieve a dynamic resource?

I have a scenario in which I have REST API which manages a Resource which we will call Group. A Group contains members and the group resource is dynamic - whenever you retrieve it, you get the latest data (so a query must run server side to update the number of members in a group - in other words, the result of the request is to modify the data, since the results of running the query are stored).
Given a *group_id* it should return a minimal amount of information like
{
group_id: "5t7yu8i9io0op",
group_name: "That's my name",
size: 34
}
So a GET to this resource causes the resource to change, since a subsequent GET could return a new value for 'size'. This tells me it is not idempotent and so you should use POST to retrieve this resource. Am I correct in this conclusion?
If I am correct, do you think it is advisable to also provide a GET method that only returns the currently stored data for the group (eg. so the size could be out of date, even the name too). I suppose in this case I should return a last-modified date as one of the fields so that the user knows how up-to-date the resource is and can then elect to use the POST method...but then I am left wondering why would anyone do that, so why not ONLY provide the POST method and forget about GET?
Confused I am!
Thanks in advance.
[EDIT]
#Satish posted a link in his/her answer to the HTTP specs. In section 9.1.1. it ends with this sentence:
Naturally, it is not possible to ensure that the server does not generate side-effects as a result of performing a GET request; in fact, some dynamic resources consider that a feature. The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them.
So in my scenario, the requester does not really care about the side-effect that the value for 'size' is recomputed as a direct result of making the request. They want the group information and it just so happens that to provide accurate, up-to-date group data, the size query must be run in order to update that value. Whilst making the request causes data to change implies this should be a POST, the user did not request that side-effect and so therefore a GET request would be acceptable and more intuitive, would it not? And therefore still be restful according to this sentence.
[2nd EDIT]
#Satish asks a very important question in the comments. So for others who read this I'll explain further about this problem:
Normally you would not run the group query to update its size from a REST request. As members are added or removed from a group, you would update the computed size of that group, store it and then a simple GET request would always return the correct size. However, our situation is more complicated in that a group is only stored as a query definition in ElasticSearch (kind of like a view in an RDBMS). Members do not get added/removed to and from groups. They get added to a much larger set of data (a collection in MongoDB). There are hundreds, potentially thousands, of different 'group definitions' so it is not practical to recompute size for every group when the collection changes. We cannot know when an item is added/removed to/from the collection which groups might change size - you only know by running the group definition who is in that group and what the size is. I hope that clears things up. :)
You should use GET. Even if dynamic resource is changing, you did not request for that change through your request and you are not accountable for that change. Ref: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
In your case when you do a GET you retrieve some information about the Group. You don't modify the group structure. Ok, the group can be changed by an external entity so your next GET may bring you another data. Am I right? Who modifies the structure of the group and when?
So you should use GETbecause the resource it will be modified from somewhere else and not by your call that tries to do a read operation.
EDIT
After your edited the question I just want to add that I agree about the side effects.
It matters if you sent data or a change command explicitly to the server or you just read something and you don't have to pay attention for what the server side is doing to gave you the response. More intuitively:
GET - Requests data from a specified resource
POST - Submits data to be processed to a specified resource
It is the combination of GET and POST. So you should use POST.
Refer : http://adarshdchaurasia.wordpress.com/2013/09/26/http-get-vs-post/
You should not use GET because if you use GET method then search engines may cache the responses. It may cause unintentional data update at your server side, which you do not want. GET method is meant to return content without updating anything on server. POST is meant to updated the things at server and return result against that operation.

Multiple tags / folders in Google Reader

I want to be able to grab data from multiple tags / folders in a users Google Reader.
I know how to do one http://www.google.com/reader/atom/user/-/label/SOMELABEL but how would you do two or three or ten?
Doesn't look like you can get multiple tags/folders in one request. If it's feasible you should iterate over the different tags/folders and aggregate them in your application.
[edit]
Since it looks like you have a large list of tags/folders you need to query, an alternative is to get the full list of entries, then sort out the ones the user wants. It looks like each entry has a category element that will tell you what tag is associated with it. This might be feasible in your case.
(Source: http://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI)
(Source: http://www.google.com/reader/atom/user/-/state/com.google/starred)
I think you cannot get aggregated data as you hope to be able to. If you think about it, even Google lets you browse folders or tags one at a time, and do not aggregate a sub-set of them.
You can choose to have a list of all the items (for each one of their available statuses) or a list of a particular tag/folder.
You could do it in 2 requests. First you need to perform a GET request to http://www.google.com/reader/stream/items/ids. It supports several parameters like
s (required parameter; stream id to fetch; may be defined more than one time),
n (required; number of items to fetch)
r for ranking (optional)
and others (see more under /ids section)
And then you should perform a POST request (this is because there could be a lot of ids, and therefore the request could be cut off) to http://www.google.com/reader/api/0/stream/items/contents. The required parameter is i which holds the feed item identifier (could be defined more than once).
This should return data from several feeds (as returned for me).