Query for set complement in CouchDB - nosql

I'm not sure that there is a good way to do with with the facilities CouchDB provides, but I'd like to somehow extract the relative complement of the sets of two different document types over a particular key.
For example, let's say that I have documents representing users and posts, both of which have a (unique) username field. There's a validation in place ensuring that a user document exists for the username in every post, but there may be any number post documents with a given username, include none. It's trivial to create a view which counts the number of posts per username. The view can even include zero-counts by emitting zero post-counts for the user documents in the view map function. What I want to do though is retrieve just the list of users who have zero associated posts.
It's possible to build the view I described above and filter client-side for zero-value results, but in my actual situation the number of results could be very, very large, and the interesting results a relatively small proportion of the total. Is there a way to do this sever-side and retrieve back just the interesting results?

I would write a map function to iterate through the documents and emit the users (or just usersnames) with 0 posts.
Then I would write a list function to iterate through the map function results and format them however you want (JSON, csv, etc).
(I would NOT use a reduce function to format the results, even if a reduce function appears to work OK in development. That is just my own experience from lessons learned the hard way.)

Personally I would filter on the client-side until I had performance issues. Next I would probably use Teddy's _filter technique—all pretty standard CouchDB stuff.
However, I stumbled across (IMO) an elegant way to find set complements. I described it when exploring how to find documents missing a field.
The basic idea
Finding non-members of your view obviously can't be done with a simple query (and a straightforward index scan.) However, it can be done in constant memory, and linear time, by simultaneously iterating through two query results at the same time.
One query is for all possible document ids. The other query is for matching documents (those you don't want). Importantly, CouchDB sorts query results, therefore you can calculate the complement efficiently.
See my details in the previous question. The basic idea is you iterate through both (sorted) lists simultaneously and when you say "hey, this document id is listed in the full set but it's missing in the sub-set, that is a hit.
(You don't have to query _all_docs, you just need two queries to CouchDB: one returning all possible values, and the other returning values not to be counted.)

Related

How to get the last page of any Firestore query?

How do I get the last page of any Firestore query without needing to send the cursor value for the last document in the query?
Obviously the first/last value is not always known unless I manually keep track of those values. I have queries that order by many different fields so having to store first/last cursor values for each of those seems like a lot of unnecessary work.
Getting the entire query result without limit is obviously expensive and impractical.
There currently isn't a way to get the last list. The common way to do this is to:
Reverse the order of the query.
Request the first page of results.
Reverse the results client-side again to get them in the right order.
Note though that there is talk (and even work being done) to add a limitToLast() operation to Firestore's query mechanism, which would allow you to get precisely the result you're looking for. It just isn't available yet. The biggest differences would be that you can skip steps 1 and 3 from the workaround.

MongoDB database design - contest application

I'm building a contest application. Which have 4 collections so far:
contest
questions
matches
users
I want to store every user score for every match he's assigned into. But I really can't find a proper way to achieve this.
All what I've came up with, Is to replace matches in users with an array in which each element contains a reference to matches collection and score field. But I think this is not very efficient.
EDIT
I was thinking about another solution. A separate collection called scores that contains three fields user, match and score.
Here's my schema structure:
Contests:
Questions:
Matches:
Users:
Note Any recommended adjustments on the current design is welcomed too.
Since mongodb is not designed to support collections relationships you migth end up with some duplicated work, I would suggest you to find a way of storing as much data as you can in a single document.
Your scores would go in each match document, probably the users array would have this structure {'users':[{user_id:'xxx',score:xxx}{user_id:'xxx',score:xxx}]}
The other solution, would be what you say, to have in each user doccument, a matches array with a structure like this: {'matches':[{match_id:'xxx',score:xxx}{match_id:'xxx',score:xxx}]}
You can have both also, this migth be more efficient depending the kind of queries you will need to do. You can also have a field in the subdocuments that stores the user/match name/title
Note: As you can see, you have two solutions, or you optimize for doccument size(so you can store more) or you optimize for performance (so you can read faster/with less resources)
Hope this be of any help.

Any way to make Mongo filtering dynamic records faster?

Scenario
I have Mongo collection Items that have dynamic item objects in it. Currently I have over 3 million records. I'm using C# with MongoSharp but I don't think it has anything to do with my problem.
Here is an example Item (it has a lot more fields than just 3):
{
_id : "1234567890",
Code : 888596937,
RefNumber : "GHTZKL",
...
}
AFAIK there is no point in using TextSearch since it's not really words, just some codes so it won't give me anything beneficial. I also cannot index them all since then I will have to index every single field.
Problem
Right now when I filter data it takes about 1-3 seconds (on ssd). Is there any way I can make it filter my items faster or it is as fast as I can get?
You don't mention what field you want to search on, but it sounds like you want to search on any arbitrary attribute. This is a common design and borders on an antipattern for MongoDB. The only way to avoid the collection scan you're getting now is to index the fields you want to search on, but indexing every field when you don't know what the fields will be ahead of time isn't possible. The solution to this to name only the common fields (and index on them), then group the other fields into name/value pairs in an array in the the document. You can then index that array to get your fast searches.
A word of caution on NVP arrays: if you array gets very large (hundreds of attributes), your index size will blow up spectacularly. It's best to try to keep the array size fairly small.
For more information on this design pattern, see Asya's great writeup.

Paginating results in MongoDB without relying on .skip()

I'm building an app that calls data from MongoDB. For purposes of this question, pretend that the user searches my app for a certain query, and MongoDB has 4,000 results to spit out that match the query.
After reading around a bit, I see that it's possible to paginate using the .skip() method, but MongoDB themselves suggest against using this as it requires the curser to iterate through all the records up until the one you're skipping to, which gets more and more expensive the higher in the list you go.
I've seen a few tutorials that rely on the _id property of the results to be sequential, but this doesn't apply here - my database has tens of thousands of records, and each has a unique id, and the 4000 results that apply to the user's query are definitely not going to be sequential.
Can anyone think of a way to do this, or is skip() the only option here?
Other considerations:
The pagination will work based on the position on the page. For instance, the first query should spit out 20 records to my app. When the user scrolls to the bottom of the page, I could potentially get the _id of the 20th element on the page and pass that to my query, find it in the list of 4,000 results, find the subsequent result and start the next set of 20 from there. Is that sort of thing possible, and would it be less CPU intensive than skip()?
Your trick in "other considerations" works only if you add a sort on _id, otherwise you can't guarantee order for follow up queries. If you want to sort on a different field, you need to index that field. I would also suggest you query for 21 elements so that you don't have to go back and find the next one after the 20th element (of course, you can still show only the first 20 elements).
MongoDB ranged pagination has a good example as well.

MongoDB. Use cursor as value for $in in next query

Is there a way to use the cursor returned by the previous query as a value for $in in the next query? For example, something like this:
var users = db.user.find({state:1})
var offers = db.offer.find({user:{$in:users}})
I think this can reduce the traffic between mongodb and client in case the client doesn't need user information at all, just offers. Am i wrong?
Basically you want to do a join between two collections which Mongo doesn't support. You can reduce the amount of data being transferred from the server by limiting the fields returned from the first query to only the unique user information (i.e. the _id) that you need to get data from the offers collection.
If you really just want to make one query then you should store more information in the offers collection. For example, if you're trying to find offers for active users then you would store the active state of the user in the offers collection.
To work from your comment:
Yes, that's why I used tag 'join' in a question. The idea is that I
can make a first query more сomplex using a bunch of fields and
regexes without storing user data in other collections except
references. In these cases I always have to perform two consecutive
queries, but transfering of the results of the first query is not
necessary neither for me nor for the mongodb itself. I just want to
understand could it be done now, will it be possible to do so in the
future or it cannot be implemented for some technical reasons
As far as I understand it there is no immediate hurry to make this possible. Also the way it is coded atm will make this quite a big change to the way cursors work and are defined. A change big enough to possibly cause implementation breaks for other people. It is really a case of whether to set safe for inserts and updates for all future drivers. It is recognised that safe should be default but this will break implementation for other people who expect it the other way around.
It is rather inefficient if you don't require the results of the first query at all however since most networks are prepped with high traffic in mind and the traffic is cheap there hasn't been a demand to make it able to do chained queries server side in the cursor.
However subselects (which this basically is, it is selecting a set of rows based upon a sub selection of previous rows) have been on mongodb-user a couple of times and there might even be a JIRA for it somewhere, if not might be useful to make one.
As for doing it right now: there is no way.