I'm working on creating an immutable append only event log for MongoDB, in this I need a sequence number genereated and can base it off of the count of documents, since there will be no removals from the event log. However, I'm trying to avoid having to do two operations on MongoDB and would rather it happen in one "transaction" within the database itself.
If I were to do this from the Mongo shell, it would be something like below:
db['event-log'].insertOne({SequenceNumber: db['event-log'].count() +1 })
Is this doable in any way with the regular API?
Prior to v4, there was the possibility of doing eval - which would have made this much easier.
Update
The reason for my need of a sequence number is to be able to guarantee the order in which they were inserted when reading them back. Default behavior of Mongo is to retrieve them in the $natural order and one can explicitly define that on .find() as well (read more here). Although documentation is clear on not relying on it, it seems that as long as there are no modifications / removal of documents already there, it should be fine from what I can gather.
I realized also that I might get around this in another way as well, I'm going to introduce an Actor framework and I could make my committer a stateful actor with the sequence number in it if I need it.
I have a collection of ~500M documents.
Every time when I execute a query, I receive one or more documents from this collection. Let's say I have a counter for each document, and I increase this counter by 1 whenever this document is returned from the query. After a few months of running the system in production, I discover that the counter of only 5% of the documents is greater than 0 (zero). Meaning, 95% of the documents are not used.
My question is: Is there an efficient way to arrange these documents to speedup the query execution time, based on the fact that 95% of the documents are not used?
What is the best practice in this case?
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
~500M documents That is quite a solid figure, good job if that's true. So here is how I see the solution of the problem:
If you want to re-write/re-factor and rebuild the DB of an app. You could use versioning pattern.
How does it looks like?
Imagine you have a two collections (or even two databases, if you are using micro service architecture)
Relevant docs / Irrelevant docs.
Basically you could use find only on relevant docs collection (which store 5% of your useful docs) and if there is nothing, then use Irrelevant.find(). This pattern will allows you to store old/historical data. And manage it via TTL index or capped collection.
You could also add some Redis magic to it. (Which uses precisely the same logic), take a look:
This article can also be helpful (as many others, like this SO question)
But don't try to replace Mongo with Redis, team them up instead.
Using Indexes and .explain()
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
Yes, it will deal with your problem. To take a look, download MongoDB Compass, create this boolean field in your schema, (don't forget to add default value), index the field and then use Explain module with some query. But don't forget about compound indexes! If you create field on one index, measure the performance by queering only this one field.
The result should been looks like this:
If your index have usage (and actually speed-up) Compass will shows you it.
To measure the performance of the queries (with and without indexing), use Explain tab.
Actually, all this part can be done without Compass itself, via .explain and .index queries. But Compass got better visuals of this process, so it's better to use it. Especially since he becomes absolutely free for all.
Is there a way to use the cursor returned by the previous query as a value for $in in the next query? For example, something like this:
var users = db.user.find({state:1})
var offers = db.offer.find({user:{$in:users}})
I think this can reduce the traffic between mongodb and client in case the client doesn't need user information at all, just offers. Am i wrong?
Basically you want to do a join between two collections which Mongo doesn't support. You can reduce the amount of data being transferred from the server by limiting the fields returned from the first query to only the unique user information (i.e. the _id) that you need to get data from the offers collection.
If you really just want to make one query then you should store more information in the offers collection. For example, if you're trying to find offers for active users then you would store the active state of the user in the offers collection.
To work from your comment:
Yes, that's why I used tag 'join' in a question. The idea is that I
can make a first query more сomplex using a bunch of fields and
regexes without storing user data in other collections except
references. In these cases I always have to perform two consecutive
queries, but transfering of the results of the first query is not
necessary neither for me nor for the mongodb itself. I just want to
understand could it be done now, will it be possible to do so in the
future or it cannot be implemented for some technical reasons
As far as I understand it there is no immediate hurry to make this possible. Also the way it is coded atm will make this quite a big change to the way cursors work and are defined. A change big enough to possibly cause implementation breaks for other people. It is really a case of whether to set safe for inserts and updates for all future drivers. It is recognised that safe should be default but this will break implementation for other people who expect it the other way around.
It is rather inefficient if you don't require the results of the first query at all however since most networks are prepped with high traffic in mind and the traffic is cheap there hasn't been a demand to make it able to do chained queries server side in the cursor.
However subselects (which this basically is, it is selecting a set of rows based upon a sub selection of previous rows) have been on mongodb-user a couple of times and there might even be a JIRA for it somewhere, if not might be useful to make one.
As for doing it right now: there is no way.
Given a store which is a collection of JSON documents in the (approximate) form of:
{
PeriodStart: 18/04/2011 17:10:49
PeriodEnd: 18/04/2011 17:15:54
Count: 12902
Max: 23041 Min: 0
Mean: 102.86 StdDev: 560.97
},
{
PeriodStart: 18/04/2011 17:15:49
PeriodEnd: 18/04/2011 17:20:54
Count: 10000
Max: 23041 Min: 0
Mean: 102.86 StdDev: 560.97
}... etc
If I want to query the collection for given date range (say all documents from last 24 hours), which would give me the easiest querying operations to do this?
To further elaborate on requirements:
Its for an application monitoring service, so strict CAP/ACID isn't necessarily required
Performance isn't a primary consideration either. Read/writes would be at most 10s per second which could be handled by an RDBMS anyway
Ability to handle changing document schema's would be desirable
Ease of querying ability of lists/sets is important (ad-hoc queries an advantage)
I may not have your query requirements down exactly, as you didn't specify. However, if you need to find any documents that start or end in a particular range, then you can apply most of what is written below. If that isn't quite what you're after, I can be more helpful with a bit more direction. :)
If you use CouchDB, you can create your indexes by splitting up the parts of your date into an array. ([year, month, day, hour, minute, second, ...])
Your map function would probably look similar to:
function (doc) {
var date = new Date(doc.PeriodStart);
emit([ date.getFullYear(), date.getMonth(), date.getDate(), date.getHours(), date.getMinutes() ] , null]);
}
To perform any sort of range query, you'd need to convert your start and end times into this same array structure. From there, your view query would have params called startkey and endkey. They would would receive the array parameters for start and end respectively.
So, to find the documents that started in the past 24 hours, you would send a querystring like this in addition to the full URI for the view itself:
// start: Apr 17, 2011 12:30pm ("24 hours ago")
// end: Apr 18, 2011 12:30pm ("today")
startkey=[2011,04,17,12,30]&endkey=[2011,04,18,12,30]
Or if you want everything from this current year:
startkey=[2011]&endkey=[2011,{}]
Note the {}. When used as an endkey: [2011,{}] is identical to [2012] when the view is collated. (either format will work)
The extra components of the array will simply be ignored, but the further specificity you add to your arrays, the more specific your range can be. Adding reduce functions can be really powerful here, if you add in the group_level parameter, but that's beyond the scope of your question.
[Update edited to match edit to original question]
Short answer, (almost) any of them will work.
BigTable databases are a great platform for monitoring services (log analysis, etc). I prefer Cassandra (Super Column Families, secondary indexes, atomic increment coming soon), but HBase will work for you too. Structure the date value so that its lexicographic ordering is the same as the date ordering. Fixed-length strings following the format "YYYYMMDDHHmmss" work nicely for this. If you use this string as your key, range queries will be very simple to perform.
Handling changing schema is a breeze - just add more columns to the column family. They don't need to be defined ahead of time.
I probably wouldn't use graph databases for this problem, as it'll probably summarize to traversing a linked list. However, I don't have a ton of experience with graph databases, so take this advice with a grain of salt.
[Update: some of this is moot since the question was edited, but I'm keeping it for posterity]
Is this all you're doing with this database? The big problem with selecting a NoSQL database isn't finding one that supports one query requirement well. The problem is finding one that supports all of your query requirements well. Also, what are your operational requirements? Can you accept a single point of failure? What kind of setup/maintenance overhead are you willing to tolerate? Can you sacrifice low latency for high-throughput batch operations, or is realtime your gig?
Hope this helps!
It seems to me that the easiest way to implement what you want is performing a range query in a search engine like ElasticSearch.
I, for one, certainly would not want to write all the map/reduce code for CouchDB (because I did in the past). Also, based on my experience (YMMV), range queries will outperform CouchDB's views and use much less resources for large datasets.
Not to mention you can compute interesting statistics with „date histogram“ facets in ElasticSearch.
ElasticSearch is schema-free, JSON based, so you should be able to evaluate it for your case in a very short time.
I've decided to go with Mongo for the time being.
I found that setup/deployment was relatively easy, and the C# wrapper was adequate for what we're trying to do (and in the cases where its not we can resort to javascript queries easily).
What you want is whichever one gives you access to some kind of spatial index. Most of these work off of B-Trees and/or hashes, neither of which is particularly good for spatial indexing.
Now, if your definition of "last 24 hours" is simply "starts or ends within the last 24 hours" then a B-Tree may be find (you do two queries, one on PeriodStart and then one on PeriodEnd, both being within range of the time window).
But if the PeriodStart to PeriodEnd is longer than the time window, then neither of these will be as much help to you.
Either way, that's what you're looking for.
This question explains how to query a date range in CouchDB. You would need your data to be in a lexicographically sortable state, in all the examples I've seen.
Since this is tagged Redis and nobody has answered that aspect I'm going to put forth a solution for it.
Step one, store your documents under a given redis key, as a hash or perhaps as a JSON string.
Step two, add the redis key (lets call it a DocID) in a sorted set, with the timestamp converted to a UNIX timestamp. For example where r is a redis Connection instance in the Python redis client library:
mydocs:Doc12 => [JSON string of the doc]
In Python:
r.set('mydocs:Doc12', JSONStringOfDocument)
timeindex:documents, DocID, Timestamp:
In Python:
r.zadd('timeindex:documents', 'Doc12', timestamp)
In effect you are building an index of documents based on UNIX timestamps.
To get documents from a range of time, you use zrange (or zrevrange if you want the order reversed) to get the list of Document IDs in that window. Then you can retrieve the documents from the db as normal. Sorted sets are pretty fast in Redis. Further advantages are that you can do set operations such as "documents in this window but not this window", and indeed even store the results in Redis automatically for later use.
One example of how this would be useful is that in your example documents you have a start and end time. If you made an index of each as above, you could get the intersection of the set of documents that start in a given range and the set of documents that end in a given range, and store the resulting set in a new key for later re-use. This would be done via zinterstore
Hopefully, that helps someone using Redis for this.
Mongodb is very positive for queries, i think that it's useful because has a lot of functions. I use mongodb for GPS distance, text search and pipeline model (aggregation includes)
I'm rebuilding Lovers on Facebook with Sinatra & Redis. I like Redis because it doesn't have the long (12-byte) BSON ObjectIds and I am storing sets of Facebook user_ids for each user. The sets are requests_sent, requests_received, & relationships, and they all contain Facebook user ids.
I'm thinking of switching to MongoDB because I want to use it's geospatial indexing. If I do, I'd want to use the FB user ids as the _id field because I want the sets to be small and I want the JSON responses to be small. But, is the BSON ObjectId better (more efficient for MongoDB) to use than just an integer (fb user_id)?
There are no major efficiency differences as far as I know except in certain cases like ordering by date (since the ObjectId's have the datetime in them, etc.)
For example you'd lose the ability to simply order by the _id you'd also lose the benefits for sharding and distribution. Aside from that, while I'd still personally use the ObjectId's anyhow ... as long as the int is unquie (of course) ... you should be just fine.
Since the _id always "comes back" in a query I suppose you'd save a little time and data transfer (a bitty bit.)
You can even make your _id an array if you wanted, and it'll all index nicely see this answer (not that I'd necessarily recommend that most of the time.)
Also see: Optimizing Object IDs