This question has been asked a few times, but none of the 'pros' for using ObjectId seem to apply to my case, and the questions are all quite old. I'm wondering if there's anything I'm missing.
Specifically, I'm deciding between these two options:
Store _id as an ObjectId in the DB.
Store _id as the string representation of an ObjectId in the DB.
My understanding of the reasons to favour storing an ObjectId object over its string representation:
They're 12 bytes instead of 24. This is only 12 MB per million records, a tiny fraction of the document size, and a few cents of extra infrastructure, so doesn't warrant adding complexity to my application code.
They're 'faster'. I keep hearing this, but no one says how much faster, which makes me wonder if there's no difference and everyone just repeats what they heard from someone else. I found these benchmarks, where read times for single items are identical. There's differences when writing a million records, but I won't be doing that. Also benchmarks like this aren't taking into account the extra time taken to convert an object (coming over the network) from string IDs to ObjectId objects for every write operation. I suspect having to add a loop on the app server to parse incoming data and convert strings to new ObjectId(item._id) for the million records would level the results.
Using ObjectId is the 'normal' way to do it. I'm somewhat swayed by this argument, I don't like doing things the weird way. But as it results in extra code that's prone to bugs, it's not a good enough reason on its own.
The reason for wanting to do option 2 is that I'm converting to/from string/ObjectId in many places in my app as data flows between userland and the DB. Not just for _id fields, but for otherItemId and listOfRelatedThingIds. They all get implicitly converted to strings on the way out (over the network) and need to manually be converted back on the way in. It's not the worst thing in the world, but if there's no good reason to be doing it, then I'll switch to strings in the DB.
The ObjectID type has a built-in timestamp for sorting. Four bytes of the 12 bytes you reference are dedicated to representing the time since the unix epoch in seconds. When you create ObjectIDs they are sortable with each other automatically based on the order in which they were created.
You can use the .getTimeStamp() function on your ObjectID to return the timestamp.
var id = new ObjectId();
console.log(id.getTimestamp())
You may think you're saving some infrastructure costs on the storage of your documents, but there is also savings in the overhead when it comes to accessing, sorting, and indexing as well.
I don't have experience with Redis so far, but I'm exploring possibilities to use MongoDB as database and Redis as cache.
The question I'm dealing with is whether Redis is capable of handling MongoDb ObjectId's in the scope of cursor-based pagination as described, for example here: https://developer.twitter.com/en/docs/tweets/timelines/guides/working-with-timelines.html.
In this example we have a maxId that serves as the maximum id that was fetched from the previous request, and will be used as lower bound for fetching the next page.
In MongoDb I've explored that it is not a problem to user greater than / smaller than operators on ObjectId's, but I don't know if I will be capable to handle this in Redis, as ObjectId's will most probably be stored as a string value.
This question is important for me as it will help me to decide whether to use MongoDb ObjectId's, or to use auto-increments as PK id. I would prefer to use ObjectId's though.
Note: I'm writing my backend with Java, so fancy npm modules are not what I'm looking for.
The solution I came up with:
Use timestamp as cursor
Store timestamp as score in Redis. Even when there are theoretically duplicate scores possible in this case, the chance that this will cause conflict in terms of pagination is negligeable for my application.
For example: I have a duplicate score at the 10th result. The next request will include the current timestamp in its result, which means that both 10th and 11th result would be returned on the next request.
In case Redis returns results, ok
In case Redis does not return results, the timestamp cursor can be used to query ObjectId's in MongoDb as well. Even though ObjectId doesn't support milliseconds, this is not a real problem. Find all objectId's <= cursor timestamp with limit / offset should work fine. The cursor timestamp in the search needs , so millisecond variations won't cause trouble.
In case Redis only returns a partial result, MongoDb can be queried based on the ObjectId of the last available post that was found in Redis.
This solution isn't ideal since the client will need to perform extra checks on the last processed vs the newly received results to avoid duplicate rendering, but this isn't a real problem as this is not an open api and only used internally. After looking for quite some time, there doesn't appear to be an one-fit-all solution to this kind of problem.
Currently working on trying to model a DB schema in MongoDB. The bit I'm getting stuck on is where an employee must indicate their times that they are available to work.
I.e.
Monday:
9AM-12PM, 2:00PM-6:00PM
Tuesday:
8AM-10AM, 12:00PM-2:00PM, 4:00PM-6:00PM
etc.
I could just have an embedded field in my schema with a list of times, but I'm not sure if thats the best solution to this.
Opinions?
There is no universal rule when it comes to schema design. I would store a list of numerical ranges, where a range's unit is seconds from the beginning of workweek. This way it would be possible to use mongo to search directly for available personnel in a single query. Date manipulation should not be a problem on a modern platform.
I'm using the Stats component in solr to do get faceted statistics, which works very well, and now I'm interested in doing the same for my date fields. But it seems it doesn't work to use facet.date fields with the stats module, is there a way of geting this to work?
My fallback plan is to add my facets as specific fields (date, year-quarter, year-month, etc), but this will require heavy re-indexing.
Your best bet might be to store a copy of all date fields in unix time. This way you can have an integer field and be able to run stats on them.
It seems that stats on date fields is in the works though.
https://issues.apache.org/jira/browse/SOLR-1023
Given a store which is a collection of JSON documents in the (approximate) form of:
{
PeriodStart: 18/04/2011 17:10:49
PeriodEnd: 18/04/2011 17:15:54
Count: 12902
Max: 23041 Min: 0
Mean: 102.86 StdDev: 560.97
},
{
PeriodStart: 18/04/2011 17:15:49
PeriodEnd: 18/04/2011 17:20:54
Count: 10000
Max: 23041 Min: 0
Mean: 102.86 StdDev: 560.97
}... etc
If I want to query the collection for given date range (say all documents from last 24 hours), which would give me the easiest querying operations to do this?
To further elaborate on requirements:
Its for an application monitoring service, so strict CAP/ACID isn't necessarily required
Performance isn't a primary consideration either. Read/writes would be at most 10s per second which could be handled by an RDBMS anyway
Ability to handle changing document schema's would be desirable
Ease of querying ability of lists/sets is important (ad-hoc queries an advantage)
I may not have your query requirements down exactly, as you didn't specify. However, if you need to find any documents that start or end in a particular range, then you can apply most of what is written below. If that isn't quite what you're after, I can be more helpful with a bit more direction. :)
If you use CouchDB, you can create your indexes by splitting up the parts of your date into an array. ([year, month, day, hour, minute, second, ...])
Your map function would probably look similar to:
function (doc) {
var date = new Date(doc.PeriodStart);
emit([ date.getFullYear(), date.getMonth(), date.getDate(), date.getHours(), date.getMinutes() ] , null]);
}
To perform any sort of range query, you'd need to convert your start and end times into this same array structure. From there, your view query would have params called startkey and endkey. They would would receive the array parameters for start and end respectively.
So, to find the documents that started in the past 24 hours, you would send a querystring like this in addition to the full URI for the view itself:
// start: Apr 17, 2011 12:30pm ("24 hours ago")
// end: Apr 18, 2011 12:30pm ("today")
startkey=[2011,04,17,12,30]&endkey=[2011,04,18,12,30]
Or if you want everything from this current year:
startkey=[2011]&endkey=[2011,{}]
Note the {}. When used as an endkey: [2011,{}] is identical to [2012] when the view is collated. (either format will work)
The extra components of the array will simply be ignored, but the further specificity you add to your arrays, the more specific your range can be. Adding reduce functions can be really powerful here, if you add in the group_level parameter, but that's beyond the scope of your question.
[Update edited to match edit to original question]
Short answer, (almost) any of them will work.
BigTable databases are a great platform for monitoring services (log analysis, etc). I prefer Cassandra (Super Column Families, secondary indexes, atomic increment coming soon), but HBase will work for you too. Structure the date value so that its lexicographic ordering is the same as the date ordering. Fixed-length strings following the format "YYYYMMDDHHmmss" work nicely for this. If you use this string as your key, range queries will be very simple to perform.
Handling changing schema is a breeze - just add more columns to the column family. They don't need to be defined ahead of time.
I probably wouldn't use graph databases for this problem, as it'll probably summarize to traversing a linked list. However, I don't have a ton of experience with graph databases, so take this advice with a grain of salt.
[Update: some of this is moot since the question was edited, but I'm keeping it for posterity]
Is this all you're doing with this database? The big problem with selecting a NoSQL database isn't finding one that supports one query requirement well. The problem is finding one that supports all of your query requirements well. Also, what are your operational requirements? Can you accept a single point of failure? What kind of setup/maintenance overhead are you willing to tolerate? Can you sacrifice low latency for high-throughput batch operations, or is realtime your gig?
Hope this helps!
It seems to me that the easiest way to implement what you want is performing a range query in a search engine like ElasticSearch.
I, for one, certainly would not want to write all the map/reduce code for CouchDB (because I did in the past). Also, based on my experience (YMMV), range queries will outperform CouchDB's views and use much less resources for large datasets.
Not to mention you can compute interesting statistics with „date histogram“ facets in ElasticSearch.
ElasticSearch is schema-free, JSON based, so you should be able to evaluate it for your case in a very short time.
I've decided to go with Mongo for the time being.
I found that setup/deployment was relatively easy, and the C# wrapper was adequate for what we're trying to do (and in the cases where its not we can resort to javascript queries easily).
What you want is whichever one gives you access to some kind of spatial index. Most of these work off of B-Trees and/or hashes, neither of which is particularly good for spatial indexing.
Now, if your definition of "last 24 hours" is simply "starts or ends within the last 24 hours" then a B-Tree may be find (you do two queries, one on PeriodStart and then one on PeriodEnd, both being within range of the time window).
But if the PeriodStart to PeriodEnd is longer than the time window, then neither of these will be as much help to you.
Either way, that's what you're looking for.
This question explains how to query a date range in CouchDB. You would need your data to be in a lexicographically sortable state, in all the examples I've seen.
Since this is tagged Redis and nobody has answered that aspect I'm going to put forth a solution for it.
Step one, store your documents under a given redis key, as a hash or perhaps as a JSON string.
Step two, add the redis key (lets call it a DocID) in a sorted set, with the timestamp converted to a UNIX timestamp. For example where r is a redis Connection instance in the Python redis client library:
mydocs:Doc12 => [JSON string of the doc]
In Python:
r.set('mydocs:Doc12', JSONStringOfDocument)
timeindex:documents, DocID, Timestamp:
In Python:
r.zadd('timeindex:documents', 'Doc12', timestamp)
In effect you are building an index of documents based on UNIX timestamps.
To get documents from a range of time, you use zrange (or zrevrange if you want the order reversed) to get the list of Document IDs in that window. Then you can retrieve the documents from the db as normal. Sorted sets are pretty fast in Redis. Further advantages are that you can do set operations such as "documents in this window but not this window", and indeed even store the results in Redis automatically for later use.
One example of how this would be useful is that in your example documents you have a start and end time. If you made an index of each as above, you could get the intersection of the set of documents that start in a given range and the set of documents that end in a given range, and store the resulting set in a new key for later re-use. This would be done via zinterstore
Hopefully, that helps someone using Redis for this.
Mongodb is very positive for queries, i think that it's useful because has a lot of functions. I use mongodb for GPS distance, text search and pipeline model (aggregation includes)