Given a store which is a collection of JSON documents in the (approximate) form of:
{
PeriodStart: 18/04/2011 17:10:49
PeriodEnd: 18/04/2011 17:15:54
Count: 12902
Max: 23041 Min: 0
Mean: 102.86 StdDev: 560.97
},
{
PeriodStart: 18/04/2011 17:15:49
PeriodEnd: 18/04/2011 17:20:54
Count: 10000
Max: 23041 Min: 0
Mean: 102.86 StdDev: 560.97
}... etc
If I want to query the collection for given date range (say all documents from last 24 hours), which would give me the easiest querying operations to do this?
To further elaborate on requirements:
Its for an application monitoring service, so strict CAP/ACID isn't necessarily required
Performance isn't a primary consideration either. Read/writes would be at most 10s per second which could be handled by an RDBMS anyway
Ability to handle changing document schema's would be desirable
Ease of querying ability of lists/sets is important (ad-hoc queries an advantage)
I may not have your query requirements down exactly, as you didn't specify. However, if you need to find any documents that start or end in a particular range, then you can apply most of what is written below. If that isn't quite what you're after, I can be more helpful with a bit more direction. :)
If you use CouchDB, you can create your indexes by splitting up the parts of your date into an array. ([year, month, day, hour, minute, second, ...])
Your map function would probably look similar to:
function (doc) {
var date = new Date(doc.PeriodStart);
emit([ date.getFullYear(), date.getMonth(), date.getDate(), date.getHours(), date.getMinutes() ] , null]);
}
To perform any sort of range query, you'd need to convert your start and end times into this same array structure. From there, your view query would have params called startkey and endkey. They would would receive the array parameters for start and end respectively.
So, to find the documents that started in the past 24 hours, you would send a querystring like this in addition to the full URI for the view itself:
// start: Apr 17, 2011 12:30pm ("24 hours ago")
// end: Apr 18, 2011 12:30pm ("today")
startkey=[2011,04,17,12,30]&endkey=[2011,04,18,12,30]
Or if you want everything from this current year:
startkey=[2011]&endkey=[2011,{}]
Note the {}. When used as an endkey: [2011,{}] is identical to [2012] when the view is collated. (either format will work)
The extra components of the array will simply be ignored, but the further specificity you add to your arrays, the more specific your range can be. Adding reduce functions can be really powerful here, if you add in the group_level parameter, but that's beyond the scope of your question.
[Update edited to match edit to original question]
Short answer, (almost) any of them will work.
BigTable databases are a great platform for monitoring services (log analysis, etc). I prefer Cassandra (Super Column Families, secondary indexes, atomic increment coming soon), but HBase will work for you too. Structure the date value so that its lexicographic ordering is the same as the date ordering. Fixed-length strings following the format "YYYYMMDDHHmmss" work nicely for this. If you use this string as your key, range queries will be very simple to perform.
Handling changing schema is a breeze - just add more columns to the column family. They don't need to be defined ahead of time.
I probably wouldn't use graph databases for this problem, as it'll probably summarize to traversing a linked list. However, I don't have a ton of experience with graph databases, so take this advice with a grain of salt.
[Update: some of this is moot since the question was edited, but I'm keeping it for posterity]
Is this all you're doing with this database? The big problem with selecting a NoSQL database isn't finding one that supports one query requirement well. The problem is finding one that supports all of your query requirements well. Also, what are your operational requirements? Can you accept a single point of failure? What kind of setup/maintenance overhead are you willing to tolerate? Can you sacrifice low latency for high-throughput batch operations, or is realtime your gig?
Hope this helps!
It seems to me that the easiest way to implement what you want is performing a range query in a search engine like ElasticSearch.
I, for one, certainly would not want to write all the map/reduce code for CouchDB (because I did in the past). Also, based on my experience (YMMV), range queries will outperform CouchDB's views and use much less resources for large datasets.
Not to mention you can compute interesting statistics with „date histogram“ facets in ElasticSearch.
ElasticSearch is schema-free, JSON based, so you should be able to evaluate it for your case in a very short time.
I've decided to go with Mongo for the time being.
I found that setup/deployment was relatively easy, and the C# wrapper was adequate for what we're trying to do (and in the cases where its not we can resort to javascript queries easily).
What you want is whichever one gives you access to some kind of spatial index. Most of these work off of B-Trees and/or hashes, neither of which is particularly good for spatial indexing.
Now, if your definition of "last 24 hours" is simply "starts or ends within the last 24 hours" then a B-Tree may be find (you do two queries, one on PeriodStart and then one on PeriodEnd, both being within range of the time window).
But if the PeriodStart to PeriodEnd is longer than the time window, then neither of these will be as much help to you.
Either way, that's what you're looking for.
This question explains how to query a date range in CouchDB. You would need your data to be in a lexicographically sortable state, in all the examples I've seen.
Since this is tagged Redis and nobody has answered that aspect I'm going to put forth a solution for it.
Step one, store your documents under a given redis key, as a hash or perhaps as a JSON string.
Step two, add the redis key (lets call it a DocID) in a sorted set, with the timestamp converted to a UNIX timestamp. For example where r is a redis Connection instance in the Python redis client library:
mydocs:Doc12 => [JSON string of the doc]
In Python:
r.set('mydocs:Doc12', JSONStringOfDocument)
timeindex:documents, DocID, Timestamp:
In Python:
r.zadd('timeindex:documents', 'Doc12', timestamp)
In effect you are building an index of documents based on UNIX timestamps.
To get documents from a range of time, you use zrange (or zrevrange if you want the order reversed) to get the list of Document IDs in that window. Then you can retrieve the documents from the db as normal. Sorted sets are pretty fast in Redis. Further advantages are that you can do set operations such as "documents in this window but not this window", and indeed even store the results in Redis automatically for later use.
One example of how this would be useful is that in your example documents you have a start and end time. If you made an index of each as above, you could get the intersection of the set of documents that start in a given range and the set of documents that end in a given range, and store the resulting set in a new key for later re-use. This would be done via zinterstore
Hopefully, that helps someone using Redis for this.
Mongodb is very positive for queries, i think that it's useful because has a lot of functions. I use mongodb for GPS distance, text search and pipeline model (aggregation includes)
Related
I'm playing with a small web application to store tasks and time spent on a daily basis, and report on it in weekly and monthly reports.
For my back-end I wanted to work with MongoDb but can't immediately find what the best way would be to store the time spent on a task.
I have no interest in storing the actual timeframe (start and stop date), just the time spent in hours:minutes.
Since I will be making weekly and monthly reports, I need to be able to calculate the sum on a weekly, monthly basis.
What would be the best way to store the time spent?
I'm leaning towards just storing the amount of minutes in a 32-bit integer and convert it to hours:minutes on the client. (Or better yet, convert it already in a Mongodb Aggregation Query, if that's possible?)
My Task document would look like this then
{
"task_id": {
"$oid": "5311a0a4e4b0386017fce592"
},
"timeEntry_date": "ISODate('2014-03-02')",
"timeSpent_minutes": 30
}
Is this the way to go in MongoDb?
Sorry if I'm stating the obvious here, just starting to learn about MongoDB...
As always on MongoDB "It depends on your usage". In MongoDB you should construct your "schema" so that it will describe your application needs and not your data structure.
If your application is write intensive you should have a really fast and simple writing procedure. eg you can store the milliseconds spent and then do the calculations on read
If you want your application to be faster in reads you can have something like
time_spent: {
hours: 1,
minutes: 23,
seconds: 45
}
in your document
PS I think that
"task_id": {
"$oid": "5311a0a4e4b0386017fce592"
}
would be better as
task_id: "5311a0a4e4b0386017fce592"
Given the requirements as stated, you'll be best served by serving the data as a numerical value. There are far more query and aggregation operators that would be easily used for a numeric value than the few other options that are available.
While you could store the period as a string, it would need to be consistently zero padded to sort correctly. With a string, you could only effectively do exact matches and range queries (find all that are "0030" or between "0015" and "0045" for example).
While you could also store the start and end times and try to calculate the period, it adds unnecessary complication to queries, and in fact may make some types of queries only possible using the aggregation framework. You might find some of the ideas here interesting depending on your needs as well.
The absolute performance difference between any option where the time length is directly stored as a value rather than computed is not likely to be measurable, as they would consume nearly the same amount of space for each option. If you're concerned about absolute disk space usage, I'd suggest you consider using shorter field names. Every document contains the full field name. Some drivers for Mongo provide a handy way of using an alias for field names (by keeping the name you use in code friendly and using a shortened version n the stored document).
I'd go with storing a integer numeric value that best suits the maximum length of time you'll encounter. You can add an index if it's frequently queried, and easily convert it on the client into something more human friendly.
We collect and store instrumentation data from a large number of hosts.
Our storage is MongoDB - several shards with replicas. Everything is stored in a single large collection.
Each document we insert is a time based observation with some attributes (measurements). The time stamp is the most important attribute because all queries are based on time at least. Documents are never updated, so it's a pure write-in-look-up model. Right now it works reasonably well with several billions of docs.
Now,
We want to grow a bit and hold up to 12 month of data which may amount to a scary trillion+ observations (documents).
I was wandering if dumping everything into a single monstrous collection is the best choice or there is a more intelligent way to go about it.
By more intelligent I mean - use less hardware while still providing fast inserts and (importantly) fast queries.
So I thought about splitting the large collection into smaller pieces hoping to gain memory on indexes, insertion and query speed.
I looked into shards, but sharding by the time stamp sounds like a bad idea because all writes will go into one node canceling the benefits of sharding.
The insert rates are pretty high, so we need sharding to work properly here.
I also thought about creating a new collection every month and then pick up a relevant collection for a user query.
Collections older than 12 month will be either dropped or archived.
There is also an option to create entirely new database every month and do similar rotation.
Other options? Or perhaps one large collection is THE option to grow real big?
Please share your experience and considerations in similar apps.
It really depends on the use-case for your queries.
If it's something that could be aggregated, I would say do this through a scheduled map/reduce function and store the smaller data size in separate collection(s).
If everything should be in the same collection and all data should be queried at the same time to generate the desired results, then you need to go with Sharding. Then depending on the data size for your queries, you could go with an in memory map/reduce or even doing it at the application layer.
As yourself pointed out, Sharding based on time is a very bad idea. It makes all the writes going to one shard, so define your shard key. MongoDB Docs, has a very good explanation on this.
If you can elaborate more on your specific needs for the queries would be easier to suggest something.
Hope it helps.
I think collection on monthly basis will help you to get some boost up but I was wondering why can not you use the hour field of your timestamp for sharding . You can add a column which will hold the HOUR part of time stamp and when you shard against it will be shared nicely as you have repeating hour daily basis. I have not tested it but thought it will may help you
Would suggest to go ahead with single collection, as suggested by #Devesh hour based shard should be fine, Need to take care of the new ' hour Key ' while querying to get better performance.
I'm building an application that stores lots of data per user (possibly in gigabytes).
Something like a request log, so lets say you have the following fields for every record:
customer_id
date
hostname
environment
pid
ip
user_agent
account_id
user_id
module
action
id
response code
response time (range)
and possibly some more.
The good thing is that the usage will be mostly write only, but when there are reads
I'd like to be able to answer then quickly in near real time.
Another prediction about the usage pattern is that most of the time people will be looking at the most recent data,
and infrequently query for the past, aggregate etc, so my guess is that the working set will be much smaller then
the whole database, i.e. recent data for most users and ranges of history for some users that are doing analytics right now.
for the later case I suppose its ok for first query to be slower until it gets the range into memory.
But the problem is that Im not quite sure how to effectively index the data.
The start of the index is clear, its customer_id and date. but the rest can be
used in any combination and I can't predict the most common ones, at least not with any degree of certainty.
We are currently prototyping this with mongo. Is there a way to do it in mongo (storage/cpu/cost) effectively?
The only thing that comes to mind is to try to predict a couple of frequent queries and index them and just massively shard the data
and ensure that each customer's data is spread evenly over the shards to allow fast table scan over just the 'customer, date' index for the rest
of the queries.
P.S. I'm also open to suggestions about db alternatives.
with this limited number of fields, you could potentially just have an index on each of them, or perhaps in combination with customer_id. MongoDB is clever enough to pick the fastest index for each case then. If you can fit your whole data set in memory (a few GB is not a lot of data!), then this all really doesn't matter.
You're saying you have a GB per user, but that still means you can have an index on the fields as there are only about a dozen. And with that much data, you want sharding anyway at some point soon.
cheers,
Derick
I think, your requirements don't really mix well together. You can't have lots of data and instantaneous ad-hoc queries.
If you use a lot of indexes, then your writes will be slow, and you'll need much more RAM.
May I suggest this:
Keep your index on customer id and date to serve recent data to users and relax your requirements to either real-timeliness or accuracy of aggregate queries.
If you sacrifice accuracy, you will be firing map-reduce jobs every once in a while to precompute queries. Users then may see slightly stale data (or may not, it's historical immutable data, after all).
If you sacrifice speed, then you'll run map-reduce each time (right now it's the only sane way of calculating aggregates in a mongodb cluster).
Hope this helps :)
I'm using Mongo to store, day by day, all the "ticks" of a set of about 40 equity. These ticks contains the trade info ( a document containing price and volume ) and book info ( a more complex document containing sell-buy proposal ). The magnitude order is about 5K trades+20K books *40 equity per day. Document are indexed both per Symbol ( the equity name ) date of insert, timeof day. After a week of collection one of my query does not scale anymore: looking for distinct date takes to long. So i decided to have a special document just to say that there is a "collection" for a certain day, is this a correct approach ? Furthermore, is correct to collect things as a separate little document, or would be better to collect ticks as an array on the equity document ?
Thanks all !
BTW this question is a consequence of this one: Using mongodb for store intraday equity data
Addition:
even if I explicitly say ( at the console )
db.books.ensureIndex({dateTag:1})
db.books.distinct("dateTag")
it reply slowly. So maybe a better question is: does index affect the distinct performance ?
Addition
After upgrading to 1.8.2 behavior is the same.
does index affect the distinct performance ?
It does indeed, however there's no "explain plan" so this can only be confirmed via the docs / code.
Document are indexed both per Symbol ( the equity name ) date of insert, timeof day
I'm not 100% clear on how many indexes you have or what type of memory footprint you have here. Just having an index does not necessarily mean that it's going to be really fast. If that index is not in memory, then you end up going to disk and slowing down your query.
If you're seeing slow performance on this query despite the index I would check two things:
Disk activity (during the query)
Data size relative to memory
However, it may just be easier to keep a list of "days stored". That distinct query is probably going to get worse, even with an index. So it's never going to be as fast as a document simply listing the days.
I don't think that your "collection for a certain day" approach would work out because you would run into MongoDb's limit of 24,000 namespaces per database. Storing the ticks in an array property of a document could make it harder to execute certain types of query (really depends on what types of reports you need to run on the ticks).
Are you sure that you have indexes in place for the properties you use in your problematic query? As last resort you could try sharding but I doubt that that is necessary at this point.
http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Distinct
clearly states that distinct() can uses indexes starting MongoDB 1.7.3
I have a mongodb for measurements which has a document per measurements. Each doc looks like:
{
timestamp : 123
value : 123
meta1 : something
meta2 : something
}
I get measurements from a number of sources every second, and so the db gets quite large, quickly. I'm interested in keeping the recent information at the frequency it was read in, but older data, i would like to average out periodically to save space, and make the db a bit quicker.
1.Whats the best approach in mongo?
2.Is there a better db for this, considering that the schema is different for different measurements, and a fixed format wouldn't work very well. RRD is also not an option as i need the dynamic query abilities.?
1. Whats the best approach in mongo?
Use capped collections for use cases such as logging. Another approach is to create a 'background process' that will be move old data from collection.
2.Is there a better db for this, considering that the schema is different for different measurements, and a fixed format wouldn't work very well. RRD is also not an option as i need the dynamic query abilities.?
Mongodb is a good fit here.
Update:
Another approch is to store each data item twice: First in capped collection(and use this collection for quering). And create another collection(or even another logdb) just for logging your events.
Thanks for the input.
I think I'm going to try out using buckets for different timeframes. So, i'll create 3 stores corresponding to say 1sec, 1min, 15min, and then manage the aggregation through a manual job running every so often which will compact/average out the values, delete of stuff that's not needed, etc...
I'm not sure about the best approach but a simple one would be to have a cron job that would remove all the documents older than a given timestamp (your_time = now - some_time).
db.docs.remove({ timestamp : {'$lte' : your_time}})
Given that you need a schemaless database that allows you to perform dynamic queries, mondogb seems to be a good fit.