I am looking to store some time series data into mondodb. The system outputs data every minute and as such I have made an array called hours. This array is 24 in size (represeing hours in the day). Each hour contains an array of size 60 (every minute of that hour).
The example below shows 5 hours with each hour having only 3, 3,5,3,2 minutes of data respectively.
At the end of each day I will update the document with hourly averages etc...
I did it like this as when it maps to a POJO it is very easy to find the correct hour and minute just by using the array indexes.
Good Link.
http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb
The use case is simply to allow charting at the various granularities. Everything from minutes to days.
{
"_id": ObjectId("55eee516b932c564bc8dd645"),
"name": "James",
"hours": [
[
1.12,
2.47,
3.25
],
[
4.12,
5.24,
6.21
],
[
7.25,
8.69,
9.54,
NumberInt(5),
6.36
],
[
10.55,
11.45,
NumberInt(12)
],
[
13.14,
14.23
]
]
}
Does using the array approach over the nested document as in the link example cause any issues when updating as so and when averaging etc...
db.metrics.update(
{
timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
type: “memory_used”
},
{$set: {“values.59.59”: 2000000 } }
)
Related
I am trying to get some data visualization for an application I am making and I am currently having an issue.
The current query I am using to get the documents grouped by month is the following:
# Generating our pipeline
pipeline = [
{"$match": query_match
},
{"$group": {
'_id': {
'$dateTrunc': {
'date': "$date", 'unit': "month"
}
},
"total": {
"$sum": 1
}
}
},
{'$sort': {
'_id': 1
}
}
]
This however, will return me the sum of documents for each month.
I want to take this a step further and calculate the average number of documents per day. but ONLY for the days which I have collections for.
As an example, the above query currently returns something like this:
Index _id total_documents
0 2022-07-01 10425
1 2022-08-01 27981
2 2022-09-01 24872
3 2022-10-01 1633
What I want is, for 2022-7 for example, I have documents submitted for 20 of the 31 days that the month has, so I want to return 10452 / 20, instead of 10452 / 31 which would technically be the daily average for that month.
Is there a way to do this in a single aggregation or would I have to use an additional query to determine how many days I have documents for first?
Thanks
I am creating a way to generate reports of the amount of time equipment was down for, during a given time frame. I will potentially have 100s to thousands of documents to work with. Every document will have a start date and end date, both in BSON format and will generally be within minutes of each other. For simplicity sake I am also zeroing out the seconds.
The actual aggregation I need to do, is I need to calculate the amount of minutes between each given date, but there may be other documents with overlapping dates. Any overlapping time should not be calculated if it's been calculated already. There are various other aggregations I'll need to do, but this is the only one that I'm unsure of, if it's even possible at all.
{
"StartTime": "2020-07-07T18:10:00.000Z",
"StopTime": "2020-07-07T18:13:00.000Z",
"TotalMinutesDown": 3,
"CreatedAt": "2020-07-07T18:13:57.675Z"
}
{
"StartTime": "2020-07-07T18:12:00.000Z",
"StopTime": "2020-07-07T18:14:00.000Z",
"TotalMinutesDown": 2,
"CreatedAt": "2020-07-07T18:13:57.675Z"
}
The two documents above are examples of what I'm working with. Every document gets the total amount of minutes between the two dates stored in the document (This field serves another purpose, unrelated). If I were to aggregate this to get total minutes down, the output of total minutes should be 4, as I'm not wanting to calculate the overlapping minutes.
Finding overlap of time ranges sounds to me a bit abstract. Let's try to convert it to a concept that databases are usually used for: discrete values.
If we convert the times to discrete value, we will be able to find the duplicate values, i.e. the "overlapping values" and eliminate them.
I'll illustrate the steps using your sample data. Since you have zeroed out the seconds, for simplicity sake, we can start from there.
Since we care about minute increments we are going to convert times to "minutes" elapsed since the Unix epoch.
{
"StartMinutes": 26569090,
"StopMinutes": 26569092,
}
{
"StartMinutes": 26569092,
"StopMinutes": 26569092
}
We convert them to discrete values
{
"minutes": [26569090, 26569091, 26569092]
}
{
"minutes": [26569092, 26569093]
}
Then we can do a set union on all the arrays
{
"allMinutes": [26569090, 26569091, 26569092, 26569093]
}
This is how we can get to the solution using aggregation. I have simplified the queries and grouped some operations together
db.collection.aggregate({
$project: {
minutes: {
$range: [
{
$divide: [{ $toLong: "$StartTime" }, 60000] // convert to minutes timestamp
},
{
$divide: [{ $toLong: "$StopTime" }, 60000]
}
]
},
}
},
{
$group: { // combine to one document
_id: null,
_temp: { $push: "$minutes" }
}
},
{
$project: {
totalMinutes: {
$size: { // get the size of the union set
$reduce: {
input: "$_temp",
initialValue: [],
in: {
$setUnion: ["$$value", "$$this"] // combine the values using set union
}
}
}
}
}
})
Mongo Playground
I plan to create a database for price history.
The history database should store prices defined 90 days in advance each day in a year.
That means: 90 days x 365 days/year = 32850 database item
Is there any way to design schema to improve query performance ?
my first suggestion was hierarchical store values like:
{
"Address": "xxxxx",
"City": "xxxxx",
"Country": "Deutschland",
"Currency": "EUR",
"Item_Name": "xxxxxx",
"Location": [
log, lat
],
"Postal_code": "xxxx",
"Price_History": [
2014 : [
"January" : {
"CW_1" : { 1: [ price1 .. price90 ], 2: [ price1 .. price90 ], },
"CW_2" : {},
"CW_3" : {},
} ,
"February" : {},
"March" : {},
]
]
}
Thank you in advance!
It all depends on which queries you are planning to run against this data. It seems to me that if you are interested in keeping a history of actions, then your queries will almost always contain a date parameter.
The Price_History array might be better formatted as sub document. Each of these documents would have a varied (but limited) range of values - the year and the month. It might be a good idea to add an index on that attribute. This way, whenever you query by a certain date range, your indexes will assist mongo to find the relevant dataset relatively quickly.
Another option would be to have each price in-itself as a document. The item connected to the price could be a sub-document perhaps not containing all of the item data, but enough to be able to make the calculations and fetch the other relevant data once your dataset is small enough. For this usage, I would recommend creating a single attribute of the date ranges to be indexed and also an index on the item._id attribute. You can still have the individual date components if you still need to query them individually. Something like this:
{
"ind_attr": "2014_January_CW1",
"date": {
"year": 2014,
"month": January",
},
"CW": 1,
"price": [ price1... price90 ],
"item": {
"name": ...,
"_id": ...,
// minimal data about the actual item
}
}
With this document structure, you could easily add an index on the ind_attr attribute. The document.item._id attribute can be used to retrieve more detailed data on the actual item if needed.
I am new to Mongodb and I have a SQL background.
So my app records the number of clicks and impressions for banners and I have decided to store all this into a single document per banner which looks like this:
{
"_id":ObjectId('534b45b9b6d966a8010002323'),
"active": true,
"banner_end": ISODate("2015-06-05T23:59:59.0Z"),
"banner_name": "Cool banner",
"banner_position": "bottom",
"banner_url": "http:\/\/www.google.com",
"banner_image":"http:\/\/www.google.com/pic.jpg",
"click_details": [
{
"date": ISODate("2014-04-14T02:29:22.961Z"),
"ip": "::1"
}
],
"clicks": NumberInt(1),
"impression_details": [
{
"date": ISODate("2014-04-14T02:28:41.353Z"),
"ip": "::1"
},
{
"date": ISODate("2014-04-14T02:28:53.52Z"),
"ip": "::1"
}
],
"impressions": NumberInt(2)
}
Obviously, as time goes by, the array of click_details and impression_details will increase (especially the impressions). I was wondering if I am doing this correctly? Or should I store the click_details and impression_detail onto a separate collection?
I will need click_detail and impression_detail later to plot graphs.
Many thanks
There is nothing wrong with this approach and moreover the sub-document has a limit of 16 MB in Mongo which will store many records for you.
Can you also highlight the number of users which you get on your site and expected number of impressions / clicks for a banner.
P.S. You can save lot of space by aliasing your columns in JSON, e.g. banner_name could be written as bn_nm and so on.
I have a MongoDB datastore set up with location data stored like this:
{
"_id" : ObjectId("51d3e161ce87bb000792dc8d"),
"datetime_recorded" : ISODate("2013-07-03T05:35:13Z"),
"loc" : {
"coordinates" : [
0.297716,
18.050614
],
"type" : "Point"
},
"vid" : "11111-22222-33333-44444"
}
I'd like to be able to perform a query similar to the date range example but instead on a time range. i.e. Retrieve all points recorded between 12AM and 4PM (can be done with 1200 and 1600 24 hour time as well).
e.g.
With points:
"datetime_recorded" : ISODate("2013-05-01T12:35:13Z"),
"datetime_recorded" : ISODate("2013-06-20T05:35:13Z"),
"datetime_recorded" : ISODate("2013-01-17T07:35:13Z"),
"datetime_recorded" : ISODate("2013-04-03T15:35:13Z"),
a query
db.points.find({'datetime_recorded': {
$gte: Date(1200 hours),
$lt: Date(1600 hours)}
});
would yield only the first and last point.
Is this possible? Or would I have to do it for every day?
Well, the best way to solve this is to store the minutes separately as well. But you can get around this with the aggregation framework, although that is not going to be very fast:
db.so.aggregate( [
{ $project: {
loc: 1,
vid: 1,
datetime_recorded: 1,
minutes: { $add: [
{ $multiply: [ { $hour: '$datetime_recorded' }, 60 ] },
{ $minute: '$datetime_recorded' }
] }
} },
{ $match: { 'minutes' : { $gte : 12 * 60, $lt : 16 * 60 } } }
] );
In the first step $project, we calculate the minutes from hour * 60 + min which we then match against in the second step: $match.
Adding an answer since I disagree with the other answers in that even though there are great things you can do with the aggregation framework, this really is not an optimal way to perform this type of query.
If your identified application usage pattern is that you rely on querying for "hours" or other times of the day without wanting to look at the "date" part, then you are far better off storing that as a numeric value in the document. Something like "milliseconds from start of day" would be granular enough for as many purposes as a BSON Date, but of course gives better performance without the need to compute for every document.
Set Up
This does require some set-up in that you need to add the new fields to your existing documents and make sure you add these on all new documents within your code. A simple conversion process might be:
MongoDB 4.2 and upwards
This can actually be done in a single request due to aggregation operations being allowed in "update" statements now.
db.collection.updateMany(
{},
[{ "$set": {
"timeOfDay": {
"$mod": [
{ "$toLong": "$datetime_recorded" },
1000 * 60 * 60 * 24
]
}
}}]
)
Older MongoDB
var batch = [];
db.collection.find({ "timeOfDay": { "$exists": false } }).forEach(doc => {
batch.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": {
"$set": {
"timeOfDay": doc.datetime_recorded.valueOf() % (60 * 60 * 24 * 1000)
}
}
}
});
// write once only per reasonable batch size
if ( batch.length >= 1000 ) {
db.collection.bulkWrite(batch);
batch = [];
}
})
if ( batch.length > 0 ) {
db.collection.bulkWrite(batch);
batch = [];
}
If you can afford to write to a new collection, then looping and rewriting would not be required:
db.collection.aggregate([
{ "$addFields": {
"timeOfDay": {
"$mod": [
{ "$subtract": [ "$datetime_recorded", Date(0) ] },
1000 * 60 * 60 * 24
]
}
}},
{ "$out": "newcollection" }
])
Or with MongoDB 4.0 and upwards:
db.collection.aggregate([
{ "$addFields": {
"timeOfDay": {
"$mod": [
{ "$toLong": "$datetime_recorded" },
1000 * 60 * 60 * 24
]
}
}},
{ "$out": "newcollection" }
])
All using the same basic conversion of:
1000 milliseconds in a second
60 seconds in a minute
60 minutes in an hour
24 hours a day
The modulo from the numeric milliseconds since epoch which is actually the value internally stored as a BSON date is the simple thing to extract as the current milliseconds in the day.
Query
Querying is then really simple, and as per the question example:
db.collection.find({
"timeOfDay": {
"$gte": 12 * 60 * 60 * 1000, "$lt": 16 * 60 * 60 * 1000
}
})
Of course using the same time scale conversion from hours into milliseconds to match the stored format. But just like before you can make this whatever scale you actually need.
Most importantly, as real document properties which don't rely on computation at run-time, you can place an index on this:
db.collection.createIndex({ "timeOfDay": 1 })
So not only is this negating run-time overhead for calculating, but also with an index you can avoid collection scans as outlined on the linked page on indexing for MongoDB.
For optimal performance you never want to calculate such things as in any real world scale it simply takes an order of magnitude longer to process all documents in the collection just to work out which ones you want than to simply reference an index and only fetch those documents.
The aggregation framework may just be able to help you rewrite the documents here, but it really should not be used as a production system method of returning such data. Store the times separately.