I'm using morphia to connect to mongoDB. I'm collecting daily mileage for cars. Right now, all daily mileage for all cars are stored in 1 collection with the following attribute:
plateNumber, date, mileage
we want to store the daily mileages from all the way back in 1990 onwards. Right now, we're already maintaining around 4500+ cars (that's roughly 1.3 mil records a year). We're trying with one year worth of data, and the performance is already slagging really badly. I was thinking of splitting the storage into multiple collections based on the plate number. so each plate number will have its own collection named after the plate number. I need some ideas. Is there any other way to solve this?
Adding details:
How we'll use the data: we want to query mileages of multiple cars (sometimes per department, or per geographic area, per make/model, etc) at any given date range.
So, lets just say we want to monitor mileages in a suburb, we'll take all plate numbers' mileages operating in that suburb from 01 Jan 2014 to 23 Jun 2014 and perform calculation on the data.
thanks.
Depending on what is your configuration you can try Sharding or you may attempt to Partition your db -- though this approach is hybrid, meaning that you would mimic partitioning from sql database systems (Oracle, Sql Server, etc.).
Also note that if you insert (basically append) a lot of entries to a single file it will gradually become slow since mongo needs to update the primary key (mongoID) that needs to be unique + if you defined other indexes on the collection those also need to be updated.
If you can provide more information on how you intend to use the collected data and in what time intervals + are these operations online or offline I'll update my answer.
Related
I'm using MongoDB and need some analytics query to produce some report of my data. I know that MongoDB is not a good choice for OLAP applications but if I want to use MonogoDB one solution could be pre-computing the required data. For instance we can create new collections for the any specific OLAP query and just update that collection when some related events happen in system. Consider this scenario:
In my app I'm storing the sales information for some vendors in sales collection. Each document in sales consist of sale value, vendor ID and date. I want to report that withing a specified time period find those vendors that have sold the most. To avoid aggregations I've created a middle collection that stores the whole amount of sales for each vendor in each day. Then when I want to prepare that report I just find those documents in the middle collection that their dates is in the specified time period and then group the results by their vendor ID and then sort the it. I think this solution would have less aggregation time because the documents in the middle collection are less than the original collection. Also it would be of O(n) time complexity.
I want to know is there any mechanism in MongoDB that makes it possible to avoid this aggregation too and make the query simpler?
I am new to MongoDB and I have difficulties implementing a solution in it.
Consider a case where I have two collections: a client and sales collection with such designs
Client
==========
id
full name
mobile
gender
region
emp_status
occupation
religion
Sales
===========
id
client_id //this would be a DBRef
trans_date //date time value
products //an array of collections of product sold in the form {product_code, description, units, unit price, amount}
total sales
Now there is a requirement to develop another collection for analytical queries where the following questions can be answered
What are the distribution of sales by gender, region and emp_status?
What are the mostly purchase products for clients in a particular region?
I considered implementing a very denormalized collection to create a flat and wide collection of the properties of the sales and client collection so that I can use map-reduce to further answer the questions.
In RDBMS, an aggregation back by a join would answer these question but I am at loss to how to make Map-Reduce or Agregation help out.
Questions:
How do I implement Map-Reduce to map across 2 collections?
Is it possible to chain MapReduce operations?
Regards.
MongoDB does not do JOINs - period!
MapReduce always runs on a single collection. You can not have a single MapReduce job which selects from more than one collection. The same applies to aggregation.
When you want to do some data-mining (not MongoDBs strongest suit), you could create a denormalized collection of all Sales with the corresponding Client object embedded. You will have to write a little program or script which iterates over all clients and
finds all Sales documents for the clinet
merges the relevant fields from Client into each document
inserts the resulting document into the new collection
When your Client document is small and doesn't change often, you might consider to always embed it into each Sales. This means that you will have redundant data, which looks very evil from the viewpoint of a seasoned RDB veteran. But remember that MongoDB is not a relational database, so you should not apply all RDBMS dogmas unreflected. The "no redundancy" rule of database normalization is only practicable when JOINs are relatively inexpensive and painless, which isn't the case with MongoDB. Besides, sometimes you might want redundancy to ensure data persistence. When you want to know your historical development of sales by region, you want to know the region where the customer resided when they bought the product, not where they reside now. When each Sale only references the current Client document, that information is lost. Sure, you can solve this with separate Address documents which have date-ranges, but that would make it even more complicated.
Another option would be to embed an array of Sales in each Client. However, MongoDB doesn't like documents which grow over time, so when your clients tend to return often, this might result in sub-par write-performance.
We collect and store instrumentation data from a large number of hosts.
Our storage is MongoDB - several shards with replicas. Everything is stored in a single large collection.
Each document we insert is a time based observation with some attributes (measurements). The time stamp is the most important attribute because all queries are based on time at least. Documents are never updated, so it's a pure write-in-look-up model. Right now it works reasonably well with several billions of docs.
Now,
We want to grow a bit and hold up to 12 month of data which may amount to a scary trillion+ observations (documents).
I was wandering if dumping everything into a single monstrous collection is the best choice or there is a more intelligent way to go about it.
By more intelligent I mean - use less hardware while still providing fast inserts and (importantly) fast queries.
So I thought about splitting the large collection into smaller pieces hoping to gain memory on indexes, insertion and query speed.
I looked into shards, but sharding by the time stamp sounds like a bad idea because all writes will go into one node canceling the benefits of sharding.
The insert rates are pretty high, so we need sharding to work properly here.
I also thought about creating a new collection every month and then pick up a relevant collection for a user query.
Collections older than 12 month will be either dropped or archived.
There is also an option to create entirely new database every month and do similar rotation.
Other options? Or perhaps one large collection is THE option to grow real big?
Please share your experience and considerations in similar apps.
It really depends on the use-case for your queries.
If it's something that could be aggregated, I would say do this through a scheduled map/reduce function and store the smaller data size in separate collection(s).
If everything should be in the same collection and all data should be queried at the same time to generate the desired results, then you need to go with Sharding. Then depending on the data size for your queries, you could go with an in memory map/reduce or even doing it at the application layer.
As yourself pointed out, Sharding based on time is a very bad idea. It makes all the writes going to one shard, so define your shard key. MongoDB Docs, has a very good explanation on this.
If you can elaborate more on your specific needs for the queries would be easier to suggest something.
Hope it helps.
I think collection on monthly basis will help you to get some boost up but I was wondering why can not you use the hour field of your timestamp for sharding . You can add a column which will hold the HOUR part of time stamp and when you shard against it will be shared nicely as you have repeating hour daily basis. I have not tested it but thought it will may help you
Would suggest to go ahead with single collection, as suggested by #Devesh hour based shard should be fine, Need to take care of the new ' hour Key ' while querying to get better performance.
I'm building a very large counter system. To be clear, the system is counting the number of times a domain occurs in a stream of data (that's about 50 - 100 million elements in size).
The system will individually process each element and make a database request to increment a counter for that domain and the date it is processed on. Here's the structure:
stats_table (or collection)
-----------
id
domain (string)
date (date, YYYY-MM-DD)
count (integer)
My initial inkling was to use MongoDB because of their atomic counter feature. However as I thought about it more, I figured Postgres updates already occur atomically (at least that's what this question leads me to believe).
My question is this: is there any benefit of using one database over the other here? Assuming that I'll be processing around 5 million domains a day, what are the key things I need to be considering here?
All single operations in Postgres are automatically wrapped in transactions and all operations on a single document in MongoDB are atomic. Atomicity isn't really a reason to preference one database over the other in this case.
While the individual counts may get quite high, if you're only storing aggregate counts and not each instance of a count, the total number of records should not be too significant. Even if you're tracking millions of domains, either Mongo or Postgres will work equally well.
MongoDB is a good solution for logging events, but I find Postgres to be preferable if you want to do a lot of interesting, relational analysis on the analytics data you're collecting. To do so efficiently in Mongo often requires a high degree of denormalization, so I'd think more about how you plan to use the data in the future.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am considering MongoDB to hold metadata of images, recorded from 100 cameras, and the records will be kept for 30 days for each camera. If one camera gives 100,000 images in a day then i am going to save (100 x 30 x 100000) images (documents) at max in MongoDB. My web application will query this data as:
Select a Camera > Select a Date > Select an Hour > Fetch all images in that hour.
I plan to design schema with following three options, and need your expert opinion/suggestion for the best way out;
1) Hour-wise Collections: Create 72000 MongoDB Collections, i.e. 1 Collection per Hour for each Camera (100 cameras X 30 days X 24 hours) (using --nssize 500 command to exceed 24000 limit). I am afraid if MongoDB will allow me to create these much collections and secondly what are expected performance benefits and losses while reading and writing to these collection. Though, for reading per hour images looks tremendously easy with this schema, because i can fetch data in a single query to any Collection.
2) Day-wise Collections: Create 3000 MongoDB Collections, i.e. 1 Collection per Day for each Camera (100 cameras X 30 days). Though this is allowable and seems good number of collection but my concern is reading images from a particular hour inside particular day collection.
3) Camera-wise Collections: Create 100 MongoDB Collections, i.e. 1 Collection for each Camera (100 cameras/collections). Then saving snapshots with unique 'id' in format like (20141122061055000) that is a rephrasing of full date timestamp (2014-11-22 06:10:55.000).
I wish if ideally i could do (1), (2) or (3) but any other option is welcomed.
Please suggest about my selection for MongoDB as well, considering my case.
Regards.
This continues from: Pros and Cons of using MongoDB instead of MS SQL Server.
I am unsure why you are trying to take the advise of using many collections.
Using many collections in this way in MongoDB is considered a bad idea (and you would have to increase ns size for this most likely after your index overhead), you should instead scale a single collection of common docs way out horizontally. It seems the other answerers agree.
I would use a single collection with a document structure maybe of (quick off the top of my head):
{
_id: {},
camera_id: ObjectId(),
image: {},
hour: ts_of_hour,
day: ts_of_day
}
That way you got all the data you need to select images based on whatever denomination you want.
NB: Consider as well that MongoDBs lock is database level, not collection level. You won't gain anything useful here only making your querying harder and more complex and maybe making your data harder to maintain.
Edit
To answer some of your concerns:
NB: I have not designed your app and this is a late answer (late at night too) so basically this is me fleshing out basic concepts that immediately come to mind.
1 collection for each camera, i.e. 100 collections almost.
Again I don't really see the point, if you were to do this for optimisation reasons then you would do it as one camera per DB, but that is officially overkill. Honestly 30m records is nothing, I will resolve that concern right now. Whether you are talking about SQL or MongoDB a 30m record collection is normally considered small, minute even, in terms of the databases potential (with MS SQL saying they can store perabytes per table).
Select All images of between FromDate and ToDate 2
You can use the answer above to accomplish that using a BSON date field on your document.
Select Top(COUNT) images between FromDate and ToDate
You can just count().
top() is not implemented in all DB systems so this is MS SQL specific here however in this particular query it does nothing useful since that query will always return one row.
You can aggregate this particular data to another collection. That is fine, so in another collection you would have a set of days:
{
count: 3,
day: (date|ts)
}
And then you can just some up over the days since count() can get slow on a large working set. So the aim of the collection to summarise your data to make your working set for queries more manageable.
So other collections are fine to use to hold "cache" of aggregation functions which would be slow, or of course to hold other entities within your app (like a relational DB would).
Basically, like in SQL, common schemas or documents get grouped in collections. So really I would design your app in SQL with only one table: images and maybe camera as well.
All others except for 5 have been covered loosely here so:
Select previous/next images from/to an Image with an ID
You can use the _id here like so:
db.images.find({_id: {$gt: last_id}}).limit(1)
And that should work pretty well.
As for the comment you posted here as well:
Do you mean that in MongoDB, querying a collection with 30 documents is not different from querying a collection with 30,00,000 documents ?
Now that depends on how much you know about database design in general and how to scale database architecture. This is something that doesn't just apply to MongoDB but also to SQL. If set-up right SQL can easily query 30m records like 30.
What it all comes down to is sharding. As to whether it would be fast comes down to your indexes across those shards that the queries to run and their working set size (how much data is needed in RAM, is it in RAM?). By the looks of it a shard index over image_id (ObjectId) and date might give you what you want. However this will need more testing and since I believe you are a little new to scaling databases you should really do some searching on this subject via Google or something.
NB again: 30m documents might not need sharding so this could be just a case of making good indexes.
Hopefully this helps and I haven't gone round in circles here,
I don't see your problem with the collections. Photos are one single scheme, and they should be in a single collection.
Each photo gets a timestamp. The rest is done by querying. You can query documents per hour without a problem:
var begin_hour = new Date(date.year, date.month, date.day, hour);
var end_hour = new Date(date.year, date.month, date.day, hour + 1);
db.photos.find({taken: {$gte: begin_hour, $lt: end_hour}})
This selects the photos by the selected hour.
If that doesn't satisfy you, there's also MapReduce.