I'm using MongoDB in my project for statistical and data analysis things. My goal is design data to have best performance and scaleability.
Let's assume I have several shops and a list of unique products per shop. And I need to query some data about the products, calculate some basic statistic (only by curtain shop).
Which way is better from the performance point of view: to have a Shop document and a list of products inside and then make querying only per this document.
Or better will be having separate collection with all products per all shops in it and then build queries for that collection?
Maybe the question itself: does mongodb could query through the body of one document with such efficient manner like through many documents.
UPD 1:
For now let's assuming that products itself is quite small (Id, Price, Name, Count) and the amount of it is limited. (So I know for sure that it won't be more than 1000 products per shop)
UPD2
Also lets assuming that I don't want to read that database for the view purposes, just for statistics. (How much sold, which is most interesting, what groups and so on)
As with all these question one of the main deciding factors is data size and growth.
Will your data per shop exceed 16 meg? Judging by how many items a shop can have and how much data can be attributed to just a single item I would yes, very quickly.
What I mean is imagine how many fields you have for a product:
Product id
Description
Price
Options
Currency
blurb
SKU
Barcode (or whatever)
Some of these fields will be quite big, For example, the description of the product could be massive.
However if on the off chance this is a very simple application and you are looking at a product that can be be fully contained in a single data row and shops which will never have more than 5-8,000 items then you could do better with subdocuments of the sort:
{
_id: ObjectId(),
shop_name: 'toys r us',
items: [
{ p_id: ObjectId(), price: '1000000', currency: 'GBP', description: 'fkf' }
]
}
Subdocuments do not come without their price though. Imagine you have a document that only has one subdocument, in 10 days has 100 and in 20, 1000.
The fragmentation caused by the consistently growing documents could be quite significant. This lowers your performance for one. Not only will your performance become a problem but also fixing fragmentation is not a nice job and then later solving it in the applications logic is even harder.
To understand more about how MongoDB actually works inside you can view this presentation: http://www.10gen.com/presentations/storage-engine-internals
As for querying on a subdocument, it does require a little extra work on MongoDBs end but it is still quite cheap (cheaper than multiple round trips) providing you set it up right.
Personally based on the information I have given above I would go for two collections but I don't know the true extent of your scenario...
Edit
UPD 1: For now let's assuming that products itself is quite small (Id, Price, Name, Count) and the amount of it is limited. (So I know for sure that it won't be more than 1000 products per shop)
Okay so your documents are small, probably a couple of bytes each. In this case you might be able to use subdocuments here with power of 2 sizes allocation to remedy some of that fragmentation: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes
This could create a performant operation, still 1 to 1000 subdocuments can cause fragmentation however those fragments should be filled by smaller "new" shop documents when they come into existence.
UPD2 Also lets assuming that I don't want to read that database for the view purposes, just for statistics. (How much solds, which is most interesting, what groups and so on)
So per shop, using subdocuments, you could easily get the totals of how much sold per shop like:
db.shops.aggregate([
// Match shop id 1
{$match: {_id: 1}},
// unwind the products for that shop
{$unwind: '$products'},
// Group back up by shop id and total amount sold
{$group: {_id: '$_id', total_sold: {$sum: '$products.sold'}}}
])
Using the new aggregation framework (since version 2.1): http://docs.mongodb.org/manual/applications/aggregation/
So subdocuments can be just as easy as two separate collections to query on as well.
Related
quick question on whether to index or not. There are frequent queries to a collection that looks for a specific 'user_id' within an array of a doc. See below -
_id:"bQddff44SF9SC99xRu",
participants:
[
{
type:"client",
user_id:"mi7x5Yphuiiyevf5",
screen_name:"Bob",
active:false
},
{
type:"agent",
user_id:"rgcy6hXT6hJSr8czX",
screen_name:"Harry",
active:false
}
]
}
Would it be a good idea to add an index to 'participants.user_id'? The array is added to frequently and occasionally items are removed.
Update
I've added the index after testing locally with the same set of data and this certainly seems to have decreased the high CPU usage on the mongo process. As there are only a small number of updates to these documents I think it was the right move. I'm looking at more possible indexes and optimisation now.
Why do you want to index? Do you have significant latency problems when querying? Or are you trying to optimise in advance?
Ultimately there are lots of variables here which make it hard to answer. Including but not limited to:
how often is the query made
how many documents in the collection
how many users are in each document
how often you add/remove users from the document after the document is inserted.
do you need to optimise inserts/updates to the collection
It may be that indexing isn't the answer, but rather how you have structured you data.
I am new to MongoDB and I have difficulties implementing a solution in it.
Consider a case where I have two collections: a client and sales collection with such designs
Client
==========
id
full name
mobile
gender
region
emp_status
occupation
religion
Sales
===========
id
client_id //this would be a DBRef
trans_date //date time value
products //an array of collections of product sold in the form {product_code, description, units, unit price, amount}
total sales
Now there is a requirement to develop another collection for analytical queries where the following questions can be answered
What are the distribution of sales by gender, region and emp_status?
What are the mostly purchase products for clients in a particular region?
I considered implementing a very denormalized collection to create a flat and wide collection of the properties of the sales and client collection so that I can use map-reduce to further answer the questions.
In RDBMS, an aggregation back by a join would answer these question but I am at loss to how to make Map-Reduce or Agregation help out.
Questions:
How do I implement Map-Reduce to map across 2 collections?
Is it possible to chain MapReduce operations?
Regards.
MongoDB does not do JOINs - period!
MapReduce always runs on a single collection. You can not have a single MapReduce job which selects from more than one collection. The same applies to aggregation.
When you want to do some data-mining (not MongoDBs strongest suit), you could create a denormalized collection of all Sales with the corresponding Client object embedded. You will have to write a little program or script which iterates over all clients and
finds all Sales documents for the clinet
merges the relevant fields from Client into each document
inserts the resulting document into the new collection
When your Client document is small and doesn't change often, you might consider to always embed it into each Sales. This means that you will have redundant data, which looks very evil from the viewpoint of a seasoned RDB veteran. But remember that MongoDB is not a relational database, so you should not apply all RDBMS dogmas unreflected. The "no redundancy" rule of database normalization is only practicable when JOINs are relatively inexpensive and painless, which isn't the case with MongoDB. Besides, sometimes you might want redundancy to ensure data persistence. When you want to know your historical development of sales by region, you want to know the region where the customer resided when they bought the product, not where they reside now. When each Sale only references the current Client document, that information is lost. Sure, you can solve this with separate Address documents which have date-ranges, but that would make it even more complicated.
Another option would be to embed an array of Sales in each Client. However, MongoDB doesn't like documents which grow over time, so when your clients tend to return often, this might result in sub-par write-performance.
I have "feeds" collection, each feed have comments. So when someone comments on feed, he is added to "subsribers" which is Mongo's multikey field.
feeds: {
_id: ...,
text: "Text",
comments: [{by: "A", text: "asd"},{by: "B", text: "sdf"}],
subscribers: ["A","B"]
}
Then when I need to get all feeds with new comments for user A, I ask for feeds with {subscribers: "A"}.
Usually there're 2-5 comments, but sometimes (on hot feeds) there might be >100 comments and >100 subscribers.
I know that it's not recommended to have multikey fields with too many keys. So how much is too much?
I ask because I need to decide - if I will use multikeys or is it better to send comments directly to each user. In this case I have to copy feed for each subscriber - and collection will grow VERY fast - which I think isn't good also: 1000 user, each followed by 10 users, each making 10 actions a day = 1 000 000 records every 10 days!
Although you may experience problems with really large documents, particularly if MongoDB must scan the entire document to fulfill a query, as you would expect; arrays with large numbers of values are not in and of themselves problematic in MongoDB, even if they're multi-key indexes.
There is one caveat: the index will not store keys (in the case of a multi-key document, this is the item in the array,) longer than 1024 bytes. As long as the items in the array are shorter than this limit, you should be fine.
Having said this, you do want to avoid data models where an array or other part of the document will grow unbounded and forever. While MongoDB adds a little bit of padding on disk for every document, if the document grows substantially after creation, the database must move it to another location on disk. However you decide to model your data, ensure that your documents do not tend to grow much after creation.
Reference:
Index Size Limit
Multikey Indexes
I am fairly new to mongodb. I am creating a web app which allows insert and search for multiple products such as laptop, hard driver, webcamm... My question is should I place all of them in a same collection such as "computer" or should I place each product in their own collection like "laptop", "hard driver", "webcam" so that when user search and insert for a product it will be faster ?
Thanks a lot
Generally speaking you should use one collection per "type" of thing you're storing. It sounds like all the examples you've given above would fall under a product "type" and should be in the same collection. Documents in the same collection need not all have the same fields, though for products you will probably have several fields in common across all documents: name, price, manufacturer, etc; each document "sub-type" might have several fields in common, like hard drives might all have RPM, storage capacity, form factor, interface (SATA2/3, IDE, etc).
The rationale for this advice is that MongoDB queries are performed on a single collection at a time. If you want to show search results that cover the different categories of products you have, then this is simple with one collection, but more difficult with several (and less performant).
As far as query performance is concerned, be sure to create indexes on the fields that you are searching on. If you allow search by product name or manufacturer, you would have an index on name, and another index on manufacturer. Your exact indexes will vary depending on the fields you have in your documents.
Insert speed, on the other hand, will be faster the fewer indexes you have (since each index potentially needs to be updated each time you save or update a document), so it is important not to create more indexes than you'll actually need.
For more on these topics, see the MongoDB docs on schema design and indexes, as well as the presentations on 10gen.com from 10gen speakers and MongoDB users.
I suggest to start with one collection. It is much simpler to search through the one collection rather then collection per product. And in the future if you queries against collection become slow you can start thinking about how to speed up them.
Mongodb has fairly fast indexes and it was design to be scalable, so once you will need scale your database -- replica sets and auto sharing in place.
I have statistical data in a Mongodb collection saved for each record per day.
For example my collection looks roughly like
{ record_id: 12345, date: Date(2011,12,13), stat_value_1:12345, stat_value_2:98765 }
Each record_id/date combo is unique. I query the collection to get statistics per record for a given date range using map-reduce.
As far as read query performance, is this strategy superior than storing one document per record_id containing an array of statistical data just like the above dict:
{ _id: record_id, stats: [
{ date: Date(2011,12,11), stat_value_1:39884, stat_value_2:98765 },
{ date: Date(2011,12,12), stat_value_1:38555, stat_value_2:4665 },
{ date: Date(2011,12,13), stat_value_1:12345, stat_value_2:265 },
]}
On the pro side I will need one query to get the entire stat history of a record without resorting to the slower map-reduce method, and on the con side I'll have to sum up the stats for a given date range in my application code and if a record outgrows is current padding size-wise there's some disc reallocation that will go on.
I think this depends on the usage scenario. If the data set for a single aggregation is small like those 700 records and you want to do this in real-time, I think it's best to choose yet another option and query all individual records and aggregate them client-side. This avoids the Map/Reduce overhead, it's easier to maintain and it does not suffer from reallocation or size limits. Index use should be efficient and connection-wise, I doubt there's much of a difference: most drivers batch transfers anyway.
The added flexibility might come in handy, for instance if you want to know the stat value for a single day across all records (if that ever makes sense for your application). Should you ever need to store more stat_values, your maximum number of dates per records would go down in the subdocument approach. It's also generally easier to work with db documents rather than subdocuments.
Map/Reduce really shines if you're aggregating huge amounts of data across multiple servers, where otherwise bandwidth and client concurrency would be bottlenecks.
I think you can reference to here, and also see foursquare how to solve this kind of problem here . They are both valuable.