I have a collection in a mongoDB on which I have set user preferences. I have a very large number of objects in one particular collection, and user can follow a key in the collection. For example:
colletionx { key1: value1, key2: value2 : key3: value3 .. keyn:valuen}
Now the user can follow any number of keys, i.e., when key1 equals some value update me. (Very much similar similar to the Twitter "follow" feature).
Now how can I efficiently do this?
Also if I query mongo with a query like this:
db.collection.find({ keyId : 290})
or db.collection.find({ keyId : { $in [ 290] } }) will there be any drastic performance improvement when there are millions of users and all follow 1 show.
I think one of the biggest concerns with having large amounts of data in any database is that when you are querying, you want to avoid hitting the disk. Mongodb does a fairly good job of keeping data in memory but if your data set outgrows your memory, you will start swapping and that will hurt your performance.
There shouldnt be much of a difference between doing an $eq query and an $in query as long as there is an index on the key you are querying. If there is no index, you'll do a full collection scan.
For large amount of data it is very recommended to work with Sharding
It will allow you to have the data splitted between shards, thus your index could fit the ram memory. I think findOne by index should be quite efficient. The only thing that can harm your performance in this case is only massive writes in addition to your reads operations. Since the mongo have a global lock.
Related
quick question on whether to index or not. There are frequent queries to a collection that looks for a specific 'user_id' within an array of a doc. See below -
_id:"bQddff44SF9SC99xRu",
participants:
[
{
type:"client",
user_id:"mi7x5Yphuiiyevf5",
screen_name:"Bob",
active:false
},
{
type:"agent",
user_id:"rgcy6hXT6hJSr8czX",
screen_name:"Harry",
active:false
}
]
}
Would it be a good idea to add an index to 'participants.user_id'? The array is added to frequently and occasionally items are removed.
Update
I've added the index after testing locally with the same set of data and this certainly seems to have decreased the high CPU usage on the mongo process. As there are only a small number of updates to these documents I think it was the right move. I'm looking at more possible indexes and optimisation now.
Why do you want to index? Do you have significant latency problems when querying? Or are you trying to optimise in advance?
Ultimately there are lots of variables here which make it hard to answer. Including but not limited to:
how often is the query made
how many documents in the collection
how many users are in each document
how often you add/remove users from the document after the document is inserted.
do you need to optimise inserts/updates to the collection
It may be that indexing isn't the answer, but rather how you have structured you data.
I am designing my database with MongoDb thinking in the scalability in the future. My main concern right now is about representing the indexes, as I have read, it is a crucial factor while scaling huge collections, in terms of RAM consumption, and sharding efficiency.
For simplicity, I have two different collections. A user collection which stores the user username, email, and some metadata, and a devices collection, that contains a device name, some metadata, and should be related with its owner. One user can have millions of devices (so it is not worth to store all in a single user document).
The devices collection should support queries in term of the whole device identifier by (username, device_name), or also by the username.
In this case I see some different approaches for storing the indexes:
Use a secondary compound index with username and device_name (in this order)
Use a primary index with and _id containing an string with username#device_name
Use an object in the _id field with both values {owner:username, device:device_name}
For testing this indexes, I have done some server load. I have created three different collections with this different alternatives and filled 5M documents. Some data:
I do not use the automatically generated _id created by mongo, as all my queries requires username/device. So this approach takes some extra space for indexing. The index size is 524MB. It is efficient while querying both by user or by user/device.
As I am replacing the _id with my own string, the index takes less space. In this case 352MB. I am still able to query efficiently by user (with a regex like /^username#/ the explain() reports almost the same results like in 1 in), and by the exact username/device.
The _id index cannot be changed to a compound index, so it is required to create a secondary compound index with {_id.owner, _id.device}. This results in a huge index size of 1059MB!. Queries goes well as in previous cases.
So, I can discard alternative 3, as this is not so much efficient. Between alternative 1 and 2, I prefer 1 as this approach is more clean, but it uses a _id field I will not use. So at this moment, the winning approach seems to be the number 2, as it allows me query efficiently by username or username/device, and it also takes less index space.
Is there a good reason to not use number 2 and follow with number 1, like when selecting the sharding key? Is there something I am missing? I am new with mongoDB and do not want to have problems when scaling my schema.
Is there a lot of overhead in excluding nearly all of the data in a document when querying a mongo database?
For example, in the case where I only want field1 and field2, for a collection with a document structure of:
{
"field1" : 1
"field2" : true
"field3" : ["big","array",...]
"field4" : ["another","big","array",...]
}
would I benefit more from:
Creating a separate collection alongside this collection containing
only field1 and field2, or
Using .find() on the original documents with inclusion/exclusion parameters
Note: The inefficiency of saving the same data twice isn't a concern for me as much as the efficiency of querying the data
Many thanks!
Projection is somewhat similar to using column names explicitly in SQL, so it seems a little counter-intuitive to ask if returning smaller amount of data would incur overhead over returning larger amount of data (full document).
So you have to find the document (depending on how you .find() it may be fast or slow) but returning only first two fields of the document rather than all the fields (complete document) would make it faster not slower.
Having a second collection may only benefit if you are concerned about your collection fitting into RAM. If the documents in the duplicate collection are much smaller then they can presumably fit into a smaller amount of total RAM decreasing a chance that a page will need to be swapped in from disk. However, if you are writing to this collection as well as original collection then you have to have a lot more data in RAM than if you just have the original collection.
So while the intricate details may depend on your individual set-up, the general answer would probably be 2. you will benefit more from using projection and only returning the two fields you need.
Scenario:
10.000.000 record/day
Records:
Visitor, day of visit, cluster (Where do we see it), metadata
What we want to know with this information:
Unique visitor on one or more clusters for a given range of dates.
Unique Visitors by day
Grouping metadata for a given range (Platform, browser, etc)
The model i stick with in order to easily query this information is:
{
VisitorId:1,
ClusterVisit: [
{clusterId:1, dates:[date1, date2]},
{clusterId:2, dates:[date1, date3]}
]
}
Index:
by VisitorId (to ensure Uniqueness)
by ClusterVisit.ClusterId-ClusterVisit.dates (for searching)
by IdUser-ClusterVisit.IdCluster (for updating)
I also have to split groups of clusters into different collections in order to ease to access the data more efficiently.
Importing:
First we search for a combination of VisitorId - ClusterId and we addToSet the date.
Second:
If first doesn't match, we upsert:
$addToSet: {VisitorId:1,
ClusterVisit: [{clusterId:1, dates:[date1]}]
}
With First and Second importing i cover if the clusterId doesn't exists or if VisitorId doesn´t exists.
Problems:
totally inefficient (near impossible) on update / insert / upsert when the collection grows, i guess because of the document size getting bigger when adding a new date.
Difficult to maintain (unset dates mostly)
i have a collection with more than 50.000.000 that i can't grow any more. It updates only 100 ~ records/sec.
I think the model i'm using is not the best for this size of information. What do you think will be best to get more upsert/sec and query the information FAST, before i mess with sharding, which is going to take more time while i learn and get confident with it.
I have a x1.large instance on AWS
RAID 10 with 10 disks
Arrays are expensive on large collections: mapreduce, aggregate...
Try .explain():
MongoDB 'count()' is very slow. How do we refine/work around with it?
Add explicit hints for index:
Simple MongoDB query very slow although index is set
A full heap?:
Insert performance of node-mongodb-native
The end of memory space for collection:
How to improve performance of update() and save() in MongoDB?
Special read clustering:
http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/
Global write lock?:
mongodb bad performance
Slow logs performance track:
Track MongoDB performance?
Rotate your logs:
Does logging output to an output file affect mongoDB performance?
Use profiler:
http://www.mongodb.org/display/DOCS/Database+Profiler
Move some collection caches to RAM:
MongoDB preload documents into RAM for better performance
Some ideas about collection allocation size:
MongoDB data schema performance
Use separate collections:
MongoDB performance with growing data structure
A single query can only use one index (better is a compound one):
Why is this mongodb query so slow?
A missing key?:
Slow MongoDB query: can you explain why?
Maybe shards:
MongoDB's performance on aggregation queries
Improving performance stackoverflow links:
https://stackoverflow.com/a/7635093/602018
A good point for further sharding replica education is:
https://education.10gen.com/courses
From the book Scaling MongoDB:
The general case
We can generalize this to a formula for shard keys:
{coarseLocality : 1, search : 1}
So my question is, is that correct? shouldn't be the oposite for better writing?
Also from the book:
This pattern continues: everything will always be added to the “last”
chunk, meaning everything will be added to one shard. This shard key
gives you a single, undistributable hot spot.
So saying that my app always search by user_id, and last entries in the collection.
What is the best shard key i should have, this:
{_id:1, user_id:1}
or:
{user_id:1,_id:1}
Kristina (author of Scaling MongoDB) wrote a blog post which has some example strategies explained in the guise of a game: How to Choose a Shard Key: The Card Game.
There are many considerations to choosing a good shard key based on your application requirements and use cases.
The general advice of {coarseLocality : 1, search : 1} order is to ensure there is some locality of your data for reading.
So in your case, you would most likely want: {user_id:1,_id:1}.
That will provide some locality of data for the same user_id when querying, and ideally your common queries will be able to get their data from a single shard.
The opposite order may provide for better write distribution (assuming _id is not a monotonically increasing key like a default ObjectId) but a potential downside is reliability: if your data for a read query is scattered across all shards, you will have retrieval problems if any one shard is down.
So saying that my app always search by user_id, and last entries in the collection.
If you commonly search by user_id (and without _id) this will also affect your choice of shard key and index optimization. To find the last entries MongoDB will have to do a sort; you will want to be doing that sort on a single shard rather than having to gather the data from all shards and sorting. If your _id happens to be date-based that would be beneficial as part of the shard key in order to find the last entries.