Mongodb model for Uniqueness - mongodb

Scenario:
10.000.000 record/day
Records:
Visitor, day of visit, cluster (Where do we see it), metadata
What we want to know with this information:
Unique visitor on one or more clusters for a given range of dates.
Unique Visitors by day
Grouping metadata for a given range (Platform, browser, etc)
The model i stick with in order to easily query this information is:
{
VisitorId:1,
ClusterVisit: [
{clusterId:1, dates:[date1, date2]},
{clusterId:2, dates:[date1, date3]}
]
}
Index:
by VisitorId (to ensure Uniqueness)
by ClusterVisit.ClusterId-ClusterVisit.dates (for searching)
by IdUser-ClusterVisit.IdCluster (for updating)
I also have to split groups of clusters into different collections in order to ease to access the data more efficiently.
Importing:
First we search for a combination of VisitorId - ClusterId and we addToSet the date.
Second:
If first doesn't match, we upsert:
$addToSet: {VisitorId:1,
ClusterVisit: [{clusterId:1, dates:[date1]}]
}
With First and Second importing i cover if the clusterId doesn't exists or if VisitorId doesn´t exists.
Problems:
totally inefficient (near impossible) on update / insert / upsert when the collection grows, i guess because of the document size getting bigger when adding a new date.
Difficult to maintain (unset dates mostly)
i have a collection with more than 50.000.000 that i can't grow any more. It updates only 100 ~ records/sec.
I think the model i'm using is not the best for this size of information. What do you think will be best to get more upsert/sec and query the information FAST, before i mess with sharding, which is going to take more time while i learn and get confident with it.
I have a x1.large instance on AWS
RAID 10 with 10 disks

Arrays are expensive on large collections: mapreduce, aggregate...
Try .explain():
MongoDB 'count()' is very slow. How do we refine/work around with it?
Add explicit hints for index:
Simple MongoDB query very slow although index is set
A full heap?:
Insert performance of node-mongodb-native
The end of memory space for collection:
How to improve performance of update() and save() in MongoDB?
Special read clustering:
http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/
Global write lock?:
mongodb bad performance
Slow logs performance track:
Track MongoDB performance?
Rotate your logs:
Does logging output to an output file affect mongoDB performance?
Use profiler:
http://www.mongodb.org/display/DOCS/Database+Profiler
Move some collection caches to RAM:
MongoDB preload documents into RAM for better performance
Some ideas about collection allocation size:
MongoDB data schema performance
Use separate collections:
MongoDB performance with growing data structure
A single query can only use one index (better is a compound one):
Why is this mongodb query so slow?
A missing key?:
Slow MongoDB query: can you explain why?
Maybe shards:
MongoDB's performance on aggregation queries
Improving performance stackoverflow links:
https://stackoverflow.com/a/7635093/602018
A good point for further sharding replica education is:
https://education.10gen.com/courses

Related

Is it possible to update data without affecting Indexes?

I'm using MongoDB(version 5.0), with somewhat large data and indexes.
Data Storage size is around 2.5TB and indexes(8 indexes) size is also sums up to 1TB.
Since I'm pouring those data into a new mongodb and establishing new indexes, so it's taking proximately 1~2 weeks. (Estimated, still pouring)
But, with service updates, it's highly alike that new column will have to be added into documents. (or modify exist column)
And, i think i will update them with bulkWrite and update.
Here come's the question, if the new column have nothing to do with indexes, will it be possible for update operation to work without accessing indexes?
For example, if old document, with indexes are set only on _id and memberNo
{
_id: 's9023902',
memberNo: '20219210',
purchasedMurchantNo: 'M2937'
}
is update into below (column purchasedMurchantNo updated and returned added)
{
_id: 's9023902',
memberNo: '20219210',
purchasedMurchantNo: 'NEW_FORMAT_NO_2937',
isReturned: false
}
My guess is that, it won't change the structure of index, and since mongodb will also know it by the index and query structure, so there will be no need to access index structure.
I'm asking, because with indexes, insert operations have been dramatically slowed down and cpu usage have also been dramatically increased, since this mongodb is not in service right now so high cpu usage rate is not an issue, but in the point when we update our mongodb, it will be in service, and it will be an issue.
Question in a sentense: In certain circumstances, can UpdateMany work in a fast speed, as if there is no index
Thank you in advance for brilliant answer given for not so brilliantly asked question.

What are the production best practices to store a large number of document when using MongoDB?

I am in need of storing applications transaction logs. Decided to use MongoDB. Every day there are almost 200000+- data is storing in single node MongoDB.
We have some reports and operation(if something happened then do something) depending on those logs. So, need to find documents matching different criteria. If going on that pace, is it vulnerable? Will it be slow to execute query?
Any suggestions to make it efficient to use MongoDB?
By the way, those data are in single collection. And MongoDB server version: 4.2.6
mongo collections can grow to be many terabytes without much issue. to be able to query that data in a speedy manner, you will have to analyze your queries and create indexes for the fields that are used in those queries.
indexes are not free though. they will take both diskspace and use up RAM, because for indexes to be useful, they need to fit entirely in RAM.
in most cases, if indexes and collections grow beyond what your hardware can handle, you will have to archive/evict old data and trim down the collections.
if your queries need to include that evicted data in order to generate your reports, you will have to have another collection for summarized values/data of the evicted records which you will have to combine with present data when generating the reports.
alternatively sharding can help with big data but there are some limitations on queries you can do with sharded collections.

Slow Upserts with PyMongoDB

I'm trying to insert ~800 million records into MongoDB using PyMongo on a macbook air 1.7GHz i7 with no multi-threading, the documents are structured as below:
Records I'm reading are the following tuple:
(user_id,imp_date,imp_creative,imp_pid,geo_id)
I'm creating my own _id field based on the user_id in the file I'm reading from.
{_id:user_id,
'imp_date':[array of dates],
'imp_creative':[array of numeric ids],
'imp_pid':[array of numeric ids],
'geo_id':numeric id}
I'm using an upsert with $push to append date, creative id, and pid for the corresponding arrays
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>}},safe=True,upsert=True)
I'm using an upsert with $set to overwrite the geographic location (only care about most recent.
self.collection.update({'_id':uid},
{"$set":{'geo_id':<geo id>}},safe=True,upsert=True)
I'm only writing about 1,500 records per second (8,000 if I set safe=False). My question is: what can I do to speed this up further (ideally 20k/second or faster)?
Ideas I can't find a definitive recommendation on:
-Using multiple threads to insert data
-Sharding
-Padding arrays (my arrays grow very slowly, each document array will have an average length of ~4 at the end of the file)
-Turning journaling off
Apologies if I've left out any required information, this is my first post.
1- You could add an index to speed it up, and index would help you to find the documents faster although the inserts would be slower (you have to update the index as well). If the improvement in the retrieving phase compensates the extra time to update the index depends on how many records you have in the collections, how many indexes you have and how complicated those indexes are.
However, in your case you are only querying with the _id so there's no much more you can do with indexes.
2- Are you using two consecutive updates? I mean, one for the $set and one for the $push?
If that's true, then you should definetelly use just one:
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>},
"$set":{'geo_id':<geo id>}},
safe=True,upsert=True)
3- The update operation is an atomic operation which might locks other queries. If the document you are about to update is not already in RAM but it is in the disk, mongo will have to first fetch it from the disk and then update it. If you do a find operation first (which doesn't block as it's a read-only operation) the document will be in RAM for sure so the update operation (the locking one) will be faster:
self.collection.findOne({'_id':uid})
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>},
"$set":{'geo_id':<geo id>}},
safe=True,upsert=True)
4-If your documents don't grow too much as you have said, it won't be necessary to bother about padding factor and reallocation issues. Furthermore, in some recent versions (can't remember if it was since 2.2 or 2.4) collections are created with the powerOfTwo option enabled by default.

How to insert quickly to a very large collection

I have a collection of over 70 million documents. Whenever I add new documents in batches (lets say 2K), the insert operation is really slow. I suspect that is because, the mongo engine is comparing the _id's of all the new documents with all the 70 million to find out any _id duplicate entries. Since the _id based index is disk-resident, it'll make the code a lot slow.
Is there anyway to avoid this. I just want mongo to take new documents and insert it as they are, without doing this check. Is it even possible?
Diagnosing "Slow" Performance
Your question includes a number of leading assumptions about how MongoDB works. I'll address those below, but I'd advise you to try to understand any performance issues based on facts such as database metrics (i.e. serverStatus, mongostat, mongotop), system resource monitoring, and information in the MongoDB log on slow queries. Metrics need to be monitored over time so you can identify what is "normal" for your deployment, so I would strongly recommend using a MongoDB-specific monitoring tool such as MMS Monitoring.
A few interesting presentations that provide very relevant background material for performance troubleshooting and debugging are:
William Zola: The (Only) Three Reasons for Slow MongoDB Performance
Aska Kamsky: Diagnostics and Debugging with MongoDB
Improving efficiency of inserts
Aside from understanding where your actual performance challenges lie and tuning your deployment, you could also improve efficiency of inserts by:
removing any unused or redundant secondary indexes on this collection
using the Bulk API to insert documents in batches
Assessing Assumptions
Whenever I add new documents in batches (lets say 2K), the insert operation is really slow. I suspect that is because, the mongo engine is comparing the _id's of all the new documents with all the 70 million to find out any _id duplicate entries. Since the _id based index is disk-resident, it'll make the code a lot slow.
If a collection has 70 million entries, that does not mean that an index lookup involves 70 million comparisons. The indexed values are stored in B-trees which allow for a small number of efficient comparisons. The exact number will depend on the depth of the tree and how your indexes are built and the value you're looking up .. but will be on the order of 10s (not millions) of comparisons.
If you're really curious about the internals, there are some experimental storage & index stats you can enable in a development environment: Storage-viz: Storage Visualizers and Commands for MongoDB.
Since the _id based index is disk-resident, it'll make the code a lot slow.
MongoDB loads your working set (portion of data & index entries recently accessed) into available memory.
If you are able to create your ids in an approximately ascending order (for example, the generated ObjectIds) then all the updates will occur at the right side of the B-tree and your working set will be much smaller (FAQ: "Must my working set fit in RAM").
Yes, I can let mongo use the _id for itself, but I don't want to waste a perfectly good index for it. Moreover, even if I let mongo generate _id for itself won't it need to compare still for duplicate key errors?
A unique _id is required for all documents in MongoDB. The default ObjectId is generated based on a formula that should ensure uniqueness (i.e. there is an extremely low chance of returning a duplicate key exception, so your application will not get duplicate key exceptions and have to retry with a new _id).
If you have a better candidate for the unique _id in your documents, then feel free to use this field (or collection of fields) instead of relying on the generated _id. Note that the _id is immutable, so you shouldn't use any fields that you might want to modify later.

general questions about using mongodb

I'm thinking about trying MongoDB to use for storing our stats but have some general questions about whether I'm understanding it correctly before I actually start learning it.
I understand the concept of using documents, what I'm not too clear about is how much data can be stored inside each document. The following diagram explains the layout I'm thinking of:
Website (document)
- some keys/values about the particular document
- statistics (tree)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
What got me excited about mongodb was the grouping functions such as:
http://www.mongodb.org/display/DOCS/Aggregation
db.test.group(
{ cond: {"invoked_at.d": {$gte: "2009-11", $lt: "2009-12"}}
, key: {http_action: true}
, initial: {count: 0, total_time:0}
, reduce: function(doc, out){ out.count++; out.total_time+=doc.response_time }
, finalize: function(out){ out.avg_time = out.total_time / out.count }
} );
But my main concern is how hard would that command for example be on the server if there is say 10's of millions of records across dozens of documents on a 512-1gb ram server on rackspace for example? Would it still run low load?
Is there any limit to the number of documents MongoDB can have (seperate databases)? Also, is there any limit to the number of records in a tree I explained above? Also, does that query I showed above run instantly or is it some sort of map/reduce query? Not very sure if I can execute that upon page load in our control panel to get those stats instantly.
Thanks!
Every document has a size limit of 4MB (which in text is A LOT).
It's recommended to run MongoDB in replication mode or to use sharding as you otherwise will have problems with single-server durability. Single-server durability is not given because MongoDB only fsync's to the disk every 60 seconds, so if your server goes down between two fsync's the data that got inserted/updated in that time will be lost.
There is no limit of documents other than your disk space in mongodb.
You should try to import a dataset that matches your data (or generate some test data) to MongoDB and analyse how fast your query executes. Remember to set indexes on those fields that you use heavily in your queries. Your above query should work pretty well even with a lot of data.
In order to analyze the speed of your query use the database profiler MongoDB comes with. On the mongo shell do:
db.setProfilingLevel(2); // to set the profiling level
[your query]
db.system.profile.find(); // to see the results
Remember to turn off profiling once you're finished (log will get pretty huge otherwise).
Regarding your database layout I suggest to change the "schema" (yeah yeah, schema less..) to:
website (collection):
- some keys/values about the particular document
statistics (collection)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
+ DBRef to website
See Database References
Documents in MongoDB are limited to a size of 4MB. Let's say a single page view results in 32 bytes being stored. Then you'll be able to store about 130,000 page views in a single document.
Basically the amount of page views a page can generate is infinite, and you indicated that you expect millions of them, so I suggest you store the log entries as separate documents. Each log entry should contain the _id of the parent document.
The number of documents in a database is limited to 2GB of total space on 32-bit systems. 64-bit systems don't have this limitation.
The group() function is a map-reduce query under the hood. The documentation recommends you use a map-reduce query instead of group(), because it has some limitations with large datasets and sharded environments.