I am recently working on a time series data project to store some sensor data.To achieve maximum insertion/write throughput i used capped collection(As per the mongodb documentation capped collection will increase the read/write performance). when i test the collection for insertion/write of some thousand documents/records using python driver with capped collection without index against the normal collection, i couldn't see much difference in improvement in write performance of capped collection over normal collection. example is like i inserted 40K records on single thread using pymongo driver. capped collection took around 25.4 seconds and normal collection took 25.7 seconds.
Could anyone please explain me when can we achieve maximum insertion/write throughput of capped collection? Is this is the right choice for time series data collections?
Data stored into capped collections are rotated upon exceeding fixed size of capped collection .
Capped collections don't require any indexes as they preserve the insertion order and also data is retrieved in natural order same as order in which the database refers to documents on disk.Hence it offers high performance in insertion and data retrieval process.
For more detailed description related to Capped collections please refer the documentation as mentioned in URL
https://docs.mongodb.com/manual/core/capped-collections/
Related
I want to write a custom archive rule in mongodb that archives the data with some condition.
Let's say I have a collection A.
If the collection has more than 1000 docs, archive the oldest doc (I have createdAt) until total document count is 1000. So basically, It should not exceed 1000.
You can implement this in a number of different ways, there is no OOTB solution for this.
I would personally use a capped collection with a size set to a 1000, this will let Mongo handle the most difficult part of your requirements, Regarding the "archiving" I would create an additional collection for archiving purposes and insert the document into both collections.
This will allow you to have a lean capped collection for your queries, and an additional "archive" collection for historical queries.
There are additional points to consider that you didn't specify, are the capped collection limitations an issue? do we need to support updates? what is the frequency of such operations, and so on.
I wanted to insert around 4 million of record in the normal collection. But the bulk insert was very slow, so I have created Capped Collections and loaded my data. Someone suggested to me that there will not any performance impact so no need to create the indexes.
But I am seeing for fetching the first 25 records with some filtering taking lots of time. I have a few questions to understand it better.
What is the ideal situation where Capped Collections are suggested
Can I create a compound index on the Capped Collections
Any performance improvement with Capped Collections over the normal collection
A capped collection limits how much data it stores. It does not make retrieval of the data it does store any faster.
Generally if you need fast (or, realistically, reasonably performant) reads you should be using indexes.
Can we save new record in decending order in MongoDB? So that the first saved document will be returned last in a find query. I do not want to use $sort, so data should be presaved in decending order.
Is it possible?
According to above mentioned description ,as an alternative solution if you do not need to use $sort, you need to create a Capped collection which maintains order of insertion of documents into MongoDB collection
For more detailed description regarding Capped collections in MongoDB please refer the documentation mentioned in following URL
https://docs.mongodb.org/manual/core/capped-collections/
But please note that capped collections are fixed size collections hence it will automatically flush old documents in case when collection size exceeds size of capped collection
The order of the records is not guaranteed by MongoDB unless you add a $sort operator. Even if the records happen to be ordered on disk, there is no guarantee that MongoDB will always return the records in the same order. MongoDB does quite a bit of work under the hood and as your data grows in size, the query optimiser may pick a different execution plan and return the data in a different order.
what I have ?
I have data of 'n' department
each department has more than 1000 datasets
each datasets has more than 10,000 csv files(size greater than 10MB) each with different schema.
This data even grow more in future
What I want to DO?
I want to map this data into mongodb
What approaches I used?
I can't map each datasets to a document in mongo since it has limit of 4-16MB
I cannot create collection for each datasets as max number of collection is also limited (<24000)
So finally I thought to create collection for each department , in that collection one document for each record in csv file belonging to that department.
I want to know from you :
will there be a performance issue if we map each record to document?
is there any max limit for number of documents?
is there any other design i can do?
will there be a performance issue if we map each record to document?
mapping each record to document in mongodb is not a bad design. You can have a look at FAQ at mongodb site
http://docs.mongodb.org/manual/faq/fundamentals/#do-mongodb-databases-have-tables .
It says,
...Instead of tables, a MongoDB database stores its data in collections,
which are the rough equivalent of RDBMS tables. A collection holds one
or more documents, which corresponds to a record or a row in a
relational database table....
Along with limitation of BSON document size(16MB), It also has max limit of 100 for level of document nesting
http://docs.mongodb.org/manual/reference/limits/#BSON Document Size
...Nested Depth for BSON Documents Changed in version 2.2.
MongoDB supports no more than 100 levels of nesting for BSON document...
So its better to go with one document for each record
is there any max limit for number of documents?
No, Its mention in reference manual of mongoDB
...Maximum Number of Documents in a Capped Collection Changed in
version
2.4.
If you specify a maximum number of documents for a capped collection
using the max parameter to create, the limit must be less than 232
documents. If you do not specify a maximum number of documents when
creating a capped collection, there is no limit on the number of
documents ...
is there any other design i can do?
If your document is too large then you can think of document partitioning at application level. But it will have high computation requirement at application layer.
will there be a performance issue if we map each record to document?
That depends entirely on how you search them. When you use a lot of queries which affect only one document, it is likely even faster that way. When a higher document-granularity results in a lot of document-spanning queries, it will get slower because MongoDB can't do that itself.
is there any max limit for number of documents?
No.
is there any other design i can do?
Maybe, but that depends on how you want to query your data. When you are content with treating files as a BLOB which is retrieved as a whole but not searched or analyzed on the database level, you could consider storing them on GridFS. It's a way to store files larger than 16MB on MongoDB.
In General, MongoDB database design doesn't depend so much on what and how much data you have, but rather on how you want to work with it.
Scenario:
10.000.000 record/day
Records:
Visitor, day of visit, cluster (Where do we see it), metadata
What we want to know with this information:
Unique visitor on one or more clusters for a given range of dates.
Unique Visitors by day
Grouping metadata for a given range (Platform, browser, etc)
The model i stick with in order to easily query this information is:
{
VisitorId:1,
ClusterVisit: [
{clusterId:1, dates:[date1, date2]},
{clusterId:2, dates:[date1, date3]}
]
}
Index:
by VisitorId (to ensure Uniqueness)
by ClusterVisit.ClusterId-ClusterVisit.dates (for searching)
by IdUser-ClusterVisit.IdCluster (for updating)
I also have to split groups of clusters into different collections in order to ease to access the data more efficiently.
Importing:
First we search for a combination of VisitorId - ClusterId and we addToSet the date.
Second:
If first doesn't match, we upsert:
$addToSet: {VisitorId:1,
ClusterVisit: [{clusterId:1, dates:[date1]}]
}
With First and Second importing i cover if the clusterId doesn't exists or if VisitorId doesn´t exists.
Problems:
totally inefficient (near impossible) on update / insert / upsert when the collection grows, i guess because of the document size getting bigger when adding a new date.
Difficult to maintain (unset dates mostly)
i have a collection with more than 50.000.000 that i can't grow any more. It updates only 100 ~ records/sec.
I think the model i'm using is not the best for this size of information. What do you think will be best to get more upsert/sec and query the information FAST, before i mess with sharding, which is going to take more time while i learn and get confident with it.
I have a x1.large instance on AWS
RAID 10 with 10 disks
Arrays are expensive on large collections: mapreduce, aggregate...
Try .explain():
MongoDB 'count()' is very slow. How do we refine/work around with it?
Add explicit hints for index:
Simple MongoDB query very slow although index is set
A full heap?:
Insert performance of node-mongodb-native
The end of memory space for collection:
How to improve performance of update() and save() in MongoDB?
Special read clustering:
http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/
Global write lock?:
mongodb bad performance
Slow logs performance track:
Track MongoDB performance?
Rotate your logs:
Does logging output to an output file affect mongoDB performance?
Use profiler:
http://www.mongodb.org/display/DOCS/Database+Profiler
Move some collection caches to RAM:
MongoDB preload documents into RAM for better performance
Some ideas about collection allocation size:
MongoDB data schema performance
Use separate collections:
MongoDB performance with growing data structure
A single query can only use one index (better is a compound one):
Why is this mongodb query so slow?
A missing key?:
Slow MongoDB query: can you explain why?
Maybe shards:
MongoDB's performance on aggregation queries
Improving performance stackoverflow links:
https://stackoverflow.com/a/7635093/602018
A good point for further sharding replica education is:
https://education.10gen.com/courses