Have been reading a few blogs related to this topic like https://www.mongodb.com/blog/post/time-series-data-and-mongodb-part-1-introduction. It seems storing related time-series data within a few (or single document) would be a better approach over storing each data point as a single document. But I am just thinking if storing it in a single doc (just forget the size bucket approach for a moment) fits my use case
a) Regarding update of the data: occasionally need to replace the entire time-series
b) Potentially need to read the sorted [<date1,value1>,<date2,value2>,....] by data range.
A few questions on my head right now.
1) The common suggestion I saw is don't embed a large array in a single doc if the size of the array is unbound. If the array size is unbound, mongoDB may need to reallocate a new space upon update. I understand if we are using the old storage engine MMAPv1 that would be an issue. But as I saw from another answer in WiredTiger and in-place updates and Does performing a partial update on a MongoDb document in WiredTiger provide any advantage over a full document update?, this doesn't seem a problem because in WiredTiger we will construct the entire docs anyway and flush it to the disk. And given its optimization towards large docs. I am just wondering if large array in a single doc would still be an issue with WiredTiger.
2) [Given each array may contain 5000+ key value pairs, each doc is around 0.5 - 1 MB] Seems like the query to retrieve the sorted time-series by date range would be less complicated (less aggregation pipelines involved as I need to unwind the subdocument to do sorting and filter) if I store each data point as a single doc. But in terms of disk and memory usage, there would definitely be adv. as I am retrieving 1 docs vs n docs. So just thinking how to draw a line here in terms of performance.
3) There are 3 approaches I could go for in scenario A.
Replace the whole docs (replaceOne/ replaceOneModel)
Update only the time-series part of the doc (updateOne/ updateOneModel)
Use bulk update or insert to update each array element
Also I am thinking about the index rebuild issues. Given the index is not linked directly to the data files in WireTiger, it seems approach 1 and 2 are both acceptable, with 2 being better as it updates only the target part. Am I missing anything here?
4) Just thinking if what are the scenarios that we should actually go for the single data point per doc approach.
I am not too familiar with how the storage engine or mongoDB itself work under the hood, would be appreciate if someone could shed some lights on this. Thanks in advance.
Related
We are storing lots of data in mongodb let's say 30M docs. And these documents does not get modified very often. There are high number of read queries(~15k qps). And many of these queries(by _id field) will result in empty search result because of the nature of our use case.
I want to understand if mongodb does some sort of optimisation for detecting if a doc is not available in the db,index or not. Are there any plugin to enable this? Other option that I see is to use application level bloom filter but that would be another piece to maintain. AFAIK HBASE has support for bloom filter to see if a document is present or not.
Finding a non-existent document is the worst case of finding a document. Same as in real life, if what you're looking for doesn't exist you'll need more time to check all the places than if the thing existed at some point.
All of the find optimizations apply equally to finding documents that end up not existing (indexes, shard keys, etc.).
Context:
I'm currently modeling data which follow a deep tree pattern consisting of 4 layers (categories, subcategories, subsubcategories, subsubsubcategories... those two lasts are of course not the real words I'll be using)
This collection is meant to grow larger and larger over time, and each layer will contain a list of dozens of elements.
Problem:
Modeling a full embedded collection like that raises a big problem ; the 16MB document limit of MongoDB is not really ideal in this context because the document size will slowly approach the limit.
But at the same time, this data is not meant to be updated very often (at most a few times a day). Client-side, the API needs to return a fully-constructed big JSON file made of all those layers nested together. It can be easily made in such a way that every time a layer is updated, the full JSON result is updated too and stored in RAM, ready to be sent.
I was wondering if having a 4 layers tree like that split in different collections would be a better idea, because at the same time it would raises more queries, but it would be way more scalable and easy to understand. But I don't really know if it's the way MongoDB documents are meant do be modeled. I may be doing something wrong (first time using MongoDB) and I want to be sure that everything is already in this way of doing things
I'll suggest you to take a look at official MongoDB tree structures advices, and especially the solution with parent reference. It will allow you to keep your structure without struggling of the 16MB maximum size, and you can use $graphLookup aggregation stages to perform your further queries on tree subdocuments
From this book: https://www.oreilly.com/library/view/50-tips-and/9781449306779/ch01.html
Specifically about "Tip #7: Pre-populate anything you can".
It claims that pre-populating data is better because
MongoDB does not need to find space for them. It merely updates the values you’ve already entered, which is much faster
Is there any truth in this? I've checked the MongoDB manual about data modeling and it does not mention anything about this.
The other tips also does not cite any sources so I'm wondering if there are any basis in these tips
Is there any truth in this?
Yes, if you're using MMAPv1 storage engine.
https://docs.mongodb.com/manual/core/write-performance/#document-growth-and-the-mmapv1-storage-engine
Some update operations can increase the size of the document; for instance, if an update adds a new field to the document.
For the MMAPv1 storage engine, if an update operation causes a document to exceed the currently allocated record size, MongoDB relocates the document on disk with enough contiguous space to hold the document. Updates that require relocations take longer than updates that do not, particularly if the collection has indexes. If a collection has indexes, MongoDB must update all index entries. Thus, for a collection with many indexes, the move will impact the write throughput.
I would like to hear some suggestion on implementing database solution for below problem
1) There are 100 million XML documents saved to the database per
day.
2) The database hold maximum 3 days of data
3) 1 million query request per day
4) The value through which the documents are filtered are stored in
a seperate table and mapped with the corresponding XMl document ID.
5) The documents are requested based on date range, documents
matching a list of ID's, Top 10 new documents, records that are new
after the previous request
Here is what I have done so far
1) Checked if I can use Redis, it is limited to few datatypes and
also cannot use multiple where conditions to filter the Hash in
Redis. Indexing based on date and lots of there fields. I am unable
to choose a right datastructure to store it on a hash
2) Investigated DynamoDB, its again a key vaue store where all the
filter conditions should be stored as one value. I am not sure if it
will be efficient querying a json document to filter the right XML
documnent.
3) Investigated Cassandra and it looks like it may fit my
requirement but it has a limitation saying that the read operations
might be slow. Cassandra has an advantage of faster write operation
over changing data. This looks like the best possible solition used
so far.
Currently we are using SQL server and there is a performance problem and so looking for a better solution.
Please suggest, thanks.
It's not that reads in Cassandra might be slow, but it's hard to guarantee SLA for reads (usually they will be fast, but then, some of them slow).
Cassandra doesn't have search capabilities which you may need in the future (ordering, searching by many fields, ranked searching). You can probably achieve that with Cassandra, but with obviously greater amount of effort than with a database suited for searching operations.
I suggest you looking at Lucene/Elasticsearch. Let me quote the features of Lucene from their main website:
Scalable
High-Performance Indexing
over 150GB/hour on modern hardware
small RAM requirements -- only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms
ranked searching -- best results returned first
many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g. title, author, contents)
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
flexible faceting, highlighting, joins and result grouping
fast, memory-efficient and typo-tolerant suggesters
pluggable ranking models, including the Vector Space Model and Okapi BM25
configurable storage engine (codecs)
I want to use a capped collection in Mongo, but I don't want my documents to die when the collection loops around. Instead, I want Mongo to notice that I'm running out of space and move the old documents into another, permanent collection for archival purposes.
Is there a way to have Mongo do this automatically, or can I register a callback that would perform this action?
You shouldn't be using a capped collection for this. I'm assuming you're doing so because you want to keep the amount of "hot" data relatively small and move stale data to a permanent collection. However, this is effectively what happens anyway when you use MongoDB. Data that's accessed often will be in memory and data that is used less often will not be. Same goes for your indexes if they remain right-balanced. I would think you're doing a bit of premature optimization or at least have a suboptimal schema or index strategy for your problem. If you post exactly what you're trying to achieve and where your performance takes a dive I can have a look.
To answer your actual question; MongoDB does not have callbacks or triggers. There are some open feature requests for them though.
EDIT (Small elaboration on technical implementation) : MongoDB is built on top of memory mapped files for it's storage engine. It basically means it's an LRU based cache of "hot" data where data in this case can be both actual data and index data. As a result data and associated index data you access often (in your case the data you'd typically have in your capped collection) will be in memory and thus very fast to query. In typical use cases the performance difference between having an "active" collection and an "archive" collection and just one big collection should be small. As you can imagine having more memory available to the mongod process means more data can stay in memory and as a result performance will improve. There are some nice presentations from 10gen available on mongodb.org that go into more detail and also provide detail on how to keep indexes right balanced etc.
At the moment, MongoDB does not support triggers at all. If you want to move documents away before they reach the end of the "cap" then you need to monitor the data usage yourself.
However, I don't see why you would want a capped collection and also still want to move your items away. If you clarify that in your question, I'll update the answer.