I have documents that look like this:
{
{
"_id": ObjectId("5444fc67931f8b040eeca671"),
"meta": {
"SessionID": "45",
"configVersion": "1",
"DeviceID": "55",
"parentObjectID": "55",
"nodeClass": "79",
"dnProperty": "16"
},
"cfg": {
"Name": "test"
}
}
The names and the data is just for testing atm. but I have a total of 25million documents in the DB. And I'm using find() to fetch a specific document(s) in this find() I use four arguments in this case, dnProperty, nodeClass, DeviceID and configVersion none of them are unique.
Atm. I have the index setup as simple as:
ensureIndex([["nodeClass", 1],["DeviceID", 1],["configVersion", 1], ["dnProperty",1]])
In other words I have index on the four arguments. I still have huge problems if you do a search that doesn't find any document at all. In my example all the "data" is random from 1-100 so if I do a find() with one of the values > 100 then it takes anywhere from 30-180sec to perform the search it also uses all of my 8gb RAM, then since there is no RAM left the computer goes very very slow.
What would be better indexes? Am I using indexes correct? Do I simply need more RAM since it will put "all" of the DB in it's working memory? Would you recommend another DB (other than mongo) to handle this better?
Sorry for multiple questions I hope they are short enough that you can give me an answer.
MongoDB uses memory mapped files which means copy of your data and indexes is stored in RAM and whenever there is a query it fetches it from the RAM itself. In the current scenario your queries are slower because your data + indexes size is so large that it will not fit in RAM , hence there will be lot of I/O activity to get data from disk which is the bottleneck.
Sharding helps in solving this problem because if you partition/shard your data across for example 5 machines then you will have 8GB * 5 = 40GB RAM which can hold your (dataset + indexes = working set) in RAM itself and the overhead of I/O will be reduced leading to improved performance.
Hence in this case your indexes will not help improve performance beyond a certain point, you will need to shard your data across multiple machines. Sharding will tend to increase the read as well as write throughput linearly.
Sharding in MongoDB
Related
System Information:
OS: Ubuntu 20.04 LTS
System: 80 GB RAM, 1 TB SSD, i7-12700k
The documents in this collection are on average 16KB, and there are 500K documents in this collection. I noticed that as the collection grows larger, the time taken to insert documents also grows larger.
In what ways could I improve the speed of writes?
It is taking 10 Hours to insert 150k documents. Which is around what the graph predicted when we integrate the line:
def f(num):
return 0.0004*num+0.9594
sum=0
for i in range(500,650):
sum+=f(i*1000)
>> sum/3600
>> 9.61497
Potential upgrades in my mind:
Use the C++ mongo engine for writes
Allocate more RAM to Mongod
Logs
iotop showing mongod using < 1% of the IO capacity with write speeds around 10-20 KB/s
htop showing the mongod is only using ~ 16GB of RAM \
Disks showing that some 300GB of SSD is free
EDIT:
Psudo code:
docs=[...]
for doc in docs:
doc["last_updated"]=str(datetime.now())
doc_from_db = collection.find_one({"key":doc["key"]})
new_dict = minify(doc)
if doc_from_db is None:
collection.insert_one(new_dict)
else:
collection.replace_one({"key":doc["key"]},new_dict,upsert=true)
When it comes to writes there are a few things to consider, the most impactful one which I'm assuming is the issue here is index size / index complexity / unique indexes.
It's hard to give exact advice without more information so I'll detail the most common bottlenecks when it comes to writes from my experience.
As mentioned indexes, if you have too many indexes. unique indexes. or indexes on very large arrays (and the document you insert have large arrays) these all heavily impact insert performance. This behavior also correlates with the graph you provided as inserting becomes worse and worse the larger the index gets. There is no "real" solution to this issue, you should reconsider which indexes and which indexes cause the bottleneck (focus on unique /array indexes). For example if you have an index that enforces uniqueness then drop it and enforce uniqueness at the application level instead.
write concern and replication lag, if you are using a replica set and you require a majority write concern this can definitely cause issues due to the sync lag that happens and grows, usually this is a side affect of a different issues, for example because of #1 (large indexes) the insert takes too long which causes sync lag which delays even further the write concern.
unoptimized hardware (Assuming you're hosted on cloud), you'd be surprised how much you can optimize write performance by just changing the disk type and increasing IOPS. this will give immediate performance. obviously at the cost of $$$.
no code was provided so I would also check that, if it's a for loop then obviously you can parallelize the logic.
I recommend you test the same insert logic on an indexless collection to pinpoint the problem, i'd be glad to help think of other issues/solutions once you can provide more information.
EDIT:
Here is an example of how to avoid the for loop issue by using bulkWrite instead in python using pymongo.
from pymongo import InsertOne, DeleteOne, ReplaceOne
from pymongo.errors import BulkWriteError
docs = [... input documents ]
requests = []
for doc in docs:
requests.append({
ReplaceOne({"docId": doc["docID"]}, doc, { upsert: True})
})
try:
db.docs.bulk_write(requests, ordered=False)
except BulkWriteError as bwe:
pprint(bwe.details)
You can enable profiling in Database, but according to previous comments and your code, just python code profiling may be enough, for example can you show the output of similar example?
https://github.com/Tornike-Skhulukhia/cprofiler_python_example/blob/main/demo.py
But before that, please check that you have index on field that you are doing searches against using find_one command in current code, otherwise database may need to do full collection scan to just find 1 document, meaning if you have more documents, this time will also increase a lot.
I am designing a MongoDB database that looks something like this:
registry:{
id:1,
duration:123,
score:3,
text:"aaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}
The text field is very big compared to the rest. I sometimes need to perform analytics queries that average the duration or the score, but never use the text.
I have queries that are more specific, and retrieve all the information about a single document. But in this queries I could spend more time making two queries to retrieve all the data.
My question is, if I make a query like this:
db.registries.aggregate( [
{
$group: {
_id: null,
averageDuration: { $avg: "$duration" },
}
}
] )
Would it need to read the data from the transcript field? That would make the query much slower and it would take a lot of RAM. If that is the case it would be better to split the records in two and have something like this right?:
registry:{
id:1,
duration:123,
score:3,
}
registry_text:{
id:1,
text:"aaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}
Thanks a lot!
I don't know how the server works in this case but I expect that, for caching reasons, the server will load complete documents into memory when it reads them from disk. Disk reads are very slow (= expensive in time taken) and I expect server will aggressively use memory if it can to avoid reads.
An important note here is that the documents are stored on disk as lists of key-value pairs comprising their contents. To not load a field from disk the server would have to rebuild the document in question as part of reading it since there are length fields involved. I don't see this happening in practice.
So, once the documents are in memory I assume they are there with all of their fields and I don't expect you can tune this.
When you are querying, the server may or may not drop individual fields but this would only change the memory requirements for the particular query. Generally these memory requirements are dwarfed by the overall database cache size and aggregation pipelines. So I don't think it really matters at what point a large field is dropped from a document during query processing (assuming you project it out in the query).
I think this isn't a worthwhile matter to try to ponder/optimize. If you have a real system with real workloads, you'll be much more pressed to optimize something else.
If you are concerned with memory usage when the amount of available memory is consumer-sized (say, under 16 gb), just get more memory - it's insanely cheap given how much time you'd spend working around lack of it (whether we are talking about provisioning bigger AWS instances or buying more sticks of RAM).
You should be able to use $project to limit the fields read.
As a general advice, don't try to normalize the data with MongoDB as you would with SQL. Also, it's often more performant to read documents plain from DB and do the processing on your server.
I have found this answer that seems to indicate that project needs to fetch all document in the database server, it only reduces bandwith
When using projection to remove unused fields, the MongoDB server will
have to fetch each full document into memory (if it isn't already
there) and filter the results to return. This use of projection
doesn't reduce the memory usage or working set on the MongoDB server,
but can save significant network bandwidth for query results depending
on your data model and the fields projected.
https://dba.stackexchange.com/questions/198444/how-mongodb-projection-affects-performance
I am planning to use a nested document structure for my MongoDB Schema design as I don't want to go for flat schema design as In my case I will need to fetch my result in one query only.
Since MongoDB has a size limit for a document.
MongoDB Limits and Threshold
A MongoDB document has a size limit of 16MB ( an amount of data). If your subcollection can growth without limits go flat.
I don't need to fetch my nested data but only be needing my nested data for filtering and querying purpose.
I want to know whether I will still be bound by MongoDB size limits even if I use my embedded data only for querying and filter purpose and never for fetching of nested data because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?
Nested schema design example
{
clinicName: "XYZ Hopital",
clinicAddress: "ABC place.",
"doctorsWorking":{
"doctorId1":{
"doctorJoined": ISODate("2017-03-15T10:47:47.647Z")
},
"doctorId2":{
"doctorJoined": ISODate("2017-04-15T10:47:47.647Z")
},
"doctorId3":{
"doctorJoined": ISODate("2017-05-15T10:47:47.647Z")
},
...
...
//upto 30000-40000 more records suppose
}
}
I don't think your understanding is correct when you say "because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?".
If we see MongoDB Doc. then it reads
The maximum BSON document size is 16 megabytes. The maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API.
So the clear limit is 16 MB on document size. Mongo should stop you from saving such a document which is greater than this size.
If I agree with your understanding for a while then let's say that it allows to
save any size of document but more than 16 MB in RAM is not allowed. But on other hand, while storing the data it won't know what queries will be run on this data. So ultimately you will be inserting such big documents which can't be used later. (because while inserting we don't tell the query pattern, we can even try to fetch the full document in a single shot later).
If the limit is on transmission (hypothetically assuming) then there are lot of ways (via code) software developers can bring data into RAM in clusters and they won't cross 16 MB limit ever (that's how they do IO ops. on large files). They will make fun of this limit and just leave it useless. I hope MongoDB creators knew it and didn't want it to happen.
Also if limit is on transmission then there won't be any need of separate collection. We can put everything in a single collections and just write smart queries and can fetch data. If fetched data is crossing 16 MB then fetch it in parts and forget the limit. But it doesn't go this way.
So the limit must be on document size else it can create so many issues.
In my opinion if you just need "doctorsWorking" data for filtering or querying purpose (and if you also think that "doctorsWorking" will cause document to cross 16 MB limit) then it's good to keep it in a separate collection.
Ultimately all things depend on query and data pattern. If a doctor can serve in multiple hospitals in shifts then it will be great to keep doctors in separate collection.
First of all, I am using MongoDB 3.0 with the new WiredTiger storage engine. Also using snappy for compression.
The use case I am trying to understand and optimize for from a technical point of view is the following;
I have a fairly large collection, with about 500 million documents that takes about 180 GB including indexes.
Example document:
{
_id: 123234,
type: "Car",
color: "Blue",
description: "bla bla"
}
Queries consist of finding documents with a specific field value. Like so;
thing.find( { type: "Car" } )
In this example the type field should obviously be indexed. So far so good. However the access pattern for this data will be completely random. At a given time I have no idea what range of documents will be accessed. I only know that they will be queried on indexed fields, returning at the most 100000 documents at a time.
What this means in my mind is that the caching in MongoDB/WiredTiger is pretty much useless. The only thing that needs to fit in the cache are the indexes. An estimation of the working set is hard if not impossible?
What I am looking for is mostly tips on what kinds of indexes to use and how to configure MongoDB for this kind of use case. Would other databases work better?
Currently I find MongoDB to work quite well on somewhat limited hardware (16 GB RAM, non SSD disc). Queries return in decent time and obviously instantly if the result set is already in the cache. But as already stated this will most likely not be the typical case. It is not critical that the queries are lightning fast, more so that they are dependable and that the database will run in a stable manner.
EDIT:
Guess I left out some important things. The database will be mostly for archival purposes. As such, data arrives from another source in bulk, say once a day. Updates will be very rare.
The example I used was a bit contrived but in essence that is what queries look like. When I mentioned multiple indexes I meant the type and color fields in that example. So documents will be queried on using these fields. As it is now, we only care about returning all documents that have a specific type, color etc. Naturally, the plan we have is to only query on fields that we have an index for. So ad-hoc queries are off the table.
Right now the index sizes are quite manageable. For the 500 million documents each of these indexes are about 2.5GB and fit easily in RAM.
Regarding average data size of an operation, I can only speculate at this point. As far as I know, typical operations return about 20k documents, with an average object size in the range of 1200 bytes. This is the stat reported by db.stats() so I guess it is for the compressed data on disc, and not how much it actually takes once in RAM.
Hope this bit of extra info helped!
Basically, if you have a consistent rate of reads that are uniformly at random over type (which is what I'm taking
I have no idea what range of documents will be accessed
to mean), then you will see stable performance from the database. It will be doing some stable proportion of reads from cache, just by good luck, and another stable proportion by reading from disk, especially if the number and size of documents are about the same between different type values. I don't think there's a special index or anything to help you besides just better hardware. Indexes should remain in RAM because they'll constantly be being used.
I suppose more information would help, as you mention only one simple query on type but then talk about having multiple indexes to worry about keeping in RAM. How much data does the average operation return? Do you ever care to return a subset of docs of certain type or only all of them? What do inserts and updates to this collection look like?
Also, if the documents being read are truly completely random over the dataset, then the working set is all of the data.
We have a problem of aggregation queries running long time (couple of minutes).
Collection:
We have a collection of 250 million documents with about 20 fields per document,
The total size of the collection is 110GB.
We have indexes over "our_id" and dtKey fields.
Hardware:
Memory:
24GB RAM (6 * 4GB DIMMS 1333 Mhz)
Disk:
Lvm 11TB built from 4 disks of 3TB disks:
600MB/s maximum instantaneous data transfers.
7200 RPM spindle. Average latency = 4.16ms
RAID 0
CPU:
2* E5-2420 0 # 1.90GHz
Total of 12 cores with 24 threads.
Dell R420.
Problem:
We are trying to make an aggregation query of the following:
db.our_collection.aggregate(
[
{
"$match":
{
"$and":
[
{"dtKey":{"$gte":20140916}},
{"dtKey":{"$lt":20141217}},
{"our_id":"111111111"}
]
}
},
{
"$project":
{
"field1":1,
"date":1
}
},
{
"$group":
{
"_id":
{
"day":{"$dayOfYear":"$date"},
"year":{"$year":"$date"}
},
"field1":{"$sum":"$field1"}
}
}
]
);
This query takes a couple of minutes to run, when it is running we can see the followings:
Mongo current operation is yielding more than 300K times
On iostat we see ~100% disk utilization
After this query is done it seems to be in cache and this can be done again in a split second,
After running it for 3 – 4 users it seems that the first one is already been swapped out from the cache and the query takes a long time again.
We have tested a count on the matching part and seen that we have users of 50K documents as well as users with 500K documents,
We tried to get only the matching part:
db.pub_stats.aggregate(
[
{
"$match":
{
"$and":
[
{"dtKey":{"$gte":20140916}},
{"dtKey":{"$lt":20141217}},
{" our_id ":"112162107"}
]
}
}
]
);
And the queries seems to take approximately 300-500M of memory,
But after running the full query, it seems to take 3.5G of memory.
Questions:
Why the pipelining of the aggregation takes so much memory?
How can we increase our performance for it to run on a reasonable time for HTTP request?
Why the pipelining of the aggregation takes so much memory?
Just performing a $match won't have to read the actual data, it can be done on the indexes. Through the projection's access of field1, the actual document will have to be read, and it will probably be cached as well.
Also, grouping can be expensive. Normally, it should report an error if your grouping stage requires more than 100M of memory - what version are you using? It requires to scan the entire result set before yielding, and MongoDB will have to at least store a pointer or an index of each element in the groups. I guess the key reason for the memory increase is the former.
How can we increase our performance for it to run on a reasonable time for HTTP request?
Your dtKey appears to encode time, and the grouping is also done based on time. I'd try to exploit that fact - for instance, by precomputing aggregates for each day and our_id combination - makes a lot of sense if there's no more criteria and the data doesn't change much anymore.
Otherwise, I'd try to move the {"our_id":"111111111"} criterion to the first position, because equality should always precede range queries. I guess the query optimizer of the aggregation framework is smart enough, but it's worth a try. Also, you might want to try turning your two indexes into a single compound index { our_id, dtkey }. Index intersections are supported now, but I'm not sure how efficient that is, really. Use the built-in profile and .explain() to analyze your query.
Lastly, MongoDB is designed for write-heavy use and scanning data sets of hundreds of GB from disk in a matter of milliseconds isn't feasible computationally at all. If your dataset is larger than your RAM, you'll be facing massive IO delays on the scale of tens of milliseconds and upwards, tens or hundreds of thousands of times, because of all the required disk operations. Remember that with random access you'll never get even close to the theoretical sequential disk transfer rates. If you can't precompute, I guess you'll need a lot more RAM. Maybe SSDs help, but that is all just guesswork.