Querying in the collection which contains 3 million items (collection size is 15GB).
//indexing
{name : 1}
//query
db.getCollection('contacts').find({name :"kello"}).limit(500)
The machine has 2Cores and 8GB memory, and it takes about 30 seconds to finish this query. It is impossible to keep your client waiting for about half minutes.
What can I do to accelerate it?
Does it work to have a machine which contains 16GB/32GB memory and configure the mongodb to cache the whole collection into memory, so that it can finish the query all in memory?
Or, should I got several 8GB machines to have a sharded cluster?
Will those methods improve the query speed?
You can create index to search documents to get faster response. You can also use elastic search to get quicker response.
https://www.compose.com/articles/mongoosastic-the-power-of-mongodb-and-elasticsearch-together/
Related
I have big dataset in mongodb to process. Sometime we only have one machine.
For example,
Have 1 million records in mongodb collection A. Data size is about 60GB .
Only one machine 4 core , 24G ram, mongodb in this machine too.
Only HDD no SSD.
Every record has html need parse , and regex match .
Result need insert to another collection B
Because point 4 , mongodb built-in mapreduce can't be competent . And I also tend to use python.
I do some testing and find :
if skip is large, query would be very slow
cursor = collection.find({'result.type':'detail'}, skip=start, limit=start+PRODUCER_CHUNKS)
With collection B record increasing, insert performance slow down much.
Any setting can improve above?
Besides, would it be faster if I deploy a docker spark cluster in this machine to do same job?
Because there is only one machine, so I think the most efficiently way is use multiprocessing . I implement this , but the speed is 25 record per second, I think it still slow .
Update
In my thought,
HHD write speed is over 120MB/s, I think it is enough to do such things ?
per record size is 60*1024*1024/10^6 = 62kb , 25record/s * 62kb /1024 = 1.5MB/s , much much lower than 120MB/s
One producer, Four workers(non-blocking, use asyncio) , each run in different process. CPU is E1231V3 .
Queue size is 500, easily full if collection B became large (because workers take the responsibility write to collection B , and insertion keep slowing down while processing. )
I think maybe has some way to speed up like if the limitation is memory:
db split my data in parts, maybe 60GB /5GB = 12
db use 5GB ram cache to read
db use 5GB ram cache to write (like redis, some interval write write once . mongodb has memory only mode,however it doesn't persist.)
I just started to deal with MongoDB.
Created 10 thousand json documents. I do search:
db.mycollection.find({"somenode1.somenode2.somenode3.somenode4.Value", "9999"}).count()
It gives out the correct result. Operating time: 34 ms. Everything is OK.
Now create a database with 1 million of the same documents. The total size of the database is 34Gb.The MongoDB divided the database into files by 2Gb. I repeat the above described query to find the number of relevant documents. I waited for result about 2 hours. The memory was occupied (16GB). Finally I shut down the Mongo.
System: Windows 7 x64, 16Gb RAM.
Please tell me what I'm doing wrong. A production db will be much bigger.
In your particular case, it appears you simply do not have enough RAM. At minimum, and index on "somenode4" would improve the query performance. Keep in mind, the indexes are going to want to be in RAM as well so you may need more RAM anyhow. Are you on a virtual machine? If so; I recommend you increase the size of the machine to account for the size of the working set.
As one of the other commenters stated, that nesting is a bit ugly but I understand it is what you were dealt. So other than RAM, indexing appears to be your best bet.
As part of your indexing effort, you may also want to try experimenting with pre-heating the indexes to ensure they are in RAM prior to that find and count(). Try executing a query that seeks for something that does not exist. This should force the indexes and data into RAM prior to that query. Pending how often your data changes, you may want this to be done once a day or more. You are essentially front-loading the slow operations.
I have a feed system using fan-out on write. I keep a list of feed ids in redis sorted set, and save the feed content in mongodb, so every time when i read 30 feeds, i have to do 30 query to mongodb, is there anyway to improve it ?
Its depend upon your setup of database. MongoDb has a vast documentation about how to increase simultaneous read and write MongoDb conncurrency
If you need so many writes in database with less latency starts using sharding Deployment Sharding.
If you need to increase number of reads in data base deploy each shards as replica set and rout your read query in secondary node Read Prefences
Also each query should covered by index Better indexing, you can check your query time by simply adding explain after a find it will show you the time and all facts
db.collection.find({a:"abcd"}).explain()
Make sure you have enough ram so that your data set fits with ram atleast your index should fit inside the ram coz each time a data fetched from disk is 10 times slower than RAM.
Check your sever status with running MongoStat it will measures your database performance , page fault , lock , query opertaion manny detail.
Also measure your hardware performance with program like iostat and make sure io wait is low and less than 1%.
Few good links to deployment of mongodb and performance tuning.
1. Production deployment of mongodb
2. Performance tuning of mongodb By 10gen
3. Using redis before mongodb to cache query and result object
4. Example of redis and mongo
I have been struggling to deploy a large database.
I have deployed 3 shard clusters and started indexing my data.
However it's been 16 days and I'm only half way through.
Question is, should I import all data to a non sharded cluster and then activate sharding once the raw data is in the database and then attach more clusters and start indexing? Will this auto balance my data?
Or I should wait another 16 days for the current method I am using...
*Edit:
Here is more explanation of the setup and data that is being imported...
So we have 160 million documents that are like this
"_id" : ObjectId("5146ae7de4b0d58a864bcfda"),
"subject" : "<concept/resource/propert/122322xyz>",
"predicate" : "<concept/property/os/123ABCDXZYZ>",
"object" : "<http://host/uri_to_object_abcdy>"
Indexes: subject, predicate, object, subject > predicate, object > predicate
Shard keys: subject, predicate, object
Setup:
3 clusters on AWS (each with 3 Replica sets) with each node having 8 GiB RAM
(Config servers are within each cluster and Mongos is in a separate server)
The data gets imported by a Java program into a the Mongos.
What would be the ideal way to import this data, index and shard. (without waiting a month for the process to be completed)
If you are doing a massive bulk insert, it is often faster to perform the insert without an index and then index the collection. This has to do with the way Mongo manages index updates on the fly.
Also, MongoDB is particularly sensitive to memory when it indexes. Check the size of your indexes in your db.stats() and hook up your DBs to the Mongo Monitoring Service.
In my experience, whenever MongoDB takes a lot more time than expected, it is due to one of two things:
It running out of physical memory or getting itself into a poor I/O pattern. MMS can help diagnose both. Check out the page faults graph in particular.
Operating on unindexed collections, which does not apply in your case.
I run a single mongodb instance which is getting inserted with logs from an app server. the current rate of insert in production is 10 inserts per second. And its a capped collection. i DONT USE ANY INDEXES . Queries were running faster when there were small number of records. only one collection has that amount of data. even querying from collection that has very few rows has become very slow. IS there any means to improve the performance.
-Avinash
This is a very difficult question to answer because we dont know much about your configuration or your document structure.
One thing that immediately pops into my head is that you are running out of memory. 10 inserts per second doesn't mean much because we do not know how big the inserted documents are.
If you are inserting larger documents at 10 per second, you could be eating up memory, causing the operating system to push some of your records to disk.
When you query without using an index, you are forced to scan every document. If your documents have been pushed to disk by the OS, you will begin having page faults. Mongo will need to fetch pages of data off the hard disk, and load them into memory so that they can be scanned. Before doing this, the operating system will need to make room for that data in memory by flushing other parts of memory out to disk.
It sounds like you are are I/O bound and the two biggest things you can do to fix this are
Add more memory to the machine running mongod
Start using indexes so that the database does not need to do full collection scans
Use proper indexes, though that will have some effect on the efficiency of insertion in a capped collection.
It would be better if you can share the collection structure and the query you are using.