I'm a MongoDB / PyMongo newbie. I've noticed that MongoDB uses a lot of CPU (in htop) and narrowed it down to one line of code.
db[collection].update({}, {'$inc':large_python_dictionary})
Although there is only ONE document in this particular database, there are a few thousand fields being updated with this one line contained in large_python_dictionary.
Is there something I can do to fix this or should I be looking into a different schema / different type of database?
Related
Issue
I have at least 10 text files(CSV), each reaches to 5GB in size. There is no issue when I import the first text file. But when I start importing the second text file it shows the Maximum Size Limit (16MB).
My primary purpose for using the database is for searching the customers from the database using customer_id index.
Given Below is the details of One CSV File.
Collection Name|Documents|Avg.Document Size|Total Document Size|Num.Indexes| Total Index Size|Properties
Customers|8,874,412|1.8 KB|15.7 GB|3|262.0 MB
To overcome this MongoDB community were recommending GridFS, but the problem with GridFS is that the data is stored in bytes and its not possible to query for a specific index in the textfile.
I don't know if its possible to query for a specific index in a textfile when using GridFS. If some one knows any help is appreciated.
Then the other solution I thought about was creating multiple instance of MonogDB running in different ports to solve the issue. Is this method feasible?
But lot of the tutorial on multiple instance shows how to cerate a replica set. There by storing the same data in the PRIMARY and the SECONDARY.
The SECONDARY instances don't allow to write and only allows to read data.
Is it possible to create multiple instance of MongoDB without creating replica set and with write and read operations on them? If Yes How? Can this method overcome the 16MB limit.
Second Solution I thought about was creating shards of the collections or simply sharding. Can this method overcome the 16MB limit. If yes any help regarding this.
Of the two solutions which is more efficient for searching for data (in terms of speed). As I mentioned earlier I just want to search of customers from this database.
The error message shows exactly where the problem is: entry #8437: line 13530, column 627
Have a look at the file and correct it in the file.
The error extraneous " in field ... is quite clear. In your CSV file you have an opening quote " but it is not closed, i.e. the rest of entire file is considered as one single field.
We are using MongoDb(3.0.6) for the application we are using. We have around 150 entries in one of the collections, but it takes approx 500ms to fetch all of these records, which I didn't expect. Mongo is deployed on a company server. What can be done to reduce this time?
We are not getting too many reads and there is no CPU load, etc, What can be mistake which may be causing this or what config should be changes to affect these.
Here is my schema: http://pastebin.com/x18iDPKf
I am just querying all the entries, which are 160 in number. I don't think time taken is due to Mongoose or NodeJs, as when I quesry using RoboMongo, It still takes same time.
Output of db.<collection>.stats().size :
223000
The query I am doing is:
db.getCollection('collectionName').find({})
Definitely it shouldn't be a problem with MongoDB. It should be a temporary issue or might be due to system constraints such as Internal Memory and so on.
If still the problem exists, use Indexes on the appropriate field which you are querying.
https://docs.mongodb.com/manual/indexes/
I'm a beginner with a non SQL structure like here with MongoDB and I don't find somebody talk about a collection with lots of data, like 1.000.000 entries ? and more ?
I saw a company page on the official site. But nothing with large data companies.
I heard about a combo with SQL : Large data are stocked on SQL tables, and only the "cache" are on MongoDB, but it's the only one solution for MongoDB and large data ?
We're using MongoDB to power Where's it Up, and the api behind it. We're currently pushing in >3 million documents per day. MongoDB is the only storage engine in use. We were keeping a bunch around for a while, but we're now using TTL to delete old records.
Things are going super well, just make sure you have all the indexes you need. Querying a million+ records without an index is bad, regardless of your storage engine. Auto-failover has been super helpful.
Something to watch out for is updating records to include more information, it can be pretty expensive if the document grows past pre-allocated space. We ended up changing how we stored data to avoid updates, and create new documents instead.
MongoDB in it's current incarnation is explicitly designed to make it easy to scale out.
As for the numbers: one of my test databases has 10M records and runs easily on my MacBook Air, which is 4 years old now.
So what you can do when your current cluster can not handle the data stored (either because the indices are too big for your RAM or because of processing the queries takes too long): add another node to your MongoDB cluster. Your performance gain should be something between slightly below linear (if your cluster was in perfect condition otherwise) up to several orders of magnitude (when indices didn't fit into RAM and/or IO was pushed to it's limits before and that situation changed after scaling out).
A word of warning: you should have somebody who knows about MongoDB administration in case you want to put you deployment into production. Though MongoDB administration seems to be easy, it is by no means something to be done by a layman. Especially not for production use.
I have deployed mongodb 64 bit 2.x version on aws m1.large instance.
I am trying to find best performance that mongo can give us on aws in-light of http://www.snailinaturtleneck.com/blog/tag/mongodb/ (and mongodb read/write performance and mongo hosting in the cloud)
I have created one db with one collection i.e. user and inserted 100,000 records/json object (each json object size is 4KB) using random number as suffix to “user-“. Also, created index on user id.
Further, I set db profiler to log slow query taking 20ms or more. I have executed java program with 10 threads. Each java class generates user id with random number and finds it in user collection in infinite loop. With such load I have observed latency in query/read up-to 60ms.
I also observed that when I run less number of threads say 3 or 4 (having query load on user collection 5K per second to find users) then I see no latency or less then 2ms latency.
I failed to understand why increasing load of finding user in collection is causing latency. I believe that mongo db can perform much more concurrent read then what I am trying and should not impact on performance as such.
One possibility I assume that would be - mongo is having performance issues if there are large queries executed on single collection like in our case, I expect to have 10K to 20K queries per second on single collection.
We would appreciate your thoughts / suggestion.
Some information is missing - what is your disk configuration? The EBS may contribute to the latency if everything is persisted to disk.
Amazon had released a white paper with best practices on how to install mongo on EC2: MongoDB on AWS. Here's its description
This whitepaper provides an overview of general best practices that apply to all major NoSQL systems and highlights one of popular NoSQL systems - MongoDB - and discusses how to best run it on the AWS cloud. It further examines different MongoDB configurations so you can optimize it for performance, durability, and security.
I have a simple data set, a few collections, not more than 20
documents in each, in MongoDB 2.0 (previously 1.8). I'm getting poor
results when it comes to querying data (at least I think they could be
much better looking at http://mongoid.org/performance.html). At first,
I though that the mapper I use in Ruby (Mongoid) was the problem, but
I made some more tests and it seems more related to the database
itself.
I've made a simple benchmark where I query the same document 10000
times by its ID, first using the Ruby Mongo driver, then Mongoid. The
results:
user system total real
driver 7.670000 0.380000 8.050000 ( 8.770334)
mongoid 9.180000 0.380000 9.560000 ( 10.384077)
The code is here: https://gist.github.com/1303536
The machine I'm testing this on is a Core 2 Duo P8400 2.27 GHz with 4
GB of RAM running Ubuntu 11.04.
I also made a similar test using pymongo to check if the problem lies
in the Ruby driver, but the result was only slightly better (5-6 s for
10000 requests).
The bsonsize of the document I'm fetching is 67. It has some small
embedded documents, but not more than 100. Some of the embedded
documents refer documents from other collections by ID, but AFAIR this
relationship is handled by the mapper, so it shouldn't influence the
performance. Fetching this document directly in the database with explain() results in millis = 0.
The odd thing is that the HDD LED keeps blinking all the time during
the tests. Shouldn't this document be cached in RAM by Mongo after
first read? Is there something obvious I could be missing? Or is this
not a poor result at all (but comparing with http://mongoid.org/performance.html
it does seem bad)?
I dropped and recreated the database. Maybe it was because of going from 1.8 to 2.0. Anyway, the HDD led stopped blinking and everything is now 2-3x times faster.
I also looked carefully at the test that was used to benchmark Mongoid and this result (0.001s) is just for one find(), not a million. I told the Mongoid's author that I think it's not stated clearly on the web site that the number of operations applies only to some of them.
Sorry for the confusion.