Slow insert() using MongoDB - mongodb

Here is the issue :
If the collection has only the default index "_id", then the time to upload a set of documents is constant when the collection gets bigger.
But if I add the following index to the collection :
db.users.createIndex({"s_id": "hashed"}, {"background": true})
then, the time to upload the same set of documents increase drastically (it looks like an exponential function)
Context :
I'm trying to insert about 80 millions documents in a collection. I don't use mongo's sharding, there is only one instance.
I'm using the python API and here is my code :
client = pymongo.MongoClient(ip_address, 27017)
users = client.get_database('local')\
.get_collection('users')
bulk_op = users.initialize_unordered_bulk_op()
for s in iterator:
bulk_op.insert(s)
bulk_op.execute()
client.close()
There are 15 concurrent connections (I'm using Apache Spark and it corresponds to the different partitions).
The instance has 4GB of RAM.
The total size of indexes is about 1,5GB when upload is finish.
Thanks a lot for your help.

Related

Deleting large amounts of data from MongoDB

I have the following code that currently works. It goes through and finds every file that's newer than a specified date and that matches a regex, then deletes it, as well as the chunks that are pointing to it.
conn = new Mongo("<url>");
db = conn.getDB("<project>");
res = db.fs.files.find({"uploadDate" : { $gte : new ISODate("2017-04-04")}}, {filename : /.*(png)/});
while (res.hasNext()) {
var tmp = res.next();
db.getCollection('fs.chunks').remove({"files_id" : tmp._id});
db.fs.files.remove({ "_id" : tmp._id});
}
It's extremely slow, and most of the time, the client I'm running it from just times out.
Also, I know that I'm deleting files from the filesystem, and not from the normal collections. It's a long story, but the code above does exactly what I want it to do.
How can I get this to run faster? It was pointed out to me earlier that I'm running this code on a client, but it's possible to run it on the server side? Before I was trying to use the Javascript driver, this is probably why. I assume using the Mongo shell executes everythin on the server.
Any help would be appreciated. So close but so far...
I know that I'm deleting files from the filesystem, and not from the normal collections
GridFS is a specification for storing binary data in MongoDB, so you are actually deleting documents from MongoDB collections rather than files from the filesystem.
It was pointed out to me earlier that I'm running this code on a client, but it's possible to run it on the server side?
The majority of your code (queries and commands) is being executed by your MongoDB server. The client (mongo shell, in this case) isn't doing any significant processing.
It's extremely slow, and most of the time, the client I'm running it from just times out.
You need to investigate where the time is being spent.
If there is problematic network latency between your mongo shell and your deployment, you could consider running the query from a mongo shell session closer to the deployment (if possible) or use query criteria matching a smaller range of documents.
Another obvious candidate to look into would be server resources. For example, is deleting a large number of documents putting pressure on your I/O or RAM? Reducing the number of documents you delete in each script run may also help in this case.
db.fs.files.find({"uploadDate" : { $gte : new ISODate("2017-04-04")}}, {filename : /.*(png)/})
This query likely isn't doing what you intended: the filename is being provided as the second option to find() (so is used for projection rather than search criteria) and the regex matches a filename containing png anywhere (for example: typng.doc).
I assume using the Mongo shell executes everythin on the server.
That's an incorrect general assumption. The mongo shell can evaluate local functions, so depending on your code there may be aspects that are executed/evaluated in a client context rather than a server context. Your example code is running queries/commands which are processed on the server, but fs.files documents returned from your find() query are being accessed in the mongo shell in order to construct the query to remove related documents in fs.chunks.
How can I get this to run faster?
In addition to comments noted above, there are a few code changes you can make to improve efficiency. In particular, you are currently removing chunk documents individually. There is a Bulk API in MongoDB 2.6+ which will reduce the round trips required per batch of deletes.
Some additional suggestions to try to improve the speed:
Add an index on {uploadDate:1, filename: 1} to support your find() query:
db.fs.files.createIndex({uploadDate:1, filename: 1})
Use the Bulk API to remove matching chunk documents rather than individual removes:
while (res.hasNext()) {
var tmp = res.next();
var bulk = db.fs.chunks.initializeUnorderedBulkOp();
bulk.find( {"files_id" : tmp._id} ).remove();
bulk.execute();
db.fs.files.remove({ "_id" : tmp._id});
}
Add a projection to the fs.files query to only include the fields you need:
var res = db.fs.files.find(
// query criteria
{
uploadDate: { $gte: new ISODate("2017-04-04") },
// Filenames that end in png
filename: /\.png$/
},
// Only include the _id field
{ _id: 1 }
)
Note: unless you've added a lot of metadata to your GridFS files (or have a lot of files to remove) this may not have a significant impact. The default fs.files documents are ~130 bytes, but the only field you require is _id (a 12 byte ObjectId).

Improve the speed of large collection querying

Querying in the collection which contains 3 million items (collection size is 15GB).
//indexing
{name : 1}
//query
db.getCollection('contacts').find({name :"kello"}).limit(500)
The machine has 2Cores and 8GB memory, and it takes about 30 seconds to finish this query. It is impossible to keep your client waiting for about half minutes.
What can I do to accelerate it?
Does it work to have a machine which contains 16GB/32GB memory and configure the mongodb to cache the whole collection into memory, so that it can finish the query all in memory?
Or, should I got several 8GB machines to have a sharded cluster?
Will those methods improve the query speed?
You can create index to search documents to get faster response. You can also use elastic search to get quicker response.
https://www.compose.com/articles/mongoosastic-the-power-of-mongodb-and-elasticsearch-together/

Mongodb: keeping a frequently written collection in RAM

I am collecting data from a streaming API and I want to create a real-time analytics dashboard. This dashboard will display a simple timeseries plotting the number of documents per hour. I am wondering if my current approach is optimal.
In the following example, on_data is fired for each new document in the stream.
# Mongo collections.
records = db.records
stats = db.records.statistics
on_data(self, data):
# Create a json document from data.
document = simplejson.loads(data)
# Insert the new document into records.
records.insert(document)
# Update a counter in records.statistics for the hour this document belongs to.
stats.update({ 'hour': document['hour'] }, { '$inc': { document['hour']: 1 } }, upsert=True)
The above works. I get a beautiful graph which plots the number of documents per hour. My question is about whether this approach is optimal or not. I am making two Mongo requests per document. The first inserts the document, the second updates a counter. The stream sends approximately 10 new documents a second.
Is there for example anyway to tell Mongo to keep the db.records.statistics in RAM? I imagine this would greatly reduce disk access on my server.
MongoDB uses memory map to handle file I/O, so it essentially treats all data as if it is already in RAM and lets the OS figure out the details. In short, you cannot force your collection to be in memory, but if the operating system handles things well, the stuff that matters will be. Check out this link to the docs for more info on mongo's memory model and how to optimize your OS configuration to best fit your use case: http://docs.mongodb.org/manual/faq/storage/
But to answer your issue specifically: you should be fine. Your 10 or 20 writes per second should not be a disk bottleneck in any case (assuming you are running on not-ancient hardware). The one thing I would suggest is to build an index over "hour" in stats, if you are not already doing that, to make your updates find documents much faster.

Mongodb Sharding and Indexing

I have been struggling to deploy a large database.
I have deployed 3 shard clusters and started indexing my data.
However it's been 16 days and I'm only half way through.
Question is, should I import all data to a non sharded cluster and then activate sharding once the raw data is in the database and then attach more clusters and start indexing? Will this auto balance my data?
Or I should wait another 16 days for the current method I am using...
*Edit:
Here is more explanation of the setup and data that is being imported...
So we have 160 million documents that are like this
"_id" : ObjectId("5146ae7de4b0d58a864bcfda"),
"subject" : "<concept/resource/propert/122322xyz>",
"predicate" : "<concept/property/os/123ABCDXZYZ>",
"object" : "<http://host/uri_to_object_abcdy>"
Indexes: subject, predicate, object, subject > predicate, object > predicate
Shard keys: subject, predicate, object
Setup:
3 clusters on AWS (each with 3 Replica sets) with each node having 8 GiB RAM
(Config servers are within each cluster and Mongos is in a separate server)
The data gets imported by a Java program into a the Mongos.
What would be the ideal way to import this data, index and shard. (without waiting a month for the process to be completed)
If you are doing a massive bulk insert, it is often faster to perform the insert without an index and then index the collection. This has to do with the way Mongo manages index updates on the fly.
Also, MongoDB is particularly sensitive to memory when it indexes. Check the size of your indexes in your db.stats() and hook up your DBs to the Mongo Monitoring Service.
In my experience, whenever MongoDB takes a lot more time than expected, it is due to one of two things:
It running out of physical memory or getting itself into a poor I/O pattern. MMS can help diagnose both. Check out the page faults graph in particular.
Operating on unindexed collections, which does not apply in your case.

general questions about using mongodb

I'm thinking about trying MongoDB to use for storing our stats but have some general questions about whether I'm understanding it correctly before I actually start learning it.
I understand the concept of using documents, what I'm not too clear about is how much data can be stored inside each document. The following diagram explains the layout I'm thinking of:
Website (document)
- some keys/values about the particular document
- statistics (tree)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
What got me excited about mongodb was the grouping functions such as:
http://www.mongodb.org/display/DOCS/Aggregation
db.test.group(
{ cond: {"invoked_at.d": {$gte: "2009-11", $lt: "2009-12"}}
, key: {http_action: true}
, initial: {count: 0, total_time:0}
, reduce: function(doc, out){ out.count++; out.total_time+=doc.response_time }
, finalize: function(out){ out.avg_time = out.total_time / out.count }
} );
But my main concern is how hard would that command for example be on the server if there is say 10's of millions of records across dozens of documents on a 512-1gb ram server on rackspace for example? Would it still run low load?
Is there any limit to the number of documents MongoDB can have (seperate databases)? Also, is there any limit to the number of records in a tree I explained above? Also, does that query I showed above run instantly or is it some sort of map/reduce query? Not very sure if I can execute that upon page load in our control panel to get those stats instantly.
Thanks!
Every document has a size limit of 4MB (which in text is A LOT).
It's recommended to run MongoDB in replication mode or to use sharding as you otherwise will have problems with single-server durability. Single-server durability is not given because MongoDB only fsync's to the disk every 60 seconds, so if your server goes down between two fsync's the data that got inserted/updated in that time will be lost.
There is no limit of documents other than your disk space in mongodb.
You should try to import a dataset that matches your data (or generate some test data) to MongoDB and analyse how fast your query executes. Remember to set indexes on those fields that you use heavily in your queries. Your above query should work pretty well even with a lot of data.
In order to analyze the speed of your query use the database profiler MongoDB comes with. On the mongo shell do:
db.setProfilingLevel(2); // to set the profiling level
[your query]
db.system.profile.find(); // to see the results
Remember to turn off profiling once you're finished (log will get pretty huge otherwise).
Regarding your database layout I suggest to change the "schema" (yeah yeah, schema less..) to:
website (collection):
- some keys/values about the particular document
statistics (collection)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
+ DBRef to website
See Database References
Documents in MongoDB are limited to a size of 4MB. Let's say a single page view results in 32 bytes being stored. Then you'll be able to store about 130,000 page views in a single document.
Basically the amount of page views a page can generate is infinite, and you indicated that you expect millions of them, so I suggest you store the log entries as separate documents. Each log entry should contain the _id of the parent document.
The number of documents in a database is limited to 2GB of total space on 32-bit systems. 64-bit systems don't have this limitation.
The group() function is a map-reduce query under the hood. The documentation recommends you use a map-reduce query instead of group(), because it has some limitations with large datasets and sharded environments.