Fast query and deletion of documents of a large collection in MongoDB - mongodb

I have a collection (let say CollOne) with several million documents. They have the common field "id"
{...,"id":1}
{...,"id":2}
I need to delete some documents in CollOne by id. Those ids stored in a document in another collection (CollTwo). This ids_to_delete document has the structure as follows
{"action_type":"toDelete","ids":[4,8,9,....]}
As CollOne is quite large, finding and deleting one document will take quite a long time. Is there any way to speed up the process?

Like you can't really avoid a deletion operation in the database if you want to delete anything. If you're having performance issue I would just recommend to make sure you have an index built on the id field otherwise Mongo will use a COLLSCAN to satisfy the query which means it will over iterate the entire colLOne collection which is I guess where you feel the pain.
Once you make sure an index is built there is no "more" efficient way than using deleteMany.
db.collOne.deleteMany({id: {$in: [4, 8, 9, .... ]})
In case you don't have an index and wonder how to build one, you should use createIndex like so:
(Prior to version 4.2 building an index lock the entire database, in large scale this could take up to several hours if not more, to avoid this use the background option)
db.collOne.createIndex({id: 1})
---- EDIT ----
In Mongo shell:
Mongo shell is javascript based, so you just have to to execute the same logic with js syntax, here's how I would do it:
let toDelete = db.collTwo.findOne({ ... })
db.collOne.deleteMany({id: {$in: toDelete.ids}})

Related

Parse-Server source _rperm and _wperm mongo queries, and should I index them?

I'm going through my mongo logs to add indexes for my unindexed queries. By default, mongo only logs queries that take over 100ms to complete.
I've found that I have several on _wperm and _rperm keys. I see that is how the ACL gets broken down. But what type of Parse.Query call might create a query like this in the logs?
query: { orderby: {}, $query: { _rperm: { $in: [ null, "*", "[UserId]" ] } } }
I'm even noticing that this query is on a class that has only 8 total objects, yet is taking 133ms to complete, which seems really slow for such a small class, even if it had to do an in memory sort and scan.
Should I solve this at the code level, modifying my query to avoid this type of mongo query? Or should I add an index for these types of queries?
I notice I also have a few that are showing up in the Slow Queries tab on mLab. The query looks like {"_id":"<val>","_wperm":{"$in":["<vals>"]}}, with the suggested index {"_id": 1, "_wperm": 1}, but it has the following note:
"_id" is in the existing {"_id": 1} unique index. The following index recommendation should only be necessary in certain circumstances.
Yet, this is one of my slower queries, taking 320 ms to complete. It's on the _User class. Is that just because the _User class has a lot of rows? Since the _id is unique I feel like it shouldn't really make a difference adding a _wperm index, since I end up with only a single object.
I'm curious if I will see a benefit from taking action on these queries or if I should safely ignore them.
You should index your collections following the mongodb recommendations. In the good old parse.com days, those indexes were created automatically based on the seen workloads. Now you need to create them. Both make sense. The _rperm will be hit on every queries run without a masterKey. The _wperm in every write. In the future, we could automatically create the _id + _wperm index as all writes use this index

is there any way to restore predefined schema to mongoDB?

I'm beginner with mongoDB. i want to know is there any way to load predefined schema to mongoDB? ( for example like cassandra that use .cql file for this purpose)
If there is, please intruduce some document about structure of that file and way for restoring.
If there is not, how i can create an index only one time when I create a collection. I think it is wrong if i create index every time I call insert method or run my program.
p.s: I have a multi-threaded program that every thread insert and update my mongo collection. I want to create index only one time.
Thanks.
To create an index on a collection you need to use ensureIndex command. You need to only call it once to create an index on a collection.
If you call ensureIndex repeatedly with the same arguments, only the first call will create an index, all subsequent calls will have no effect.
So if you know what indexes you're going to use for your database, you can create a script that will call that command.
An example insert_index.js file that creates 2 indexes for collA and collB collections:
db.collA.ensureIndex({ a : 1});
db.collB.ensureIndex({ b : -1});
You can call it from a shell like this:
mongo --quiet localhost/dbName insert_index.js
This will create those indexes on a database named dbName on your localhost. It's worth noticing that if your database and/or collections are not yet created, this will create both the database and the collections for which you're adding the indexes.
Edit
To clarify a little bit. MongoDB is schemaless so you can't restore it's schema.
You can only create indexes and collections (by using createCollection helper).
MongoDB is basically schemaless so there is no definition of a schema or namespaces to be restored.
In the case of indexes, these can be created at any time. There does not need to be a collection present or even the required fields for the index as this will all be sorted out as the collections are created and when documents are inserted that matches the defined fields.
Commands to create an index are generally the same with each implementation language, for example:
db.collection.ensureIndex({ a: 1, b: -1 })
Will define the index on the target collection in the target database that will reference field "a" and field "b", the latter in descending order. This will happen even if the collection or even the database does not exist as yet, or in fact will establish a blank namespace in that case.
Subsequent calls to the same index creation method do not actually re-create the index. Where the same index is specified to one that already exists it is effectively skipped as a "no-operation".
As such, you can simply feed all your required index creation statements at application startup and anything that is not already present will be created. Anything that already exists will be left alone.

How to make distinct operation more quickly in mongodb

There are 30,000,000 records in one collection.
when I use distinct command on this collection by java, it takes about 4 minutes, the result's count is about 40,000.
Is mongodb's distinct operation so inefficiency?
and how can I make it more efficient?
Is mongodb's distinct operation so inefficiency?
At 30m records? I would say 4 minutes is actually quite good, I think that's just as fast, maybe a little faster than SQL does it.
I would probably test this in other databases before saying it is inefficient.
However, one way of looking at performance is to see if the field is indexed first and if that index is in RAM or can be loaded without page thrashing. Distinct() can use an index so long as the field has an index.
and how can I make it more efficient?
You could use a couple of methods:
Incremental map reduce to distinct the main collection once every, say, 5 mins to a unique collection
And Pre-aggregate the unique collection on save by saving to two collections, one detail and one unique
Those are the two most viable methods of getting around this performantly.
Edit
Distinct() is not outdated and if it fits your needs is actually more performant than $group since it can use an index.
The .distinct() operation is an old one, as is .group(). In general these have been superseded by .aggregate() which should be generally used in preference to these actions:
db.collection.aggregate([
{ "$group": {
"_id": "$field",
"count": { "$sum": 1 }
}
)
Substituting "$field" with whatever field you wish to get a distinct count from. The $ prefixes the field name to assign the value.
Look at the documentation and especially $group for more information.

how to build index in mongodb in this situation

I have a mongodb database, which has following fields:
{"word":"ipad", "date":20140113, "docid": 324, "score": 98}
which is a reverse index for a log of docs(about 120 millions).
there are two kinds of queries in my system:
one of which is :
db.index.find({"word":"ipad", "date":20140113}).sort({"score":-1})
this query fetch the word "ipad" in date 20140113, and sort the all docs by score.
another query is:
db.index.find({"word":"ipad", "date":20140113, "docid":324})
to speed up these two kinds of query, what index should I build?
Should I build two indexes like this?:
db.index.ensureIndex({"word":1, "date":1, "docid":1}, {"unique":true})
db.index.ensureIndex({"word":1, "date":1, "score":1}
but I think build the two index use two much hard disk space.
So do you have some good ideas?
You are sorting by score descending (.sort({"score":-1})), which means that your index should also be descending on the score-field so it can support the sorting:
db.index.ensureIndex({"word":1, "date":1, "score":-1});
The other index looks good to speed up that query, but you still might want to confirm that by running the query in the mongo shell followed with .explain().
Indexes are always a tradeoff of space and write-performance for read-performance. When you can't afford the space, you can't have the index and have to deal with it. But usually the write-performance is the larger concern, because drive space is usually cheap.
But maybe you could save one of the three indexes you have. "Wait, three indexes?" Yes, keep in mind that every collection must have an unique index on the _id field which is created implicitely when the collection is initialized.
But the _id field doesn't have to be an auto-generated ObjectId. It can be anything you want. When you have another index with an uniqueness-constraint and you have no use for the _id field, you can move that unique-constraint to the _id field to save an index. Your documents would then look like this:
{ _id: {
"word":"ipad",
"date":20140113,
"docid": 324
},
"score": 98
}

MongoDB index on many (nested) fields/attributes

In e-commerce application I have documents like this:
{ category:'A', ..., price:122,
attr:{ width:6, height:4, hasLCD:true, lcdType:'some text', ..., a36:null }
}
I.e. every product has many attributes of various simple types.
Now I want to filter products by dynamic queries containing top level fields plus some attributes. For example:
find({category:'A', price:{$lt:200}, ...,
'attr.height':{$lt:6}, 'attr.hasLCD':true, 'attr.lcdType':{$in:[...]}, ...})
And I'd like this to perform fast.
Trying to index on all possible 'attr.*' variants gives me an error (too many compound keys). I also suspect that if I index it that way and then omit one of attrs in query index won't work.
Trying to index on 'attr' as a whole does not help either.
What is the proper way to model this under MongoDB?
Update
I have tried this approach (also mentioned here). I.e. store attributes as array of key-value pairs:
attr2: [ {tag:'lcgType', value:'some text'}, ...
And index it like this:
ensureIndex({ 'attr2.tag':1, 'attr2.value':1 })
And query like this:
find({attr2:{$all:[
{$elemMatch:{tag:'bestseller',value:true}},
{$elemMatch:{tag:'weight',value:{$lte:100}}}
]}})
Now explain() says that it is using "BtreeCursor attr2.tag_1_attr2.value_1" but still "nscanned" : 31607 and the whole execution time have actually increased (compared to non-indexed scenario).
Something is wrong here.
Sub-question
What if I select some (less than 31) most frequently queried attributes and try to index on those. If I put all of them in single compound index:
ensureIndex({'attr.a1':1, 'attr.a2':1, ...})
According to the docs this index won't be used for queries missing attr.a1 attribute.
How to define index in this case?
If you really have to allow a lot of filters, combinations and possibly even sorts, MongoDB is not a good fit because it uses only one index per query. The number of indexes then grows way too fast, because compound keys are somewhat inflexible (that should answer the subquestion) and becomes a performance hog.
Use a search database like ElasticSearch, SolR, etc. instead that comes with the features you need. You can the use a $in on the ids that the search server returned if you want to keep the base information in MongoDB (it's usually a good idea to have the search database simply replicate the information of the primary data store so you don't need to sync changes two-way, which would be a nightmare)