Should I use sparse index for boolean flags in mongodb? - mongodb

I have a boolean flag :finished. Should I
A: index({ finished: 1 })
B: index({ finished: 1 }, {sparse: true})
C: use flag :unfinished instead, to query by that
D: other?
Ruby mongoid syntax. Most my records will have flag finished=true, and most operations fetch those unfinished, obviously. I'm not sure if I understand when to use sparse and when not to. Thanks!

The sparse flag is a little weird. To understand when to use it, you have to understand why "sparse" exists in the first place.
When you create a simple index on one field, there is an entry for each document, even documents that don't have that field.
For example, if you have an index on {rarely_set_field : 1}, you will have an index that is filled mostly with null because that field doesn't exist in most cases. This is a waste of space and it's inefficient to search.
The {sparse:true} option will get rid of the null values, so you get an index that only contain entries when {rarely_set_field} is defined.
Back to your case.
You are asking about using a boolean + sparse. But sparse doesn't really affect "boolean", sparse affect "is set vs. is not set".
In your case, you are trying to fetch unfinished. To leverage sparse the key is not the boolean value, but the fact that unfinished entries have that key and that "finished" entries have no key at all.
{ _id: 1, data: {...}, unfinished: true }
{ _id: 2, data: {...} } // this entry is finished
It sounds like you are using a Queue
You can definitely leverage the information above to implement a sparse index. However, it actually sounds like you are using a Queue. MongoDB is serviceable as a Queue, here are two examples.
However, if you look at the Queue, they are not doing it the way you are doing it. I'm personally using MongoDB as a Queue for some production systems and it runs pretty well, but test your expected load as a dedicated Queue will perform much better.

Sparse is only helpful if the value is null, not false. When you say "most will have finished=true", I'm guessing that most of finished is non-null, making sparse not very beneficial.
And since most values are a single value, I doubt any type of index would help at all if your queries are specific enough.

Related

How should I efficiently delete alot of records from a mongodb collection?

This bounty has ended. Answers to this question are eligible for a +500 reputation bounty. Bounty grace period ends in 4 hours.
Jiew Meng wants to draw more attention to this question.
I am using Mongo to store multi tenant data. As part of data cleanup for a tenant I want to delete everything related to the tenant. The tenantId is indexed but there are alot of rows and it takes a long time to query and I have no easy way to get the progress.
Currently I do something
db.records.deleteMany({tenantId: x})
Is there a better way?
Thinking of doing in batches but like query for x records then build a list of ids to delete. Seems very manual but isit the recommended way?
Some options that I can think of.
Drop the index, before deleting. You can recreate the index after the deletion.
Change the write concern to a lower value, possibly 0. Request won't wait for acknowledgement from secondaries.
db.records.deleteMany({tenantId: x},{w : 0});
If there is another field with enough cardinality to reduce the number of documents, try including that in the query.
Ex: if anotherField as 0,1,2,3 as values, then execute the delete command 4 times, each time with different value.
db.records.deleteMany({tenantId: x, anotherField: 0},{w : 0});
db.records.deleteMany({tenantId: x, anotherField: 1},{w : 0});
db.records.deleteMany({tenantId: x, anotherField: 2},{w : 0});
db.records.deleteMany({tenantId: x, anotherField: 3},{w : 0});
The performance may depend on variety of different factors. But here are some options you can try to improve the performance
Bulk operations
Bulk operations might help here. bulk.find(query).remove() is a version of db.collection.remove(query) that optimized for large numbers of operations. You can read more about it here
You can use the following way:
Declare a search query:
var query= {tenantId: x};
Initialize and use a bulk:
var bulk = db.yourCollection.initializeUnorderedBulkOp()
bulk.find(query).remove() // or try delete() instead of remove()
bulk.execute()
The idea here rather not to speed up the removal, but to produce less load.
Also you could try bulkWrite()
db.yourCollection.bulkWrite([
{ deleteMany: {
"filter" : query,
}}
])
TTL indexes
It may be not suitable for your use case, but there's entirely another approach without removing by yourself at all.
If it is suitable for you to delete data based on a timestamp, then a TTL index might help you. The idea here is that the record is being removed when the TTL expires.
Implemented as a special index type, TTL collections make it possible
to store data in MongoDB and have the mongod automatically remove data
after a specified period of time.
DeleteMany I think, There must be something common between all the rows that you want to remove from the collection.
You can find out something and then create a query accordingly.
this will help you to remove those records fast.
Let me give you one example. I want to remove all the records where username is not exists.
db.collection.deleteMany({ username: {$exists: false} })
The best place to start is to find something that all records have in common in-order to removed them all at once.
For example the following code deletes all entries that don't contain an email address.
db.users.deleteMany({ email: { $exists: false } })
MongoDB documentation have great examples. Link provided below.
https://www.mongodb.com/docs/manual/reference/method/db.collection.deleteMany/#delete-multiple-documents
You might also want to consider dropping the index since it could be recreated after your done with the operation.
Finally you might want to lower the write concern in your operation in order to speed things up. A compile list of options can be found here
https://www.mongodb.com/docs/v5.0/reference/write-concern/#w-option
I found a good tutorial on https://www.geeksforgeeks.org/mongodb-delete-multiple-documents-using-mongoshell/ that might help you further.
apologies for any grammatical mistakes since English is not my native tongue
I would suggest two solutions, and also please export your model If anything goes wrong you will have a backup of your data or try this in your test DB first 
you can use your tenantId as a condition, not matching _id but with extra logic, like if any of the records do have the tenantId delete them so this way all of your tenant data will be removed using a single query.
db.records.deleteMany({tenantId : {$exists: true})
// suggestion- if any of your tenant data has a field tenantId but it is null you can check for a null value also to delete those records.
 
2) find command data in all of the records, if there is use it as a condition to delete those records.
for example, all of your tenant data have a common field called type with the same value use delete statement like
db.records.deleteMany({type : 1})

Need to count boolean values in Tableau

I'm using mongo with Tableau and have a boolean called "verified" that shows as true vs false.
Each user can add "certifications" to his/her record, then we go in with an admin tool and flag the cert as verified:true or verified:false. I want to show a simple table that has the number of certifications for each user, then another column with the number verified.
Currently I'm using "COUNTD([Certifications.Verified])" to count the number of verified but I don't think it's accurately counting.
This is just counting if the sub-schema of "verified" exists with a true or false state so the numbers are not accurate. Note, in some cases this node doesn't exist and is shown as a null.
I need to way to count if the the verified=true then 1 if no verified node exists or verified:false then 0.
How do I add the logic to count this accurately in Tableau?
Update: Thanks for the Mongo queries but I'm looking for Tableau custom fields to show this.
You're going to want to use the $cond pipeline operation, within your .aggregate() operator. It'll allow you to specify what you would like returned based on a conditional, which in your case would be the Verified field. I don't know how your data is structured, but I would imagine using something like this:
$sum: { $cond: ["$Certifications.Verified", 1, 0] }
If Verified is true for that certification, it will return a 1 which will be accounted for in the $sum operator. Whether you want to use something like $group operator or a $project to create this summed field will depend on your preference/use case.
You can use this to return count.
schemaName.find({Certifications.Verified : true}).count(function (error,count) {
console.log(count);
});
it return non-zero value(no of document satisfied condition) if certificate verified = true is exist
otherwise it return 0
Replacing whatever your table's key is with RowID:
COUNTD(IIF([Certifications.Verified]=1, RowID, NULL))
COUNTD() is useful but can be computationally inefficient on large data sets, so don’t use COUNTD() in situations where a simpler, faster aggregation function will work equally well.
If you simply want to know how many records satisfy From Tableau, just use SUM(INT(<condition>)) The INT() type conversion function converts True to 1 and False to 0. So if the level of detail of your data has one record per THING, and you want to know how many THINGs satisfy your condition, just using SUM(INT(<condition>)) will do the trick, faster than using count distinct on record ids.
There are even some data sources, like MS Access, that don’t implement COUNTD()
Bottom line,
SUM(INT()) is the simplest way to count records that satisfy a condition
COUNTD() is very flexible and useful, but can be slow. For large data sets or high performance environments, consider alternatives such as reshaping your data.
BTW, similar advice applies to LOD calculations. They are very useful, and flexible, but introduce complexity and performance costs. Use them when necessary, but don’t use them when a simpler, faster approach will suffice. I see a lot of people just use FIXED LOD calcs for everything, presumably because it seems a lot like SQL. Overdone, that can lead to a very brittle solution.

MongoDB: Indexes, Sorting

After having read the official documentations on indexes, sort, intersection, i'm a little bit confuse on how everything work together.
I've trouble making my query use the indexes i've created. I work on a mongodb 3.0.3, on a collection having ~4millions of document.
To simplify, let's say my document is composed of 6 fields:
{
a:<text>,
b:<boolean>,
c:<text>,
d:<boolean>,
e:<date>,
f:<date>
}
The query I want to achieve is the following :
db.mycoll.find({ a:"OK", b:true, c:"ProviderA", d:true, e:{ $gte:ISODate("2016-10-28T12:00:01Z"),$lt:ISODate("2016-10-28T12:00:02") } }).sort({f:1});
So intuitively I've created two indexes
db.mycoll.createIndex({a: 1, b: 1, c: 1, d:1, e:1 }, {background: true,name: "test1"})
db.mycoll.createIndex({f:1}, {background: true,name: "test2"})
But the explain() give me that the first index is not used at all.
I known there is some kind of limitation when there is ranges in play in the filter (in the e field), but I can't find my way around it.
Also instead of having a single index on f, I try a compound index on {e:1,f:1} but it didn't change anything.
So What I have misunderstood?
Thanks for your support.
Update: also I find some time the following predicate for mongodb 2.6 :
A good rule of thumb for queries with sort is to order the indexed fields in this order:
First, the field(s) on which you will query for exact values.
Second, the field(s) on which you will sort.
Finally, field(s) on which you will query for a range of values (e.g., $gt, $lt, $in)
An example of using this rule of thumb is in the section on “Sorting the results of a complex query on a range of values” below, including a link to further reading.
Does this also apply for 3.X version?
Update 2: following above predicate, I created the following index
db.mycoll.createIndex({a: 1, b: 1, c: 1, d:1 , f:1, e:1}, {background: true,name: "test1"})
And for the same query :
db.mycoll.find({ a:"OK", b:true, c:"ProviderA", d:true, e:{ $gte:ISODate("2016-10-28T12:00:01Z"),$lt:ISODate("2016-10-28T12:00:02") } }).sort({f:1});
the index is indeed used. However too much keys seems to be scan, I may need to find a better order the fields in the query/index.
Mongo acts sometimes a bit strange when it comes to the index selection.
Mongo automagically decides what index to use. The smaller an index is the more likely it is used (especially indexes with only one field) - this is my experience. May be this happens because it is more often already loaded in RAM? To find out what index to use when Mongo performs test queries when it is idle. However the result is sometimes unexpected.
Therefore if you know what index to use you can force a query to use a specific index using the $hint option. You should try that.
Your two indexes used in the query and the sort does not overlap so MongoDB can not use them for index intersection:
Index intersection does not apply when the sort() operation requires an index completely separate from the query predicate.

Custom MongoDB Object _id vs Compound index

So I need to create a lookup collection in MongoDB to verify uniqueness. The requirement is to check if the same 2 values are being repeated or not. In SQL, I would something like this
SELECT count(id) WHERE key1 = 'value1' AND key2 = 'value2'
If the above query returns a count then it means the combination is not unique. I have 2 solutions in mind but I am not sure which one is more scalable. There are 30M+ docs against which I need to create this mapping.
Solution1:
I create a collection of docs with compound index on key1 and key2
{
_id: <MongoID>,
key1: <value1>,
key2: <value2>
}
Solution2:
I write application logic to create custom _id by concatenating value1 and value2
{
_id: <value1>_<value2>
}
Personally, I feel the second one is more optimised as it only has a single index and the size of doc is also smaller. But I am not sure if it is a good practice to create my own _id indexes as they may not be completely random. What do you think?
Thanks in advance.
Update:
My database already has a lot of indexes which take up memory so I want to keep index size to as low as possible specially for collections which are only used to verify uniqueness.
I would suggest Solution 1 i.e to use compound index and use two different properties key1 and key2
db.yourCollection.ensureIndex( { "key1": 1, "key2": 1 }, { unique: true } )
You can search easily by individual field if required. i.e if you require to search only by key1 or key2 then it would be easy with compound index. If you make _id with combination of keys, then it will be hard to search by individual field.
Size of document in Mongo is very least bothered while designing document.
If in near future if you would required to change keys values of same document with respect to other values, it will be easy. Keep in mind if you are using reference of this document in other collection's document.
In terms of your scalability, _id index would be sequential, easily shardable, and you can let MongoDB manage it.
If you are searching with those keys then it will use that index otherwise it will use the other required indexes for your search.
If you are still thinking of size of document than searching then you can go with Solution 1, make _id like
{_id:{key1:<value1>,key2:<value2>}}
By this you can search specific _id.key1 too.
Update:
Yes if document size is your concern than maintaining. And if you are sure about keys will not modify in future of same document and if it still modifying and do not have reference in other collections, then you can use Solution 1. Just use keys as objects than underscore _. You can add more keys later too if wanted in future.
I think the solution 2 is more suitable for your requirement. It is absolutely ok to generate the _id value of MongoDB. Most of the applications does populate the _id value with UUID. In your case, it make sense to concatenate value 1 and 2 for _id value assuming this collection is primarily used for verifying the uniqueness (i.e kind of temporary table) or lookup purpose.
Solution 1 is expensive as it requires additional index. Again, it depends on whether you are going to use this collection for verifying the uniqueness purpose alone or for some other use case as well.
Please note that you need to create the unique compound index, so that it doesn't allow to insert data for duplicate values.

MongoDB workaround for not supported sparse unique compound index

I need a workaround because MongoDB does not support sparse unique compound indexes (it will set the values to null if not present whereas it doesn't add the field to the index when it's a non-compound index). See https://jira.mongodb.org/browse/SERVER-2193
In my particular case I have events. They can either be one-time or recurring. I have a field parent which is only present when the event is an instance of a recurring event (I periodically create new copies of the parent to have the recurring events for the next weeks in the system).
I thought I'd just add this index in order to prevent duplicate copies when the cronjob runs twice
events.ensureIndex({ dateFrom: 1, dateTo: 1, parent: 1 }) { sparse: true, unique: true }
Unfortunately as said above MongoDB does not support sparse on compound indexes. What this means is that for one-time events the parent field is not present and is set to null by MongoDB. If I now have a second one-time event at the same time, it causes a duplicate key error, which I only want when parent is set.
Any ideas?
Edit: I've seen MongoDB: Unique and sparse compound indexes with sparse values , but checking for uniqueness on application level is a no-go. I mean that's what the database is there for, to guarantee uniqueness.
You can add a 4th field which would be dateFrom+dateTo+parent (string concatenation). When the parent is null, choose a uid, for example from ObjectId function, and then index that field (unique).
This way you can enforce the uniqueness you want. However you can hardly use it for anything else than enforcing this constraint. (Although queries like "get docs where the string starts with blah blah" may be pretty efficient)