I'm using mongo with Tableau and have a boolean called "verified" that shows as true vs false.
Each user can add "certifications" to his/her record, then we go in with an admin tool and flag the cert as verified:true or verified:false. I want to show a simple table that has the number of certifications for each user, then another column with the number verified.
Currently I'm using "COUNTD([Certifications.Verified])" to count the number of verified but I don't think it's accurately counting.
This is just counting if the sub-schema of "verified" exists with a true or false state so the numbers are not accurate. Note, in some cases this node doesn't exist and is shown as a null.
I need to way to count if the the verified=true then 1 if no verified node exists or verified:false then 0.
How do I add the logic to count this accurately in Tableau?
Update: Thanks for the Mongo queries but I'm looking for Tableau custom fields to show this.
You're going to want to use the $cond pipeline operation, within your .aggregate() operator. It'll allow you to specify what you would like returned based on a conditional, which in your case would be the Verified field. I don't know how your data is structured, but I would imagine using something like this:
$sum: { $cond: ["$Certifications.Verified", 1, 0] }
If Verified is true for that certification, it will return a 1 which will be accounted for in the $sum operator. Whether you want to use something like $group operator or a $project to create this summed field will depend on your preference/use case.
You can use this to return count.
schemaName.find({Certifications.Verified : true}).count(function (error,count) {
console.log(count);
});
it return non-zero value(no of document satisfied condition) if certificate verified = true is exist
otherwise it return 0
Replacing whatever your table's key is with RowID:
COUNTD(IIF([Certifications.Verified]=1, RowID, NULL))
COUNTD() is useful but can be computationally inefficient on large data sets, so don’t use COUNTD() in situations where a simpler, faster aggregation function will work equally well.
If you simply want to know how many records satisfy From Tableau, just use SUM(INT(<condition>)) The INT() type conversion function converts True to 1 and False to 0. So if the level of detail of your data has one record per THING, and you want to know how many THINGs satisfy your condition, just using SUM(INT(<condition>)) will do the trick, faster than using count distinct on record ids.
There are even some data sources, like MS Access, that don’t implement COUNTD()
Bottom line,
SUM(INT()) is the simplest way to count records that satisfy a condition
COUNTD() is very flexible and useful, but can be slow. For large data sets or high performance environments, consider alternatives such as reshaping your data.
BTW, similar advice applies to LOD calculations. They are very useful, and flexible, but introduce complexity and performance costs. Use them when necessary, but don’t use them when a simpler, faster approach will suffice. I see a lot of people just use FIXED LOD calcs for everything, presumably because it seems a lot like SQL. Overdone, that can lead to a very brittle solution.
Related
This bounty has ended. Answers to this question are eligible for a +500 reputation bounty. Bounty grace period ends in 4 hours.
Jiew Meng wants to draw more attention to this question.
I am using Mongo to store multi tenant data. As part of data cleanup for a tenant I want to delete everything related to the tenant. The tenantId is indexed but there are alot of rows and it takes a long time to query and I have no easy way to get the progress.
Currently I do something
db.records.deleteMany({tenantId: x})
Is there a better way?
Thinking of doing in batches but like query for x records then build a list of ids to delete. Seems very manual but isit the recommended way?
Some options that I can think of.
Drop the index, before deleting. You can recreate the index after the deletion.
Change the write concern to a lower value, possibly 0. Request won't wait for acknowledgement from secondaries.
db.records.deleteMany({tenantId: x},{w : 0});
If there is another field with enough cardinality to reduce the number of documents, try including that in the query.
Ex: if anotherField as 0,1,2,3 as values, then execute the delete command 4 times, each time with different value.
db.records.deleteMany({tenantId: x, anotherField: 0},{w : 0});
db.records.deleteMany({tenantId: x, anotherField: 1},{w : 0});
db.records.deleteMany({tenantId: x, anotherField: 2},{w : 0});
db.records.deleteMany({tenantId: x, anotherField: 3},{w : 0});
The performance may depend on variety of different factors. But here are some options you can try to improve the performance
Bulk operations
Bulk operations might help here. bulk.find(query).remove() is a version of db.collection.remove(query) that optimized for large numbers of operations. You can read more about it here
You can use the following way:
Declare a search query:
var query= {tenantId: x};
Initialize and use a bulk:
var bulk = db.yourCollection.initializeUnorderedBulkOp()
bulk.find(query).remove() // or try delete() instead of remove()
bulk.execute()
The idea here rather not to speed up the removal, but to produce less load.
Also you could try bulkWrite()
db.yourCollection.bulkWrite([
{ deleteMany: {
"filter" : query,
}}
])
TTL indexes
It may be not suitable for your use case, but there's entirely another approach without removing by yourself at all.
If it is suitable for you to delete data based on a timestamp, then a TTL index might help you. The idea here is that the record is being removed when the TTL expires.
Implemented as a special index type, TTL collections make it possible
to store data in MongoDB and have the mongod automatically remove data
after a specified period of time.
DeleteMany I think, There must be something common between all the rows that you want to remove from the collection.
You can find out something and then create a query accordingly.
this will help you to remove those records fast.
Let me give you one example. I want to remove all the records where username is not exists.
db.collection.deleteMany({ username: {$exists: false} })
The best place to start is to find something that all records have in common in-order to removed them all at once.
For example the following code deletes all entries that don't contain an email address.
db.users.deleteMany({ email: { $exists: false } })
MongoDB documentation have great examples. Link provided below.
https://www.mongodb.com/docs/manual/reference/method/db.collection.deleteMany/#delete-multiple-documents
You might also want to consider dropping the index since it could be recreated after your done with the operation.
Finally you might want to lower the write concern in your operation in order to speed things up. A compile list of options can be found here
https://www.mongodb.com/docs/v5.0/reference/write-concern/#w-option
I found a good tutorial on https://www.geeksforgeeks.org/mongodb-delete-multiple-documents-using-mongoshell/ that might help you further.
apologies for any grammatical mistakes since English is not my native tongue
I would suggest two solutions, and also please export your model If anything goes wrong you will have a backup of your data or try this in your test DB first
you can use your tenantId as a condition, not matching _id but with extra logic, like if any of the records do have the tenantId delete them so this way all of your tenant data will be removed using a single query.
db.records.deleteMany({tenantId : {$exists: true})
// suggestion- if any of your tenant data has a field tenantId but it is null you can check for a null value also to delete those records.
2) find command data in all of the records, if there is use it as a condition to delete those records.
for example, all of your tenant data have a common field called type with the same value use delete statement like
db.records.deleteMany({type : 1})
I'm working on my app and I just ran into a dilemma regarding what's the best way to handle indexes for firestore.
I have a query that search for publication in a specify community that contains at least one of the tag and in a geohash range. The index for that query looks like this:
community Ascending tag Ascending location.geohash Ascending
Now if my user doesnt need to filter by tag, I run the query without the arrayContains(tag) which prompt me to create another index:
community Ascending location.geohash Ascending
My question is, is it better to create that second index or, to just use the first one and specifying all possible tags in arrayContains in the query if the user want no filters on tag ?
Neither is pertinently better, but it's a typical space vs time tradeoff.
Adding the extra tags in the query adds some overhead there, but it saves you the (storage) cost for the additional index. So you're trading some small amount of runtime performance for a small amount of space/cost savings.
One thing to check is whether the query with tags can actually run on just the second index, as Firestore may be able to do a zigzag merge join. In that case you could only keep the second, smaller index and save the runtime performance of adding additional clauses, but then get a (similarly small) performance difference on the query where you do specify one or more tags.
I'm doing a query where all I want to know if there is at least one row in the collection that matches the query, so I pass limit=1 to find(). All I care about is whether the count() of the returned cursor is > 0. Would it be faster to use count(with_limit_and_skip=True) or just count()? Intuitively it seems to me like I should pass with_limit_and_skip=True, because if there are a whole bunch of matching records then the count could stop at my limit of 1.
Maybe this merits an explanation of how limits and skips work under the covers in mongodb/pymongo.
Thanks!
Your intuition is correct. That's the whole point of the with_limit_and_skip flag.
With with_limit_and_skip=False, count() has to count all the matching documents, even if you use limit=1, which is pretty much guaranteed to be slower.
From the docs:
Returns the number of documents in the results set for this query. Does not take limit() and skip() into account by default - set with_limit_and_skip to True if that is the desired behavior.
I have two fields a and b, where b has substantially higher selectivity than a.
Now, if I am only querying on both a and b (never on either field by itself), which of the following two indexes is better and why:
{a: 1, b : 1}
{b: 1, a : 1}
Explain seems to return almost identical results, but I read somewhere that you should put higher selectivity fields first. I don't know why that would make sense though.
After some extensive work to improve queries on a 150 000 000 records database I have found out the following:
not necessarily higher selectivity fields, but actually fields that are "faster" to match, being moved to the first position can increase performance drastically
I had an index composed of the following fields:
zip, address, city, first name, last name
Address is matched by an array, not string = string so it takes most time to execute and is the slowest to match. My first index that I created was: address_zip_city_last_name_first_name and the execution time for matching 1000 records against the whole DB would go for hours.
Address field actually probably has the highest selectivity on these, but since it is not being matched by a simple string equality, it takes the most time. It actually goes something like this
{ address: {$all : ["1233", "main", "avenue] }}
By changing this index to having the "faster" fields in the beginning, for example: zip_city_first_name_last_name_address the performance was much better. The same 1000 records would match in just one second instead for going for hours.
Hope this helps someone
cheers
After doing some further analysis the two indexes are in fact pretty much identical from a performance point of view.
Really if you are in a similar situation, the real consideration should be whether in the future you might be more likely to query on a alone or b alone, and put that field first in the index.
I believe the optimiser will choose the index best to use, although you can provide hints
e.g.
db.collection.find({user:u, foo:d}).hint({user:1});
see http://www.mongodb.org/display/DOCS/Optimization
I have a boolean flag :finished. Should I
A: index({ finished: 1 })
B: index({ finished: 1 }, {sparse: true})
C: use flag :unfinished instead, to query by that
D: other?
Ruby mongoid syntax. Most my records will have flag finished=true, and most operations fetch those unfinished, obviously. I'm not sure if I understand when to use sparse and when not to. Thanks!
The sparse flag is a little weird. To understand when to use it, you have to understand why "sparse" exists in the first place.
When you create a simple index on one field, there is an entry for each document, even documents that don't have that field.
For example, if you have an index on {rarely_set_field : 1}, you will have an index that is filled mostly with null because that field doesn't exist in most cases. This is a waste of space and it's inefficient to search.
The {sparse:true} option will get rid of the null values, so you get an index that only contain entries when {rarely_set_field} is defined.
Back to your case.
You are asking about using a boolean + sparse. But sparse doesn't really affect "boolean", sparse affect "is set vs. is not set".
In your case, you are trying to fetch unfinished. To leverage sparse the key is not the boolean value, but the fact that unfinished entries have that key and that "finished" entries have no key at all.
{ _id: 1, data: {...}, unfinished: true }
{ _id: 2, data: {...} } // this entry is finished
It sounds like you are using a Queue
You can definitely leverage the information above to implement a sparse index. However, it actually sounds like you are using a Queue. MongoDB is serviceable as a Queue, here are two examples.
However, if you look at the Queue, they are not doing it the way you are doing it. I'm personally using MongoDB as a Queue for some production systems and it runs pretty well, but test your expected load as a dedicated Queue will perform much better.
Sparse is only helpful if the value is null, not false. When you say "most will have finished=true", I'm guessing that most of finished is non-null, making sparse not very beneficial.
And since most values are a single value, I doubt any type of index would help at all if your queries are specific enough.