Find empty documents in a database - mongodb

I have queried an API which is quiet inconsistent and therefore does not return objects for all numerical indexes (but most of them). To further go on with .count() on the numerical index I've been inserting empty documents with db.collection.insert({})
My question now is: how would I find and count these objects?
Something like db.collection.count({}) won't work obviously.
Thanks for any idea!

Use the $where operator. The Javascript expression returns only documents containing a single key. (that single key being the documents "_id" key)
db.collection.find({ "$where": "return Object.keys(this).length == 1" }).count()

For MongoDB 3.4.4 and newer, consider running the following aggregate pipeline which uses $objectToArray (which is available from MongoDB 3.4.4 and newer versions) to get the count of those empty documents/null fields:
db.collection.aggregate([
{ "$project": {
"hashmaps": { "$objectToArray": "$$ROOT" }
} },
{ "$project": {
"keys": "$hashmaps.k"
} },
{ "$group": {
"_id": null,
"count": { "$sum": {
"$cond": [
{
"$eq":[
{
"$ifNull": [
{ "$arrayElemAt": ["$keys", 1] },
0
]
},
0
]
},
1,
0
]
} }
} }
]);

Related

get document with same 3 fields in a collection

i have a collection with more then 1000 documents and there are some documents with same value in some fields, i need to get those
the collection is:
[{_id,fields1,fields2,fields3,etc...}]
what query can i use to get all the elements that have the same 3 fields for example:
[
{_id:1,fields1:'a',fields2:1,fields3:'z'},
{_id:2,fields1:'a',fields2:1,fields3:'z'},
{_id:3,fields1:'f',fields2:2,fields3:'g'},
{_id:4,fields1:'f',fields2:2,fields3:'g'},
{_id:5,fields1:'j',fields2:3,fields3:'g'},
]
i need to get
[
{_id:2,fields1:'a',fields2:1,fields3:'z'},
{_id:4,fields1:'f',fields2:2,fields3:'g'},
]
in this way i can easly get a list of "duplicate" that i can delete if needed, it's not really important get id 2 and 4 or 1 and 3
but 5 would never be included as it's not 'duplicated'
EDIT:
sorry but i forgot to mention that there are some document with null value i need to exclude those
This is the perfect use case of window field. You can use $setWindowFields to compute $rank in the grouping/partition you want. Then, get those rank not equal to 1 to get the duplicates.
db.collection.aggregate([
{
$match: {
fields1: {
$ne: null
},
fields2: {
$ne: null
},
fields3: {
$ne: null
}
}
},
{
"$setWindowFields": {
"partitionBy": {
fields1: "$fields1",
fields2: "$fields2",
fields3: "$fields3"
},
"sortBy": {
"_id": 1
},
"output": {
"duplicateRank": {
"$rank": {}
}
}
}
},
{
$match: {
duplicateRank: {
$ne: 1
}
}
},
{
$unset: "duplicateRank"
}
])
Mongo Playground
I think you can try this aggregation query:
First group by the feilds you want to know if there are multiple values.
It creates an array with the _ids that are repeated.
Then get only where there is more than one ($match).
And last project to get the desired output. I've used the first _id found.
db.collection.aggregate([
{
"$group": {
"_id": {
"fields1": "$fields1",
"fields2": "$fields2",
"fields3": "$fields3"
},
"duplicatesIds": {
"$push": "$_id"
}
}
},
{
"$match": {
"$expr": {
"$gt": [
{
"$size": "$duplicatesIds"
},
1
]
}
}
},
{
"$project": {
"_id": {
"$arrayElemAt": [
"$duplicatesIds",
0
]
},
"fields1": "$_id.fields1",
"fields2": "$_id.fields3",
"fields3": "$_id.fields2"
}
}
])
Example here

Efficiently find the most recent filtered document in MongoDB collection using datetime field

I have a large collection of documents with datetime fields in them, and I need to retrieve the most recent document for any given queried list.
Sample data:
[
{"_id": "42.abc",
"ts_utc": "2019-05-27T23:43:16.963Z"},
{"_id": "42.def",
"ts_utc": "2019-05-27T23:43:17.055Z"},
{"_id": "69.abc",
"ts_utc": "2019-05-27T23:43:17.147Z"},
{"_id": "69.def",
"ts_utc": "2019-05-27T23:44:02.427Z"}
]
Essentially, I need to get the most recent record for the "42" group as well as the most recent record for the "69" group. Using the sample data above, the desired result for the "42" group would be document "42.def".
My current solution is to query each group one at a time (looping with PyMongo), sort by the ts_utc field, and limit it to one, but this is really slow.
// Requires official MongoShell 3.6+
db = db.getSiblingDB("someDB");
db.getCollection("collectionName").find(
{
"_id" : /^42\..*/
}
).sort(
{
"ts_utc" : -1.0
}
).limit(1);
Is there a faster way to get the results I'm after?
Assuming all your documents have the format displayed above, you can split the id into two parts (using the dot character) and use aggregation to find the max element per each first array (numeric) element.
That way you can do it in a one shot, instead of iterating per each group.
db.foo.aggregate([
{ $project: { id_parts : { $split: ["$_id", "."] }, ts_utc : 1 }},
{ $group: {"_id" : { $arrayElemAt: [ "$id_parts", 0 ] }, max : {$max: "$ts_utc"}}}
])
As #danh mentioned in the comment, the best way you can do is probably adding an auxiliary field to indicate the grouping. You may further index the auxiliary field to boost the performance.
Here is an ad-hoc way to derive the field and get the latest result per grouping:
db.collection.aggregate([
{
"$addFields": {
"group": {
"$arrayElemAt": [
{
"$split": [
"$_id",
"."
]
},
0
]
}
}
},
{
$sort: {
ts_utc: -1
}
},
{
"$group": {
"_id": "$group",
"doc": {
"$first": "$$ROOT"
}
}
},
{
"$replaceRoot": {
"newRoot": "$doc"
}
}
])
Here is the Mongo playground for your reference.

SELECT COUNT with HAVING clause

This is my input :
{"_id": "phd/Klink2006","type": "Phd", "title": "IQForCE - Intelligent Query (Re-)Formulation with Concept-based Expansion", "year": 2006, "publisher": "Verlag Dr. Hut, M?nchen", "authors": ["Stefan Klink"], "isbn": ["3-89963-303-2"]}
I want to count books that have less than 3 authors. How can I reach this ?
$group by null, check condition if size of authors is less than 3 then count 1 otherwise 0
db.collection.aggregate([
{
$group: {
_id: null,
count: {
$sum: {
$cond: [
{ $lt: [{ $size: "$authors" }, 3] },
1,
0
]
}
}
}
}
])
Playground
You can use the $where operator (this will return all the documents).
db.collection.find({
"$where": "this.authors.length < 3"
});
Important consideration:
$where evaluates JavaScript and cannot take advantage of indexes.
Therefore, query performance improves when you express your query
using the standard MongoDB operators (e.g., $gt, $in).
In general, you
should use $where only when you cannot express your query using
another operator. If you must use $where, try to include at least one
other standard query operator to filter the result set. Using $where
alone requires a collection scan.
The best options in term of performance is to create a new key authorsLength
db.collection.aggregate([
{
"$match": {
"authorsLength": {
"$lt": 3
}
}
},
{
"$group": {
"_id": null,
"count": {
"$sum": 1
}
}
}
])

MongoDB $push aggregaton won't keep the right order

I tried to make a $group aggregation with MongoDB, like the following example:
"$group": {
"_id": "$test_id",
"feeling": {
"$push": "$feeling"
},
"reference_id": {
"$push": "$_id"
},
"training_start": {
"$push": "$training_start"
},
"training_duration": {
"$push": "$duration_ms"
}
}
The aggregation works fine, but the created arrays are sorted different. That means, if I check the result of the aggregation by looking at reference_id[x] and training_start[x] then the value of training_start in the source collection is not equal to training_start[x].
Maybe an example shows my problem more precisely:
One document after the $group aggregation:
{
_id: "string_1",
reference_id: [1, 2, 3],
training_start: [01:00:00, 02:00:00, 03:00:00] (date times)
}
Documents from source collection:
{
_id:1,
training_start: 01:00:00,
test_id: "string_1"
},
{
_id:2,
training_start: 03:00:00,
test_id: "string_1"
},
{
_id:3,
training_start: 02:00:00,
test_id: "string_1"
}
The first elements in these arrays are always in the right order. So I checked if each grouped field has the same number of entries by using the code below. And the annoying result is, that the amount of entries in each array is equal. So there is no shift in the arrays caused by missing values.
"$group": {
"_id": "$test_id",
"sum": {
"$sum": {
"$cond": {
"if": {
"$lte": [
"$training_start", null
]
},
"then": 0,
"else": 1
}
}
}
Does anybody know, if there is an other way to create arrays (already tried $addToSet) which keep the order, the elements where pushed in? Or am I the problem?
Greetings Max

Embedded List ordering

I have the following object with an embedded list of items and I would like to write a query and return all or specific items ordered by date. Is it possible to do it or should I have a different collection for items and keep here their references?
I know that you can match specific element using $elemMatch.
{
"_id": "51cb12857124a215940cf2d4",
"level1":
[
{
"name":"item00",
"description":"item01",
"date": 1238492103
},
{
"name":"item10",
"description":"item11",
"date": 1238492104
}
]
}
If you want these items ordered by date on a more often than not then your best option is to keep the list ordered in the first place. The $push operator has an additional $sort paramter explicitly for this purpose.
db.collection.update(
{ "_id": "51cb12857124a215940cf2d4" },
{ "$push": {
"level1":{
"$each":[{
"name":"item11",
"description":"item11",
"date": 1238492104
}],
"$sort": { "date": 1 }
}
}
)
That actually even adapts so you could just sort your whole collection in one statement:
db.collection.update(
{},
{ "$push": {
"level1":{
"$each":[], "$sort": { "date": 1 }
}
},
{ "multi": true }
)
Without that your ony alternate is to order the results via the .aggregate() method. This really should not be your chosen operation as it requires processing $unwind on the array contents and then $sort operation on the elements within the document. Naturally this comes with some significant overhead on larger selections:
db.collection.aggregate([
{ "$unwind": "$level1" },
{ "$sort": { "_id": 1, "level1.date": 1 } },
{ "$group": {
"_id": "$_id",
"level1": { "$push": "$level1" }
}}
])