Handling optional/empty data in MongoDB - mongodb

I remember reading somewhere that the mongo engine was more confortable when the entire structure of a document was already in place in case of an update, so here is the question.
When dealing with "empty" data, for example when inserting an empty string, should I default it to null, "" or not insert it at all ?
{
_id: ObjectId("5192b6072fda974610000005"),
description: ""
}
or
{
_id: ObjectId("5192b6072fda974610000005"),
description: null
}
or
{
_id: ObjectId("5192b6072fda974610000005")
}
You have to remember that the description field may or may not be filled in every document (based on user input).

Introduction
If a document doesn't have a value, the DB considers its value to be null. Suppose a database with the following documents:
{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
{ "_id" : ObjectId("5192d2441698aa96f0690d98"), "a" : 1 }
If you create a query to find documents with the field desc different than null, you will get just one document:
db.test.find({desc: {$ne: null}})
// Output:
{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }
The database doesn't differ documents without a desc field and documents with a desc field with the value null. One more test:
db.test.find({desc: null})
// Output:
{ "_id" : ObjectId("5192d2441698aa96f0690d98"), "a" : 1 }
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
But the differences are only ignored in the queries, because, as shown in the last example above, the fields are still saved on disk and you'll receive documents with the same structure of the documents that were sent to the MongoDB.
Question
When dealing with "empty" data, for example when inserting an empty string, should I default it to null, "" or not insert it at all ?
There isn't much difference from {desc: null} to {}, because most of the operators will have the same result. You should only pay special attention to these two operators:
$exists
$type
I'd save documents without the desc field, because the operators will continue to work as expected and I'd save some space.
Padding factor
If you know the documents in your database grow frequently, then MongoDB might need to move the documents during the update, because there isn't enough space in the previous document place. To prevent moving documents around, MongoDB allocates extra space for each document.
The ammount of extra space allocated by MongoDB per document is controlled by the padding factor. You cannot (and don't need to) choose the padding factor, because MongoDB will adaptively learn it, but you can help MongoDB preallocating internal space for each document by filling the possible future fields with null values. The difference is very small (depending on your application) and might be even smaller after MongoDB learn the best padding factor.
Sparse indexes
This section isn't too important to your specific problem right now, but may help you when you face similar problems.
If you create an unique index on field desc, then you wouldn't be able to save more than one document with the same value and in the previous database, we had more than one document with same value on field desc. Let's try to create an unique index in the previous presented database and see what error we get:
db.test.ensureIndex({desc: 1}, {unique: true})
// Output:
{
"err" : "E11000 duplicate key error index: test.test.$desc_1 dup key: { : null }",
"code" : 11000,
"n" : 0,
"connectionId" : 3,
"ok" : 1
}
If we want to be able to create an unique index on some field and let some documents have this field empty, we should create a sparse index. Let's try to create the unique index again:
// No errors this time:
db.test.ensureIndex({desc: 1}, {unique: true, sparse: true})
So far, so good, but why am I explaining all this? Because there is a obscure behaviour about sparse indexes. In the following query, we expect to have ALL documents sorted by desc.
db.test.find().sort({desc: 1})
// Output:
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }
The result seems weird. What happened to the missing document? Let's try the query without sorting it:
{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
{ "_id" : ObjectId("5192d2441698aa96f0690d98"), "a" : 1 }
All documents were returned this time. What's happening? It's simple, but not so obvious. When we sort the result by desc, we use the sparse index created previously and there is no entries for the documents that haven't the desc field. The following query show us the use of the index to sort the result:
db.test.find().sort({desc: 1}).explain().cursor
// Output:
"BtreeCursor desc_1"
We can skip the index using a hint:
db.test.find().sort({desc: 1}).hint({$natural: 1})
// Output:
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
{ "_id" : ObjectId("5192d2441698aa96f0690d98"), "a" : 1 }
{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }
Summary
Sparse unique indexes don't work if you include {desc: null}
Sparse unique indexes don't work if you include {desc: ""}
Sparse indexes might change the result of a query

There is little difference between the null value field and a document without the field. The main difference is that the former consumes a little disk space, while the latter does not consume at all. They can be distinguished by using $exists operator.
The field with an empty string is quite different from them. Though it depends on purpose I don't recommend to use it as a replacement for null. To be precise, they should be used to mean different things. For instance, think about voting. A person who cast a blank ballot is different from a person who wasn't permitted to vote. The former vote is an empty String, while the latter vote is null.
There is already a similar question here.

Related

Index vs Aggregation Pipeline for Sorting

I'm developing an application using MongoDB as its database, and for sorting data, I encountered an interesting argument from a colleague that index can be used instead of aggregation pipeline for getting sorted data.
I tried this and it actually works; using an index with specified field and direction does return sorted data when queried. When using aggregation pipeline, I also obtained the same result.
I have created an index with the following specification:
index name: batch_deleted_a_desc
num: asc
marked: asc
value: desc
Using aggregation pipeline:
> db.test.aggregate([{$match: {num:"3",marked:false}}, {$sort:{"value":-1}}])
{ "_id" : ObjectId("5d70b40ba7bebd3d7c135615"), "value" : 4, "marked" : false, "num" : "3" }
{ "_id" : ObjectId("5d70b414a7bebd3d7c135616"), "value" : 2, "marked" : false, "num" : "3" }
{ "_id" : ObjectId("5d70b3fea7bebd3d7c135614"), "value" : 1, "marked" : false, "num" : "3" }
Using index:
> db.test.find({num:"3",marked:false})
{ "_id" : ObjectId("5d70b40ba7bebd3d7c135615"), "value" : 4, "marked" : false, "num" : "3" }
{ "_id" : ObjectId("5d70b414a7bebd3d7c135616"), "value" : 2, "marked" : false, "num" : "3" }
{ "_id" : ObjectId("5d70b3fea7bebd3d7c135614"), "value" : 1, "marked" : false, "num" : "3" }
As you can see, the results are the same. But I am unsure that using index for getting sorted data is a good practice, and yet using aggregation pipeline is (sometimes) taking more effort than just creating index.
So, which would be the best option?
In the context of the question, the better option would be the aggregation because it explicitly specifies the sort.
In the query example, results are being returned in order specified by the index because the query is using the index { num: 1, marked: 1, value: 1}. However, nothing specified in the query will guarantee that ordering, meaning results may change at some point in the future. For example, consider the case where the index { num: 1, marked: 1, updated_at: 1 } were to be created. The query planner may decide to use this index instead, which may result in results in a different order.
In the absence of a sort, a query would return results in the order of the index being used, but you should not rely upon that ordering without explicitly specifying it. Quoting the docs:
Unless you specify the sort() method or use the $near operator,
MongoDB does not guarantee the order of query results.

Add object to object array if an object property is not given yet

Use Case
I've got a collection band_profiles and I've got a collection band_profiles_history. The history collection is supposed to store a band_profile snapshot every 24 hour and therefore I am using MongoDB's recommended format for historical tracking: Each month+year is it's own document and in an object array I will store the bandProfile snapshot along with the current day of the month.
My models:
A document in band_profiles_history looks like this:
{
"_id" : ObjectId("599e3bc406955db4cbffe0a8"),
"month" : 7,
"tag_lowercased" : "9yq88gg",
"year" : 2017,
"values" : [
{
"_id" : ObjectId("599e3bc41c073a7418fead91"),
"profile" : {
"_id" : ObjectId("5989a65d0f39d9fd70cde1fe"),
"tag" : "9YQ88GG",
"name_normalized" : "example name1",
},
"day" : 1
},
{
"_id" : ObjectId("599e3bc41c073a7418fead91"),
"profile" : {
"_id" : ObjectId("5989a65d0f39d9fd70cde1fe"),
"tag" : "9YQ88GG",
"name_normalized" : "new name",
},
"day" : 2
}
]
}
And a document in band_profiles:
{
"_id" : ObjectId("5989a6190f39d9fd70cddeb1"),
"tag" : "9V9LRGU",
"name_normalized" : "example name",
"tag_lowercased" : "9v9lrgu",
}
This is how I upsert my documents into band_profiles_history at the moment:
BandProfileHistory.update(
{ tag_lowercased: tag, year, month},
{ $push: {
values: { day, profile }
}
},
{ upsert: true }
)
My problem:
I only want to insert ONE snapshot for every day. Right now it would always push a new object into the object array values no matter if I already have an object for that day or not. How can I achieve that it would only push that object if there is no object for the current day yet?
Putting mongoose aside for a moment:
There is an operation addToSet that will add an element to an array if it doesn't already exists.
Caveat:
If the value is a document, MongoDB determines that the document is a duplicate if an existing document in the array matches the to-be-added document exactly; i.e. the existing document has the exact same fields and values and the fields are in the same order. As such, field order matters and you cannot specify that MongoDB compare only a subset of the fields in the document to determine whether the document is a duplicate of an existing array element.
Since you are trying to add an entire document you are subjected to this restriction.
So I see the following solutions for you:
Solution 1:
Read in the array, see if it contains the element you want and if not push it to the values array with push.
This has the disadvantage of NOT being an atomic operation meaning that you could end up would duplicates anyways. This could be acceptable if you ran a periodical clean up job to remove duplicates from this field on each document.
It's up to you to decide if this is acceptable.
Solution 2:
Assuming you are putting the field _id in the subdocuments of your values field, stop doing it. Assuming mongoose is doing this for you (because it does, from what I understand) stop it from doing it like it says here: Stop mongoose from creating _id for subdocument in arrays.
Next you need to ensure that the fields in the document always have the same order, because order matters when comparing documents in the addToSet operation as stated in the citation above.
Solution 3
Change the schema of your band_profiles_history to something like:
{
"_id" : ObjectId("599e3bc406955db4cbffe0a8"),
"month" : 7,
"tag_lowercased" : "9yq88gg",
"year" : 2017,
"values" : {
"1": { "_id" : ObjectId("599e3bc41c073a7418fead91"),
"profile" : {
"_id" : ObjectId("5989a65d0f39d9fd70cde1fe"),
"tag" : "9YQ88GG",
"name_normalized" : "example name1"
}
},
"2": {
"_id" : ObjectId("599e3bc41c073a7418fead91"),
"profile" : {
"_id" : ObjectId("5989a65d0f39d9fd70cde1fe"),
"tag" : "9YQ88GG",
"name_normalized" : "new name"
}
}
}
Notice that the day field became the key for the subdocuments on the values. Notice also that values is now an Object instead of an Array.
No you can run an update query that would update values.<day> only if values.<day> didn't exist.
Personally I don't like this as it is using the fact that JSON doesn't allow duplicate keys to support the schema.
First of all, sadly mongodb does not support uniqueness of a field in an array of a collection. You can see there is major bug opened for 7 years and not closed yet(that is a shame in my opinion).
What you can do from here is limited and all is on application level. I had same problem and solve it in application level. Do something like this:
First read your document with document _id and values.day.
If your reading in step 1 returns null, that means there is no record on values array for given day, so you can push the new value(I assume band_profile_history has record with _id value).
If your reading in step 1 returns a document, that means values array has a record for given day. In that case you can use setoperation with $operator.
Like others said, they will be not atomic but while you are dealing with your problem in application level, you can make whole bunch of code synchronized. There will be 2 queries to run on mongodb among of 3 queries. Like below:
db.getCollection('band_profiles_history').find({"_id": "1", "values.day": 3})
if returns null:
db.getCollection('band_profiles_history').update({"_id": "1"}, {$push: {"values": {<your new band profile history for given day>}}})
if returns not null:
db.getCollection('band_profiles_history').update({"_id": "1", "values.day": 3}, {$set: {"values.$": {<your new band profile history for given day>}}})
To check if object is empty
{ field: {$exists: false} }
or if it is an array
{ field: {$eq: []} }
Mongoose also supports field: {type: Date} so you can use it instead counting a days, and do updates only for current date.

A range query on a mixed-type field doesn't return all documents

I have a collection in MongoDB and one of its field can have mixed-type values. Now I'd like to apply a range query on this field, but it doesn't return all matched documents.
Sorting all documents by value field returns number values followed by a string value, as described in the document.
> db.test.find().sort({ value: 1 })
{ "_id" : ObjectId("54dc639e498f5e13a42b0383"), "value" : 1 }
{ "_id" : ObjectId("54dc639e498f5e13a42b0384"), "value" : 2 }
{ "_id" : ObjectId("54dc639e498f5e13a42b0385"), "value" : 11 }
{ "_id" : ObjectId("54dc7757498f5e13a42b0386"), "value" : "3" }
But when I use a range query, it just returns documents whose value is of the same type as a value in the query. For example, the query below returns one document instead of two.
> db.test.find({ value: { $gte: 11 } }).sort({ value: 1 })
{ "_id" : ObjectId("54dc639e498f5e13a42b0385"), "value" : 11 }
Given the document says it uses the same comparison order as sorting for different BSON types, it feels strange for me.
Are there any ways to get all matched documents in this case?
I don't think this is a good data structure at all, but you're right that the behavior is odd and should probably be considered a documentation bug.
This behavior is documented in the perl driver documentation ("You must query for data using the correct type.") and explained in a question related to indexing here on SO. However, the behavior is the same with or without indexes.
I was able to reproduce this with MongoDB 3.0-rc2. With your data,
> db.mixedType.find({"value" : {$gt : ""}});
{ "_id" : ObjectId("54dc7757498f5e13a42b0386"), "value" : "3" }
> db.mixedType.find({"value" : {$gt : 10}})
{ "_id" : ObjectId("54dc639e498f5e13a42b0385"), "value" : 11 }.
So comparison of different types, setting numeric types aside, yields false. I also tried empty objects, empty arrays and boolean values, e.g. $gt : false will return a document with value : true and $gte: {} will return a document with value : {}.

Can mongo return documents with empty/missing fields at the end in asc order?

db.jason.find().sort({"rank":1})
{ "_id" : ObjectId("51ae517372779b7eeeb81342"), "name" : "jason" }
{ "_id" : ObjectId("51ae517372779b7eeeb81343"), "name" : "jason" }
{ "_id" : ObjectId("51ae513772779b7eeeb8133a"), "name" : "jason", "rank" : 0 }
{ "_id" : ObjectId("51ae513772779b7eeeb8133b"), "name" : "jason", "rank" : 1 }
{ "_id" : ObjectId("51ae513772779b7eeeb8133c"), "name" : "jason", "rank" : 2 }
I'd like the documents without rank to conclude the sort vs beginning it...but also keep the items with rank in asc order. Is this possible or should I default empty ones with either a string or high number like 99999?
When returning the result for db.jason.find().sort({"rank":1}), MongoDB will order the documents by "rank" type, and then by "rank" value. For the purpose of sort order, MongoDB treats documents where a field is missing as having a NULL type for that field. The NULL type is ordered before numeric types, and this cannot be changed (see http://docs.mongodb.org/manual/reference/method/cursor.sort/ for the built-in type sort order). I would suggest constructing two queries instead (one for documents containing "rank", and one for documents without "rank") and merging the results in your application. However, if you need to keep this a single query, then you will need to set "rank" in all documents to generate the order you desire (for example, by using a sentinel value with a type which sorts after numeric types).
To avoid this problem, what I did was I took another boolean key as- {"isRanked":Boolean}, and then I did a multiple field sort, with sort:{"isRanked":-1, "rank":1}
in this way I got my results with "rank" missing at end

matching fields internally in mongodb

I am having following document in mongodb
{
"_id" : ObjectId("517b88decd483543a8bdd95b"),
"studentId" : 23,
"students" : [
{
"id" : 23,
"class" : "a"
},
{
"id" : 55,
"class" : "b"
}
]
}
{
"_id" : ObjectId("517b9d05254e385a07fc4e71"),
"studentId" : 55,
"students" : [
{
"id" : 33,
"class" : "c"
}
]
}
Note: Not an actual data but schema is exactly same.
Requirement: Finding the document which matches the studentId and students.id(id inside the students array using single query.
I have tried the code like below
db.data.aggregate({$match:{"students.id":"$studentId"}},{$group:{_id:"$student"}});
Result: Empty Array, If i replace {"students.id":"$studentId"} to {"students.id":33} it is returning the second document in the above shown json.
Is it possible to get the documents for this scenario using single query?
If possible, I'd suggest that you set the condition while storing the data so that you can do a quick truth check (isInStudentsList). It would be super fast to do that type of query.
Otherwise, there is a relatively complex way of using the Aggregation framework pipeline to do what you want in a single query:
db.students.aggregate(
{$project:
{studentId: 1, studentIdComp: "$students.id"}},
{$unwind: "$studentIdComp"},
{$project : { studentId : 1,
isStudentEqual: { $eq : [ "$studentId", "$studentIdComp" ] }}},
{$match: {isStudentEqual: true}})
Given your input example the output would be:
{
"result" : [
{
"_id" : ObjectId("517b88decd483543a8bdd95b"),
"studentId" : 23,
"isStudentEqual" : true
}
],
"ok" : 1
}
A brief explanation of the steps:
Build a projection of the document with just studentId and a new field with an array containing just the id (so the first document it would contain [23, 55].
Using that structure, $unwind. That creates a new temporary document for each array element in the studentIdComp array.
Now, take those documents, and create a new document projection, which continues to have the studentId and adds a new field called isStudentEqual that compares the equality of two fields, the studentId and studentIdComp. Remember that at this point there is a single temporary document that contains those two fields.
Finally, check that the comparison value isStudentEqual is true and return those documents (which will contain the original document _id and the studentId.
If the student was in the list multiple times, you might need to group the results on studentId or _id to prevent duplicates (but I don't know that you'd need that).
Unfortunately it's impossible ;(
to solve this problem it is necessary to use a $where statement
(example: Finding embeded document in mongodb?),
but $where is restricted from being used with aggregation framework
db.data.find({students: {$elemMatch: {id: 23}} , studentId: 23});