So, I have a problem in MongoDB... I have stored some data in MongoDB and it basically looks like:
{
_id: 1,
name: "aa",
importance: [0.5, 0.25, 0.25]
}
Where this importance attribute is an array that I will have to keep updating, e.g., after getting some data, it should be updated to [0.80, 0.10, 0.10]...
In this toy problem, I don't really understand how to replace a full array. Do I have to do it elementwise somehow? If so, it is not a tractable solution in my case, because the number of elements in my array reach above 1000.
You can replace the entire importance array in an update by using $set:
db.test.update({_id: 1}, {$set: {importance: [0.80, 0.10, 0.10]}})
Related
I'm using MongoDB 4.0 on mongoDB Atlas cluster (3 replicas - 1 shard).
Assuming i have a collection that contains multiple documents.
Each of this documents holding an array out of subdocuments that represent cities in a certain year with additional information. An example document would look like that (i removed unessesary information to simplify example):
{_id:123,
cities:[
{name:"vienna",
year:1985
},
{name:"berlin",
year:2001
}
{name:"vienna",
year:1985
}
]}
I have a compound index on and year. What is the fastest way to count the occurrences of name and year combinations?
I already tried the following aggregation:
[{$unwind: {
path: '$cities'
}}, {$group: {
_id: {
name: 'cities.name',
year: '$cities.year'
},
count: {
$sum: 1
}
}}, {$project: {
count: 1,
name: '$_id.name',
year: '$_id.year',
_id: 0
}}]
Another approach i tried was a map-reduce in the following form - the map reduce performed a bit better ~30% less time needed.
map function:
function m() {
for (var i in this.cities) {
emit({
name: this.cities[i].name,
year: this.cities[i].year
},
1);
}
}
reduce function (also tried to replace sum with length, but surprisingly sum is faster):
function r(id, counts) {
return Array.sum(counts);
}
function call in mongoshell:
db.test.mapReduce(m,r,{out:"mr_test"})
Now i was asking myself - Is it possible to access the index? As far as i know it is a B+ tree that holds the pointers to the relevant documents on disk, therefore from a technical point of view I think is would be possible to iterate through all leaves of the index tree and just counting the pointers? Does anybody if this is possible?
Does anybody knows another way to solve this approach in a high performant way? (It is not possible to change the design, because of other dependencies of the software, we are running this on a very big dataset). Has anybody maybe experience in solve such task via shards?
The index will not be very helpful in this situation.
MongoDB indexes were designed for identifying documents that match a given critera.
If you create an index on {cities.name:1, cities.year:1}
This document:
{_id:123,
cities:[
{name:"vienna",
year:1985
},
{name:"berlin",
year:2001
}
{name:"vienna",
year:1985
}
]}
Will have 2 entries in the b-tree that refer to this document:
vienna|1985
berlin|2001
Even if it were possible to count the incidence of a specific key in the index, this does not necessarily correspond.
MongoDB does not provide a method to examine the raw entries in an index, and it explicitly refuses to use an index on a field containing an array for counting.
The MongoDB count command and helper functions all count documents, not elements inside of them. As you noticed, you can unwind the array and count the items in an aggregation pipeline, but at that point you've already loaded all of the documents into memory, so it's too late to make use of an index.
My Mongodb collection has this document structure:
{
_id: 1,
my_dict: {
my_key: [
{id: x, other_fields: other_values},
...
]
},
...
},
I need to update the array subdocuments very often, so an Index on the id field seems like a good idea. Still, I have many documents (millions) but my arrays inside them are small (max ~20 elements). Would it still improve performance a lot to index it, compared to the cost of indexing?
PS: I'm not using the id as a key (dict instead of an array), as I also often need to get the number of elements in "the array" ($size only works on arrays). I cannot use count as I am using Mongodb 3.2.
Followup question: If it would make a very big difference, I could instead use a dict like so:
{id: {others_fields: other_values}}
and store the size myself in a field. What I dislike about this is that I would need another field and update it myself (possible errors maybe, as I would need to use $inc each time I add/delete an item) instead of relying on "real" values. I would also have to manage the possibility that a key could be called _my_size, which would conflict with my logic. It would look then like this:
{
_id: 1,
my_dict: {
my_key: {
id: {other_fields: other_values},
_my_size: 1
},
},
},
Still not sure which is best for performance. I will need to update the subdocument (with id field) a lot, as well as computing the $size a lot (maybe 1/10 of the calls to update).
Which Schema/Strategy would give me a better performance? Or even more important, would it actually make a big difference? (possibly thousands of calls per second)
Update example:
update(
{_id: 1, my_dict.my_key.id: update_data_id},
{$set: {my_dict.my_key: update_data}}
)
Getting the size example:
aggregate(
{$match: {_id: 1}},
{$project: {_id: 0, nb_of_sub_documents: {$size: $my_dict.my_key}}}
Basically the collection output of an elaborate aggregate pipeline for a very large dataset is similar to the following:
{
"_id" : {
"clienta" : NumberLong(460011766),
"clientb" : NumberLong(2886729962)
},
"states" : [
[
"fixed", "fixed.rotated","fixed.rotated.off"
]
],
"VBPP" : [
244,
182,
184,
11,
299,
],
"PPF" : 72.4,
}
The intuitive, albeit slow, way to update these fields to be calculations of their former selves (length and variance of an array) with PyMongo before converting to arrays is as follows:
records_list = []
cursor = db.clientAgg.find({}, {'_id' : 0,
'states' : 1,
'VBPP' : 1,
'PPF': 1})
for record in cursor:
records_list.append(record)
for dicts in records_list:
dicts['states'] = len(dicts['states'])
dicts['VBPP'] = np.var(dicts['VBPP'])
I have written various forms of this basic flow to optimize for speed, but bringing in 500k dictionaries in memory to modify them before converting them to arrays to go through a machine learning estimator is costly. I have tried various ways to update the records directly via a cursor with variants of the following with no success:
cursor = db.clientAgg.find().skip(0).limit(50000)
def iter():
for item in cursor:
yield item
l = []
for x in iter():
x['VBPP'] = np.var(x['VBPP'])
# Or
# db.clientAgg.update({'_id':x['_id']},{'$set':{'x.VBPS': somefunction as above }},upsert=False, multi=True)
I also unsuccessfully tried using Mongo's usual operators since the variance is as simple as subtracting the mean from each element of the array, squaring the result, then averaging the results.
If I could successfully modify the collection directly then I could utilize something very fast like Monary or IOPro to load data directly from Mongo and into a numpy array without the additional overhead.
Thank you for your time
MongoDB has no way to update a document with values calculated from the document's fields; currently you can only use update to set values to constants that you pass in from your application. So you can set document.x to 2, but you can't set document.x to document.y + document.z or any other calculated value.
See https://jira.mongodb.org/browse/SERVER-11345 and https://jira.mongodb.org/browse/SERVER-458 for possible future features.
In the immediate future, PyMongo will release a bulk API that allows you to send a batch of distinct update operations in a single network round-trip which will improve your performance.
Addendum:
I have two other ideas. First, run some Javascript server-side. E.g., to set all documents' b fields to 2 * a:
db.eval(function() {
var collection = db.test_collection;
collection.find().forEach(function(doc) {
var b = 2 * doc.a;
collection.update({_id: doc._id}, {$set: {b: b}});
});
});
The second idea is to use the aggregation framework's $out operator, new in MongoDB 2.5.2, to transform the collection into a second collection that includes the calculated field:
db.test_collection.aggregate({
$project: {
a: '$a',
b: {$multiply: [2, '$a']}
}
}, {
$out: 'test_collection2'
});
Note that $project must explicitly include all the fields you want; only _id is included by default.
For a million documents on my machine the former approach took 2.5 minutes, and the latter 9 seconds. So you could use the aggregation framework to copy your data from its source to its destination, with the calculated fields included. Then, if desired, drop the original collection and rename the target collection to the source's name.
My final thought on this, is that MongoDB 2.5.3 and later can stream large result sets from an aggregation pipeline using a cursor. There's no reason Monary can't use that capability, so you might file a feature request there. That would allow you to get documents from a collection in the form you want, via Monary, without having to actually store the calculated fields in MongoDB.
How do I accomplish the following?
db.test.save( {a: [1,2,3]} );
db.test.find( {a: [1,2,3,4]} ); //must match because it contains all required values [1, 2, 3]
db.test.find( {a: [1,2]} ); //must NOT match because the required value 3 is missing
I know about $in and $all but they work differently.
Interesting..The problem is..the $in and the $or operators get applied on the elements of the array that you are comparing against each document in the collection, not on the elements of the arrays in the documents..To summarize your question: You want it to be a match, if any of the documents in the collection happens to be a subset of the passed array. I can't think of a way to do this unless you swap your input and output. What I mean is..Let's take your first input:
db.test.find( {a: [1,2,3,4]} );
Consider putting this in a temporary collection say,temp as:
db.temp.save( {a: [1,2,3,4]} );
Now iterate over each document in test collection and 'find' it in temp, with the $all operator to ensure it is completely contained, i.e., do something like this:
foreach(doc in test)
{
db.temp.find( { a: { $all: doc.a } } );
}
This is definitely a workaround! I am not sure if I am missing any other operator that can do this job.
There is currently no operator that will find documents with a subset values. There is a proposed improvement to add a $subset operator. Please feel free to up vote the issue in JIRA:
https://jira.mongodb.org/browse/SERVER-974
Other potential workarounds, which may not be suitable for your use case, may involve map reduce or the new aggregation framework (available in MongoDB version 2.2):
http://docs.mongodb.org/manual/reference/aggregation/#_S_project
I've read and searched about MongoDB's JSON-BSON constructions but I do not understand (could not find either) how to have nested data and how to query it.
What I'd like to learn is, if somebody wants to store array within an array as in:
id: x,
name: y,
other: z,
multipleArray: [
(lab1: A, lab2: B, lab3:C),
(lab1: AB, lab2: BB, lab3:CB)
(lab1: AC, lab2: BC, lab3:CC)
..
..
]
How to store such data and then retrieve some, a specific or all elements of "multipleArray" contents?
Any resource on the subject would also be highly appreciated.
Bryan had some great advice which you should heed.
Also, as Manoj said, what you actually have is an array of objects. The following might help you out a bit...
Lists are just ordered sequences: [1,2,3...] or [2,292,111]
The first element in the last example is 2, the second is 292... lists/arrays are denoted by square brackets [ ]
Objects map keys to values: { name: "Tyler", age: 26, fav_color: "green" }
name maps to "Tyler", age maps to 25, etc... and objects are denoted by braces { }
A document in mongodb is an object. So, like above, they map keys to values. Those values can be strings, numbers, arrays... or other even other (nested) objects)
So, lets take a look at your document. You have an object (document) that has keys id, name, other, and multipleArray. The value multiple array maps to is an array [ ] of Objects { }.
{
id: x,
name: y,
other: z,
multipleArray: [
{lab1: "A", lab2: "B", lab3:"C"},
{lab1: "AB", lab2: "BB", lab3:"CB"},
{lab1: "AC", lab2: "BC", lab3:"CC"}
]
}
MongoDB has this feature called multikeys, it basically takes the value you are querying for and tries to match it against every value in the array.
If you wanted to find the document where multipleArray contained the document {lab1: "A", lab2: "B", lab3: "C"}, you query like this: db.data.find({multipleArray: {lab1: "A", lab2: "B", lab3: "C"}})
I'm assuming x, y, and z are defined already.
There are more subtleties and complexities, but if you want to learn more read the documentation on the mongodb site here or get a book.
Your question is a bit generic and as such is difficult to give a good answer.
When modeling your data to be stored and queried using MongoDB, you should consider how you plan to actually use and query your data. Based on the answer to that, you should be able to come up with a good data structure for storing the data.
It would be good for you to to familiarize yourself with the MongoDB query methods (http://www.mongodb.org/display/DOCS/Querying) so you can understand the many ways to query data in MongoDB.
Whatever language you are using should have a nice library that should abstract away the low level details of storing and querying the data, but its still going to be important to know what query methods MongoDB supports.
In general, MongoDB queries let you "reach into" nested objects in a given document and that also includes arrays as well.