I've read and searched about MongoDB's JSON-BSON constructions but I do not understand (could not find either) how to have nested data and how to query it.
What I'd like to learn is, if somebody wants to store array within an array as in:
id: x,
name: y,
other: z,
multipleArray: [
(lab1: A, lab2: B, lab3:C),
(lab1: AB, lab2: BB, lab3:CB)
(lab1: AC, lab2: BC, lab3:CC)
..
..
]
How to store such data and then retrieve some, a specific or all elements of "multipleArray" contents?
Any resource on the subject would also be highly appreciated.
Bryan had some great advice which you should heed.
Also, as Manoj said, what you actually have is an array of objects. The following might help you out a bit...
Lists are just ordered sequences: [1,2,3...] or [2,292,111]
The first element in the last example is 2, the second is 292... lists/arrays are denoted by square brackets [ ]
Objects map keys to values: { name: "Tyler", age: 26, fav_color: "green" }
name maps to "Tyler", age maps to 25, etc... and objects are denoted by braces { }
A document in mongodb is an object. So, like above, they map keys to values. Those values can be strings, numbers, arrays... or other even other (nested) objects)
So, lets take a look at your document. You have an object (document) that has keys id, name, other, and multipleArray. The value multiple array maps to is an array [ ] of Objects { }.
{
id: x,
name: y,
other: z,
multipleArray: [
{lab1: "A", lab2: "B", lab3:"C"},
{lab1: "AB", lab2: "BB", lab3:"CB"},
{lab1: "AC", lab2: "BC", lab3:"CC"}
]
}
MongoDB has this feature called multikeys, it basically takes the value you are querying for and tries to match it against every value in the array.
If you wanted to find the document where multipleArray contained the document {lab1: "A", lab2: "B", lab3: "C"}, you query like this: db.data.find({multipleArray: {lab1: "A", lab2: "B", lab3: "C"}})
I'm assuming x, y, and z are defined already.
There are more subtleties and complexities, but if you want to learn more read the documentation on the mongodb site here or get a book.
Your question is a bit generic and as such is difficult to give a good answer.
When modeling your data to be stored and queried using MongoDB, you should consider how you plan to actually use and query your data. Based on the answer to that, you should be able to come up with a good data structure for storing the data.
It would be good for you to to familiarize yourself with the MongoDB query methods (http://www.mongodb.org/display/DOCS/Querying) so you can understand the many ways to query data in MongoDB.
Whatever language you are using should have a nice library that should abstract away the low level details of storing and querying the data, but its still going to be important to know what query methods MongoDB supports.
In general, MongoDB queries let you "reach into" nested objects in a given document and that also includes arrays as well.
Related
Literal Expression = PurchaseOrders.pledgedDocuments[valuation.value=62500]
Purchase Order Structure
PurchaseOrders:
[
{
"productId": "PURCHASE_ORDER_FINANCING",
"pledgedDocuments" : [{"valuation" : {"value" : "62500"} }]
}
]
The literal Expression produces null result.
However if
PurchaseOrders.pledgedDocuments[valuation = null]
Return all results !
What am I doing wrong ?
I was able to solve using flatten function - but dont know how it worked :(
In your original question, it is not exactly clear to me what is your end-goal, so I will try to provide some references.
value filtering
First, your PurchaseOrders -> pledgedDocuments -> valuation -> value appears to be a string, so in your original question trying to filter by
QUOTE:
... [valuation.value=62500]
will not help you.
You'll need to filter to something more ~like: valuation.value="62500"
list projection
In your original question, you are projecting on the PurchaseOrders which is a list and accessing pledgedDocuments which again is a list !
So when you do:
QUOTE:
PurchaseOrders.pledgedDocuments (...)
You don't have a simple list; you have a list of lists, it is a list of all the lists of pledged documents.
final solution
I believe what you wanted is:
flatten(PurchaseOrders.pledgedDocuments)[valuation.value="62500"]
And let's do the exercise on paper about what is actually happening.
First,
Let's focus on PurchaseOrders.pledgedDocuments.
You supply PurchaseOrders which is a LIST of POs,
and you project on pledgedDocuments.
What is that intermediate results?
Referencing your original question input value for POs, it is:
[
[{"valuation" : {"value" : "62500"} }]
]
notice how it is a list of lists?
With the first part of the expression, PurchaseOrders.pledgedDocuments, you have asked: for each PO, give me the list of pledged documents.
By hypothesis, if you supplied 3 POs, and each having 2 documents, you would have obtained with PurchaseOrders.pledgedDocuments a resulting list of again 3 elements, each element actually being a list of 2 elements.
Now,
With flatten(PurchaseOrders.pledgedDocuments) you achieve:
[{"valuation" : {"value" : "62500"} }]
So at this point you have a list containing all documents, regardless of which PO.
Now,
With flatten(PurchaseOrders.pledgedDocuments)[valuation.value="62500"] the complete expression, you still achieve:
[{"valuation" : {"value" : "62500"} }]
Because you have asked on the flattened list, to keep only those elements containing a valuation.value equal to the "62500" string.
In other words iff you have used this expression, what you achieved is:
From any PO, return me the documents having the valuations' value
equals to the string 62500, regardless of the PO the document belongs to.
My Mongodb collection has this document structure:
{
_id: 1,
my_dict: {
my_key: [
{id: x, other_fields: other_values},
...
]
},
...
},
I need to update the array subdocuments very often, so an Index on the id field seems like a good idea. Still, I have many documents (millions) but my arrays inside them are small (max ~20 elements). Would it still improve performance a lot to index it, compared to the cost of indexing?
PS: I'm not using the id as a key (dict instead of an array), as I also often need to get the number of elements in "the array" ($size only works on arrays). I cannot use count as I am using Mongodb 3.2.
Followup question: If it would make a very big difference, I could instead use a dict like so:
{id: {others_fields: other_values}}
and store the size myself in a field. What I dislike about this is that I would need another field and update it myself (possible errors maybe, as I would need to use $inc each time I add/delete an item) instead of relying on "real" values. I would also have to manage the possibility that a key could be called _my_size, which would conflict with my logic. It would look then like this:
{
_id: 1,
my_dict: {
my_key: {
id: {other_fields: other_values},
_my_size: 1
},
},
},
Still not sure which is best for performance. I will need to update the subdocument (with id field) a lot, as well as computing the $size a lot (maybe 1/10 of the calls to update).
Which Schema/Strategy would give me a better performance? Or even more important, would it actually make a big difference? (possibly thousands of calls per second)
Update example:
update(
{_id: 1, my_dict.my_key.id: update_data_id},
{$set: {my_dict.my_key: update_data}}
)
Getting the size example:
aggregate(
{$match: {_id: 1}},
{$project: {_id: 0, nb_of_sub_documents: {$size: $my_dict.my_key}}}
So, I have a problem in MongoDB... I have stored some data in MongoDB and it basically looks like:
{
_id: 1,
name: "aa",
importance: [0.5, 0.25, 0.25]
}
Where this importance attribute is an array that I will have to keep updating, e.g., after getting some data, it should be updated to [0.80, 0.10, 0.10]...
In this toy problem, I don't really understand how to replace a full array. Do I have to do it elementwise somehow? If so, it is not a tractable solution in my case, because the number of elements in my array reach above 1000.
You can replace the entire importance array in an update by using $set:
db.test.update({_id: 1}, {$set: {importance: [0.80, 0.10, 0.10]}})
Basically the collection output of an elaborate aggregate pipeline for a very large dataset is similar to the following:
{
"_id" : {
"clienta" : NumberLong(460011766),
"clientb" : NumberLong(2886729962)
},
"states" : [
[
"fixed", "fixed.rotated","fixed.rotated.off"
]
],
"VBPP" : [
244,
182,
184,
11,
299,
],
"PPF" : 72.4,
}
The intuitive, albeit slow, way to update these fields to be calculations of their former selves (length and variance of an array) with PyMongo before converting to arrays is as follows:
records_list = []
cursor = db.clientAgg.find({}, {'_id' : 0,
'states' : 1,
'VBPP' : 1,
'PPF': 1})
for record in cursor:
records_list.append(record)
for dicts in records_list:
dicts['states'] = len(dicts['states'])
dicts['VBPP'] = np.var(dicts['VBPP'])
I have written various forms of this basic flow to optimize for speed, but bringing in 500k dictionaries in memory to modify them before converting them to arrays to go through a machine learning estimator is costly. I have tried various ways to update the records directly via a cursor with variants of the following with no success:
cursor = db.clientAgg.find().skip(0).limit(50000)
def iter():
for item in cursor:
yield item
l = []
for x in iter():
x['VBPP'] = np.var(x['VBPP'])
# Or
# db.clientAgg.update({'_id':x['_id']},{'$set':{'x.VBPS': somefunction as above }},upsert=False, multi=True)
I also unsuccessfully tried using Mongo's usual operators since the variance is as simple as subtracting the mean from each element of the array, squaring the result, then averaging the results.
If I could successfully modify the collection directly then I could utilize something very fast like Monary or IOPro to load data directly from Mongo and into a numpy array without the additional overhead.
Thank you for your time
MongoDB has no way to update a document with values calculated from the document's fields; currently you can only use update to set values to constants that you pass in from your application. So you can set document.x to 2, but you can't set document.x to document.y + document.z or any other calculated value.
See https://jira.mongodb.org/browse/SERVER-11345 and https://jira.mongodb.org/browse/SERVER-458 for possible future features.
In the immediate future, PyMongo will release a bulk API that allows you to send a batch of distinct update operations in a single network round-trip which will improve your performance.
Addendum:
I have two other ideas. First, run some Javascript server-side. E.g., to set all documents' b fields to 2 * a:
db.eval(function() {
var collection = db.test_collection;
collection.find().forEach(function(doc) {
var b = 2 * doc.a;
collection.update({_id: doc._id}, {$set: {b: b}});
});
});
The second idea is to use the aggregation framework's $out operator, new in MongoDB 2.5.2, to transform the collection into a second collection that includes the calculated field:
db.test_collection.aggregate({
$project: {
a: '$a',
b: {$multiply: [2, '$a']}
}
}, {
$out: 'test_collection2'
});
Note that $project must explicitly include all the fields you want; only _id is included by default.
For a million documents on my machine the former approach took 2.5 minutes, and the latter 9 seconds. So you could use the aggregation framework to copy your data from its source to its destination, with the calculated fields included. Then, if desired, drop the original collection and rename the target collection to the source's name.
My final thought on this, is that MongoDB 2.5.3 and later can stream large result sets from an aggregation pipeline using a cursor. There's no reason Monary can't use that capability, so you might file a feature request there. That would allow you to get documents from a collection in the form you want, via Monary, without having to actually store the calculated fields in MongoDB.
Assuming a mapreduce function representing object relationships like:
function (doc) {
emit([doc.source, doc.target, doc.name], null);
}
The normal example of filtering a compound key is something like:
startKey = [ a_source ]
endKey = [ a_source, {} ]
That should provide a list of all objects referenced from a_source
Now I want the oposite and I am not sure if that is possible. I have not been able to find an example where the variant part comes first, like:
startKey = [ *simbol_first_match* , a_destination ]
endKey = [ {} , a_destination ]
Is that posible? Are compound keys (1) filter and (2) sort operations within a query limited to the order of the elements appear in the key?
I know I could define another view/mapreduce, but I would like to avoid the extra disk space cost if possible.
No, you can't do that. See here where I explained how keys work in view requests with CouchDB.
Compound keys are nothing special, no filtering or anything. Internally you have to imagine that there is no array anymore. It's just syntactic sugar for us developers. So [a,b] - [a,c] is treated just like 'a_b' - 'a_c' (with _ being a special delimiter).