Assuming a mapreduce function representing object relationships like:
function (doc) {
emit([doc.source, doc.target, doc.name], null);
}
The normal example of filtering a compound key is something like:
startKey = [ a_source ]
endKey = [ a_source, {} ]
That should provide a list of all objects referenced from a_source
Now I want the oposite and I am not sure if that is possible. I have not been able to find an example where the variant part comes first, like:
startKey = [ *simbol_first_match* , a_destination ]
endKey = [ {} , a_destination ]
Is that posible? Are compound keys (1) filter and (2) sort operations within a query limited to the order of the elements appear in the key?
I know I could define another view/mapreduce, but I would like to avoid the extra disk space cost if possible.
No, you can't do that. See here where I explained how keys work in view requests with CouchDB.
Compound keys are nothing special, no filtering or anything. Internally you have to imagine that there is no array anymore. It's just syntactic sugar for us developers. So [a,b] - [a,c] is treated just like 'a_b' - 'a_c' (with _ being a special delimiter).
Related
Literal Expression = PurchaseOrders.pledgedDocuments[valuation.value=62500]
Purchase Order Structure
PurchaseOrders:
[
{
"productId": "PURCHASE_ORDER_FINANCING",
"pledgedDocuments" : [{"valuation" : {"value" : "62500"} }]
}
]
The literal Expression produces null result.
However if
PurchaseOrders.pledgedDocuments[valuation = null]
Return all results !
What am I doing wrong ?
I was able to solve using flatten function - but dont know how it worked :(
In your original question, it is not exactly clear to me what is your end-goal, so I will try to provide some references.
value filtering
First, your PurchaseOrders -> pledgedDocuments -> valuation -> value appears to be a string, so in your original question trying to filter by
QUOTE:
... [valuation.value=62500]
will not help you.
You'll need to filter to something more ~like: valuation.value="62500"
list projection
In your original question, you are projecting on the PurchaseOrders which is a list and accessing pledgedDocuments which again is a list !
So when you do:
QUOTE:
PurchaseOrders.pledgedDocuments (...)
You don't have a simple list; you have a list of lists, it is a list of all the lists of pledged documents.
final solution
I believe what you wanted is:
flatten(PurchaseOrders.pledgedDocuments)[valuation.value="62500"]
And let's do the exercise on paper about what is actually happening.
First,
Let's focus on PurchaseOrders.pledgedDocuments.
You supply PurchaseOrders which is a LIST of POs,
and you project on pledgedDocuments.
What is that intermediate results?
Referencing your original question input value for POs, it is:
[
[{"valuation" : {"value" : "62500"} }]
]
notice how it is a list of lists?
With the first part of the expression, PurchaseOrders.pledgedDocuments, you have asked: for each PO, give me the list of pledged documents.
By hypothesis, if you supplied 3 POs, and each having 2 documents, you would have obtained with PurchaseOrders.pledgedDocuments a resulting list of again 3 elements, each element actually being a list of 2 elements.
Now,
With flatten(PurchaseOrders.pledgedDocuments) you achieve:
[{"valuation" : {"value" : "62500"} }]
So at this point you have a list containing all documents, regardless of which PO.
Now,
With flatten(PurchaseOrders.pledgedDocuments)[valuation.value="62500"] the complete expression, you still achieve:
[{"valuation" : {"value" : "62500"} }]
Because you have asked on the flattened list, to keep only those elements containing a valuation.value equal to the "62500" string.
In other words iff you have used this expression, what you achieved is:
From any PO, return me the documents having the valuations' value
equals to the string 62500, regardless of the PO the document belongs to.
I have documents with four fields: A, B, C, D Now I need to find documents where at least three fields matches. For example:
Query: A=a, B=b, C=c, D=d
Returned documents:
a,b,c,d (four of four met)
a,b,c (three of four met)
a,b,d (another three of four met)
a,c,d (another three of four met)
b,c,d (another three of four met)
So far I created something like:
`(A=a AND B=b AND C=c)
OR (A=a AND B=b AND D=d)
OR (A=a AND C=c AND D=d)
OR (B=b AND C=c AND D=d)`
But this is ugly and error prone.
Is there a better way to achieve it? Also, query performance matters.
I'm using Spring Data but I believe it does not matter. My current code:
Criteria c = new Criteria();
Criteria ca = Criteria.where("A").is(doc.getA());
Criteria cb = Criteria.where("B").is(doc.getB());
Criteria cc = Criteria.where("C").is(doc.getC());
Criteria cd = Criteria.where("D").is(doc.getD());
c.orOperator(
new Criteria().andOperator(ca,cb,cc),
new Criteria().andOperator(ca,cb,cd),
new Criteria().andOperator(ca,cc,cd),
new Criteria().andOperator(cb,cc,cd)
);
Query query = new Query(c);
return operations.find(query, Document.class, "documents");
Currently in MongoDB we cannot do this directly, since we dont have any functionality supporting Permutation/Combination on the query parameters.
But we can simplify the query by breaking the condition into parts.
Use Aggregation pipeline
$project with records (A=a AND B=b) --> This will give the records which are having two conditions matching.(Our objective is to find the records which are having matches for 3 out of 4 or 4 out of 4 on the given condition)`
Next in the pipeline use OR condition (C=c OR D=d) to find the final set of records which yields our expected result.
Hope it Helps!
The way you have it you have to do all permutations in your query. You can use the aggregation framework to do this without permuting all combinations. And it is generic enough to do with any K. The downside is I think you need Mongodb 3.2+ and also Spring Data doesn't support these oparations yet: $filter $concatArrays
But you can do it pretty easy with the java driver.
[
{
$project:{
totalMatched:{
$size:{
$filter:{
input:{
$concatArrays:[ ["$A"], ["$B"], ["$C"],["$D"]]
},
as:"attr",
cond:{
$eq:["$$attr","a"]
}
}
}
}
}
},
{
$match:{
totalMatched:{ $gte:3 }
}
}
]
All you are doing is you are concatenating the values of all the fields you need to check in a single array. Then select a subset of those elements that are equal to the value you are looking for (or any condition you want for that matter) and finally getting the size of that array for each document.
Now all you need to do is to $match the documents that have a size of greater than or equal to what you want.
Basically the collection output of an elaborate aggregate pipeline for a very large dataset is similar to the following:
{
"_id" : {
"clienta" : NumberLong(460011766),
"clientb" : NumberLong(2886729962)
},
"states" : [
[
"fixed", "fixed.rotated","fixed.rotated.off"
]
],
"VBPP" : [
244,
182,
184,
11,
299,
],
"PPF" : 72.4,
}
The intuitive, albeit slow, way to update these fields to be calculations of their former selves (length and variance of an array) with PyMongo before converting to arrays is as follows:
records_list = []
cursor = db.clientAgg.find({}, {'_id' : 0,
'states' : 1,
'VBPP' : 1,
'PPF': 1})
for record in cursor:
records_list.append(record)
for dicts in records_list:
dicts['states'] = len(dicts['states'])
dicts['VBPP'] = np.var(dicts['VBPP'])
I have written various forms of this basic flow to optimize for speed, but bringing in 500k dictionaries in memory to modify them before converting them to arrays to go through a machine learning estimator is costly. I have tried various ways to update the records directly via a cursor with variants of the following with no success:
cursor = db.clientAgg.find().skip(0).limit(50000)
def iter():
for item in cursor:
yield item
l = []
for x in iter():
x['VBPP'] = np.var(x['VBPP'])
# Or
# db.clientAgg.update({'_id':x['_id']},{'$set':{'x.VBPS': somefunction as above }},upsert=False, multi=True)
I also unsuccessfully tried using Mongo's usual operators since the variance is as simple as subtracting the mean from each element of the array, squaring the result, then averaging the results.
If I could successfully modify the collection directly then I could utilize something very fast like Monary or IOPro to load data directly from Mongo and into a numpy array without the additional overhead.
Thank you for your time
MongoDB has no way to update a document with values calculated from the document's fields; currently you can only use update to set values to constants that you pass in from your application. So you can set document.x to 2, but you can't set document.x to document.y + document.z or any other calculated value.
See https://jira.mongodb.org/browse/SERVER-11345 and https://jira.mongodb.org/browse/SERVER-458 for possible future features.
In the immediate future, PyMongo will release a bulk API that allows you to send a batch of distinct update operations in a single network round-trip which will improve your performance.
Addendum:
I have two other ideas. First, run some Javascript server-side. E.g., to set all documents' b fields to 2 * a:
db.eval(function() {
var collection = db.test_collection;
collection.find().forEach(function(doc) {
var b = 2 * doc.a;
collection.update({_id: doc._id}, {$set: {b: b}});
});
});
The second idea is to use the aggregation framework's $out operator, new in MongoDB 2.5.2, to transform the collection into a second collection that includes the calculated field:
db.test_collection.aggregate({
$project: {
a: '$a',
b: {$multiply: [2, '$a']}
}
}, {
$out: 'test_collection2'
});
Note that $project must explicitly include all the fields you want; only _id is included by default.
For a million documents on my machine the former approach took 2.5 minutes, and the latter 9 seconds. So you could use the aggregation framework to copy your data from its source to its destination, with the calculated fields included. Then, if desired, drop the original collection and rename the target collection to the source's name.
My final thought on this, is that MongoDB 2.5.3 and later can stream large result sets from an aggregation pipeline using a cursor. There's no reason Monary can't use that capability, so you might file a feature request there. That would allow you to get documents from a collection in the form you want, via Monary, without having to actually store the calculated fields in MongoDB.
I seek to add a dictionary for sentences (that most are single words) but am undecided if I should organize them by words, such as:
{word:"bacon", en:"bacon", ro:"sunca", fr:"jambon"}
or by language:
{
en:{ bacon:bacon },
ro:{ bacon:sunca },
fr:{ bacon:jambon }
}
I realize both have pros and cons, are equally valid, but I am seeking the wisdom of those who met this problem before, made a choice, and are happy or regret to have made it, and of course, why.
Thank you.
The below representation is simple and elegant. But the document representation in mongodb (or most nosql databases for that matter) is heavily influenced by the usage pattern of the data.
{word:"bacon", en:"bacon", ro:"sunca", fr:"jambon"}
This representation has the below merits assuming you want to look-up the other language translation by passing in the word
Intitive
No duplication
You can have index on the word
Of the two options that you provide, the first one is better. But since there will be at most so many languages, in your case, I would actually go for arrays:
LANG = {en: 0, ro: 1, fr: 2}
DICT = [
[:ham, :sunca, :jambon],
[:egg, :ou, :œuf]
]
For translation, write eg:
def translate( word, from: nil, to: nil )
from = LANG[from]
to = LANG[to]
i = DICT.index { |entry| entry[from] == word }
return DICT[i][to]
end
translate( :egg, from: :en, to: :fr )
#=> :œuf
Note please, that although effort has been done to minimize the size of the dictionary, as for the speed, faster matching algorithms would be available, using eg. suffix trees.
I've read and searched about MongoDB's JSON-BSON constructions but I do not understand (could not find either) how to have nested data and how to query it.
What I'd like to learn is, if somebody wants to store array within an array as in:
id: x,
name: y,
other: z,
multipleArray: [
(lab1: A, lab2: B, lab3:C),
(lab1: AB, lab2: BB, lab3:CB)
(lab1: AC, lab2: BC, lab3:CC)
..
..
]
How to store such data and then retrieve some, a specific or all elements of "multipleArray" contents?
Any resource on the subject would also be highly appreciated.
Bryan had some great advice which you should heed.
Also, as Manoj said, what you actually have is an array of objects. The following might help you out a bit...
Lists are just ordered sequences: [1,2,3...] or [2,292,111]
The first element in the last example is 2, the second is 292... lists/arrays are denoted by square brackets [ ]
Objects map keys to values: { name: "Tyler", age: 26, fav_color: "green" }
name maps to "Tyler", age maps to 25, etc... and objects are denoted by braces { }
A document in mongodb is an object. So, like above, they map keys to values. Those values can be strings, numbers, arrays... or other even other (nested) objects)
So, lets take a look at your document. You have an object (document) that has keys id, name, other, and multipleArray. The value multiple array maps to is an array [ ] of Objects { }.
{
id: x,
name: y,
other: z,
multipleArray: [
{lab1: "A", lab2: "B", lab3:"C"},
{lab1: "AB", lab2: "BB", lab3:"CB"},
{lab1: "AC", lab2: "BC", lab3:"CC"}
]
}
MongoDB has this feature called multikeys, it basically takes the value you are querying for and tries to match it against every value in the array.
If you wanted to find the document where multipleArray contained the document {lab1: "A", lab2: "B", lab3: "C"}, you query like this: db.data.find({multipleArray: {lab1: "A", lab2: "B", lab3: "C"}})
I'm assuming x, y, and z are defined already.
There are more subtleties and complexities, but if you want to learn more read the documentation on the mongodb site here or get a book.
Your question is a bit generic and as such is difficult to give a good answer.
When modeling your data to be stored and queried using MongoDB, you should consider how you plan to actually use and query your data. Based on the answer to that, you should be able to come up with a good data structure for storing the data.
It would be good for you to to familiarize yourself with the MongoDB query methods (http://www.mongodb.org/display/DOCS/Querying) so you can understand the many ways to query data in MongoDB.
Whatever language you are using should have a nice library that should abstract away the low level details of storing and querying the data, but its still going to be important to know what query methods MongoDB supports.
In general, MongoDB queries let you "reach into" nested objects in a given document and that also includes arrays as well.