I would like to retrieve documents by the presence of an string in a nested array. For example, the data (representing a dependency parse of a sentence) looks like:
{'tuples': [['xcomp', 'multiply', 'using'],
['det', 'method', 'the'],
['nn', 'method', 'foil'],
['dobj', 'using', 'method']]}
The closest solution I've found assumes that ['nn', ...] is the second position of the tuples list-of-lists:
db.c.find({'tuples.2.0' : 'nn'})
Is there a way to relax the fixed position? The tuples (not their contents) can be in any order.
Secondly, it would be really great to be able to retrieve documents that have ['nn', 'method', X], meaning a noun "method" in their dependency parse.
Thank you!
Got it!
db.c.find({'tuples' : {$elemMatch : {$all : ['nn']}}})
db.c.find({'tuples' : {$elemMatch : {$all : ['nn','method']}}})
Related
Literal Expression = PurchaseOrders.pledgedDocuments[valuation.value=62500]
Purchase Order Structure
PurchaseOrders:
[
{
"productId": "PURCHASE_ORDER_FINANCING",
"pledgedDocuments" : [{"valuation" : {"value" : "62500"} }]
}
]
The literal Expression produces null result.
However if
PurchaseOrders.pledgedDocuments[valuation = null]
Return all results !
What am I doing wrong ?
I was able to solve using flatten function - but dont know how it worked :(
In your original question, it is not exactly clear to me what is your end-goal, so I will try to provide some references.
value filtering
First, your PurchaseOrders -> pledgedDocuments -> valuation -> value appears to be a string, so in your original question trying to filter by
QUOTE:
... [valuation.value=62500]
will not help you.
You'll need to filter to something more ~like: valuation.value="62500"
list projection
In your original question, you are projecting on the PurchaseOrders which is a list and accessing pledgedDocuments which again is a list !
So when you do:
QUOTE:
PurchaseOrders.pledgedDocuments (...)
You don't have a simple list; you have a list of lists, it is a list of all the lists of pledged documents.
final solution
I believe what you wanted is:
flatten(PurchaseOrders.pledgedDocuments)[valuation.value="62500"]
And let's do the exercise on paper about what is actually happening.
First,
Let's focus on PurchaseOrders.pledgedDocuments.
You supply PurchaseOrders which is a LIST of POs,
and you project on pledgedDocuments.
What is that intermediate results?
Referencing your original question input value for POs, it is:
[
[{"valuation" : {"value" : "62500"} }]
]
notice how it is a list of lists?
With the first part of the expression, PurchaseOrders.pledgedDocuments, you have asked: for each PO, give me the list of pledged documents.
By hypothesis, if you supplied 3 POs, and each having 2 documents, you would have obtained with PurchaseOrders.pledgedDocuments a resulting list of again 3 elements, each element actually being a list of 2 elements.
Now,
With flatten(PurchaseOrders.pledgedDocuments) you achieve:
[{"valuation" : {"value" : "62500"} }]
So at this point you have a list containing all documents, regardless of which PO.
Now,
With flatten(PurchaseOrders.pledgedDocuments)[valuation.value="62500"] the complete expression, you still achieve:
[{"valuation" : {"value" : "62500"} }]
Because you have asked on the flattened list, to keep only those elements containing a valuation.value equal to the "62500" string.
In other words iff you have used this expression, what you achieved is:
From any PO, return me the documents having the valuations' value
equals to the string 62500, regardless of the PO the document belongs to.
I've been studying up on MongoDB and I understand that it is highly recommended that documents structures are completely built-out (pre-allocated) at the point of insert, this way future changes to that document do not require the document to be moved around on the disk. Does this apply when using $addToSet or $push?
For example, say I have the following document:
"_id" : "rsMH4GxtduZZfxQrC",
"createdAt" : ISODate("2015-03-01T12:08:23.007Z"),
"market" : "LTC_CNY",
"type" : "recentTrades",
"data" : [
{
"date" : "1422168530",
"price" : 13.8,
"amount" : 0.203,
"tid" : "2435402",
"type" : "buy"
},
{
"date" : "1422168529",
"price" : 13.8,
"amount" : 0.594,
"tid" : "2435401",
"type" : "buy"
},
{
"date" : "1422168529",
"price" : 13.79,
"amount" : 0.594,
"tid" : "2435400",
"type" : "buy"
}
]
And I am using one of the following commands to add a new array of objects (newData) to the data field:
$addToSet to add to the end of the array:
Collection.update(
{ _id: 'rsMH4GxtduZZfxQrC' },
{
$addToSet: {
data: {
$each: newData
}
}
}
);
$push (with $position) to add to the front of the array:
Collection.update(
{ _id: 'rsMH4GxtduZZfxQrC' },
{
$push: {
data: {
$each: newData,
$position: 0
}
}
}
);
The data array in the document will grow due to new objects that were added from newData. So will this type of document update cause the document to be moved around on the disk?
For this particular system, the data array in these documents can grow to upwards of 75k objects within, so if these documents are indeed being moved around on disk after every $addToSet or $push update, should the document be defined with 75k nulls (data: [null,null...null]) on insert, and then perhaps use $set to replace the values over time? Thanks!
I understand that it is highly recommended that documents structures are completely built-out (pre-allocated) at the point of insert, this way future changes to that document do not require the document to be moved around on the disk. Does this apply when using $addToSet or $push?
It's recommended if it's feasible for the use case, which it usually isn't. Time series data is a notable exception. It doesn't really apply with $addToSet and $push because they tend to increase the size of the document by growing an array.
the data array in these documents can grow to upwards of 75k objects within
Stop. Are you sure you want constantly growing arrays with tens of thousands of entries? Are you going to query wanting specific entries back? Are you going to index any fields in the array entries? You probably want to rethink your document structure. Maybe you want each data entry to be a separate document with fields like market, type, createdAt replicated in each? You wouldn't be worrying about document moves.
Why will the array grow to 75K entries? Can you do less entries per document? Is this time series data? It's great to be able to preallocate documents and do in-place updates with the mmap storage engine, but it's not feasible for every use case and it's not a requirement for MongoDB to perform well.
should the document be defined with 75k nulls (data: [null,null...null]) on insert, and then perhaps use $set to replace the values over time?
No, this is not really helpful. The document size will be computed based on the BSON size of the null values in the array, so when you replace null with another type the size will increase and you'll get document rewrites anyway. You would need to preallocate the array with objects with all fields set to a default value for its type, e.g.
{
"date" : ISODate("1970-01-01T00:00:00Z") // use a date type instead of a string date
"price" : 0,
"amount" : 0,
"tid" : "000000", // assuming 7 character code - strings icky for default preallocation
"type" : "none" // assuming it's "buy" or "sell", want a default as long as longest real values
}
MongoDB uses the power of two allocation strategy to store your documents, which means it will allocate the size of the document^2 for storage. Therefore if your nested arrays don't lead to a total growth larger then the original size to the power of two, mongo will not have to reallocate the document.
See: http://docs.mongodb.org/manual/core/storage/
Bottom line here is that any "document growth" is pretty much always going to result in the "physical move" of the storage allocation unless you have "pre-allocated" by some means on the original document submission. Yes there is "power of two" allocation, but this does not always mean anything valid to your storage case.
The additional "catch" here is on "capped collections", where indeed the "hidden catch" is that such "pre-allocation" methods are likely not to be "replicated" to other members in a replica set if those instructions fall outside of the "oplog" period where the replica set entries are applied.
Growing any structure beyond what is allocated from an "initial allocation" or the general tricks that can be applied will result in that document being "moved" in storage space when it grows beyond the space it was originally supplied with.
In order to ensure this does not happen, then you always "pre-allocate" to the expected provisions of your data on the original creation. And with the obvious caveat of the condition already described.
Say I have a MongoDB collection of documents with only two fields - x and y - and one of them (say, x) has an index.
Will any of the following queries will have better performance than the other?
Single-match query:
db.collection.aggregate({$match : {x : "value", y : "value"}})
Double-match query (indexed field matched first):
db.collection.aggregate({$match : {x : "value"}}, {$match : {y : "value"}})
Will any of the following queries will have better performance than
the other?
In a nutshell: no. The performance will be more-or-less the same, at least as far as both of them will use the same index.
db.collection.aggregate({$match : {x : "value", y : "value"}})
This will use the index on {x:1} the same way that a regular find() on x and y would use it.
Double-match query (indexed field matched first):
db.collection.aggregate({$match : {x : "value"}}, {$match : {y : "value"}})
The first $match will use the index on x just like a find would.
In the first case the index is used to reduce the resultant set of documents to examine for matching y value. In the second case the index is used to only pass through the pipeline the documents that match x so the second state would have to in-memory examine them for whether they match y.
This is basically the same operation in both cases efficiency-wise.
The single match will have better performance since it can use a single index.
The double match is actually treated as a double $match, a.k.a a $match within a $match as such an index is not actually use for the second $match.
This behaviour however, has been changed in 2.5.4: https://jira.mongodb.org/browse/SERVER-11184 so that multiple calls will just result in one call on the server. This is actually a bit of a bummer since it makes some queries that require a second non-indexed part harder now :\.
Currently I have an index on a geospatial field in one of my collections set up like this:
collection.ensureIndex({ loc : "2d" }, { min : -10000 , max : 10000,
bits : 32, unique:true}
However, I would like to include one more field in the index, so I can take advantage of covered index queries in one of my use cases. An ensureIndex with multiple fields (compound index) looks something like this:
collection.ensureIndex( { username : 1, password : 1, roles : 1} );
Question is - how do I write my first index spec with an additional field, so that I keep my min/max parameters? In other words, how to specify that the min/max/bits only apply to one of the index fields? My best guess so far is:
collection.ensureIndex({ loc : "2d", field2 : 1 },
{ min : -10000 , max : 10000 , bits : 32, unique:true}
But I have no confidence that this is working as it should!
UPDATE:
There is more info in the documentation here, but it still does not explicitly show how to specify the min/max in this case.
db.collection.getIndexes() should give you the parameters used to construct the index - the min/max/bits field will only ever apply to the "2d" field.
There's no way to see these parameters aside from getIndexes(), but you can easily verify that you aren't allowed to insert the same location/field pair, and that you aren't allowed to insert locs outside your bounds (but field2 can be anything). The bits setting is harder to verify directly, though you can indirectly verify by setting it very low and seeing that nearby points then trigger the unique key violations.
After reading about MongoDB and Geospatial Indexing
I was amazed that it did not support compound keys not starting with the 2d index.
I dont know if I would gain anything on it, but right now the mssql solution is just as slow/fast.
SELECT TOP 30 * FROM Villages WHERE SID = 10 ORDER BY (math to calc radius from the center point)
This works, but is slow because it not smart enough to use a index so it has to calc the radius for all villages with that SID.
So in Mongo I wanted to create an index like: {sid: 1, loc: "2d"} so I could filter out alot from the start.
I'm not sure there are any solutions for this. I thought about creating a collection for each sid since they don't share any information. But what are the disadvantages of this? Or is this how people do it ?
Update
The maps are flat: 800, 800 to -800,-800, villages are places from the center of the map and out. There are about 300 different maps which are not related, so they could be in diff collections, but not sure about the overhead.
If more information is need, please let me know.
What I have tried
> var res = db.Villages.find({sid: 464})
> db.Villages.find({loc: {$near: [50, 50]}, sid: {$in: res}})
error: { "$err" : "invalid query", "code" : 12580 }
>
Also tried this
db.Villages.find({loc: {$near: [50, 50]}, sid: {$in: db.Villages.find({sid: 464}, {sid: 1})}})
error: { "$err" : "invalid query", "code" : 12580 }
I'm not really sure what I'm doing wrong, but its probably somthing about the syntax. Confused here.
As you stated already Mongodb cannot accept location as secondary key in geo index. 2d has to be first in index. So you are out of luck here in changing indexing patterns here.
But there is a workaround, instead the compound geo index you can create two separate indexes on sid and one compound index with loc and sid
db.your_collection.ensureIndex({sid : 1})
db.your_collection.ensureIndex({loc : '2d',sid:1})
or two separate indexes on sid and loc
db.your_collection.ensureIndex({sid : 1})
db.your_collection.ensureIndex({loc : '2d'})
(am not sure which of the above one is efficient, you can try it yourself)
and you can make two different queries to get the results filterd by sid first and the location next, kinda like this
res = db.your_collection.find({sid:10})
//get all the ids from the res (res_ids)
//and query by location using the ids
db.your_collection.find({loc:{ $near : [50,50] } ,sid : {$in : res_ids}})