I am designing a generic notification subscription system where user can specify a compound rule at the time of subscription in terms of MongoDB query, or more generally, json query. The subscription data is stored in MongoDB collection. For example,
{ "userId": 1, "rule": {"p1": "a"} }
{ "userId": 2, "rule": {"p1": "a", "p2": "b"} }
{ "userId": 3, "rule": {"p3": {$gt: 3} } }
Later when an event in the form of a json object, such as the following, arrives, I want to find all user rules the event maches:
{"p1": "a", "p3": 4}
The above event should match rules specified by userId 1 and 3 in the example. The event object doesn't have to be stored in MongoDB.
Although I can probably meet the requirement by writing a loop at application layer. For efficiency I really want to implement it at db layer, preferably allow distributed (sharded) execution due to volume and latency requirement.
Is it achievable? Any help is appreciated. In fact, I am open to other NOSQL dbs as long as supporting dynamic event schema and there is a way to specify compound rule.
What you are trying to achieve is not possible, at least in MongoDB.
If you reason about how a query engine works, you will realize that this has not a straightforward solution.
On high-level terms, the engine will generate a condition object from your query that then will get evaluated against each document in the set that will result in a boolean value which determines if the document belongs to the result set or not.
In your case you want to do the other way round, you want to generate a condition object based on the document and then apply it to something (e.g an object) that you give it.
Even if it were possible, the cost of doing this on the DB would be too high as it would require to compile an expression function for each object and execute it and there would be no way to optimize the execution of the query.
It is more reasonable to actually do that outside the database, where you could have the expression functions already created.
You cant store "Comparison Query Operators" in a mongo database, but you can do this:
{ "userId": 1, "rule": {"p1": "a"} }
{ "userId": 2, "rule": {"p1": "a", "p2": "b"} }
{ "userId": 3, "rule": {"p3": {"value": 3, "operator":"gt"} } }
You store value AND OPERATOR, in string form, and you can make a query like this:
db.test.find({"rule.p3.comparator":"gt", "rule.p3.value":{$lt:4}})
Notice, if your "operator" is "gt", you must use $lt (the opposite comparison operator) in the query
Your complete example is something like this:
db.test.find({$or:[{"rule.p3.comparator":"gt", "rule.p3.value":{$lt:4}}, {"rule.p1":"a"}]})
This query match userId 1 and 3 like you want
Update: Following solution doesn't work. Problem with mongodb is that it doesn't use NodeJs to run map-reduce javascript, nor support any package manager. So it's hard to use any 3rd party libraries.
My own proposed solution, which hasn't been confirmed :
Compose query conforming to json-query syntax
upon arrival of evt, call MongoDB mapReduce function on user rules collection to invoke jsonQuery in mapper
var jsonQuery = require('json-query')
var mapper = function(evt) {
function map() {
if(jsonQuery(this.rule, {data: evt}){
emit(this.userId);
}
}
return map;
};
db.userRules.mapReduce(mapper(evt), ...);
The reason to compose query into json-query syntax instead of MongoDB query syntax is only json-query offers the jsonQuery method that tries to match one rule to one object. For above code to meet the requirements in question, following assumptions have to be met:
MongoDB can execute mapReduce on distributed nodes
In mapReduce I can use external library such as json-query, which implies the library code has to be distributed to all MongoDB nodes, perhaps as a result of closure.
Related
We need to cache records for a service with a terrible API.
This service provides us with API to query for data about our employees, but does not inform us whether employees are new or have been updated. Nor can we filter our queries to them for this information.
Our proposed solution to the problems this creates for us is to periodically (e.g. every 15 minutes) query all our employee data and upsert it into a Mongo database. Then, when we write to the MongoDb, we would like to include an additional property which indicates whether the record is new or whether the record has any changes since the last time it was upserted (obviously not including the field we are using for the timestamp).
The idea is, instead of querying the source directly, which we can't filter by such timestamps, we would instead query our cache which would include said timestamp and use it for a filter.
(Ideally, we'd like to write this in C# using the MongoDb driver, but more important right now is whether we can do this in an upsert call or whether we'd need to load all the records into memory, do comparisons, and then add the timestamps before upserting them....)
There might be a way of doing that, but how efficient that is, still needs to be seen. The update command in MongoDB can take an aggregation pipeline to perform an update operation. We can use the $addFields stage of MongoDB to add a new field denoting the update status, and we can use $function to compute its value. A short example is:
db.collection.update({
key: 1
},
[
{
"$addFields": {
changed: {
"$function": {
lang: "js",
"args": [
"$$ROOT",
{
"key": 1,
data: "somedata"
}
],
"body": "function(originalDoc, newDoc) { return JSON.stringify(originalDoc) !== JSON.stringify(newDoc) }"
}
}
}
}
],
{
upsert: true
})
Here's the playground link.
Some points to consider here, are:
If the order of fields in the old and new versions of the doc is not the same then JSON.stringify will fail.
The function specified in $function will run on the server-side, so ideally it needs to be lightweight. If there is a large number of users, that will get upserted, then it may or may not act as a bottleneck.
My main object looks like this:
{
"_id" : ObjectId("56eb0a06560fd7047318465a"),
...
"intervalAbsenceDates" : [
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a06560fd7047318463b")),
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a05560fd70473184467")),
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a05560fd70473184468")),
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a05560fd70473184469")),
An embedded ScheduleIntervalContainer object looks like this:
{
"_id" : ObjectId("56eb0a06560fd7047318463b"),
"end" : ISODate("2022-08-23T07:06:00Z"),
"available" : true,
"confirmation" : true,
"start" : ISODate("2022-08-19T09:33:00Z")
}
Now I will query all ScheduleIntervalContainers where start and end is in a range.
I have tried a lot but I not even can query one ScheduleIntervalContainer by id.
This is my approach:
db.InstitutionUserConnection.find( {
"intervalAbsenceDates" : {
"$ref" : "ScheduleIntervalContainer",
"$id" : ObjectId("56eb0a05560fd7047318446d")
}
})
Could anyone give me a hint how to query all ScheduleIntervalContainers which have start and end in a time range.
Using DBRef is fraught with problems and the usage is somewhat "antiquated" as it was more of a "shoehorn" solution to requests to provide a mechanism for providing references to external collection data than a really well thought solution.
The general recommendation is to not use DBRef and rather simply use a plain ObjectId or other unique identifier and resolve the actual "collection" or even "database container" with other information, being either another standard property stored in the document or just a plain external reference in your code which identifies the target collection.
The Basic BSON Problem
A good reason for this is that people often make the common mistake ( just like you have ) of interpretting the "serialized" output from a stored object containing a DBRef to actually be "properties" present on the object. This is not true, as in fact the object has it's own BSON type, just like Date and ObjectId. The only time that serialized form is valid is when used with a "strict mode" JSON parser, which would yeild the actual DBRef object.
The manual itself is not particularly helpful with that fact and says "DBRefs have the following fields". This is misleading since those properties are not actually available as fields for query, and would not be exposed other than object inspection available to JavaScript processing of $where or mapReduce. And not a good approach to use either of those for "query" purposes.
So those properties are not available for query, but you can of course specifiy the BSON form directly from your API. Such as:
db.InstitutionUserConnection.find({
"intervalAbsenceDates": DBRef(
"ScheduleIntervalContainer",
ObjectId("56eb0a05560fd70473184469")
)
})
Which resolves and matches correctly since the correct BSON form was sent in a query and the element will actually match.
From this principle we can then move on to the the concept of "ranges" as asked in the question.
MongoDB does not "really" do Joins
Until recent releases ( MongoDB 3.2.x series ) the statement was actually "MongoDB does not do joins", and this has been a general distinction in design philosophy away from relational databases.
The general mantra "has" been that "joins are costly" and therefore do not scale well in distributed data systems such as what MongoDB is principally designed for.
So if you are asking for referencing a property in the document which the DBRef refers to, then you are basically out of luck. The only possible action for filtering results based on an external property in this case would be:
Look at all the data in the master collection, either loaded to query in whole or processed individually
For each retrieved document, lookup and expand the DBRef values to their target collection data.
Filter out documents whose expanded data from external references do not actually meet the conditions.
This means all the "expansion" and "filtering" must take place on the "client" to the database rather than on the server itself. There simply is no mechanism to do this, so you end up pulling a lot of data over the network connection.
With a DBRef in place this is purely not possible to perform on the server even with modern releases. Again it's the same BSON type problem, since the "source" contains DBRef types and the "target" contains ObjectId types.
If however you can simply live with the fact that your "range" is looking at the "creation date" data that would be inherently present in any ObjectId, then there is of course another approach that does not involve a "join".
Filtering on ObjectId "range"
Every ObjectId starts with 4-bytes that represents the current timestamp value ( excluding milliseconds ) at the time the ObjectId was created. This is generally a good indicator of the time of "insertion" for the document in question where it was used.
From this you can determine that the "created date" of the documents refernced with DBRef is approximately equal to that portion of the ObjectId value used by that document in the target collection. This allows you to basically construct a "range" value for ObjectId values that would fall between the given range. As a consequence, you can then construct DBRef BSON objects that would work with range operators:
// Define start and end dates to query
var dateStart = new Date("2016-03-17T19:48:21Z"), // equal to 56eb0a05
dateEnd = new Date("2016-03-17T19:48:25Z"); // equal to 56eb0a09
// Convert to hex and pad to ObjectId length
var startRange = new ObjectId(
( dateStart.valueOf() / 1000 ).toString(16) + "0000000000000000"
),
// Yields ObjectId("56eb0a050000000000000000")
endRange = new ObjectId(
( dateEnd.valueOf() / 1000 ).toString(16) + "ffffffffffffffff"
);
// Yields ObjectId("56eb0a09ffffffffffffffff")
// Now query with contructed DBRef values
db.InstitutionUserConnection.find({
"intervalAbsenceDates": {
"$elemMatch": {
"$gte": DBRef("ScheduleIntervalContainer",startRange),
"$lt": DBRef("ScheduleIntervalContainer",endRange),
}
}
})
So as long as "created" is what you are looking for, then that method should suffice for selecting the matching parents without first expanding the DBRef values in the array for further inpection.
The Reverse Case Lookup
Of course the other case here is to simply query the "joined" collection first and then look for documents in the "master" collection that contain the ObjectId values within the DBRef. This does of course mean multiple queries to be issued, but it does cure the case of expanding every DBRef just to match the related properties:
// Create array of matching DBRef values
var refs = db.ScheduleIntervalContainer.find({
"start" { "$lte": targetDate },
"end": { "$gte": targetDate }
}).map(function(doc) {
return DBRef("ScheduleIntervalContainer",doc._id)
});
// Find documents that match the DBRef's within the array
db.InstitutionUserConnection.find({
"intervalAbsenceDates": { "$in": refs }
})
The practicallity of this varies on the number of matches from the related collection resulting in the array that would be passed to $in, but it does actually yield the desired result.
Actually doing Joins
I mention earlier that "modern" MongoDB releases now have an approach to "joining" data from different collections. This is the $lookup aggregation pipeline operator.
But while this can be used to "join" data, the current usage of DBRef does not work here. I did also mention earlier that the basic problem is the data in the array is a DBRef, but the data in the referenced collection is instead an ObjectId.
So if you wanted to use a $lookup approach, then you would first need to use plain ObjectId values in place of the existing DBRef values:
{
"_id" : ObjectId("56eb0a06560fd7047318465a"),
"intervalAbsenceDates" : [
ObjectId("56eb0a06560fd7047318463b"),
ObjectId("56eb0a05560fd70473184467"),
ObjectId("56eb0a05560fd70473184468"),
ObjectId("56eb0a05560fd70473184469")
]
}
With data in that structure you could then use $lookup and other aggregation pipeline methods to just return the documents that actually match the value of a related property. I.e "end" within the related object:
db.InstitutionUserConnection.aggregate([
// Presently you need to unwind the array first
{ "$unwind": "$intervalAbsenceDates" },
// Then $lookup to get a resulting array of matches for each member
{ "$lookup": {
"from": "ScheduleIntervalContainer",
"localField": "intervalAbsenceDates",
"foreignField": "_id",
"as": "absenceDates"
}},
// unwind the array result field as well
{ "$unwind": "$absenceDates" },
// Now reform the documents
{ "$group": {
"_id": "$_id",
"intervalAbsenceDates": { "$push": "$absenceDates" }
}},
// Then query on the "end" property for the range
{ "$match": {
"intervalAbsenceDates": {
"$elemMatch": {
"end": {
"$gte": new Date("2016-03-23"),
"$lt": new Date("2016-03-24")
}
}
}
}}
])
The current behaviour of $lookup is that you cannot directly process on an array property in the document, so the procedure shown in "$lookup on ObjectId's in an array" is used to replace the current array with the expanded objects from the other collection.
Once the operations here actually produce a document that now has that related data embedded, it's a straightforward process of looking at the properties of the documents within the array to see if they match the query conditions.
Conclusion
This all should show that DBRef is not a good idea for storing references. Whist it is possible as demonstrated to work around the problem using the ObjectId values, you generally want to have a plain ObjectId or other key value as the reference and resolve them by other means. And even if the workaround is sufficient, it works just the same with plain ObjectId values or anything else that presents a natural range.
When it comes to using the "values" of refernced properties in such a "join", then of course regardless of using DBRef or other value, it is not a possibilty without $lookup to use that in query conditions on the server. All data would first need to be loaded to the client and then resolved with additional queries to the database before those properties can be inspected for filtering.
Since the mechanism of $lookup will in fact result in a form that will look exactly like if you "embedded" the data in the first place, then "embedding" is most often the correct approach, since the data is already present in the source collection and available for query.
There is a lot of "scare media" around regarding the BSON limit of 16MB and saying this is why you keep data in another collection. Sometimes this does indeed apply, but most of the time it does not. Afterall, 16MB is really quite a large amount of data, and more than most would actually use in general applications.
Citation from MongoDB: The Definitive Guide
To give you an idea of how much 16MB is, the entire text of War and Peace is just 3.14MB.
To examine for queries to need to get to an "embedded form" anyway, and it is arguable that if you can store an array of DBRef or ObjectId or whatever as embedded data, then storing all of the content they are actually pointing to is not really that much more of a stretch.
The general lesson is that you should be designing based on the actual usage patterns your applcation applies to the data. If you are querying "related data" all of the time, then it makes most sense to keep that data all in the one collection. Of course other factors apply, but always keep in mind the trade-off in performance considerations by what you are doing.
I have two collections viz. whitelist (id, count, expiry) and blacklist (id).
Now i would like to create an index such that when count>=200 then call a JS function which will remove the document from whitelist and add the id to blacklist.
So can i do this in Mongo using db.collection.createindex({"count":1}, ???);
or do i need to write a daemon to scan the entire collection? or is there any better method for the same?
You seem to be asking for what in a SQL relational database we would call a "trigger", which is something completely different from an "index" even in that world.
In the NoSQL world typically and especially with MongoDB, that sort of "server logic" is relegated to the "client" code operations rather than the server. Think of it as another part of the "scalability" philosphy of these products, where certain functions like "triggers" are taken away due to the stance that these "cost" a lot with distributed data.
So in order to do what you want you do it in "code" instead of defining a database "trigger". The process is simple enough, via .findAndModify() and other wrapping variants available to langauge API's:
// Increment below 200 and return the modified document
var doc = db.whitelist.findAndModify({
"query": { "_id": myId, "count": { "$lt": 200 } }
"update": { "count": { "$inc": 1 } },
"new": true
});
// Then remove the blacklist where the value meets conditions
if ( doc.hasOwnProperty("count") {
if ( doc.count >= 200 )
db.blacklist.remove({ "_id": myId });
}
Be careful with the actual language API method variant as the structure typically differs fromt the "query/update" keys as is provided in the shell method.
The basic principles remain the same. Modifiy and fetch, then remove from the other collection if your conditions are met. But it is "two" trips to the server, and there is no way to make the server "trigger" when such a condition is met by itself.
db.whitelist.insert(doc);
if(db.whitelist.find(criterion).count() >= 200) {
var bulkRemove = db.whitelist.initializeUnorderedBulkOp();
var bulkInsert = db.blacklist.initializeUnorderedBulkOp();
db.whitelist.find(criterion).forEach(
function(doc){
bulkInsert.insert({_id:doc._id});
bulkRemove.find({doc._id}).removeOne();
}
);
bulkInsert.execute();
bulkRemove.execute();
}
First, you insert the document as usual. Since criterion is going to use an index, the if clause should be determined fast and efficiently.
In case we have 200 or more documents matching that criterion, we use bulk operations to insert the ids into the blacklist and remove the documents from the whitelist, which will be executed in parallel.
The problem with only writing the _id to the backlist is that you need to check wether the criterion for being blacklisted is matched, so the _id needs to contain that criterion.
A better solution IMHO is to flag entries of a single collection using a field named blacklisted for individual entries or to use the aggregation framework to find blacklisted documents and write them to an a collection using the out pipeline stage. Sadly, you didn't give example data or a proper description of your use case, so you get an unspecified answer.
I have a very large collection (~7M items) in MongoDB, primarily consisting of documents with three fields.
I'd like to be able to iterate over all the unique values for one of the fields, in an expedient manner.
Currently, I'm querying for just that field, and then processing the returned results by iterating on the cursor for uniqueness. This works, but it's rather slow, and I suspect there must be a better way.
I know mongo has the db.collection.distinct() function, but this is limited by the maximum BSON size (16 MB), which my dataset exceeds.
Is there any way to iterate over something similar to the db.collection.distinct(), but using a cursor or some other method, so the record-size limit isn't as much of an issue?
I think maybe something like the map/reduce functionality would possibly be suited for this kind of thing, but I don't really understand the map-reduce paradigm in the first place, so I have no idea what I'm doing. The project I'm working on is partially to learn about working with different database tools, so I'm rather inexperienced.
I'm using PyMongo if it's relevant (I don't think it is). This should be mostly dependent on MongoDB alone.
Example:
For this dataset:
{"basePath" : "foo", "internalPath" : "Neque", "itemhash": "49f4c6804be2523e2a5e74b1ffbf7e05"}
{"basePath" : "foo", "internalPath" : "porro", "itemhash": "ffc8fd5ef8a4515a0b743d5f52b444bf"}
{"basePath" : "bar", "internalPath" : "quisquam", "itemhash": "cf34a8047defea9a51b4a75e9c28f9e7"}
{"basePath" : "baz", "internalPath" : "est", "itemhash": "c07bc6f51234205efcdeedb7153fdb04"}
{"basePath" : "foo", "internalPath" : "qui", "itemhash": "5aa8cfe2f0fe08ee8b796e70662bfb42"}
What I'd like to do is iterate over just the basePath field. For the above dataset, this means I'd iterate over foo, bar, and baz just once each.
I'm not sure if it's relevant, but the DB I have is structured so that while each field is not unique, the aggregate of all three is unique (this is enforced with an index).
The query and filter operation I'm currently using (note: I'm restricting the query to a subset of the items to reduce processing time):
self.log.info("Running path query")
itemCursor = self.dbInt.coll.find({"basePath": pathRE}, fields={'_id': False, 'internalPath': False, 'itemhash': False}, exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
for item in itemCursor:
# print item
items.add(item["basePath"])
self.log.info("total unique items = %s", len(items))
Running the same query with self.dbInt.coll.distinct("basePath") results in OperationFailure: command SON([('distinct', u'deduper_collection'), ('key', 'basePath')]) failed: exception: distinct too big, 16mb cap
Ok, here is the solution I wound up using. I'd add it as an answer, but I don't want to detract from the actual answers that got me here.
reStr = "^%s" % fqPathBase
pathRE = re.compile(reStr)
self.log.info("Running path query")
pipeline = [
{ "$match" :
{
"basePath" : pathRE
}
},
# Group the keys
{"$group":
{
"_id": "$basePath"
}
},
# Output to a collection "tmp_unique_coll"
{"$out": "tmp_unique_coll"}
]
itemCursor = self.dbInt.coll.aggregate(pipeline, allowDiskUse=True)
itemCursor = self.dbInt.db.tmp_unique_coll.find(exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
retItems = 0
for item in itemCursor:
retItems += 1
items.add(item["_id"])
self.log.info("Recieved items = %d", retItems)
self.log.info("total unique items = %s", len(items))
General performance compared to my previous solution is about 2X in terms of wall-clock time. On a query that returns 834273 items, with 11467 uniques:
Original method(retreive, stuff into a python set to enforce uniqueness):
real 0m22.538s
user 0m17.136s
sys 0m0.324s
Aggregate pipeline method :
real 0m9.881s
user 0m0.548s
sys 0m0.096s
So while the overall execution time is only ~2X better, the aggregation pipeline is massively more performant in terms of actual CPU time.
Update:
I revisited this project recently, and rewrote the DB layer to use a SQL database, and everything was much easier. A complex processing pipeline is now a simple SELECT DISTINCT(colName) WHERE xxx operation.
Realistically, MongoDB and NoSQL databases in general are vary much the wrong database type for what I'm trying to do here.
From the discussion points so far I'm going to take a stab at this. And I'm also noting that as of writing, the 2.6 release for MongoDB should be just around the corner, good weather permitting, so I am going to make some references there.
Oh and the FYI that didn't come up in chat, .distinct() is an entirely different animal that pre-dates the methods used in the responses here, and as such is subject to many limitations.
And this soltion is finally a solution for 2.6 up, or any current dev release over 2.5.3
The alternative for now is use mapReduce because the only restriction is the output size
Without going into the inner workings of distinct, I'm going to go on the presumption that aggregate is doing this more efficiently [and even more so in upcoming release].
db.collection.aggregate([
// Group the key and increment the count per match
{$group: { _id: "$basePath", count: {$sum: 1} }},
// Hey you can even sort it without breaking things
{$sort: { count: 1 }},
// Output to a collection "output"
{$out: "output"}
])
So we are using the $out pipeline stage to get the final result that is over 16MB into a collection of it's own. There you can do what you want with it.
As 2.6 is "just around the corner" there is one more tweak that can be added.
Use allowDiskUse from the runCommand form, where each stage can use disk and not be subject to memory restrictions.
The main point here, is that this is nearly live for production. And the performance will be better than the same operation in mapReduce. So go ahead and play. Install 2.5.5 for you own use now.
A MapReduce, in the current version of Mongo would avoid the problems of the results exceeding 16MB.
map = function() {
if(this['basePath']) {
emit(this['basePath'], 1);
}
// if basePath always exists you can just call the emit:
// emit(this.basePath);
};
reduce = function(key, values) {
return Array.sum(values);
};
For each document the basePath is emitted with a single value representing the count of that value. The reduce simply creates the sum of all the values. The resulting collection would have all unique values for basePath along with the total number of occurrences.
And, as you'll need to store the results to prevent an error using the out option which specifies a destination collection.
db.yourCollectionName.mapReduce(
map,
reduce,
{ out: "distinctMR" }
)
#Neil Lunn 's answer could be simplified:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}])
$project filters fields for you. In particular, '_id': 0 filters out the _id field.
Result still too large? Batch it with $limit and $skip:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}, {'$limit': X}, {'$skip': Y}])
I think the most scalable solution is to perform a query for each unique value. The queries must be executed one after the other, and each query will give you the "next" unique value based on the previous query result. The idea is that the query will return you one single document, that will contain the unique value that you are looking for. If you use the proper projection, mongo will just use the index loaded into memory without having to read from disk.
You can define this strategy using $gt operator in mongo, but you must take into account values like null or empty strings, and potentially discard them using the $ne or $nin operator. You can also extend this strategy using multiple keys, using operators like $gte for one key and $gt for the other.
This strategy should give you the distinct values of a string field in alphabetical order, or distinct numerical values sorted ascendingly.
I'm using PyMongo to fetch data from MongoDB. All documents in the collection look like the structure below:
{
"_id" : ObjectId("50755d055a953d6e7b1699b6"),
"actor":
{
"languages": ["nl"]
},
"language":
{
"value": "nl"
}
}
I'm trying to fetch all the conversations where the property language.value is inside the property actor.languages.
At the moment I know how to look for all conversations with a constant value inside actor.languages (eg. all conversations with en inside actor.languages).
But I'm stuck on how to do the same comparison with a variable value (language.value) inside the current document.
Any help is welcome, thanks in advance!
db.testcoll.find({$where:"this.actor.languages.indexOf(this.language.value) >= 0"})
You could use a $where provided your query set is small, but any real size and you could start seeing problems, especially since this query seems like one that needs to be run in realtime on a page and the JS engine is single threaded among other problems.
I would actually consider a better way in this case is through the client side, it is quite straight forward, pull out records based on one of the values, iterate and test their conditional double value (i.e. pull out based on language.value being nl and test actor.languages value for that previous value).
I would imagine you might be able to do this with the aggregation framework however, at the min you cannot use computed fields within $match. I would imagine it would look like this:
{$project:
{languages_value: "$language.value", languages: "$actor.languages"}
}, {$match: {"$languages": {$in:"$languages_values"}}
If you could. But there might be a way.