Custom index comparator in MongoDB - mongodb

I'm working with a dataset composed by probabilistic encrypted elements indistinguishable from random samples. This way, sequential encryptions of the same number results in different ciphertexts. However, these still comparable through a special function that applies algorithms like SHA256 to compare two ciphertexts.
I want to add a list of the described ciphertexts to a MongoDB database and index it using a tree-based structure (i.e.: AVL). I can't simply apply the default indexing of the database because, as described, the records must be comparable using the special function.
An example: Suppose I have a database db and a collection c composed by the following document type:
{
"_id":ObjectId,
"r":string
}
Moreover, let F(int,string,string) be the following function:
F(h,l,r) = ( SHA256(l | r) + h ) % 3
where the operator | is a standard concatenation function.
I want to execute the following query in an efficient way, such as in a collection with some suitable indexing:
db.c.find( { F(h,l,r) :{ $eq: 0 } } )
for h and l chosen arbitrarily but not constants. I.e.: Suppose I want to find all records that satisfy F(h1,l1,r), for some pair (h1, l1). Later, in another moment, I want to do the same but using (h2, l2) such that h1 != h2 and l1 != l2. h and l may assume any value in the set of integers.
How can I do that?

You can execute this query use the operator $where, but this way can't use index. So, for query performance it's dependents on the size of your dataset.
db.c.find({$where: function() { return F(1, "bb", this.r) == 0; }})
Before execute the code above, you need store your function F on the mongodb server:
db.system.js.save({
_id: "F",
value: function(h, l, r) {
// the body of function
}
})
Links:
store javascript function on server

I've tried a solution that store the result of the function in your collection, so I changed the schema, like below:
{
"_id": ObjectId,
"r": {
"_key": F(H, L, value),
"value": String
}
}
The field r._key is value of F(h,l,r) with constant h and l, and the field r.value is original r field.
So you can create index on field r._key and your query condition will be:
db.c.find( { "r._key" : 0 } )

Related

Selecting data from MongoDB where K of N criterias are met

I have documents with four fields: A, B, C, D Now I need to find documents where at least three fields matches. For example:
Query: A=a, B=b, C=c, D=d
Returned documents:
a,b,c,d (four of four met)
a,b,c (three of four met)
a,b,d (another three of four met)
a,c,d (another three of four met)
b,c,d (another three of four met)
So far I created something like:
`(A=a AND B=b AND C=c)
OR (A=a AND B=b AND D=d)
OR (A=a AND C=c AND D=d)
OR (B=b AND C=c AND D=d)`
But this is ugly and error prone.
Is there a better way to achieve it? Also, query performance matters.
I'm using Spring Data but I believe it does not matter. My current code:
Criteria c = new Criteria();
Criteria ca = Criteria.where("A").is(doc.getA());
Criteria cb = Criteria.where("B").is(doc.getB());
Criteria cc = Criteria.where("C").is(doc.getC());
Criteria cd = Criteria.where("D").is(doc.getD());
c.orOperator(
new Criteria().andOperator(ca,cb,cc),
new Criteria().andOperator(ca,cb,cd),
new Criteria().andOperator(ca,cc,cd),
new Criteria().andOperator(cb,cc,cd)
);
Query query = new Query(c);
return operations.find(query, Document.class, "documents");
Currently in MongoDB we cannot do this directly, since we dont have any functionality supporting Permutation/Combination on the query parameters.
But we can simplify the query by breaking the condition into parts.
Use Aggregation pipeline
$project with records (A=a AND B=b) --> This will give the records which are having two conditions matching.(Our objective is to find the records which are having matches for 3 out of 4 or 4 out of 4 on the given condition)`
Next in the pipeline use OR condition (C=c OR D=d) to find the final set of records which yields our expected result.
Hope it Helps!
The way you have it you have to do all permutations in your query. You can use the aggregation framework to do this without permuting all combinations. And it is generic enough to do with any K. The downside is I think you need Mongodb 3.2+ and also Spring Data doesn't support these oparations yet: $filter $concatArrays
But you can do it pretty easy with the java driver.
[
{
$project:{
totalMatched:{
$size:{
$filter:{
input:{
$concatArrays:[ ["$A"], ["$B"], ["$C"],["$D"]]
},
as:"attr",
cond:{
$eq:["$$attr","a"]
}
}
}
}
}
},
{
$match:{
totalMatched:{ $gte:3 }
}
}
]
All you are doing is you are concatenating the values of all the fields you need to check in a single array. Then select a subset of those elements that are equal to the value you are looking for (or any condition you want for that matter) and finally getting the size of that array for each document.
Now all you need to do is to $match the documents that have a size of greater than or equal to what you want.

mongodb query opposite of value in range

I have a mongoDB collection with data like this
{
start: "5.3",
end: "8.10",
owner: "guy1"
}
{
start: "9.0",
end: "14.5",
owner: "guy2"
}
Suppose I wish to know who got element 6.2.
We can see that guy1 has got element 6.2 because he own element 5.3 to element 8.10.
This is opposite to
db.collection.find( { field: { $gt: value1, $lt: value2 } } );
The fields specify the start and end. The supplied data is in between the range.
Looking for the largest value of start which is smaller than the required value.
How to query this?
Negating the logic, this should work:
db.collection.find( { start : { $lte : 6.2 }, end : { $gt : 6.2 } })
The problem is you need to ensure that you have no overlapping intervals, which can be tricky, but that is a problem at the application level. Also, please make sure you're not storing doubles as strings, otherwise the $gt/$lt queries will not work as expected, i.e make sure your data looks like this instead:
{
start: 9.0,
end: 14.5,
owner: "guy2"
}
As pointed out by Michael, you should also make sure your interval logic works out, because there are different methods interval inclusion can be defined. Math notation is ) vs. ], but the discussion gets more complicated with floating point numbers where there are limits on the value of a smallest epsilon which depends on the value of the number itself...
If I understood you correctly, you can use $not operator (put your original query inside $not) or you knowing that
not (x > a and x < b) is equal to x < a or x > b
you can rewrite it with $or query

Searching a collection on parts of values

Suppose I have a collection C with items with a property X. Suppose the values of X are themselves objects, list {a:1, b: 2, c: 3}. Can I find (or findOne) on C asking for items whose X property has a value whose a property == 1? I'd like to write C.find({X.a: 1}). Or maybe
C.find({X: function(value) {
return value.a == 1;
}
});
Your pseudo code just needs quotes around the property for mongo to understand it. C.find({'X.a': 1}) would return any document where X.a equalled 1.
The key words if you want to learn more are 'subdocuments' and 'dot notation' as described here.
You can use dot notation to access the elements in nested documents but you will need to do $unwind if you need to access elements within a list and then check if X.a equals 1.

PyMongo updating array records with calculated fields via cursor

Basically the collection output of an elaborate aggregate pipeline for a very large dataset is similar to the following:
{
"_id" : {
"clienta" : NumberLong(460011766),
"clientb" : NumberLong(2886729962)
},
"states" : [
[
"fixed", "fixed.rotated","fixed.rotated.off"
]
],
"VBPP" : [
244,
182,
184,
11,
299,
],
"PPF" : 72.4,
}
The intuitive, albeit slow, way to update these fields to be calculations of their former selves (length and variance of an array) with PyMongo before converting to arrays is as follows:
records_list = []
cursor = db.clientAgg.find({}, {'_id' : 0,
'states' : 1,
'VBPP' : 1,
'PPF': 1})
for record in cursor:
records_list.append(record)
for dicts in records_list:
dicts['states'] = len(dicts['states'])
dicts['VBPP'] = np.var(dicts['VBPP'])
I have written various forms of this basic flow to optimize for speed, but bringing in 500k dictionaries in memory to modify them before converting them to arrays to go through a machine learning estimator is costly. I have tried various ways to update the records directly via a cursor with variants of the following with no success:
cursor = db.clientAgg.find().skip(0).limit(50000)
def iter():
for item in cursor:
yield item
l = []
for x in iter():
x['VBPP'] = np.var(x['VBPP'])
# Or
# db.clientAgg.update({'_id':x['_id']},{'$set':{'x.VBPS': somefunction as above }},upsert=False, multi=True)
I also unsuccessfully tried using Mongo's usual operators since the variance is as simple as subtracting the mean from each element of the array, squaring the result, then averaging the results.
If I could successfully modify the collection directly then I could utilize something very fast like Monary or IOPro to load data directly from Mongo and into a numpy array without the additional overhead.
Thank you for your time
MongoDB has no way to update a document with values calculated from the document's fields; currently you can only use update to set values to constants that you pass in from your application. So you can set document.x to 2, but you can't set document.x to document.y + document.z or any other calculated value.
See https://jira.mongodb.org/browse/SERVER-11345 and https://jira.mongodb.org/browse/SERVER-458 for possible future features.
In the immediate future, PyMongo will release a bulk API that allows you to send a batch of distinct update operations in a single network round-trip which will improve your performance.
Addendum:
I have two other ideas. First, run some Javascript server-side. E.g., to set all documents' b fields to 2 * a:
db.eval(function() {
var collection = db.test_collection;
collection.find().forEach(function(doc) {
var b = 2 * doc.a;
collection.update({_id: doc._id}, {$set: {b: b}});
});
});
The second idea is to use the aggregation framework's $out operator, new in MongoDB 2.5.2, to transform the collection into a second collection that includes the calculated field:
db.test_collection.aggregate({
$project: {
a: '$a',
b: {$multiply: [2, '$a']}
}
}, {
$out: 'test_collection2'
});
Note that $project must explicitly include all the fields you want; only _id is included by default.
For a million documents on my machine the former approach took 2.5 minutes, and the latter 9 seconds. So you could use the aggregation framework to copy your data from its source to its destination, with the calculated fields included. Then, if desired, drop the original collection and rename the target collection to the source's name.
My final thought on this, is that MongoDB 2.5.3 and later can stream large result sets from an aggregation pipeline using a cursor. There's no reason Monary can't use that capability, so you might file a feature request there. That would allow you to get documents from a collection in the form you want, via Monary, without having to actually store the calculated fields in MongoDB.

search in limited number of record MongoDB

I want to search in the first 1000 records of my document whose name is CityDB. I used the following code:
db.CityDB.find({'index.2':"London"}).limit(1000)
but it does not work, it return the first 1000 of finding, but I want to search just in the first 1000 records not all records. Could you please help me.
Thanks,
Amir
Note that there is no guarantee that your documents are returned in any particular order by a query as long as you don't sort explicitely. Documents in a new collection are usually returned in insertion order, but various things can cause that order to change unexpectedly, so don't rely on it. By the way: Auto-generated _id's start with a timestamp, so when you sort by _id, the objects are returned by creation-date.
Now about your actual question. When you first want to limit the documents and then perform a filter-operation on this limited set, you can use the aggregation pipeline. It allows you to use $limit-operator first and then use the $match-operator on the remaining documents.
db.CityDB.aggregate(
// { $sort: { _id: 1 } }, // <- uncomment when you want the first 1000 by creation-time
{ $limit: 1000 },
{ $match: { 'index.2':"London" } }
)
I can think of two ways to achieve this:
1) You have a global counter and every time you input data into your collection you add a field count = currentCounter and increase currentCounter by 1. When you need to select your first k elements, you find it this way
db.CityDB.find({
'index.2':"London",
count : {
'$gte' : currentCounter - k
}
})
This is not atomic and might give you sometimes more then k elements on a heavy loaded system (but it can support indexes).
Here is another approach which works nice in the shell:
2) Create your dummy data:
var k = 100;
for(var i = 1; i<k; i++){
db.a.insert({
_id : i,
z: Math.floor(1 + Math.random() * 10)
})
}
output = [];
And now find in the first k records where z == 3
k = 10;
db.a.find().sort({$natural : -1}).limit(k).forEach(function(el){
if (el.z == 3){
output.push(el)
}
})
as you see your output has correct elements:
output
I think it is pretty straight forward to modify my example for your needs.
P.S. also take a look in aggregation framework, there might be a way to achieve what you need with it.