How can I get multiple sort in MongoDB with Perl?
My current approach looks something like this:
my $sort = {"is_instock" => -1, "ua" => 1};
my $resultSet = $collection
->find({moderated => 1, markers => {'$all'=>$obj->{markers}}})
->sort($sort)
->limit(25);
#{$result} = $resultSet->all;
But, i got array sorted by one field(ua). What i did wrong?
The basic problem here is that a "hash" in Perl is ordered by "key" by default. In order to get the "order of insertion" you need to use Tie::IxHash as follows:
use Tie::IxHash;
my %sort;
tie ( %sort, 'Tie::IxHash' );
my $sort = \%sort;
$sort = { "is_instock" => -1, "ua" => 1 };
Then when you use this in your MongoDB query, the keys are considered in the order you inserted them, rather than their lexcial order.
It should have been orderd that way anyhow since the keys are in lexical order, but I suggest you did something wrong and you need to be aware of the insertion order anyway.
The otherwise reason is that "in_stock" does not exist, or is not the true path name to the field. You need to specifiy the full path to the field with "dot notation" otherwise the path is invalid.
I'm currently using MongoDB 2.6 through MongoHQ. I've several mapreduces jobs which crunch raw data from a collection (c1) to produce a new collection (c2).
I've also an aggregation pipeline which parses (c2) to generate a new collection (c3) with the great $out operator.
However, I need to add extra fields to (c3) outside of the aggregation pipeline and keep them even after a new run of the aggregation but it seems that aggregation, based on the _id key just overwrite the content without updating it. So if I've previously add an extra field like foo : 'bar' to (c3) and I re-run the aggregation, I will loose the foo field.
Based on documentation (http://docs.mongodb.org/manual/reference/operator/aggregation/out/#pipe._S_out)
Replace Existing Collection
If the collection specified by the $out operation already exists, then upon completion of the aggregation, the $out stage atomically replaces the existing collection with the new results collection. The $out operation does not change any indexes that existed on the previous collection. If the aggregation fails, the $out operation makes no changes to the pre-existing collection.
Is there a better way or a tricky one :-) to update the $out collection instead of overwriting records with same _id ? I could write a python script or javascript to do the job but I would to avoid doing many database calls and in a smarter way as aggregation. May be it is not possible, so I will look for a different and more 'classical' path.
Thanks for your help
Well, not directly with the $out operator as much with the mapReduce output this is pretty much an "overwrite" operation (though mapReduce does have "merge" and "reduce" modes as well).
But since you have a MongoDB 2.6 version you do actually return a "cursor". So while the "client/server" interaction may not be as optimal as you want but you also have "bulk update" operations so you can do something along the lines of:
var cursor = db.collection.aggregate([
// pipeline here
]);
var batch = [];
while ( cursor.hasNext() ) {
var doc = cursor.next();
var updoc = {
"q": { "_id": doc._id },
"u": {
// only new fields except for
"$setOnInsert": {
// the fields you expect to add from before
},
"upsert": true
}
};
batch.push(updoc);
// try to do sensible under 16MB updates, number may vary
if ( ( batch.length % 500 ) == 0 ) {
db.runCommand({
"update": "newcollection",
"updates": batch
});
batch = []; // reset the content
}
}
db.runCommand({
"update": "newcollection",
"updates": batch
});
And of course, though there will be many naysayers, and not without reason because you really need to weigh up the consequences ( which are very real ), you can always wrap what is essentially a JavaScript call with db.eval() in order to get the full server side execution.
But where possible ( and that is unless you have a completely remote database solution ), then it is generally advised to take the "client/server" option, but keep the process as "close" ( in networking terms ) to the server as possible.
Unlike Map reduce it seems as though the $out operator in the aggregation framework has a very specific set of pre-defined behaviours ( http://docs.mongodb.org/manual/reference/operator/aggregation/out/#behaviors ), however, it does seem that the $out option could change, I did not find a JIRA relating to this specific case however others have posted changes ( https://jira.mongodb.org/browse/SERVER-13201 ).
As for solving your problem now, you either are forced to revert back to Map Reduce (I don't know the scenario from where this is being run) or aggregate in a certain manner that allows you to feed in the new data and the old data you need.
Most common way of achieving this might be to update the original rows with the new data, maybe by aggregating the original row back down to itself.
Thanks for all your messages.
As I do not want to use cursor (requests consuming) I try to get the job by combining 2 map reduces jobs and one aggregation. It is quite 'fat' but it works and could give some idea for others.
Of course, I would be very pleased hearing from you other great alternatives.
So, I have a collection c1 which is the result of a previous mapreduce job as you could see by the value object.
c1 : { id:'xxxx', value:{ language:'...', keyword: '...', params: '...', field1: val1, field2: val2}}
the xxxx unique ID key is the concatenation of the value.language , value.keyword and value.params as follow :
*xxxx = _*
I've got another collection c2 : { _id : ObjectID, language:'...', keyword:'...', field1: val1, field2: val2, labels: 'yyyyy'} which is quite a
projection of the c1 collection but with an extra field labels which is a string with different labels comma separated. This c2 collection is a central repository for all combination of language and keywords with their attached field values.
Target
The target is to group all records from the c1 collection based on the
group key _, make some calculations on
other fields and store the result to the c2 collection but by keeping
the old 'labels' field from c2 with the same key. So fields1 & 2 of
this c2 collection will be recalculated each time we launch the whole
batch but the labels field will stay unchanged.
As described in my first message, by using aggregation or mapreduce jobs you could not reach this target as the 'labels' field will be removed.
As I do not want to use cursors and other foreach loop which are very network and database resquests consuming (I have a big collection and I use a MongoHQ service)
I try to solve the problem by using mapreduce and aggregation jobs.
1st Phase
So, firstly I run a mapreduce job (m1) which is a sort of copy of the c2 collection but clearing the value of field1 & 2 to 0. The result will be store in a c3 collection.
function m1Map(){
language = this['value']['language'];
keyword = this['value']['keyword'];
labels = this['labels'];
key = language + '_' + keyword;
emit(key,{'language':language,'keyword':keyword,'field1': 0, 'field2': 0.0, 'labels' : labels});
}
function m1Reduce(key,values){
language = values[0]['language'];
keyword = values[0]['keyword'];
labels = values[0]['labels'];
return {'language':language,'keyword':keyword,'field1': 0, 'field2': 0.0, 'labels' : labels}};
}
So now, c3 is a copy of c2 collection with field1&2 set to 0. Here is the shape of this collection :
c3 : { id:'', value:{ language:'...', keyword: '...', field1: 0, field2: 0.0, labels: '...'}}
2nd Phase
In a second step I run a mapreduce job (m2) which group the c1 collection value by the key _ and I project an extra field 'labels' with a fixed value 'x' in my example. This 'x' value is never used on the c2 collection, that is a special value. The output of this m2 mapreduce job will be stored in the same previous c3 collection with a 'reduce' option in the out directive. The python script will be described further.
function m2Map(){
language = this['value']['language'];
keyword = this['value']['keyword'];
field1 = this['value']['field1'];
field2 = this['value']['field2'];
key = language + '_' + keyword;
emit(key,{'language':language,'keyword':keyword,'field1': field1, 'field2': field2, 'labels' : 'x'});
}
Then I make some calculations on the Reduce function :
function m2Reduce(key,values){
// Init
language = values[0]['language'];
keyword = values[0]['keyword'];
field1 = 0;
field2 = 0;
bLabel = 0;
for (var i = 0; i < values.length; i++){
if (values[i]['labels'] == 'x') {
// We know these emit values are coming from the map and not from previous value on the c2 collection
// 'x' is never used on the c2 collection
field1 += parseInt(values[i]['field1']);
field2 += parseFloat(values[i]['field2']);
} else {
// these values are from the c2 collection
if (bLabel == 0) {
// we keep the former value for the 'labels' field
labels = values[i]['labels'];
bLabel = 1;
} else {
// we concatenate the 'labels' field if we have 2 records but theorytically it is impossible as c2 has only one record by unique key
// anyway, a good check afterwards :-)
labels += ','+values[i]['labels'];
}
}
}
if (bLabel == 0) {
// if values are only coming from the map emit, we force again the 'x' value for labels, it these values are re-used in another reduce call
labels = 'x';
}
return {'language':language,'keyword':keyword, 'field1': field1, 'field2': field2, 'labels' : labels};
}
The Python mapreduce script which calls the two m1 & m2 mapreduce jobs
(see pymongo for import : http://api.mongodb.org/python/2.7rc0/installation.html)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from pymongo import MongoClient
from pymongo import MongoReplicaSetClient
from bson.code import Code
from bson.son import SON
# MongoHQ
uri = 'mongodb://user:passwd#url_node1:port,url_node2:port/mydb'
client = MongoReplicaSetClient(uri,replicaSet='set-xxxxxxx')
db = client.mydb
coll1 = db.c1
coll2 = db.c2
#Load map and reduce functions
m1_map = Code(open('m1Map.js','r').read())
m1_reduce = Code(open('m1Reduce.js','r').read())
m2_map = Code(open('m2Map.js','r').read())
m2_reduce = Code(open('m2Reduce.js','r').read())
#Run the map-reduce queries
results = coll2.map_reduce(m1_map,m1_reduce,"c3",query={})
results = coll1.map_reduce(m2_map,m2_reduce,out=SON([("reduce", "c3")]),query={})
3rd Phase
At this point, we have a c3 collection which is complete with all field 1 & 2 computed values and the labels kept. So now, we have to run a last aggregation pipeline to copy the c3 content (in a mapreduce form with a compound value) to a more classical collection c2 with flatten fields without the value object.
db.c3.aggregate([{$project : { _id: 0, keyword: '$value.keyword', language: '$value.language', field1: '$value.field1', field2 : '$value.field2', labels : '$value.labels'}},{$out:'c2'}])
Et voilà ! The target is reached. This solution is quite long with 2 mapreduce jobs and one aggregation pipeline but this is an alternative solution for those who do not want to use consuming cursor or external loop.
Thanks.
suppose you have a collection of documents with the following structure:
_id
A_id = ObjectId
B_id = ObjectId
C_id = ObjectId
+ other stuff
suppose you have a collection of roughly 100 million to 1 billion documents. I have to run a query,
which returns all documents such that A_id, B_id, or C_id are in some list of ObjectId, say L = [ ObjectId]
Something like this:
{ '$or' : [ { 'A_id' : { '$in' : L}},
{ 'B_id' : { '$in' : L}},
{ 'C_id' : { '$in' : L}} ]
}
Q: is it doable to run such query? Is it normal to run such queries on mongodb?
Q: how long can it take of a single server and how long may it take at horizontally scaled database?
It is a doable query.
The real question is "is it a good query?"
The answer to that question is extremely dependant upon many variables.
First off I am assuming you have an index on each of the fields you are querying. I am also assuming that the query stands as is, without a sort. It should be noted that there are problems which stop a sort index from being used here by the optimiser: https://jira.mongodb.org/browse/SERVER-1205
Assuming you have indexes on A_id, B_id and C_id MongoDB will essentially do 3 queries and merge duplicates before returning your result.
This means that for small $or queries it might be faster within the database (or mongos) itself since you don't have to merge duplicates yourself in your application which not only spares network traffic but also costly iteration of the results of each clause of the $or.
So for a small $or like that the query is Okay. It isn't the best query in the world but it will do if you have no choice but to do an $or.
Q: how long can it take of a single server and how long may it take at horizontally scaled database?
Not sure if anyone here can answer that. It depends upon schema, the size of the $ins and much more.
It's certainly doable to run that query. However, you might want to consider an alternative structure that could be more easily searched.
Instead of
_id
A_id = ObjectId
B_id = ObjectId
C_id = ObjectId
+ other stuff
You might want to restructure it to be:
_id
idList = [
{ k: 'A', v: AObjectId },
{ k: 'B', v: BObjectId },
{ k: 'C', v: CObjectId }
]
+ other stuff
By using an array, with sub-objects with a key and value field, you can index the value fields so you can do just a single efficient query:
{ 'idList' : { $in : [listToCheck] } }
This is my query to sphinxQL:
SELECT option_id FROM items WHERE cat IN (10,11) GROUP BY option_id LIMIT 100000 OPTION max_matches=100000
cat is sql_attr_multi field, this query not return to me correct result. Anybody knows how to search by fields by this sphinx attribute?
That query looks for items where cat attribute contains either 10 OR 11, is that what you trying to do?
If its not, would help to know what you are trying to query!
I had similar problem.
When I pass array to IN condition for MVA attribute I have no result, however there is several ones in index.
When I debug condition (attribute array(10, 11) in you case) I see that array values is string integer instead.
array(
0 => "10",
1 => "11"
)
For every single value in condition uses quoteArr() function
https://github.com/FoolCode/SphinxQL-Query-Builder/blob/master/src/SphinxQL.php#L518
wich escape value according with https://github.com/FoolCode/SphinxQL-Query-Builder/blob/master/src/Drivers/ConnectionBase.php#L95
The quote function use is_int() PHP internal function:
$a = "1";
var_dump(is_int($a)); // return bool(false)
It mean, thet instead
cat IN (10, 11)
you have
cat IN ("10", "11")
But sphinx can't filter MVA attribute by not integer (string) values no metter IN or OR WHERE notation you use.
[1064] index document : unsupported filter type 'string' on MVA column [ SELECT * FROM `document` WHERE MATCH('(some query)') AND `_category` = '5' LIMIT 0, 10]
You should use strict value type:
foreach ($category as &$item) {
$item = (int)$item;
} unset($item);
I am not sure that it is your incindent. Unfortunately, there isn't enough data to say it for sure in this case.
I have a MongoDB structure which currently looks like this:
[campaigns] => Array (
[0] => Array (
[campaign_id] => 4e8cba7a0b7aabea08000006
[short_code] => IHEQnP
[users] => Array (
)
)
[1] => Array (
[campaign_id] => 4e8ccf7c0b7aabe508000007
[short_code] => QLU_IY
[users] => Array (
)
)
)
What I would like to be able to do, is search for the short code, and just have the relevant array returned. I initially tried:
db.users.find({'campaigns.short_code':'IHEQnP'}, {'campaigns.campaign_id':1})
However that returns all the arrays, as opposed to just the one (or field) that I want.
Is there a way in Mongo to get the correct array (or even field associated with the array)? Or is that something I would have to do on the server? Am using the lithium framework to retrieve the results (in case it helps).
Thanks in advance :)
Dan
When you use a criteria like campaigns.short_code you are still searching the collection, the campaigns is just a property of a document, your find returns documents.
So given this structure you can not achieve what you want directly by a query.
Arrays in MongoDb can be sliced, but not sorted:
db.users.find({}, {campaigns: { $slice : 1}})
This would give you the first campaign, but as you cant sort it so IHEQnP is at top its of no help in this situation.
Read more here.
You could however filter this quite simple in Lithium after retrieving the full document:
$id = 'id to match against';
$result = $user->campaigns->find(function($model) use ($id) {
return $model->campaign_id === $id
});
See docs for Entity::find here
My solution would be to keep it in User if the amount of campaigns is low (fast to sort and filter in PHP as long as document size isnt too big).
If this is expected to grow then look at moving it to its own model/collection or re-think how you modelled your data.