How to limit the no. of rows retrieved in mongodb input transformation used in kettle.
I tried in mongodb input query with below queries but none of them are working :
{"$query" : {"$limit" : 10}}
or {"$limit" : 10}
Please let me know where i am going wrong.
Thanks,
Deepthi
There are several query modification operators you can use. Their names are not totally intuitive and don't match the names of functions you would use in the Mongo shell, but they do the same sorts of things.
In your case, you need the $maxScan operator. You could write your query as:
{"$query": {...}, "$maxScan": 10}
We have to use aggregation method providing by MongoDB, and here is the link: docs.mongodb.org/manual/applications/aggregation.
In my case, I use this query in MongoDB Input of Kettle, and we also have to choose Query is aggregation pipline.
{$match: {activity_type: " view "}},
{$sort: {activity_target: -1 } },
{$limit: 10}
The following screen shots can help you understand the operations more clear.
Related
I'm currently failing to build a mongoDB query that uses first a sort and then outputs one sample. Documents in the database look like this:
{
"_id" : "cynthia",
"crawled" : false,
"priority" : 2.0
}
I'm trying to achieve the following: Get me one random element with the highest priority.
I tested it with the following query:
db.getCollection('profiles').aggregate([
{$match: {crawled: false }},
{$sort: {priority: -1}},
{$sample: {size: 1}}
])
Unfortunately, this is not working. Mongo seems to totally ignore the $sort. I see no difference between using with $sort or not.
Does anybody with more mongoDB experience has an idea on that? If you have an idea of a better implementation of the "priority" feature just tell me.
Every idea is highly appreciated.
$sample is not what you're looking for. According to the docs:
Randomly selects the specified number of documents from its input.
So you'll get one random document from your filtered set of documents.
$limit is what you need since it takes first n documents from previous stage. Your pipeline should look like this:
db.profiles.aggregate([
{$match: {crawled: false }},
{$sort: {priority: -1}},
{$limit: 1}
])
Say you're querying documents based on 2 data points. One is a simple bool parameter, and the other is a complicated $geoWithin calculation.
db.collection.find( {"geoField": { "$geoWithin" : ...}, "boolField" : true} )
Will mongo reorder these parameters, so that it checks the boolField 1st, before running the complicated check?
MongoDB uses indexes like any other DBs. So the important thing for mongoDB is if any query fields has an index or not, not the order of query fields. At least there is no information in their documentation that mongoDB try to checks primitive query fields first. So for your example if boolField has an index mongoDB first check this field and eliminate documents whose boolField is false. But If geoField has an index then mongoDB first execute query on this field.
So what happens if none of them have index or both of them have? It should be the given order of fields in query because there is no suggestion or info beside of indexes in query optimization page of mongoDB. Additionally you can always test your queries performances with just adding .explain("executionStats").
So check the performance of db.collection.find( {"geoField": { "$geoWithin" : ...}, "boolField" : true} ) and db.collection.find( { "boolField" : true, "geoField": { "$geoWithin" : ...} } ). And let us know :)
To add to above response, if you want mongo to use specific index you can use cursor.hint . This https://docs.mongodb.com/manual/core/query-plans/ explains how default index selection is done.
I am trying to search a list of values on a matching field in documents which is array of documents. Using $in makes it OR between the values I supply. Using $all seems to be more logical.
For eg:
Collection: Phrases
sample doc:
{
"locales": [
{
"name": "BPT",
"internal_desc": "Entre 2 e 3 horas"
},
{
"name": "JPN",
"internal_desc": "2 ~ 3 時間"
}
]
}
Query:
db.phrases.find({"locales.name":{"$all":["BPT", "JPN"]}})
But some posts suggesting $all is bad in terms of performance. Is there any other way to achieve this?
Using $and instead of $all will result in equivalent performance. The bottom line is that given what you are trying to accomplish using $all is your best bet (as far as my understanding goes). However, $all can be optimized by making the first element in the expression more selective. For example if you know that "BPT" shows up in 2% of documents and "JPN" shows up in 20% of documents then it makes sense to list "BPT" as the first element in the $all expression. This way mongo only needs to filter through fewer documents on each consecutive element in your $all expression. Im sure you've seen the documentation but here is a link nonetheless: $all - mongodb
You can use the $and syntax, as shown in the query below;
db.phrases.find({$and : [
{"locales.name" : "BPT"},
{"locales.name" : "JPN"}
]
});
You can get information about your query, to see what the db is doing when executing the query by using the explain command, as displayed below;
db.phrases.explain().find({$and : [
{"locales.name" : "BPT"},
{"locales.name" : "JPN"}
]
});
Although, the explain command is more relevant to dbs where indexes are used, since it sort of gives you information about, which index was utilised by the db on the search.
Have a quick look into MongoDB indexes and explain() for further information.
I hope this helps.
Regards,
Nick.
I have a collection of document, where each document looks like this:
{'name' : 'John', 'locations' :
[
{'place' : 'Paris', 'been' : true}
{'place' : 'Moscow', 'been' : false}
{'place' : 'Berlin', 'been' : true}
]
}
Where the locations array could have any length.
I want to match documents where the been field is true for all elements in the locations array. Looking at the documentation it looks like I should use $and somehow but I'm not sure if it works with sub-fields.
There are several options:
use $ne: db.destinations.find({"locations.been":{$ne:false}})
change your business logic to precompute that value before saving the document. Otherwise, this search must look through all records and then all places. This value could be indexed.
use the $where operator, but, understand the performance implications. It may require a full table scan. In this case, it would.
write a map-reduce function with the filter logic and only emit those that are valid. You'd need to incrementally update it per the docs.
write a query using the aggregation framework. There are a lot of good examples here. Although, like other solutions, this could end up looping through the entire collection.
I think it's impossible to do with standart MongoDB operators like $elemMatch or $all. The only possible way is to write custom JS query:
db.test.find("return this.locations.every(function(loc){return loc.been});")
I run this command:
db.ads_view.aggregate({$group: {_id : "$campaign", "action" : {$sum: 1} }});
ads_view : 500 000 documents.
this queries take 1.8s . this is its profile : https://gist.github.com/afecec63a994f8f7fd8a
indexed : db.ads_view.ensureIndex({campaign: 1});
But mongodb don't use index. Anyone know if can aggregate framework use indexes, how to index this query.
This is a late answer, but since $group in Mongo as of version 4.0 still won't make use of indexes, it may be helpful for others.
To speed up your aggregation significantly, performe a $sort before $group.
So your query would become:
db.ads_view.aggregate({$sort:{"campaign":1}},{$group: {_id : "$campaign", "action" : {$sum: 1} }});
This assumes an index on campaign, which should have been created according to your question. In Mongo 4.0, create the index with db.ads_view.createIndex({campaign:1}).
I tested this on a collection containing 5.5+ Mio. documents. Without $sort, the aggregation would not have finished even after several hours; with $sort preceeding $group, aggregation is taking a couple of seconds.
The $group operator is not one of the ones that will use an index currently. The list of operators that do (as of 2.2) are:
$match
$sort
$limit
$skip
From here:
http://docs.mongodb.org/manual/applications/aggregation/#pipeline-operators-and-indexes
Based on the number of yields going on in the gist, I would assume you either have a very active instance or that a lot of this data is not in memory when you are doing the group (it will yield on page fault usually too), hence the 1.8s
Note that even if $group could use an index, and your index covered everything being grouped, it would still involve a full scan of the index to do the group, and would likely not be terrible fast anyway.
$group doesn't use an index because it doesn't have to. When you $group your items you're essentially indexing all documents passing through the $group stage of the pipeline using your $group's _id. If you used an index that matched the $group's _id, you'd still have to pass through all the docs in the index so it's the same amount of work.