In reactivemongo my query look like this:
val result =collName.find(BSONDocument("loc" -> BSONDocument("$near" ->
BSONArray(51,-114)))).cursor[BSONDocument].enumerate()
result.apply(Iteratee.foreach { doc => println(+BSONDocument.pretty(doc))})
I want to print only top 2 result, so i pass the maxdocs value in enumerate and then query is
val result =collName.find(BSONDocument("loc" -> BSONDocument("$near" ->
BSONArray(51,-114)))).cursor[BSONDocument].enumerate(2)
result.apply(Iteratee.foreach { doc => println(+BSONDocument.pretty(doc))})
But it's not workinng, it's print all document of query.
How to print only top 2 result ?
I basically stumbled over the same thing.
Turns out, that the ReactiveMongo driver transfers the result documents in batches, taking the maxDocs setting into account only when it wants to load the next batch of documents.
You can configure the batch size to be equal to the maxDocs limit or to a proper divisor thereof:
val result = collName.
find(BSONDocument("loc" -> BSONDocument("$near" -> BSONArray(51,-114)))).
options(QueryOpts(batchSizeN = 2)).
cursor[BSONDocument].enumerate(2)
Or, alternatively, let MongoDB choose the batch size and limit the documents you process using an Enumeratee:
val result = collName.
find(BSONDocument("loc" -> BSONDocument("$near" -> BSONArray(51,-114)))).
cursor[BSONDocument].
enumerate(2) &> Enumeratee.take(2)
Related
I have an array that i want to use for 2 feeders. I was expecting each feeder will be able to use all the values in the array. But seems like the values run out
val baseArray = Array ( Map("transactionId" -> "q-1"),
Map("transactionId" -> "q-2"),
Map("transactionId" -> "q-3"))
val feeder_getA = baseArray.clone.queue
val scn_getInsuredOrPrincipals = scenario("getInsuredOrPrincipals")
.feed(feeder_getA)
.exec(http("request_getA").get("/getA/${transactionId}"))
val feeder_getB = baseArray.clone.queue
val scn_getInsuredOrPrincipals = scenario("getInsuredOrPrincipals")
.feed(feeder_getB)
.exec(http("request_getB").get("/getB/${transactionId}"))
setUp(
scn_getInsuredOrPrincipals.inject(
atOnceUsers(3), // 2
rampUsers(3) over (5 seconds)
),
scn_getInsuredOrPrincipal.inject(
atOnceUsers(3), // 2
rampUsers(3) over (5 seconds)
)
)
I get an error saying feeder is now empty after 3 values are consumed... i was assuming feeder_getA and feeder_getB would each get 3 values so each scenario would get equal number of values. That doesnt seem like its happening. Almot as if the clone isnt working
The issue is that your feeders are defined using the queue strategy, which runs through the elements and then fails if no more are available:
val feeder_getA = baseArray.clone.queue
You need to use the circular strategy, which goes back to the beginning:
val feeder_getA = baseArray.clone.circular
For more information see the docs.
I am reading data from mongodb on a spark job, using the com.mongodb.spark.sql connector (v 2.0.0).
It works fine for most db's, but for a specific db, the stage takes a long time and the number of partitions is very high.
My program is set on 128 partitions (x2 number of vCPUs) which works fine after some testing the we did. On this load the number jumps to 2061 partitions and the stage takes several minutes to process, even though I am using a filter and the document clearly states that filters are done on the underlining data source (https://docs.mongodb.com/spark-connector/v2.0/scala/datasets-and-sql/)
This is how I read data:
val readConfig: ReadConfig = ReadConfig(
Map(
"spark.mongodb.input.uri" -> s"${mongodb.uri}/?${mongodb.uriParams}",
"spark.mongodb.input.database" -> s"${mongodb.dbNamesConfig.siteInstances}",
"collection" -> params.collectionName
), None)
val df: DataFrame = sparkSession.read.format("com.mongodb.spark.sql").options(readConfig.asOptions)
.schema(impressionSchema)
.load()
println("df: " + df.rdd.getNumPartitions) // this is 2061 partitions
val filteredDF = df.coalesce(128).filter(
$"_timestamp".isNotNull
.and($"_timestamp".between(new Timestamp(start.getMillis()), new Timestamp(end.getMillis())))
.and($"component_type" === SITE_INSTANCE_CHART_COMPONENT)
)
println("filteredDF: " + filteredDF.rdd.getNumPartitions) // 128 after using coalesce
filteredDF.select(
$"iid",
$"instance_id".as("instanceId"),
$"_global_visitor_key".as("globalVisitorKey"),
$"_timestamp".as("timestamp"),
$"_timestamp".cast(DataTypes.DateType).as("date")
)
Data is not very big (Shuffle Write is 20MB for this stage) and even if I filter only 1 document, the run time is the same (only the Shuffle Write is much smaller).
How can solve this?
Thanks
Nir
I am using ReactiveMongo, I want to create a query that performs like query with numbers (BigDecimal) in MongoDB. For eg: whole number like 4321.3456 should be matched by 4321.34.
The following 2 queries work on MongoShell to achieve this:
db.employee.find({"$where":"/^4321.34.*/.test(this.salary)"})
db.collection.find({
"$where": function() {
return Math.round(this.salary * 100)/ 100 === 1.12;
}
})
But, I couldn't find a way to perform this query using ReactiveMongo.
How can I execute such queries using ReactiveMongo ?
UPDATE
I have tried following query
val filter=Json.obj("$where" -> """/^4321.34.*/.test(this.salary)"""))
collection.find(filter).cursor[JsObject]()
In my case I was sure that I will get only 2 digits after decimal part so I did range query like
val lowerLimit = 4321.34
val upperLimit = lowerLimit + 0.01
val filter = Json.obj("$gte" -> JsNumber(lowerLimit),"$lt" -> JsNumber(upperLimit)))
collection.find(filter).cursor[JsObject]()
The above query works only if we are sure that only two digit after decimal part is sent. If three digits after decimal part is sent we have to do val upperLimit = lowerLimit + 0.001.
I'm trying to integrate MongoDB driver in Erlang.
After some coding it appears to me that the only way to limit the number of retrieved rows can only occurs when dealing with the cursor after the find()action.
Here's my code so far :
Cursor = mongo:find(Connection, Collection, Selector),
Result = case Limit of
infinity ->
mc_cursor:rest(Cursor);
_ ->
mc_cursor:take(Cursor, Limit)
end,
mc_cursor:close(Cursor)
What I'm afraid of, is when the Collection will be huge, what will happen ?
Won't it be to huge to fetch and fit the memory ?
How the cursor is basically working ?
Or is there just a better way to limit the fetch ?
I think you could use the batch_size parameter.
The following code is from mongo.erl file
%% #doc Return projection of selected documents starting from Nth document in batches of batchsize.
%% 0 batchsize means default batch size.
%% Negative batch size means one batch only.
%% Empty projection means full projection.
-spec find(pid(), collection(), selector(), projector(), skip(), batchsize()) -> cursor(). % Action
find(Connection, Coll, Selector, Projector, Skip, BatchSize) ->
mc_action_man:read(Connection, #'query'{
collection = Coll,
selector = Selector,
projector = Projector,
skip = Skip,
batchsize = BatchSize
}).
===============
reponse to the comments:
In the mc_action_man.erl file, it still use cursor to save "current postion".
read(Connection, Request = #'query'{collection = Collection, batchsize = BatchSize} ) ->
{Cursor, Batch} = mc_connection_man:request(Connection, Request),
mc_cursor:create(Connection, Collection, Cursor, BatchSize, Batch).
In the mc_worker.erl, it is the the data actual send to the db, I think you could add write_log (ex: lager) code to monitor the actual request to find the problem.
handle_call(Request, From, State = #state{socket = Socket, ets = Ets, conn_state = CS}) % read requests
when is_record(Request, 'query'); is_record(Request, getmore) ->
UpdReq = case is_record(Request, 'query') of
true -> Request#'query'{slaveok = CS#conn_state.read_mode =:= slave_ok};
false -> Request
end,
{ok, Id} = mc_worker_logic:make_request(Socket, CS#conn_state.database, UpdReq),
inet:setopts(Socket, [{active, once}]),
RespFun = fun(Response) -> gen_server:reply(From, Response) end, % save function, which will be called on response
true = ets:insert_new(Ets, {Id, RespFun}),
{noreply, State};
My query looks like that:
var x = db.collection.aggregate(...);
I want to know the number of items in the result set. The documentation says that this function returns a cursor. However it contains far less methods/fields than when using db.collection.find().
for (var k in x) print(k);
Produces
_firstBatch
_cursor
hasNext
next
objsLeftInBatch
help
toArray
forEach
map
itcount
shellPrint
pretty
No count() method! Why is this cursor different from the one returned by find()? itcount() returns some type of count, but the documentation says "for testing only".
Using a group stage in my aggregation ({$group:{_id:null,cnt:{$sum:1}}}), I can get the count, like that:
var cnt = x.hasNext() ? x.next().cnt : 0;
Is there a more straight forward way to get this count? As in db.collection.find(...).count()?
Barno's answer is correct to point out that itcount() is a perfectly good method for counting the number of results of the aggregation. I just wanted to make a few more points and clear up some other points of confusion:
No count() method! Why is this cursor different from the one returned by find()?
The trick with the count() method is that it counts the number of results of find() on the server side. itcount(), as you can see in the code, iterates over the cursor, retrieving the results from the server, and counts them. The "it" is for "iterate". There's currently (as of MongoDB 2.6), no way to just get the count of results from an aggregation pipeline without returning the cursor of results.
Using a group stage in my aggregation ({$group:{_id:null,cnt:{$sum:1}}}), I can get the count
Yes. This is a reasonable way to get the count of results and should be more performant than itcount() since it does the work on the server and does not need to send the results to the client. If the point of the aggregation within your application is just to produce the number of results, I would suggest using the $group stage to get the count. In the shell and for testing purposes, itcount() works fine.
Where have you read that itcount() is "for testing only"?
If in the mongo shell I do
var p = db.collection.aggregate(...);
printjson(p.help)
I receive
function () {
// This is the same as the "Cursor Methods" section of DBQuery.help().
print("\nCursor methods");
print("\t.toArray() - iterates through docs and returns an array of the results")
print("\t.forEach( func )")
print("\t.map( func )")
print("\t.hasNext()")
print("\t.next()")
print("\t.objsLeftInBatch() - returns count of docs left in current batch (when exhausted, a new getMore will be issued)")
print("\t.itcount() - iterates through documents and counts them")
print("\t.pretty() - pretty print each document, possibly over multiple lines")
}
If I do
printjson(p)
I find that
"itcount" : function (){
var num = 0;
while ( this.hasNext() ){
num++;
this.next();
}
return num;
}
This function
while ( this.hasNext() ){
num++;
this.next();
}
It is very similar var cnt = x.hasNext() ? x.next().cnt : 0; And this while is perfect for count...