MongoDB find median - mongodb

I would like to upon user request graph median values of many documents. I'd prefer not to transfer entire documents from the database to my application solely for purposes of determining median values.
I understand that development is still planned for a median aggregator in MongoDB, however I see that currently the following operations are supported:
sort
count
limit
Short of editing mongo source code, Is there any reasonable way I can combine these operations to obtain median values; for example, to sort values, count them, and limit to return median values?

It appears that editing Mongo source code is the only solution.

Related

"Random" sample from MongoDB returning heavily skewed results

I have a collection in MongoDB with ~600,000 documents. Of those, exactly half have a field set to 0, while the others have the same field set to 1. When I try to get a random sample from this collection using the sample operation in the aggregation pipeline (via PyMongo), it skews heavily toward the 1 value.
In a 25,000 record sample, there might be 300-400 records where the field is 0, and then 24,000+ records where the field in question is 1.
If the initial collection is equally distributed, why is this use of $sample returning results with such a vastly different distribution, and how can I get a representative sample from a collection?
Here's the PyMongo line I'm using for the query:
cursor = foo_database.bar_collection.aggregate( [ { "$sample": { "size": 25000} } ])
As of MongoDB 3.4.9, part of the reason for the bias you've observed is that $sample relies almost entirely on the storage engine's random cursor implementation (see SERVER-19183). This is done so that $sample could be performant when the collection contains a lot of data. However, since the storage engine stores documents in a sorted order using a B-tree type implementation, it's not always possible to create a truly random result.
There are currently two feature requests for better $sample mechanics, namely SERVER-22069 and SERVER-22068.
Having said that, if you require a truly unbiased samples of your data, rolling your own $sample-like solution is likely the best way to proceed at this point. Something like:
Get a list of all _id in the collection.
Perform a random sampling on this list (e.g. using Python's random.choice).
Obtain all the relevant documents using the sampled _id, which will be reasonably performant depending on the sample size you want, since _id is always indexed.

mongodb batch operations Is it possible to order or limit data somehow

Let's assume I have 10000k of users
I want to select first 500 users and set golden status for them
Is it possible to do that with mongo batch (bulk) opertions?
You can change multiple documents in tn mongoDB using update. But the query can't use $limit or $orderby. Check this issue. So you will need to write a query(as in find) and update documents. If your schema has the ranking of the users, it is very trivial to do that.
db.users.update({rank:{'$lte':500}},{'$set':{status:'gold'}})
But if you have a score based on which you will rank them, it is not so trivial. There is no batch operation. You will need to have to do it yourself for ex. using forEach()
db.users.find({}).sort({score:-1}).limit(500).forEach(...)
in the forEach you can manually set the golden status.
A shrewd way to avoid this is make a query to know the score of the 500th ranked user and use that to update.
db.users.find({}).sort({score:-1}).skip(499).limit(1)
or
db.users.find({}).sort({score:-1}).skip(500)
I don't how it is done in pymongo. Let us assume that docs is the query output.
fooScore=docs[0].score
or
fooScore=docs[500].score
and
db.users.update({score:{'$gte':{fooScore}},{'$set':{status:'gold'}})
Also you will have to deal differently when users are less than 500.
Tinker the strategy for users with same score...
Also consider setting an index on rank or score for efficient queries.

MongoDB: Store average speed as a field or compute on-the-go

I am developing an Android app that uses MongoDB to store user records in document format. I will have several records that contain information about a GPS track such as start longitude and latitude, finish longitude and latitude, total time, top speed and total distance.
My question is regarding average speed. Should I let my app compute the average speed and store that as a field in the document, or should I compute this by only getting time and distance?
I will have thousands of records that should be sorted based on average speed and the most reasonable seems to store the average speed in the document as well. However, that breaks away from traditional SQL Acid thinking where speed would be calculated outside the DB.
The current document structure for the record collection is like this:
DocumentID (record)
DocumentID (user)
Start lnlt
Finish lnlt
Start time/date/gmt
End time/date/gmt
Total distance
Total time
Top speed
KMZ File
You should not talk abount ACID properties once you made a choice to use document oriented DB such as Mongo. Now, you have answered the question yourself:
" the most reasonable seems to store the average speed in the document as well."
We programmers have the tendency of ignoring the reasonable or simple approaches. We always try to question our selves whenever the solution we find looks obvious or common sense ;-).
Anyways, my suggestion is to store it as you want the sort to be performed by DB and not the application. This means that if any of the variables that influence the average speed change after initial storage then you should remember to update the result field as well.
My question is regarding average speed. Should I let my app compute the average speed and store that as a field in the document, or should I compute this by only getting time and distance?
As #Panegea rightly said, MongoDB does not rely on ACID properties. It relies on your app being able to control the distributed nature of it self, however that being said calculating the average speed outside of the DB isn't all that bad and using an atomic operator like $set will stop oddities when not using full ACID queries.
What you and #Panegea talk about is a form of pre-aggregation of your needed value to a pre-defined field on the document. This is by far a recommended approach not only in MongoDB but also in SQL (like the total shares on a facebook wall post) where querying for the aggregation of a computed field would be tiresome and very difficult for the server, or just not wise.
Edit
You clould achieve this with the aggregation framework: http://docs.mongodb.org/manual/applications/aggregation/ as well, might wanna take a look there, but pre-aggregation is by far the fastest method.

Create aggregated user stats with MongoDB

I am building a MongoDB database that will work with an Android app. I have a user collection and a records collection. The records documents consist of GPS tracks such as start and end coordinates, total time and top speed and distance. The user document is has user id, first name, last name and so forth.
I want to have aggregate stats for each user that summarizes total distance, total time, total average speed and top speed to date.
I am confused if I should do a map reduce and create an aggregate collection for users, or if I should add these stats to the user document with some kind of cron job type soliuton. I have read many guides about map reduce and aggregation for MongoDB but can't figure this out.
Thanks!
It sounds like your aggregate indicator values are per-user, in which case I would simply calculate them and push them directly into the user object as the same time as you update current co-oordinates, speed etc. They would be nice and easy (and fast) to query, and you could aggregate them further if you wished.
When I say pre-calculate, I don't mean MapReduce, which you would use as a batch process, I simply mean calculate on update of the user object.
If your aggregate stats are compiled across users, then you could still pre-calculate them on update, but if you also need to be able to query those aggregate stats against some other condition or filter, such as, "tell me what the total distance travelled for all users within x region", then depending on the number of combinations you may not be able to cover all those with pre-calculation.
So, if your aggregate stats ARE across users, AND need some sort of filter applying, then they'll need to be calculated from some snapshot of data. The two approaches here are;
the aggregation framework in 2.2
MapReduce
You would need to use MapReduce say, if you've a LOT of historical data that you want to crunch and you can pre-calculate the results for fast reading later. By my definition, that data isn't changing frequently, but even if it did, you can also use incremental MR to add new results to an existing calculation.
The aggregation framework in 2.2 will allow you to do a lot of this on demand, but it won't be as quick of course as pre-calculated values but way quicker than MR when executed on-demand. It can't cope with the high volume result-sets that you can do with MR, but it's better suited to queries where you don't know the parameter values in advance.
By way of example, if you wanted to calculate the aggregate sums of users stats within a particular lat/long, you couldn't use MR because there are just too many combinations of that filter, so you'd need to do that on the fly.
If however, you wanted it by city, well you could conceivably use MR there because you could stick to a finite set of cities and just pre-calculate them all.
But to wrap up, if your aggregate indicator values are per-user alone, then I'd start by calculating and storing the values inside the user object when I update the user object as I said in the first paragraph. Yes, you're storing the value as well as the inputs, but that's the model that saves you having to calculate on the fly.

Query for set complement in CouchDB

I'm not sure that there is a good way to do with with the facilities CouchDB provides, but I'd like to somehow extract the relative complement of the sets of two different document types over a particular key.
For example, let's say that I have documents representing users and posts, both of which have a (unique) username field. There's a validation in place ensuring that a user document exists for the username in every post, but there may be any number post documents with a given username, include none. It's trivial to create a view which counts the number of posts per username. The view can even include zero-counts by emitting zero post-counts for the user documents in the view map function. What I want to do though is retrieve just the list of users who have zero associated posts.
It's possible to build the view I described above and filter client-side for zero-value results, but in my actual situation the number of results could be very, very large, and the interesting results a relatively small proportion of the total. Is there a way to do this sever-side and retrieve back just the interesting results?
I would write a map function to iterate through the documents and emit the users (or just usersnames) with 0 posts.
Then I would write a list function to iterate through the map function results and format them however you want (JSON, csv, etc).
(I would NOT use a reduce function to format the results, even if a reduce function appears to work OK in development. That is just my own experience from lessons learned the hard way.)
Personally I would filter on the client-side until I had performance issues. Next I would probably use Teddy's _filter technique—all pretty standard CouchDB stuff.
However, I stumbled across (IMO) an elegant way to find set complements. I described it when exploring how to find documents missing a field.
The basic idea
Finding non-members of your view obviously can't be done with a simple query (and a straightforward index scan.) However, it can be done in constant memory, and linear time, by simultaneously iterating through two query results at the same time.
One query is for all possible document ids. The other query is for matching documents (those you don't want). Importantly, CouchDB sorts query results, therefore you can calculate the complement efficiently.
See my details in the previous question. The basic idea is you iterate through both (sorted) lists simultaneously and when you say "hey, this document id is listed in the full set but it's missing in the sub-set, that is a hit.
(You don't have to query _all_docs, you just need two queries to CouchDB: one returning all possible values, and the other returning values not to be counted.)