Selecting weighted random document from MongoDB collection - mongodb

I know random record selection is not actually supported by MongoDB yet, but I have found a few ways to work around it.
However, I want to select a weighted random item. This is fairly easy with mySql, but I'm not sure of the best way to go about it with Mongo.
The problem I am solving is: I have a collection that holds sweepstakes entries, and based on the number of times a user shares/promotes the contest, they get an "extra entry", to increase their chance of winning. Rather than duplicate the user's entry, I have a field that records the number of times they have shared the contest. I want to use this number as a multiplier to weight the random selection of a "winner".
Here are a few approaches I have thought of:
Use a variation on the Cookbook random selection method, generating an array of random numbers (equal to the multiplier), for greater chances the record will be near the random point queried (but Mongo doesn't support array [multi-key] indexes, yes? so it might be slow)
Another variation on the Cookbook random method using a geospatial query, using a round polygon with a radius equal to the multiplier instead of a simple random number (if this is even possible, I've never used MongoDB geo indexes and queries)
Expand the entries in a new temporary collection, then use one of the MongoDB random selection methods
Avoid the problem and just store the duplicated entries in Mongo in the first place, and do a regular random select thingamajig
Keep a separate index of the MongoIDs and their weight multipliers in mySql (either constantly synced, or generated on demand) and use mySql to do a random weighted selection
Query out a huge array to do it in PHP and hope it doesn't run out of memory! :/
Am I on to anything here? Any other suggestions, for an obvious solution I am missing? I'm going to do some experimenting to see what works, but any feedback on my initial ideas is welcome!!
Performance needs to be "good" not great, since none of these contests are probably ever going to have millions of entries (usually more like [tens of] thousands), so fairness/accuracy is more important than speed. Thanks.

Related

Fast way to count all vertices (with property x)

I'm using Titan with Cassandra and have several (related) questions about querying the database with Gremlin:
1.) Is there an faster way to count all vertices than
g.V.count()
Titan claims to use an index. But how can I use an index without property?
WARN c.t.t.g.transaction.StandardTitanTx - Query requires iterating over all vertices [<>]. For better performance, use indexes
2.) Is there an faster way to count all vertices with property 'myProperty' than
g.V.has('myProperty').count()
Again Titan means following:
WARN c.t.t.g.transaction.StandardTitanTx - Query requires iterating over all vertices [(myProperty<> null)]. For better performance, use indexes
But again, how can I do this? I already have an index of 'myProperty', but it needs a value to query fast.
3.) And the same questions with edges...
Iterating all vertices with g.V.count() is the only way to get the count. It can't be done "faster". If your graph is so large that it takes hours to get an answer or your query just never returns at all, you should consider using Faunus. However, even with Faunus you can expect to wait for your answer (such is the nature of Hadoop...no sub-second response here), but at least you will get one.
Any time you do a table scan (i.e. iterate all Vertices) you get that warning of "iterating over all vertices". Generally speaking, you don't want to do that, as you will never get a response. Adding an index won't help you count all vertices any faster.
Edges have the same answer. Use g.E.count() in Gremlin if you can. If it takes too long, then try Faunus so you can at least get an answer.
doing a count is expensive in big distributed graph databases. You can have a node that keeps track of many of the databases frequent aggregate numbers and update it from a cron job so you have it handy. Usually if you have millions of vertices having the count from the previous hour is not such disaster.

MongoDB: Store average speed as a field or compute on-the-go

I am developing an Android app that uses MongoDB to store user records in document format. I will have several records that contain information about a GPS track such as start longitude and latitude, finish longitude and latitude, total time, top speed and total distance.
My question is regarding average speed. Should I let my app compute the average speed and store that as a field in the document, or should I compute this by only getting time and distance?
I will have thousands of records that should be sorted based on average speed and the most reasonable seems to store the average speed in the document as well. However, that breaks away from traditional SQL Acid thinking where speed would be calculated outside the DB.
The current document structure for the record collection is like this:
DocumentID (record)
DocumentID (user)
Start lnlt
Finish lnlt
Start time/date/gmt
End time/date/gmt
Total distance
Total time
Top speed
KMZ File
You should not talk abount ACID properties once you made a choice to use document oriented DB such as Mongo. Now, you have answered the question yourself:
" the most reasonable seems to store the average speed in the document as well."
We programmers have the tendency of ignoring the reasonable or simple approaches. We always try to question our selves whenever the solution we find looks obvious or common sense ;-).
Anyways, my suggestion is to store it as you want the sort to be performed by DB and not the application. This means that if any of the variables that influence the average speed change after initial storage then you should remember to update the result field as well.
My question is regarding average speed. Should I let my app compute the average speed and store that as a field in the document, or should I compute this by only getting time and distance?
As #Panegea rightly said, MongoDB does not rely on ACID properties. It relies on your app being able to control the distributed nature of it self, however that being said calculating the average speed outside of the DB isn't all that bad and using an atomic operator like $set will stop oddities when not using full ACID queries.
What you and #Panegea talk about is a form of pre-aggregation of your needed value to a pre-defined field on the document. This is by far a recommended approach not only in MongoDB but also in SQL (like the total shares on a facebook wall post) where querying for the aggregation of a computed field would be tiresome and very difficult for the server, or just not wise.
Edit
You clould achieve this with the aggregation framework: http://docs.mongodb.org/manual/applications/aggregation/ as well, might wanna take a look there, but pre-aggregation is by far the fastest method.

Create aggregated user stats with MongoDB

I am building a MongoDB database that will work with an Android app. I have a user collection and a records collection. The records documents consist of GPS tracks such as start and end coordinates, total time and top speed and distance. The user document is has user id, first name, last name and so forth.
I want to have aggregate stats for each user that summarizes total distance, total time, total average speed and top speed to date.
I am confused if I should do a map reduce and create an aggregate collection for users, or if I should add these stats to the user document with some kind of cron job type soliuton. I have read many guides about map reduce and aggregation for MongoDB but can't figure this out.
Thanks!
It sounds like your aggregate indicator values are per-user, in which case I would simply calculate them and push them directly into the user object as the same time as you update current co-oordinates, speed etc. They would be nice and easy (and fast) to query, and you could aggregate them further if you wished.
When I say pre-calculate, I don't mean MapReduce, which you would use as a batch process, I simply mean calculate on update of the user object.
If your aggregate stats are compiled across users, then you could still pre-calculate them on update, but if you also need to be able to query those aggregate stats against some other condition or filter, such as, "tell me what the total distance travelled for all users within x region", then depending on the number of combinations you may not be able to cover all those with pre-calculation.
So, if your aggregate stats ARE across users, AND need some sort of filter applying, then they'll need to be calculated from some snapshot of data. The two approaches here are;
the aggregation framework in 2.2
MapReduce
You would need to use MapReduce say, if you've a LOT of historical data that you want to crunch and you can pre-calculate the results for fast reading later. By my definition, that data isn't changing frequently, but even if it did, you can also use incremental MR to add new results to an existing calculation.
The aggregation framework in 2.2 will allow you to do a lot of this on demand, but it won't be as quick of course as pre-calculated values but way quicker than MR when executed on-demand. It can't cope with the high volume result-sets that you can do with MR, but it's better suited to queries where you don't know the parameter values in advance.
By way of example, if you wanted to calculate the aggregate sums of users stats within a particular lat/long, you couldn't use MR because there are just too many combinations of that filter, so you'd need to do that on the fly.
If however, you wanted it by city, well you could conceivably use MR there because you could stick to a finite set of cities and just pre-calculate them all.
But to wrap up, if your aggregate indicator values are per-user alone, then I'd start by calculating and storing the values inside the user object when I update the user object as I said in the first paragraph. Yes, you're storing the value as well as the inputs, but that's the model that saves you having to calculate on the fly.

CoreData fetch speed affected by number of parameters in NSPredicate?

I am trying to figure out a way to optimize some of my CoreData fetch requests. I currently have NSPredicates that have 2 or 3 parameters to search on. They are all indexed.
Is is faster to have a single index I can search on or several I can search against? Also are indexes against int's faster than say an index against a string?
What certainly helps is to make sure to select records that discriminate the most first. For example selecting only a couple of records with a certain keyvalue is very much faster (with an index available) than selecting all active records if 90 % of the records is active AND selecting something else at the same time. In this case you should probably be even off better be removing the index on the non discriminating field to make sure the index on the disciminating field is used.
Also, a predicate with an or statement will be a lot slower than one without.
Selecting on integers will be faster than selecting on strings, but if both are indexed the difference will be small.
Selecting on a keypath instead of a key also negatively affects performance.
(One example I recently used, predicate:
product.subgroup.code == %#
Selects from 150.000 products the right ones in a glitch (within 0.1 sec), while:
product.subgroup.maingroup.code == %#
Selects from 150.000 products the right ones in about 1.5 sec
In core data you can only tell a single attribute to be indexed in the data model editor. In real SQL databases, you would index on several attributes at once. Afaik, no index advisor can be used in core-data.
Testing with a real-life database in instruments (use the instrument for core data fetches) will help you find the bottlenecks and probably the best answer for you case.

Query for set complement in CouchDB

I'm not sure that there is a good way to do with with the facilities CouchDB provides, but I'd like to somehow extract the relative complement of the sets of two different document types over a particular key.
For example, let's say that I have documents representing users and posts, both of which have a (unique) username field. There's a validation in place ensuring that a user document exists for the username in every post, but there may be any number post documents with a given username, include none. It's trivial to create a view which counts the number of posts per username. The view can even include zero-counts by emitting zero post-counts for the user documents in the view map function. What I want to do though is retrieve just the list of users who have zero associated posts.
It's possible to build the view I described above and filter client-side for zero-value results, but in my actual situation the number of results could be very, very large, and the interesting results a relatively small proportion of the total. Is there a way to do this sever-side and retrieve back just the interesting results?
I would write a map function to iterate through the documents and emit the users (or just usersnames) with 0 posts.
Then I would write a list function to iterate through the map function results and format them however you want (JSON, csv, etc).
(I would NOT use a reduce function to format the results, even if a reduce function appears to work OK in development. That is just my own experience from lessons learned the hard way.)
Personally I would filter on the client-side until I had performance issues. Next I would probably use Teddy's _filter technique—all pretty standard CouchDB stuff.
However, I stumbled across (IMO) an elegant way to find set complements. I described it when exploring how to find documents missing a field.
The basic idea
Finding non-members of your view obviously can't be done with a simple query (and a straightforward index scan.) However, it can be done in constant memory, and linear time, by simultaneously iterating through two query results at the same time.
One query is for all possible document ids. The other query is for matching documents (those you don't want). Importantly, CouchDB sorts query results, therefore you can calculate the complement efficiently.
See my details in the previous question. The basic idea is you iterate through both (sorted) lists simultaneously and when you say "hey, this document id is listed in the full set but it's missing in the sub-set, that is a hit.
(You don't have to query _all_docs, you just need two queries to CouchDB: one returning all possible values, and the other returning values not to be counted.)