How to average millions of rows of NumberLong in Mongo? - mongodb

I am trying to calculate the average of millions of records with NumberLong type in Mongo.
However aggregate and $avg doesn't work because of the sizes.
Any good approach to solve it?

You can use MapReduce for this.
Your map function would take each document and emit an object with two fields: one field value with the value you want to average and one field count with a value of 1.
Your reduce function would then sum up both the field count and the field value of all objects passed to it, returning one object representing how many documents were summarized and what their sum is.
Your finalize function would then divide the value by the count of the resulting object and return this number.
The second MapReduce example in the official documentation is very close to your use-case, you should be able to use it as a reference. The only difference is that you only want one average value, not separate ones for subsets of your collection, so you would replace key with a constant value.

Related

Order of results for `sort` using mongoose

If I have two equal values for a field. What would be the order of results for sort on that field? Random or ordered by insertion date?
If two documents have equal values for the field you're sorting on, then MongoDB will return the results in the order they are found on disk (ie Natural order)
from MongoDB Documentation :
natural order:
The order in which the database refers to documents on
disk. This is the default sort order. See $natural and Return in
Natural Order.
This may coincide with insertion date in some case, but not all of the time (especially when you perform insertion/deletion on your collection), so you should assume that this is random ordering

Possible to retrieve multiple random, non-sequential documents from MongoDB?

I'd like to retrieve a random set of documents from a MongoDB database. So far after lots of Googling, I've only seen ways to retrieve one random document OR a set of documents starting at a random skip position but where the documents are still sequential.
I've tried mongoose-simple-random, and unfortunately it doesn't retrieve a "true" random set. What it does is skip to a random position and then retrieve n documents from that position.
Instead, I'd like to retrieve a random set like MySQL does using one query (or a minimal amount of queries), and I need this list to be random every time. I need this to be efficient -- relatively on par with such a query with MySQL. I want to reproduce the following but in MongoDB:
SELECT * FROM products ORDER BY rand() LIMIT 50;
Is this possible? I'm using Mongoose, but an example with any adapter -- or even a straight MongoDB query -- is cool.
I've seen one method of adding a field to each document, generating a random value for each field, and using {rand: {$gte:rand()}} each query we want randomized. But, my concern is that two queries could theoretically return the same set.
You may do two requests, but in an efficient way :
Your first request just gets the list of all "_id" of document of your collections. Be sure to use a mongo projection db.products.find({}, { '_id' : 1 }).
You have a list of "_id", just pick N randomly from the list.
Do a second query using the $in operator.
What is especially important is that your first query is fully supported by an index (because it's "_id"). This index is likely fully in memory (else you'd probably have performance problems). So, only the index is read while running the first query, and it's incredibly fast.
Although the second query means reading actual documents, the index will help a lot.
If you can do things this way, you should try.
I don't think MySQL ORDER BY rand() is particularly efficient - as I understand it, it essentially assigns a random number to each row, then sorts the table on this random number column and returns the top N results.
If you're willing to accept some overhead on your inserts to the collection, you can reduce the problem to generating N random integers in a range. Add a counter field to each document: each document will be assigned a unique positive integer, sequentially. It doesn't matter what document gets what number, as long as the assignment is unique and the numbers are sequential, and you either don't delete documents or you complicate the counter document scheme to handle holes. You can do this by making your inserts two-step. In a separate counter collection, keep a document with the first number that hasn't been used for the counter. When an insert occurs, first findAndModify the counter document to retrieve the next counter value and increment the counter value atomically. Then insert the new document with the counter value. To find N random values, find the max counter value, then generate N distinct random numbers in the range defined by the max counter, then use $in to retrieve the documents. Most languages should have random libraries that will handle generating the N random integers in a range.

Distinguish array from single value in a document

I have two type of documents in a mongodb collection:
one where key sessions has a simple value:
{"sessions": NumberLong("10000000000001")}
one where key sessions has an array of values.
{"sessions": [NumberLong("10000000000001")]}
Is there any way to retrieve all documents from the second category, ie. only documents whose value is an arary and not a simple value?
You can use this kind of query for that:
db.collectionName.find( { $where : "Array.isArray(this.sessions)" } );
but you'd better convert all the records to one type to keep the things consistent.
This code can be simple like this:
db.c.find({sessions:{$gte:[]}});
Explanation:
Because you only want to retrieve documents whose sessions data type is array, and by the feature of $gte (if data types are different between tow operands, it returns false; Double, Integer32, Integer64 are considered as same data type.), giving an empty array as the opposite operand will help to retrieve all results by required.
Also , $gt, $lt, $lte for standard query (attention: different behaviors to operaors with same name in expression of aggregation pipeline) have the same feature. I proved this by practice on MongoDB V2.4.8, V2.6.4.

Using Mongo: should we create an index tailored to each type of high-volume query?

We have two types of high-volume queries. One looks for docs involving 5 attributes: a date (lte), a value stored in an array, a value stored in a second array, one integer (gte), and one float (gte).
The second includes these five attributes plus two more.
Should we create two compound indices, one for each query? Assume each attribute has a high cardinality.
If we do, because each query involves multiple arrays, it doesn't seem like we can create an index because of Mongo's restriction. How do people structure their Mongo databases in this case?
We're using MongoMapper.
Thanks!
Indexes for queries after the first ranges in the query the value of the additional index fields drops significantly.
Conceptually, I find it best to think of the addition fields in the index pruning ever smaller sub-trees from the query. The first range chops off a large branch, the second a smaller, the third smaller, etc. My general rule of thumb is only the first range from the query in the index is of value.
The caveat to that rule is that additional fields in the index can be useful to aid sorting returned results.
For the first query I would create a index on the two array values and then which ever of the ranges will exclude the most documents. The date field is unlikely to provide high exclusion unless you can close the range (lte and gte). The integer and float is hard to tell without knowing the domain.
If the second query's two additional attributes also use ranges in the query and do not have a significantly higher exclusion value then I would just work with the one index.
Rob.

Maintaining order of mongodb collection

I have a collection that will have many documents (maybe millions). When a user inserts a new document, I would like to have a field that maintains the "order" of the data that I can index. For example, if one field is time, in this format "1352392957.46516", if I have three documents, the first with time: 1352392957.46516 and the second with time: 1352392957.48516 (20ms later) and the third with 1352392957.49516 (10ms later) I would like to have an another field where the first document would have 0, and the second would be 1, the third 2 and so on.
The reason I want this is so that I can index that field, then when I do a find I can do an efficient $mod operation to down sample the data. So for example, if I have a million docs, and I only want 1000 of them evenly spaced, I could do a $mod [1000, 0] on the integer field.
The reason I could not do that on the Time field is because they may not be perfectly spaced, or might be all even or odd so the mod would not work. So the separate integer field would keep the order in a linearly increasing fashion.
Also, you should be able to insert documents anywhere in the collection, so all subsequent fields would need to be updated.
Is there a way to do this automatically? Or would I have to implement this? Or is there a more efficient way of doing what I am describing?
It is well beyond "slower inserts" if you are updating several million documents for a single insert - this approach makes your entire collection the active working set. Similarly, in order to do the $mod comparison with a key value, you will have to compare every key value in the index.
Given your requirement for a sorted sampling order, I'm not sure there is a more efficient preaggregation approach you can take.
I would use skip() and limit() to fetch a random document. The skip() command will be scanning from the beginning of the index to skip over unwanted documents each time, but if you have enough RAM to keep the index in memory the performance should be acceptable:
// Add an index on time field
db.data.ensureIndex({'time':1})
// Count number of documents
var dc = db.data.count()
// Iterate and sample every 1000 docs
var i = 0; var sampleSize = 1000; var results = [];
while (i < dc) {
results.push(db.data.find().sort({time:1}).skip(i).limit(1)[0]);
i += sampleSize;
}
// Result array of sampled docs
printjson(results);