Select MongoDB documents with custom random probability distribution - mongodb

I have a collection that looks something like this:
[
{
"id": 1,
"tier": 0
},
{
"id": 2,
"tier": 1
},
{
"id": 3
"tier": 2
},
{
"id": 4,
"tier": 0
}
]
Is there a standard way to select n elements where the probabilty of choosing an element of the lowest tier is p, the next lowest tier is (1-p)*p, and so on, with standard random selection of element?
So for example, if the most likely thing happens and I run the query against the above example with n = 2 and any p > .5 (which I think will always be true), then I'd get back [{"id": 1, ...}, {"id": 4}]; with n = 3, then [{"id": 4}, {"id": 1}, {"id": 2}], etc.
E.g. here's some pseudo-Python code given a dictionary like that as objs:
def f(objs, p, n):
# get eligible tiers
tiers_set = set()
for o in objs:
eligible_tiers.add(o["tier"])
tiers_list = sorted(list(tiers_set))
# get the tier for each index of results
tiers = []
while len(tiers) < min(n, len(obis)):
tiers.append(select_random_with_initial_p(eligible_tiers, p))
# get res
res = []
for tier in tiers:
res.append(select_standard_random_in_tier(objs, tier)
return res

First, enable geospatial indexing on a collection:
db.docs.ensureIndex( { random_point: '2d' } )
To create a bunch of documents with random points on the X-axis:
for ( i = 0; i < 10; ++i ) {
db.docs.insert( { key: i, random_point: [Math.random(), 0] } );
}
Then you can get a random document from the collection like this:
db.docs.findOne( { random_point : { $near : [Math.random(), 0] } } )
Or you can retrieve several document nearest to a random point:
db.docs.find( { random_point : { $near : [Math.random(), 0] } } ).limit( 4 )
This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.
To make your custom random selection, you can change that part [Math.random(), 0], so it best suits your random distribution
Source: Random record from MongoDB

Related

Querying array of nested objects with nested array of objects in Redshift

Let's say I have the following JSON
{
"id": 1,
"sets": [
{
"values": [
{
"value": 1
},
{
"value": 2
}
]
},
{
"values": [
{
"value": 5
},
{
"value": 6
}
]
}
]
}
If the table name is X I expect the query
SELECT x.id, v.value
FROM X as x,
x.sets as sets,
sets.values as v
to give me
id, value
1, 1
1, 2
2, 5
2, 6
and it does work if both sets and values has one object each. When there's more the query fails with column 'id' had 0 remaining values but expected 2. Seems to me I'm not iterating over "sets" properly?
So my question is: what's the proper way to query data structured like my example above in Redshift (using PartiQL)?

how to perform statistics for every n elements in MongoDB

How to perform basic statistics for every n elements in Mongodb. For example, if I have total of 100 records like below
Name
Count
Sample
a
10
x
a
20
y
a
10
z
b
10
x
b
10
y
b
5
z
how do I perform mean, median, std dev for every 10 records so I get 10 results. So I want to calculate mean/median/std dev for A for every 10 sample till all the elements of database. Similarly for b, c and so on
excuse me if it is a naive question
you need to have some sort of counter to keep track of count.... for example I have added here rownumber then applied bucket of 3 (here n=3) and then returning the sum and average of the group(3). this example can be modified to do some sorting and grouping before we create the bucket to get the desired result.
Pls refer to https://mongoplayground.net/p/CL7vQGUWD_S
db.collection.aggregate([
{
$set: {
"rownum": {
"$function": {
"body": "function() {try {row_number+= 1;} catch (e) {row_number= 0;}return row_number;}",
"args": [],
"lang": "js"
}
}
}
},
{
$bucket: {
groupBy: "$rownum",
// Field to group by
boundaries: [
1,
4,
7,
11,
14,
17,
21,
25
],
// Boundaries for the buckets
default: "Other",
// Bucket id for documents which do not fall into a bucket
output: {
// Output for each bucket
"countSUM": {
$sum: "$count"
},
"averagePrice": {
$avg: "$count"
}
}
}
}
])

Limit inserts in mongodb

I enter these documents in a table:
db.RoomUsers.insert( {userId: 1, roomId: 1} )
db.RoomUsers.insert( {userId: 2, roomId: 1} )
db.RoomUsers.insert( {userId: 3, roomId: 1} )
db.RoomUsers.insert( {userId: 4, roomId: 1} )
Now, my application requires that in RoomUsers there can be a limited number of user in each room. Let's say that there cannot be more than 5 user for each room.
How to fix that?
If I was using an RDBM I could maybe have this strategy(Im not sure its the best one, but still):
1 - Count number of entries in RoomUsers where roomId = X
2 - If number of users is less than Y then:
2A - Start a transaction
2B - Insert new user in RoomUsers
2C - Count number of entries in RoomUsers where roomId = X
2D - If number of users is greater than Y then: Rollback. Otherwise: commit
MongoDB doesn't really have transaction, what I understand. How to accomplish the same thing in noSql?
There is one approach will let you do it atomically.
You should embed userIds into RoomUsers collection. Something like
{ "userIds" : [ 1, 2, 3, 4 ], "roomId" : 1 }
Now you can use the below update query.
db.RoomUsers.update( { roomId : 1, "userIds": { $not: {$size: 5 } } }, { $push : { "userIds":5 } } )

Map Reduce Mongo DB: Sum of ODD and EVEN numbers with elements

I am trying to process a number series ( collection ) get sum of odd / even numbers separately along with elements considered for calculations of each.
The numberseries document structure is as follows:
{
_id: <Autogenerated>,
number: <any number, it can repeat. Even if it repeats, it should be added each time. >
}
The output is something like below( not exact but in general )
{
..
{
"odd":<result>, elements:{n1,n3,n5}
},
{
"even":<result>, elements:{n2,n4,n6}
}
..
}
Map Function:
mapf = function(){
var value = { sum : 0, elements :[] };
value.sum = this.number;
value.elements.push(this.number);
print(tojson(value));
if( this.number % 2 != 0 ){
emit( "odd", value );
}
if( this.number % 2 == 0 ){
emit( "even", value );
}
}
Reduce Values argument:
Values is an array of JSON emitted from map:
[{
"sum": 1,
"elements": [1]
}, {
"sum": 3,
"elements": [3]
} ... ]
Reduce Function:
reducef = function(key, values){
var result = { sum : 0 , elements:[] };
print("K " + key +"Values array " + tojson(values) );
for(var i = 0; i<values.length;i++ ){
v = values[i];
print("Key "+key+"V.JSON"+tojson(v)+" V.SUM -> "+v.sum);
result.sum += v.sum;
result.elements.push(v.elements[0]);
print(tojson(result));
}
return result;
}
I am getting sum correctly, but the elements array is not properly getting populated. It is containing only some of the elements considered for calculations.
UPDATE
As per the answer given by Neil, I further verified my code. I found that my code, without any modification, works for small dataset, but does not work for large data-set.
Below are points which I have verified as pointed out, I found my code to be correct.
print("K " + key +"Values array " + tojson(values) );
Above line in reduce function results in following values object printed.
[{
"sum": 1,
"elements": [1]
}, {
"sum": 3,
"elements": [3]
}, {
"sum": 5,
"elements": [5]
}, {
"sum": 7,
"elements": [7]
}, {
"sum": 9,
"elements": [9]
}, {
"sum": 11,
"elements": [11]
}, {
"sum": 13,
"elements": [13]
}, {
"sum": 15,
"elements": [15]
}, {
"sum": 17,
"elements": [17]
}, {
"sum": 19,
"elements": [19]
}]
Hence the line to push elements to array in final results result.elements.push(v.elements[0]); should be correct.
In map function, before emitting, I am modifying value.sum as follows
value.sum = this.number;
This ensures that sum is not zero and numbers are properly getting added due to this.
When I test this code with 20 records, 40 records, 100 records, it works perfectly.
When I test this code with 20000 records, the sum value is correct but the element array
does not contain 10000 elements each( Odd and even numbers are equally distributed in collection ) .
In later case, I get below message:
query not recording (too large)
Okay, there is a clear reason and you do appear to have read some of the documentation and at least applied this rule:
"the type of the return object must be identical to the type of the value emitted by the map function ..."
And by that this means that both the map function and the reduce function essentially have the same output, which you did:
{ sum : 0, elements :[] };
But there was a piece of documentation that has not been understood:
"MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key."
So where the whole thing goes wrong is that you have assumed that since your "map" function only emits one element, that then there will be only one element in the "elements" array. A careful re-read of the above says that this is not true. And in fact the output from "reduce" will very likely be fed back into the "reduce" function again. This is indeed how mapReduce deals with a large number of values for the "values" array.
To fix it, change this in the "reduce" function:
result.elements.push(v.elements[0]);
To this:
v.elements.forEach(function(element) {
result.elements.push(element);
}
And in that way, when the "reduce" function returns a result that has summed up a few "elements" already and pushed them to the list, then that "input" will be processed correctly and merged with any other "values" that come in with it.
BTW. I Think you actually meant this in your mapper:
var value = { sum : 1, elements :[] };
Otherwise this code down here would just be summing 0's:
result.sum += v.sum;
But aggregate does this better
All of that said the following aggregation framework statement does the same thing but better and faster with an implementation in native code:
db.collection.aggregate([
{ "$project": {
"type": { "$cond": [
{ "$eq": [ { "$mod": [ "$number", 2 ] }, 0 ] },
"even",
"odd"
]},
"number": 1
}},
{ "$group": {
"_id": "$type",
"sum": { "$sum": 1 },
"elements": { "$push": "$number" }
}}
])
And also note that in both cases you are not really "summing the elements", but rather "counting" them. So if your want the sum then the mapReduce part becomes:
//result.sum += v.sum;
v.elements.forEach(function(element) {
result.sum += element;
result.elements.push(element);
}
And the aggregate part becomes:
{ "$group": {
"_id": "$type",
"sum": { "$sum": "$number" },
"elements": { "$push": "$number" }
}}
Which truly sums the "odd" or "even" numbers as found in your collection.

Conditional $inc in a nested MongoDB array

My database looks like this:
{
_id: 1,
values: [ 1, 2, 3, 4, 5 ]
},
{
_id: 2,
values: [ 2, 4, 6, 8, 10 ]
}, ...
I'd like to update every value in every document's nested array ("values") that meets some criterion. For instance, I'd like to increment every value that's >= 4 by one, which ought to yield:
{
_id: 1,
values: [ 1, 2, 3, 5, 6 ]
},
{
_id: 2,
values: [ 2, 5, 7, 8, 11 ]
}, ...
I'm used to working with SQL, where the nested array would be a seperated table connected with a unique ID. I'm a little lost in this new NoSQL world.
Thank you kindly,
This sort of update is not really possible using nested arrays, the reason for this is given in the positional $ operator documentation, and that states that you can only match the first array element for a given condition in the query.
So a statement like this:
db.collection.update(
{ "values": { "$gte": 4 } },
{ "$inc": { "values.$": 1 } }
)
Will not work in the sense that only the "first" array element that was matched would be incremented. So on your first document you would get this:
{ "_id" : 1, "values" : [ 1, 2, 3, 6, 6 ] }
In order to update the values as you are suggesting you would need to iterate the documents and the array elements to produce the result:
db.collecction.find({ "values": { "$gte": 4 } }).forEach(function(doc) {
for ( var i=0; i < doc.values.length; i++ ) {
if ( doc.values[i] >= 4 ) {
doc.values[i]++;
}
}
db.collection.update(
{ "_id": doc._id },
{ "$set": { "values": doc.values } }
);
})
Or whatever code equivalent of that basic concept.
Generally speaking, this sort of update does not lend itself well to a structure that contains elements in an array. If that is really your need, then the elements are better off listed within a separate collection.
Then again, the presentation of this question is more of a "hypothetical" situation without understanding your actual use case for performing this sort of udpate. So if you possibly described what you actually need to do and how your data really looks in another question, then that might get a more meaningful response in terms of the best approach for you to use.