Linear funnel from a collection of events with MongoDB aggregation, is it possible? - mongodb

I have a number of event documents, each event has a number of fields, but the ones that are relevant for my query are:
person_id - a reference to the person that triggered the event
event - a string key to identify the event
occurred_at - the utc of the time the event occurred
What I want to achieve is:
for a list of event keys eg `['event_1','event_2', 'event_3']
get counts of the number of people that performed each event and all the event previous to that event, in order, ie:
the number of people who performed event_1
the number of people who performed event_1, and then event_2
the number of people who performed event_1, and then event_2, and then event_3
etc
a secondary goal is to be able to get the average occurred_at date for each event so that I can calculate the average time between each event
The best I have got is the following two map reduces:
db.events.mapReduce(function () {
emit(this.person_id, {
e: [{
e: this.event,
o: this.occurred_at
}]
})
}, function (key, values) {
return {
e: [].concat.apply([], values.map(function (x) {
return x.e
}))
}
}, {
query: {
account_id: ObjectId('52011239b1b9229f92000003'),
event: {
$in: ['event_a', 'event_b', 'event_c','event_d','event_e','event_f']
}
},
out: 'people_funnel_chains',
sort: { person_id: 1, occurred_at: 1 }
})
And then:
db.people_funnel_chains.mapReduce(function() {
funnel = ['event_a', 'event_b', 'event_c','event_d','event_e','event_f']
events = this.value.e;
for (var e in funnel) {
e = funnel[e];
if ((i = events.map(function (x) {
return x.e
}).indexOf(e)) > -1) {
emit(e, { c: 1, o: events[i].o })
events = events.slice(i + 1, events.length);
} else {
break;
}
}
}, function(key,values) {
return {
c: Array.sum(values.map(function(x) { return x.c })),
o: new Date(Array.sum(values.map(function(x) { return x.o.getTime() }))/values.length)
};
}, { out: {inline: 1} })
I would like to achieve this is in real time using the aggregate framework but can see no way to do it. For 10s of thousands of records this is taking 10s of seconds, I can run it incrementally which means its fast enough for new data coming in but if I want to modify the original query (eg change the event chain) it can't be done in a single request which I would love it to be able to do.
Update using Cursor.forEach()
Using Cursor.forEach() I've managed to get huge improvement on this (essentially removing the requirement for the first map reduce).
var time = new Date().getTime(), funnel_event_keys = ['event_a', 'event_b', 'event_c','event_d','event_e','event_f'], looking_for_i = 0, looking_for = funnel_event_keys[0], funnel = {}, last_person_id = null;
for (var i in funnel_event_keys) { funnel[funnel_event_keys[i]] = [0,null] };
db.events.find({
account_id: ObjectId('52011239b1b9229f92000003'),
event: {
$in: funnel_event_keys
}
}, { person_id: 1, event: 1, occurred_at: 1 }).sort({ person_id: 1, occurred_at: 1 }).forEach(function(e) {
var current_person_id = e['person_id'].str;
if (last_person_id != current_person_id) {
looking_for_i = 0;
looking_for = funnel_event_keys[0]
}
if (e['event'] == looking_for) {
var funnel_event = funnel[looking_for]
funnel_event[0] = funnel_event[0] + 1;
funnel_event[1] = ((funnel_event[1] || e['occurred_at'].getTime()) + e['occurred_at'].getTime())/2;
looking_for_i = looking_for_i + 1;
looking_for = funnel_event_keys[looking_for_i]
}
last_person_id = current_person_id;
})
funnel;
new Date().getTime() - time;
I wonder if something custom with data in memory would be able to improve on this? Getting 100s of thousands of records out of MongoDB into memory (on a different machine) is going to be a bottle neck, is there a technology I'm not aware of that could do this?

I wrote up a complete answer on my MongoDB blog but as a summary, what you have to do is project your actions based on which ones you care about to map values of action field into appropriate key names, group by person aggregating for the three actions when they did them (and optionally how many times) and then project new fields which check if action2 was done after action1, and action3 was done after action2... Last phase just sums up the number of people who did just 1, or 1 and then 2, or 1 and then 2 and then 3.
Using a function to generate the aggregation pipeline, it's possible to generate results based on array of actions passed in.
In my test case, the entire pipeline ran in under 200ms for a collection of 40,000 documents (this was on my small laptop).
As it was correctly pointed out, the general solution I describe assumes that while an actor can take any action multiple times that they can only advance from action1 to action2 but that they cannot skip directly from action1 to action3 (interpreting action order as describing prerequisites where you cannot do action3 until you've done action2).
As it turns out, aggregation framework can be used even for sequences of events where the order is completely arbitrary but you still want to know how many people at some point did the sequence action1, action2, action3.
The main adjustment to make on the original answer is to add an extra two-stage step in the middle. This step unwinds the collected by person document to re-group it finding the first occurrence of the second action that comes after the first occurrence of the first action.
Once we have that the final comparison becomes for action1, followed by earliest occurrence of action2 and compare that to the latest occurrence of action3.
It can probably be generalized to handle arbitrary number of events but every additional event past two would add two more stages to the aggregation.
Here is my write-up of the modification of the pipeline to achieve the answer you are looking for.

Related

Long running Mongo queries in Meteor

How would one go about updating 1000s of documents in a collection in meteor where forEach has to be used to first calculate the changes for each individual document?
There's a timeout of 10 minutes or so as well as a certain number of megabytes. What I've done in the past is split the updates into groups of 300 and update like that. But is there a simpler way to do it in meteor to allow the for each loop to run for an hour of needed?
Using percolate:synced-cron you could easily do this in batches.
SyncedCron.add({
name: 'Update mass quantities',
schedule: function(parser) {
// parser is a later.parse object
return parser.text('every 1 minute'); // or at any interval you wish
},
job: function() {
var query = { notYetProcessed: true }; // or whatever your criteria are
var batchSize = { limit: 300 }; // for example
myCollection.find(query,batchSize).forEach(doc){
var update = { $set: { notYetProcessed: false }}; // along with everything else you want to update
myCollection.update(doc._id,update);
}
}
});
This will run every minute until there are no more records to be processed. It will continue running after that of course but won't find anything to update.

How do I publish two random items from a Meteor collection?

I'm making an app where two random things from a collection are displayed to the user. Every time the user refreshes the page or clicks on a button, she would get another random pair of items.
For example, if the collection were of fruits, I'd want something like this:
apple vs banana
peach vs pineapple
banana vs peach
The code below is for the server side and it works except for the fact that the random pair is generated only once. The pair doesn't update until the server is restarted. I understand it is because generate_pair() is only called once. I have tried calling generate_pair() from one of the Meteor.publish functions but it only sometimes works. Other times, I get no items (errors) or only one item.
I don't mind publishing the entire collection and selecting random items from the client side. I just don't want to crash the browser if Items has 30,000 entries.
So to conclude, does anyone have any ideas of how to get two random items from a collection appearing on the client side?
var first_item, second_item;
// This is the best way I could find to get a random item from a Meteor collection
// Every item in Items has a 'random_number' field with a randomly generated number between 0 and 1
var random_item = function() {
return Items.find({
random_number: {
$gt: Math.random()
}
}, {
limit: 1
});
};
// Generates a pair of items and ensure that they're not duplicates.
var generate_pair = function() {
first_item = random_item();
second_item = random_item();
// Regenerate second item if it is a duplicate
while (first_item.fetch()[0]._id === second_item.fetch()[0]._id) {
second_item = random_item();
}
};
generate_pair();
Meteor.publish('first_item', function() {
return first_item;
});
// Is this good Meteor style to have two publications doing essentially the same thing?
Meteor.publish('second_item', function() {
return second_item;
});
The problem with your approach is that subscribing to the same publication with the same arguments (no arguments in this case) over and over in the client will only get you subscribed only once to the server-side logic, this is because Meteor is optimizing its internal Pub/Sub mechanism.
To truly discard the previous subscription and get the server-side publish code to re-execute and send two new random documents, you need to introduce a useless random argument to your publication, your client-side code will subscribe over and over to the publication with a random number and each time you'll get unsubscribed and resubscribed to new random documents.
Here is a full implementation of this pattern :
server/server.js
function randomItemId(){
// get the total items count of the collection
var itemsCount = Items.find().count();
// get a random number (N) between [0 , itemsCount - 1]
var random = Math.floor(Random.fraction() * itemsCount);
// choose a random item by skipping N items
var item = Items.findOne({},{
skip: random
});
return item && item._id;
}
function generateItemIdPair(){
// return an array of 2 random items ids
var result = [
randomItemId(),
randomItemId()
];
//
while(result[0] == result[1]){
result[1] = randomItemId();
}
//
return result;
}
Meteor.publish("randomItems",function(random){
var pair = generateItemIdPair();
// publish the 2 items whose ids are in the random pair
return Items.find({
_id: {
$in: pair
}
});
});
client/client.js
// every 5 seconds subscribe to 2 new random items
Meteor.setInterval(function(){
Meteor.subscribe("randomItems", Random.fraction(), function(){
console.log("fetched these random items :", Items.find().fetch());
});
}, 5000);
You'll need to meteor add random for this code to work.
Meteor.publish 'randomDocs', ->
ids = _(Docs.find().fetch()).pluck '_id'
randomIds = _(ids).sample 2
Docs.find _id: $in: randomIds
Here's another approach, uses the excellent publishComposite package to populate matches in a local (client-only) collection so it doesn't conflict with other uses of the main collection:
if (Meteor.isClient) {
randomDocs = new Mongo.Collection('randomDocs');
}
if (Meteor.isServer) {
Meteor.publishComposite("randomDocs",function(select_count) {
return {
collectionName:"randomDocs",
find: function() {
let self=this;
_.sample(baseCollection.find({}).fetch(),select_count).forEach(function(doc) {
self.added("randomDocs",doc._id,doc);
},self);
self.ready();
}
}
});
}
in onCreated: this.subscribe("randomDocs",3);
(then in a helper): return randomDocs.find({},{$limit:3});

mapreduce between consecutive documents

Setup:
I got a large collection with the following entries
Name - String
Begin - time stamp
End - time stamp
Problem:
I want to get the gaps between documents, Using the map-reduce paradigm.
Approach:
I'm trying to set a new collection of pairs mid, after that I can compute differences from it using $unwind and Pair[1].Begin - Pair[0].End
function map(){
emit(0, this)
}
function reduce(){
var i = 0;
var pairs = [];
while ( i < values.length -1){
pairs.push([values[i], values[i+1]]);
i = i + 1;
}
return {"pairs":pairs};
}
db.collection.mapReduce(map, reduce, sort:{begin:1}, out:{replace:"mid"})
This works with limited number of document because of the 16MB document cap. I'm not sure if I need to get the collection into memory and doing it there, How else can I approach this problem?
The mapReduce function of MongoDB has a different way of handling what you propose than the method you are using to solve it. The key factor here is "keeping" the "previous" document in order to make the comparison to the next.
The actual mechanism that supports this is the "scope" functionality, which allows a sort of "global" variable approach to use in the overall code. As you will see, what you are asking when that is considered takes no "reduction" at all as there is no "grouping", just emission of document "pair" data:
db.collection.mapReduce(
function() {
if ( last == null ) {
last = this;
} else {
emit(
{
"start_id": last._id,
"end_id": this._id
},
this.Begin - last.End
);
last = this;
}
},
function() {}, // no reduction required
{
"out": { "inline": 1 },
"scope": { "last": null }
}
)
Out with a collection as the output as required to your size.
But this way by using a "global" to keep the last document then the code is both simple and efficient.

Average Aggregation Queries in Meteor

Ok, still in my toy app, I want to find out the average mileage on a group of car owners' odometers. This is pretty easy on the client but doesn't scale. Right? But on the server, I don't exactly see how to accomplish it.
Questions:
How do you implement something on the server then use it on the client?
How do you use the $avg aggregation function of mongo to leverage its optimized aggregation function?
Or alternatively to (2) how do you do a map/reduce on the server and make it available to the client?
The suggestion by #HubertOG was to use Meteor.call, which makes sense and I did this:
# Client side
Template.mileage.average_miles = ->
answer = null
Meteor.call "average_mileage", (error, result) ->
console.log "got average mileage result #{result}"
answer = result
console.log "but wait, answer = #{answer}"
answer
# Server side
Meteor.methods average_mileage: ->
console.log "server mileage called"
total = count = 0
r = Mileage.find({}).forEach (mileage) ->
total += mileage.mileage
count += 1
console.log "server about to return #{total / count}"
total / count
That would seem to work fine, but it doesn't because as near as I can tell Meteor.call is an asynchronous call and answer will always be a null return. Handling stuff on the server seems like a common enough use case that I must have just overlooked something. What would that be?
Thanks!
As of Meteor 0.6.5, the collection API doesn't support aggregation queries yet because there's no (straightforward) way to do live updates on them. However, you can still write them yourself, and make them available in a Meteor.publish, although the result will be static. In my opinion, doing it this way is still preferable because you can merge multiple aggregations and use the client-side collection API.
Meteor.publish("someAggregation", function (args) {
var sub = this;
// This works for Meteor 0.6.5
var db = MongoInternals.defaultRemoteCollectionDriver().mongo.db;
// Your arguments to Mongo's aggregation. Make these however you want.
var pipeline = [
{ $match: doSomethingWith(args) },
{ $group: {
_id: whatWeAreGroupingWith(args),
count: { $sum: 1 }
}}
];
db.collection("server_collection_name").aggregate(
pipeline,
// Need to wrap the callback so it gets called in a Fiber.
Meteor.bindEnvironment(
function(err, result) {
// Add each of the results to the subscription.
_.each(result, function(e) {
// Generate a random disposable id for aggregated documents
sub.added("client_collection_name", Random.id(), {
key: e._id.somethingOfInterest,
count: e.count
});
});
sub.ready();
},
function(error) {
Meteor._debug( "Error doing aggregation: " + error);
}
)
);
});
The above is an example grouping/count aggregation. Some things of note:
When you do this, you'll naturally be doing an aggregation on server_collection_name and pushing the results to a different collection called client_collection_name.
This subscription isn't going to be live, and will probably be updated whenever the arguments change, so we use a really simple loop that just pushes all the results out.
The results of the aggregation don't have Mongo ObjectIDs, so we generate some arbitrary ones of our own.
The callback to the aggregation needs to be wrapped in a Fiber. I use Meteor.bindEnvironment here but one can also use a Future for more low-level control.
If you start combining the results of publications like these, you'll need to carefully consider how the randomly generated ids impact the merge box. However, a straightforward implementation of this is just a standard database query, except it is more convenient to use with Meteor APIs client-side.
TL;DR version: Almost anytime you are pushing data out from the server, a publish is preferable to a method.
For more information about different ways to do aggregation, check out this post.
I did this with the 'aggregate' method. (ver 0.7.x)
if(Meteor.isServer){
Future = Npm.require('fibers/future');
Meteor.methods({
'aggregate' : function(param){
var fut = new Future();
MongoInternals.defaultRemoteCollectionDriver().mongo._getCollection(param.collection).aggregate(param.pipe,function(err, result){
fut.return(result);
});
return fut.wait();
}
,'test':function(param){
var _param = {
pipe : [
{ $unwind:'$data' },
{ $match:{
'data.y':"2031",
'data.m':'01',
'data.d':'01'
}},
{ $project : {
'_id':0
,'project_id' : "$project_id"
,'idx' : "$data.idx"
,'y' : '$data.y'
,'m' : '$data.m'
,'d' : '$data.d'
}}
],
collection:"yourCollection"
}
Meteor.call('aggregate',_param);
}
});
}
If you want reactivity, use Meteor.publish instead of Meteor.call. There's an example in the docs where they publish the number of messages in a given room (just above the documentation for this.userId), you should be able to do something similar.
You can use Meteor.methods for that.
// server
Meteor.methods({
average: function() {
...
return something;
},
});
// client
var _avg = { /* Create an object to store value and dependency */
dep: new Deps.Dependency();
};
Template.mileage.rendered = function() {
_avg.init = true;
};
Template.mileage.averageMiles = function() {
_avg.dep.depend(); /* Make the function rerun when _avg.dep is touched */
if(_avg.init) { /* Fetch the value from the server if not yet done */
_avg.init = false;
Meteor.call('average', function(error, result) {
_avg.val = result;
_avg.dep.changed(); /* Rerun the helper */
});
}
return _avg.val;
});

Auto increment in MongoDB to store sequence of Unique User ID

I am making a analytics system, the API call would provide a Unique User ID, but it's not in sequence and too sparse.
I need to give each Unique User ID an auto increment id to mark a analytics datapoint in a bitarray/bitset. So the first user encounters would corresponding to the first bit of the bitarray, second user would be the second bit in the bitarray, etc.
So is there a solid and fast way to generate incremental Unique User IDs in MongoDB?
As selected answer says you can use findAndModify to generate sequential IDs.
But I strongly disagree with opinion that you should not do that. It all depends on your business needs. Having 12-byte ID may be very resource consuming and cause significant scalability issues in future.
I have detailed answer here.
You can, but you should not
https://web.archive.org/web/20151009224806/http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
Each object in mongo already has an id, and they are sortable in insertion order. What is wrong with getting collection of user objects, iterating over it and use this as incremented ID? Er go for kind of map-reduce job entirely
I know this is an old question, but I shall post my answer for posterity...
It depends on the system that you are building and the particular business rules in place.
I am building a moderate to large scale CRM in MongoDb, C# (Backend API), and Angular (Frontend web app) and found ObjectId utterly terrible for use in Angular Routing for selecting particular entities. Same with API Controller routing.
The suggestion above worked perfectly for my project.
db.contacts.insert({
"id":db.contacts.find().Count()+1,
"name":"John Doe",
"emails":[
"john#doe.com",
"john.doe#business.com"
],
"phone":"555111322",
"status":"Active"
});
The reason it is perfect for my case, but not all cases is that as the above comment states, if you delete 3 records from the collection, you will get collisions.
My business rules state that due to our in house SLA's, we are not allowed to delete correspondence data or clients records for longer than the potential lifespan of the application I'm writing, and therefor, I simply mark records with an enum "Status" which is either "Active" or "Deleted". You can delete something from the UI, and it will say "Contact has been deleted" but all the application has done is change the status of the contact to "Deleted" and when the app calls the respository for a list of contacts, I filter out deleted records before pushing the data to the client app.
Therefore, db.collection.find().count() + 1 is a perfect solution for me...
It won't work for everyone, but if you will not be deleting data, it works fine.
Edit
latest versions of pymongo:
db.contacts.count() + 1
First Record should be add
"_id" = 1 in your db
$database = "demo";
$collections ="democollaction";
echo getnextid($database,$collections);
function getnextid($database,$collections){
$m = new MongoClient();
$db = $m->selectDB($database);
$cursor = $collection->find()->sort(array("_id" => -1))->limit(1);
$array = iterator_to_array($cursor);
foreach($array as $value){
return $value["_id"] + 1;
}
}
I had a similar issue, namely I was interested in generating unique numbers, which can be used as identifiers, but doesn't have to. I came up with the following solution. First to initialize the collection:
fun create(mongo: MongoTemplate) {
mongo.db.getCollection("sequence")
.insertOne(Document(mapOf("_id" to "globalCounter", "sequenceValue" to 0L)))
}
An then a service that return unique (and ascending) numbers:
#Service
class IdCounter(val mongoTemplate: MongoTemplate) {
companion object {
const val collection = "sequence"
}
private val idField = "_id"
private val idValue = "globalCounter"
private val sequence = "sequenceValue"
fun nextValue(): Long {
val filter = Document(mapOf(idField to idValue))
val update = Document("\$inc", Document(mapOf(sequence to 1)))
val updated: Document = mongoTemplate.db.getCollection(collection).findOneAndUpdate(filter, update)!!
return updated[sequence] as Long
}
}
I believe that id doesn't have the weaknesses related to concurrent environment that some of the other solutions may suffer from.
// await collection.insertOne({ autoIncrementId: 1 });
const { value: { autoIncrementId } } = await collection.findOneAndUpdate(
{ autoIncrementId: { $exists: true } },
{
$inc: { autoIncrementId: 1 },
},
);
return collection.insertOne({ id: autoIncrementId, ...data });
I used something like nested queries in MySQL to simulate auto increment, which worked for me. To get the latest id and increment one to it you can use:
lastContact = db.contacts.find().sort({$natural:-1}).limit(1)[0];
db.contacts.insert({
"id":lastContact ?lastContact ["id"] + 1 : 1,
"name":"John Doe",
"emails": ["john#doe.com", "john.doe#business.com"],
"phone":"555111322",
"status":"Active"
})
It solves the removal issue of Alex's answer. So no duplicate id will appear if any record is removed.
More explanation: I just get the id of the latest inserted document, add one to it, and then set it as the id of the new record. And ternary is for cases that we don't have any records yet or all of the records are removed.
this could be another approach
const mongoose = require("mongoose");
const contractSchema = mongoose.Schema(
{
account: {
type: mongoose.Schema.Types.ObjectId,
required: true,
},
idContract: {
type: Number,
default: 0,
},
},
{ timestamps: true }
);
contractSchema.pre("save", function (next) {
var docs = this;
mongoose
.model("contract", contractSchema)
.countDocuments({ account: docs.account }, function (error, counter) {
if (error) return next(error);
docs.idContract = counter + 1;
next();
});
});
module.exports = mongoose.model("contract", contractSchema);
// First check the table length
const data = await table.find()
if(data.length === 0){
const id = 1
// then post your query along with your id
}
else{
// find last item and then its id
const length = data.length
const lastItem = data[length-1]
const lastItemId = lastItem.id // or { id } = lastItem
const id = lastItemId + 1
// now apply new id to your new item
// even if you delete any item from middle also this work
}