RethinkDB: Getting random documents for each category

RethinkDB: Getting random documents for each category - database-performance

I have a table with ~8000 quiz questions.
They are divided in 25 categories.
Each category has a attribute max_questions which tells me how many questions i have to pick randomly to generate a quiz.
e.g
Category 1 -> 2 questions
Category 2 -> 3 questions
Category 3 -> 1 question
I came up with a solution, but i takes approx. 2 seconds to perform.
r.table('categories').pluck('id', 'max_questions').orderBy('id').run(conn, function(err, cursor) {
if(err) return next(new Error(err.msg));
cursor.toArray(function(err, categories) {
if(err) return next(new Error(err.msg));
async.concat(categories, function(category, callback) {
r.table('questions').filter({category_id: category.id }).sample(category.max_questions).run(conn, callback);
}, function(err, questions) {
if(err) return next(new Error(err.msg));
res.json(questions);
});
});
});
Is there a faster way to retrieve the questions with RethinkDB? Making 25 requests and calling 25 times .sample() for one quiz doesn't sound good to me.
I really appreciate your help!

It'll be much faster if you do it all in one query rather than making multiple requests to the database. Here's a single query that's more or less equivalent to what you wrote:
categories.map(function (doc) {
return doc.merge(
{"questions":
questions
.filter({category_id:doc("id")})
.sample(doc("max_questions"))
.coerceTo("ARRAY")})
})
Notice I've bound the tables to variables here so categories is bound to r.table("categories").

Related

Get number of documents in a big collection in Cloud Firestore

I know this question was already asked but I'm being specific about my case: I've got a large database (approximately 1 million documents inside the collection users).
I wanna get the exact number of documents inside users. I'm trying this:
export const count_users = functions.https.onRequest((request, response) => {
corsHandler(request, response, () => {
db.collection('users').select().get().then(
(snapshot) => response.json(snapshot.docs.length)
)
.catch(function(error) {
console.error("[count_users] Error counting users: ", error);
response.json("Failed");
});
});
});
Although it seems right, it takes forever to give me a result. I'm not allowed to add or remove documents from the database.
Is there any possible approach for getting this quantity?

MongoDB requests very slow to return results

So, I only have a few documents in my Mongo DB. For example, I have this basic find request (see below) which takes 4 seconds to return a 1.12KB JSON, before the component re-render.
app.get('/mypath', (req, res) => {
MongoClient.connect(urlDb, (err, db) => {
let Mycoll = db.collection('Mycoll');
Mycoll.find({}).toArray( (err, data) => {
if (err) throw err;
else{
res.status(200).json(data);
}
})
db.close();
})
});
Sometimes for that same component to re-render, with the same request, it takes 8 seconds (which equals an eternity for an Internet user).
Is it supposed to take this long ? I can imagine a user of my app starting to think ("well, that doesn't work") and close it right before the results show.
Is there anything you could point me to to optimize the performance ? Any tool you would recommend to analyze what exactly causes this bottleneck ? Or anything I did wrong ?
At this stage, I don't incriminate React/Redux, because with no DB requests involved, my other components render fast.

How do I publish two random items from a Meteor collection?

I'm making an app where two random things from a collection are displayed to the user. Every time the user refreshes the page or clicks on a button, she would get another random pair of items.
For example, if the collection were of fruits, I'd want something like this:
apple vs banana
peach vs pineapple
banana vs peach
The code below is for the server side and it works except for the fact that the random pair is generated only once. The pair doesn't update until the server is restarted. I understand it is because generate_pair() is only called once. I have tried calling generate_pair() from one of the Meteor.publish functions but it only sometimes works. Other times, I get no items (errors) or only one item.
I don't mind publishing the entire collection and selecting random items from the client side. I just don't want to crash the browser if Items has 30,000 entries.
So to conclude, does anyone have any ideas of how to get two random items from a collection appearing on the client side?
var first_item, second_item;
// This is the best way I could find to get a random item from a Meteor collection
// Every item in Items has a 'random_number' field with a randomly generated number between 0 and 1
var random_item = function() {
return Items.find({
random_number: {
$gt: Math.random()
}
}, {
limit: 1
});
};
// Generates a pair of items and ensure that they're not duplicates.
var generate_pair = function() {
first_item = random_item();
second_item = random_item();
// Regenerate second item if it is a duplicate
while (first_item.fetch()[0]._id === second_item.fetch()[0]._id) {
second_item = random_item();
}
};
generate_pair();
Meteor.publish('first_item', function() {
return first_item;
});
// Is this good Meteor style to have two publications doing essentially the same thing?
Meteor.publish('second_item', function() {
return second_item;
});

The problem with your approach is that subscribing to the same publication with the same arguments (no arguments in this case) over and over in the client will only get you subscribed only once to the server-side logic, this is because Meteor is optimizing its internal Pub/Sub mechanism.
To truly discard the previous subscription and get the server-side publish code to re-execute and send two new random documents, you need to introduce a useless random argument to your publication, your client-side code will subscribe over and over to the publication with a random number and each time you'll get unsubscribed and resubscribed to new random documents.
Here is a full implementation of this pattern :
server/server.js
function randomItemId(){
// get the total items count of the collection
var itemsCount = Items.find().count();
// get a random number (N) between [0 , itemsCount - 1]
var random = Math.floor(Random.fraction() * itemsCount);
// choose a random item by skipping N items
var item = Items.findOne({},{
skip: random
});
return item && item._id;
}
function generateItemIdPair(){
// return an array of 2 random items ids
var result = [
randomItemId(),
randomItemId()
];
//
while(result[0] == result[1]){
result[1] = randomItemId();
}
//
return result;
}
Meteor.publish("randomItems",function(random){
var pair = generateItemIdPair();
// publish the 2 items whose ids are in the random pair
return Items.find({
_id: {
$in: pair
}
});
});
client/client.js
// every 5 seconds subscribe to 2 new random items
Meteor.setInterval(function(){
Meteor.subscribe("randomItems", Random.fraction(), function(){
console.log("fetched these random items :", Items.find().fetch());
});
}, 5000);
You'll need to meteor add random for this code to work.

Meteor.publish 'randomDocs', ->
ids = _(Docs.find().fetch()).pluck '_id'
randomIds = _(ids).sample 2
Docs.find _id: $in: randomIds

Here's another approach, uses the excellent publishComposite package to populate matches in a local (client-only) collection so it doesn't conflict with other uses of the main collection:
if (Meteor.isClient) {
randomDocs = new Mongo.Collection('randomDocs');
}
if (Meteor.isServer) {
Meteor.publishComposite("randomDocs",function(select_count) {
return {
collectionName:"randomDocs",
find: function() {
let self=this;
_.sample(baseCollection.find({}).fetch(),select_count).forEach(function(doc) {
self.added("randomDocs",doc._id,doc);
},self);
self.ready();
}
}
});
}
in onCreated: this.subscribe("randomDocs",3);
(then in a helper): return randomDocs.find({},{$limit:3});

Mongo - new vs processed approach

I am new to Mongo and have gotten close to where I want to be after 3 days of banging my head against the keyboard, but now I think I may just be misunderstanding certain key concepts:
What I am trying to do:
I have a node script that is pulling in feed items from various sources very frequently and storing them (title, link, origin, processed:false)
I have another script pulling out records at random, one at a time, using them, and updating processed:true
End Goal: Items should be unique by title - if it's been seen before it should not be written to DB, and once it's been processed one time, it should never be processed again.
INSERT SCRIPT:
key = {'title':title};
data = {'origin':origin, 'title':title, 'original_link':original_url, 'processed':false};
collection.update(key, data, {upsert:true}, function(err, doc) { ...
READ SCRIPT:
collection.findOne({processed:false}, function(err, doc){
if (err) throw err;
logger.info("Read out the following item from mongodb:...");
console.dir(doc);
thisId = doc._id;
markProcessed(thisId);
}
var markProcessed = function(id) {
collection.update({ _id:id },
{
$set: {'processed':true},
}, function(err, doc){
if (err) throw err;
logger.info("Marked record:"+id+" as processed");
console.dir(doc);
}
)
};
I've tried using collection.ensureIndex({'title':1}, {unique:true}) to no success either.
As the two scripts run in parallel the read script ends up repeating work on already processed records, and although the markProcessed function was working all yesterday it miraculously does not today :)
I would very much appreciate any guidance.

There is a problem with your insert script. When you use collection.update and you already have a document with the same key in the database, that document will be overwritten with the new one. An unique index doesn't prevent this, because there aren't two documents with the same title in the collection at the same time.
When you don't want to overwrite an existing record, use collection.insert which will fail when the inserted document violates an unique index.

MongoDB Social Network Adding Followers

I'm implementing a social network in MongoDB and I need to keep track of Followers and Following for each User. When I search for Users I want to display a list like Facebook with the User Name, Picture and number of Followers & Following. If I just wanted to display the User Name and Picture (info that doesn't change) it would be easy, but I also need to display the number of Followers & Following (which changes fairly regularly).
My current strategy is to embed the People a User follows into each User Document:
firstName: "Joe",
lastName: "Bloggs",
follows: [
{
_id: ObjectId("520534b81c9aac710d000002"),
profilePictureUrl: "https://pipt.s3.amazonaws.com/users/xxx.jpg",
name: "Mark Rogers",
},
{
_id: ObjectId("51f26293a5c5ea4331cb786a"),
name: "The Palace Bar",
profilePictureUrl: "https://s3-eu-west-1.amazonaws.com/businesses/xxx.jpg",
}
]
The question is - What is the best strategy to keep track of the number of Followers & Following for each User?
If I include the number of Follows / Following as part of the embedded document i.e.
follows: [
{
_id: ObjectId("520534b81c9aac710d000002"),
profilePictureUrl: "https://pipt.s3.amazonaws.com/users/xxx.jpg",
name: "Mark Rogers",
**followers: 10,**
**following: 400**
}
then every time a User follows someone requires multiple updates across all the embedded documents.
Since the consistency of this data isn't really important (i.e. Showing someone I have 10 instead of 11 followers isn't the end of the world), I can queue this update. Is this approach ok or can anyone suggest a better approach ?

You're on the right track. Think about which calculation is performed more - determining the number of followers/following or changing number of followers/following? Even if you're caching the output of the # of followers/following calculation it's still going to be performed one or two orders of magnitude more often than changing the number.
Also, think about the opposite. If you really need to display the number of followers/following for each of those users, you'll have to then do an aggregate on each load (or cache it somewhere, but you're still doing a lot of calcs).
Option 1: Cache the number of followers/following in the embedded document.
Upsides: Can display stats in O(1) time
Downsides: Requires O(N) time to follow/unfollow
Option 2: Count the number of followers/following on each page view (or cache invalidation)
Upsides: Can follow/unfollow in O(1) time
Downsides: Requires O(N) time to display
Add in the fact that follower/following stats can be eventually consistent whereas the counts have to be displayed on demand and I think it's a pretty easy decision to cache it.

I've gone ahead and implement the update followers/following based on the same strategy recommended by Mason (Option 1). Here's my code in NodeJs and Mongoose and using the AsyncJs Waterfall pattern in case anyone is interested or has any opinions. I haven't implemented queuing yet but the plan would be to farm most of this of to a queue.
async.waterfall([
function (callback) {
/** find & update the person we are following */
Model.User
.findByIdAndUpdate(id,{$inc:{followers:1}},{upsert:true,select:{fullName:1,profilePictureUrl:1,address:1,following:1,followers:1}})
.lean()
.exec(callback);
},
function (followee, callback) {
/** find & update the person doing the following */
var query = {
$inc:{following:1},
$addToSet: { follows: followee}
}
Model.User
.findByIdAndUpdate(credentials.username,query,{upsert:true,select:{fullName:1,profilePictureUrl:1,address:1,following:1,followers:1}})
.lean()
.exec(function(err,follower){
callback(err,follower,followee);
});
},
function(follower,followee,callback){
/** update the following count */
Model.User
.update({'follows._id':follower.id},{'follows.$.following':follower.following},{upsert:true,multi:true},function(err){
callback(err,followee);
});
},
function(followee,callback){
/** update the followers count */
Model.User
.update({'follows._id':followee.id},{'follows.$.followers':followee.followers},{upsert:true,multi:true},callback);
}
], function (err) {
if (err)
next(err);
else {
res.send(HTTPStatus.OK);
next();
}
});