How should I insert a bunch of data in Meteor Mongodb? - mongodb

It's taking a long time to save 500+ Facebook friends in MongoDB and I think I'm doing it so wrong. I'll paste how I'm doing the insertion:
models.js:
Friends = new Meteor.Collection('friends');
Friend = {
set : function(owner, friend) {
var user_id = get_user_by_uid(friend['uid']);
return Friends.update({uid: friend['uid'], owner: owner}, {$set:{
name : friend['name'],
pic_square : 'https://graph.facebook.com/'+friend['uid']+'/picture?width=150&height=150',
pic_cover : friend['pic_cover'],
uid : friend['uid'],
likes_count : friend['likes_count'],
friend_count : friend['friend_count'],
wall_count : friend['wall_count'],
age : get_age(friend['birthday_date']),
mutual_friend_count : friend['mutual_friend_count'],
owner : owner,
user_id : user_id ? user_id['_id'] : undefined
}}, {upsert: true});
}
}
server.js:
// First get facebook list of friends
friends = friends['data']['data'];
_.each(friends, function(friend){
Friend.set(user_id, friend);
});
The loads go high with 2+ users and it takes ages to insert on the database. What should I change here ?

The performance is bad for two reasons I think.
First, you are experiencing minimongo performance, not mongodb performance, on the client. minimongo can't index, so upsert is expensive—it is O(n^2) expensive on database size. Simply add a if (Meteor.isSimulation) return; statement into your model just before the database updating.
Take a look at some sample code to see how to organize your code a bit, because Friend.set(user_id, friend) should be occurring in a method call, conventionally, defined in your model.js. Then, it should escape if it is detected as a client simulating the call, as opposed to the server executing it.
Second, you are using uid and owner like a key without making them a key. On your server startup code, add Friends._ensureIndex({uid:1, owner:1}).
If none of this works, then your queries to Facebook might be rate-limited in some way.
Check out https://stackoverflow.com/a/8805436/1757994 for some discussion about the error message you'd receive if you are being rate limited.
They almost certainly do not want you copying the graph the way you are doing. You might want to consider not copying the graph at all and only getting the data on a use-basis, because it very rapidly becomes stale anyway.

Related

query too large issue with mongodb

let's say we have a collection of users and each user is followed by another user. if I want to find the users that are NOT following me, I need to do something like:
db.users.find({_id: { $nin : followers_ids } } ) ;
if the amount of followers_ids is huge, let's say 100k users, mongodb will start saying the query is too large, plus sending a big amount of data over the network to make the query is not good neither. what are the best practices to accomplish this query without sending all this ids over the network ?.
I recommend that you limit the number of query Results to Reduce Network Demand. According to the Docs,
MongoDB cursors return results in groups of multiple documents. If you know the number of results you want, you can reduce the demand on network resources by issuing the limit() method.
This is typically used in conjunction with sort operations. For
example, if you need only 50 results from your query to the users
collection, you would issue the following command:
db.users.find({$nin : followers_ids}).sort( { timestamp : -1 } ).limit(50)
You can then use the cursor to get retrieve more user documents as needed.
Recommendation to Restructure Followers Schema
I would recommend that you restructure your user documents if the followers will grow to a large amount. Currently user schema may be as such:
{
_id: ObjectId("123"),
username: "jobs",
email: "stevej#apple.com",
followers: [
ObjectId("12345"),
ObjectId("12375"),
ObjectId("12395"),
]
}
The good thing about the schema is whenever this user does anything all of the users you need to notify is right here inside of the document. The downside is that if you needed to find everyone a user is following you will have to query the entire users collection. Also your user document will become larger and more volatile as the followers grow.
You may want to further normalize your followers. You can keep a collection that matches followee to followers with documents that look like this:
{
_id: ObjectId("123"),//Followee's "_id"
followers: [
ObjectId("12345"),
ObjectId("12375"),
ObjectId("12395"),
]
}
This will keep your user documents slender, but will take an extra query to get the followers. As the "followers" array changes in size, you can enable the userPowerOf2Sizes allocation strategy to reduce fragmentation and moves.

MongoDB document model size/performance limits? A collection with an object that possibly houses 100k+ names?

I'm trying to build an event website that will host videos and such. I've set up a collection with the event name, event description, and an object with some friendly info of people "attending". If things go well there might be 100-200k people attending, and those people should have access to whoever else is in the event. (clicking on the friendly name will find the user's id and subsequently their full profile) Is that asking too much of mongo? Or is there a better way to go about doing something like that? It seems like that could get rather large rather quick.
{
_id : ...., // event Id,
'name' : // event name
'description' : //event description
'attendees' :{
{'username': user's friendly name, 'avatarlink': avatar url},
{'username': user's friendly name, 'avatarlink': avatar url},
{'username': user's friendly name, 'avatarlink': avatar url},
{'username': user's friendly name, 'avatarlink': avatar url}
}
}
Thanks for the suggestions!
In MongoDB many-to-many modeling (or one-to-many) in general, you should take a different approach depending if the many are few (up to few dozens usually) or "really" many as in your case.
It will be better for you not to use embedding in your case, and instead normalize. If you embed users in your events collection, adding attendees to a certain event will increase the array size. Since documents are updated in-place, if the document can't fit it's disk size, it will have to moved on disk, a very expensive operation which will also cause fragmentation. There are few techniques to deal with moves, but none is ideal.
Having a array of ObjectId as attendees will be better in that documents will grow much less dramatically, but still issue few problems. How will you find all events user has participated in? You can have a multi-key index for attendees, but once a certain document moves, the index will have to be updated per each user entry (the index contains a pointer to the document place on disk). In your case, where you plan to have up to 200K of users it will be very very painful.
Embedding is a very cool feature of MongoDB or any other document oriented database, but it's naive to think it doesn't (sometimes) comes without a price.
I think you should really rethink your schema: having an events collection, a users collection and a user_event collection with a structure similar to this:
{
_id : ObjectId(),
user_id : ObjectId(),
event_id : ObjectId()
}
Normalization is not a dirty word
Perhaps you should consider modeling your data in two collections and your attendees field in an event document would be an array of user ids.
Here's a sample of the schema:
db.events
{
_id : ...., // event Id,
'name' : // event name
'description' : //event description
'attendees' :[ObjectId('userId1'), ObjectId('userId2') ...]
}
db.users
{
_id : ObjectId('userId1'),
username: 'user friendly name',
avatarLink: 'url to avatar'
}
Then you could do 2 separate queries
db.events.find({_id: ObjectId('eventId')});
db.users.find( {_id: {$in: [ObjectId['userId1'), ObjectId('userId2')]}});

mongodb - add column to one collection find based on value in another collection

I have a posts collection which stores posts related info and author information. This is a nested tree.
Then I have a postrating collection which stores which user has rated a particular post up or down.
When a request is made to get a nested tree for a particular post, I also need to return if the current user has voted, and if yes, up or down on each of the post being returned.
In SQL this would be something like "posts.*, postrating.vote from posts join postrating on postID and postrating.memberID=currentUser".
I know MongoDB does not support joins. What are my options with MongoDB?
use map reduce - performance for a simple query?
in the post document store the ratings - BSON size limit?
Get list of all required posts. Get list of all votes by current user. Loop on posts and if user has voted add that to output?
Is there any other way? Can this be done using aggregation?
NOTE: I started on MongoDB last week.
In MongoDB, the simplest way is probably to handle this with application-side logic and not to try this in a single query. There are many ways to structure your data, but here's one possibility:
user_document = {
name : "User1",
postsIhaveLiked : [ "post1", "post2" ... ]
}
post_document = {
postID : "post1",
content : "my awesome blog post"
}
With this structure, you would first query for the user's user_document. Then, for each post returned, you could check if the post's postID is in that user's "postsIhaveLiked" list.
The main idea with this is that you get your data in two steps, not one. This is different from a join, but based on the same underlying idea of using one key (in this case, the postID) to relate two different pieces of data.
In general, try to avoid using map-reduce for performance reasons. And for this simple use case, aggregation is not what you want.

Join like query in mongodb

i have a requirement where i need to get the friends of user. I have made two collections named User and Friends.
The code that i use to access the data from the Friends and User is:
var friend = Friends.find({acceptor:req.currentUser.id,status:'0'},function(err, friends) {
console.log('----------------friends-----------------'+friends.length);
});
console.log is giving me the desired results for friends.Now if i use friend to access the User data like the one given i am not getting the result that i need.
var user = User.find({_id:friend.requestor},function(err, users) {
console.log('----------------user-----------------'+users.length);
});
how can i join the two queries to get the desired result.Please help
I'd suggest you try to denormalize the data instead of going down the SQL path:
User {
"FirstName" : "Jack",
"LastName" : "Doe",
// ...
// no friend info here
}
Put denormalized information in the list of friends. Don't use an embedded array, because you probably don't need to fetch all friend ids every time you fetch a user. The details of the data structure depend on the relations you want to support (directed vs. undirected, etc.), but it would roughly look like this:
FriendList {
OwnerUserId : ObjectId("..."),
FriendUserId : ObjectId("..."),
FriendName: "Jack Doe"
// add more denormalized information
}
Now, to display the list of friends of a user:
var friends = db.FriendList.find({"OwnerUserId" : currentUserId});
The downside is that, if a friend changes her name, you'll have to update all references of that name. On the other hand, that logic is trivial and that the (typically much more common) query "fetch all friends" is super fast, easy to code, and easy to page.

MongoDB, return recent document for each user_id in collection

Looking for similar functionality to Postgres' Distinct On.
Have a collection of documents {user_id, current_status, date}, where status is just text and date is a Date. Still in the early stages of wrapping my head around mongo and getting a feel for best way to do things.
Would mapreduce be the best solution here, map emits all, and reduce keeps a record of the latest one, or is there a built in solution without pulling out mr?
There is a distinct command, however I'm not sure that's what you need. Distinct is kind of a "query" command and with lots of users, you're probably going to want to roll up data not in real-time.
Map-Reduce is probably one way to go here.
Map Phase: Your key would simply be an ID. Your value would be something like the following {current_status:'blah',date:1234}.
Reduce Phase: Given an array of values, you would grab the most recent and return only it.
To make this work optimally you'll probably want to look at a new feature from 1.8.0. The "re-reduce" feature. Will allow you to process only new data instead of re-processing the whole status collection.
The other way to do this is to build a "most-recent" collection and tie the status insert to that collection. So when you insert a new status for the user, you update their "most-recent".
Depending on the importance of this feature, you could possibly do both things.
Current solution that seems to be working well.
map = function () {emit(this.user.id, this.created_at);}
//We call new date just in case somethings not being stored as a date and instead just a string, cause my date gathering/inserting function is kind of stupid atm
reduce = function(key, values) { return new Date(Math.max.apply(Math, values.map(function(x){return new Date(x)})))}
res = db.statuses.mapReduce(map,reduce);
Another way to achieve the same result would be to use the group command, which is a kind of a mr-shortcut that lets you aggregate on a specific key or set of keys.
In your case it would read like this:
db.coll.group({ key : { user_id: true },
reduce : function(obj, prev) {
if (new Date(obj.date) < prev.date) {
prev.status = obj.status;
prev.date = obj.date;
}
},
initial : { status : "" }
})
However, unless you have a rather small fixed amount of users I strongly believe that a better solution would be, as previously suggested, to keep a separate collection containing only the latest status-message for each user.