Managing (indexing) large datasets with Meteor and Mongo - mongodb

How does Meteor handle the process of DB indexing? I've read that there are none at this time but I'm particularly concerned with very large data sets, joined with multiple lookups, etc. and will really impact performance. Are these issues taken care of by Mongo and Meteor?
I am coming from a Rails/PostgreSQL background and am about 2 days into Meteor and Mongo.
Thanks.

Meteor does expose a method for creating indexes, which maps to the mongo method db.collection.ensureIndex
You can access it on each Meteor.Collection instance, on the server. For Example:
if (Meteor.isServer){
var myCollection = new Meteor.Collection("dummy");
// create an index on 'dummy', field1 & field2
myCollection._ensureIndex({field1: 1, field2: 1});
}
From a performance POV, create indexes based on what you publish, but avoid over-indexing.
With oplog tailing, the initial query will only run occasionally- and get changes from the oplog.
Without oplog tailing, meteor will re-run the query every 10s, so better indexes have a large gain.

Got a response from the Discover Meteor book folks:
Sacha Greif Mod − Actually, we are in the process of writing a new
sidebar to address migrations. You'll have access to it for free if
you're on the Full or Premium packages :)
Regarding indexes, I think we might address that in an upcoming blog
post :)
Thanks much for the reply. I'm looking forward to both.

Related

MongoDB indexing for a Parse Server application

We have a social app where users can chat with each other and we’ve reached 350K messages!
We recently noticed that as the number of messages is growing, the find operations are getting slower! I believe the issue here is that the Message collection is not indexed.
That’s what I want to do now! I found this piece of code at the MongoDB docs:
db.comments.ensure_index(('discussion_id', 1))
This is my Message collection:
{
chatRoom: <Pointer>,
user: <Pointer>,
text: <String>,
isSeen: <Bool>
}
So I guess this is all I have to do:
db.Message.ensure_index(('chatRoom', 1))
Is that just it? Run this command and I’m all set? All existing and future messages will be indexed after that?
Your index actually should depend on what your query looks like. Suppose your message query looks like this:
var query = new Parse.Query("Message");
query.equalTo("chatRoom", aChatRoom);
query.equalTo("user", someUser);
query.equalTo("isSeen", false);
query.descending("createdAt");
query.find().then(function(results){//whatever});
Then you would need to build an index on the Message collection specifically for this query. In this case:
db.Message.createIndex({_p_chatRoom:1, _p_user:1, isSeen: -1, _created_at: -1})
Alternatively, an index with just the chatroom will perform much better than no index at all
db.Message.createIndex({_p_chatRoom:1})
To really understand which indexes to build, you'll need to do some reading on the Mongo docs https://docs.mongodb.com/manual/reference/method/db.collection.createIndex/#db.collection.createIndex
I personally use MLab for my Parse MongoDB, because I'm not very knowledgeable about databases, and they actually have a slow query analyzer that recommends indexes based on common queries in your application, so if you don't want to learn the finer points of MongoDB indexing, then MLab is a great place to start

Meteor serving large static collections

I started building and app and chose meteor as a platform, but I stumbled upon a problem, I need to serve large collections of data to user let's say 2000-5000 records, now I understand that having such large reactive collection is a problem for meteor, but the thing is I don't need it be reactive, I just need statistically display it to user whenever he requests it. I just started out with meteor and don't know of it's capabilities, so I wonder if something like this is possible? For example php queries ~3000 records from mysql and prints it to user in around 3 second.
But using meteor even for smaller collections let's say 500 records I have to wait for a lot more time: ~1min.
I have a clue that this slow loading might be caused by meteor default MongoDB implementation, and using external database should increase performance, though I did not tried it yet. Anyway, the question would be can I achieve fast loading of large data collections in meteor and if so how would I do that, and what are best practices of handling large collections in meteor?
PS. I chose meteor because I do need it's reactivity for some cases, with small collections. But I also need to serve larger static collections. But I wonder if I can combine both in meteor?
A couple of pointers, which may help with your static collections:
Use 'reactive: false' in your find queries that don't need to be reactive as that'll stop meteor watching for updates.
http://docs.meteor.com/#/full/find
Figure out what fields you need where and only return the bare minimum. You can use session variables to filter based on the context, which will make your publications a lot more effective.
http://docs.meteor.com/#/full/meteor_publish
Surely the user doesn't need to see all 2000-5000 records at once? Are you not able to implement some sort of paging mechanism?
Best pattern for pagination for Meteor

Aggregate,Find,Group confusion?

I am building a web based system for my organization, using Mongo DB, I have gone through the document provided by mongo db and came to the following conclusion:
find: Cannot pull data from sub array.
group: Cannot work in sharded environment.
aggregate:Best for sub arrays, but has performance issue when data set is large.
Map Reduce : Too risky to write map and reduce function.
So,if someone can help me out with the best approach to work with sub array document, in production environment having sharded cluster.
Example:
{"testdata":{"studdet":[{"id","name":"xxxx","marks",80}.....]}}
now my "studdet" is a huge collection of more than 1000, rows for each document,
So suppose my query is:
"Find all the "name" from "studdet" where marks is greater than 80"
its definitely going to be an aggregate query, so is it feasible to go with aggregate in this case because ,"find" cannot do this and "group" will not work in sharded environment, so if I go with aggregate what will be the performance impact, i need to call this query most of the time.
Please have a look at:
http://docs.mongodb.org/manual/core/data-modeling/
and
http://docs.mongodb.org/manual/tutorial/model-embedded-one-to-many-relationships-between-documents/#data-modeling-example-one-to-many
These documents describe the decisions in creating a good document schema in MongoDB. That is one of the hardest things to do in MongoDB, and one of the most important. It will affect your performance etc.
In your case running a database that has a student collection with an array of grades looks to be the best bet.
{_id:, …., grades:[{type:”test”, grade:80},….]}
In general, and, given your sample data set, the aggregation framework is the best choice. The aggregation framework is faster then map reduce in most cases (certainly in execution speed, it is C++ vs javascript for map reduce).
If your data's working set becomes so large you have to shard then aggregation, and everything else, will be slower. Not, however, slower then putting everything on a single machine that has a lot of page faults. Generally you need a working set larger then the RAM available on a modern computer for sharding to be the correct way to go such that you can keep everything in RAM. (At this point a commercial support contract for Mongo for assistance is going to be a less then the cost of hardware, and that include extensive help with schema design.)
If you need anything else please don’t hesitate to ask.
Best,
Charlie

Mongodb model for Uniqueness

Scenario:
10.000.000 record/day
Records:
Visitor, day of visit, cluster (Where do we see it), metadata
What we want to know with this information:
Unique visitor on one or more clusters for a given range of dates.
Unique Visitors by day
Grouping metadata for a given range (Platform, browser, etc)
The model i stick with in order to easily query this information is:
{
VisitorId:1,
ClusterVisit: [
{clusterId:1, dates:[date1, date2]},
{clusterId:2, dates:[date1, date3]}
]
}
Index:
by VisitorId (to ensure Uniqueness)
by ClusterVisit.ClusterId-ClusterVisit.dates (for searching)
by IdUser-ClusterVisit.IdCluster (for updating)
I also have to split groups of clusters into different collections in order to ease to access the data more efficiently.
Importing:
First we search for a combination of VisitorId - ClusterId and we addToSet the date.
Second:
If first doesn't match, we upsert:
$addToSet: {VisitorId:1,
ClusterVisit: [{clusterId:1, dates:[date1]}]
}
With First and Second importing i cover if the clusterId doesn't exists or if VisitorId doesn´t exists.
Problems:
totally inefficient (near impossible) on update / insert / upsert when the collection grows, i guess because of the document size getting bigger when adding a new date.
Difficult to maintain (unset dates mostly)
i have a collection with more than 50.000.000 that i can't grow any more. It updates only 100 ~ records/sec.
I think the model i'm using is not the best for this size of information. What do you think will be best to get more upsert/sec and query the information FAST, before i mess with sharding, which is going to take more time while i learn and get confident with it.
I have a x1.large instance on AWS
RAID 10 with 10 disks
Arrays are expensive on large collections: mapreduce, aggregate...
Try .explain():
MongoDB 'count()' is very slow. How do we refine/work around with it?
Add explicit hints for index:
Simple MongoDB query very slow although index is set
A full heap?:
Insert performance of node-mongodb-native
The end of memory space for collection:
How to improve performance of update() and save() in MongoDB?
Special read clustering:
http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/
Global write lock?:
mongodb bad performance
Slow logs performance track:
Track MongoDB performance?
Rotate your logs:
Does logging output to an output file affect mongoDB performance?
Use profiler:
http://www.mongodb.org/display/DOCS/Database+Profiler
Move some collection caches to RAM:
MongoDB preload documents into RAM for better performance
Some ideas about collection allocation size:
MongoDB data schema performance
Use separate collections:
MongoDB performance with growing data structure
A single query can only use one index (better is a compound one):
Why is this mongodb query so slow?
A missing key?:
Slow MongoDB query: can you explain why?
Maybe shards:
MongoDB's performance on aggregation queries
Improving performance stackoverflow links:
https://stackoverflow.com/a/7635093/602018
A good point for further sharding replica education is:
https://education.10gen.com/courses

Easiest way to scale Mongo with limited resources?

I have a web server (40gig hd, 1 gig ram) that runs Mongo and a Rails application.
The Mongo DB is a document store of Twitter tweets and users, which has several million records. I perform map-reduce queries on the data to extract things like most common hashtags, words, mentions etc (very standard stuff). The meta-data of each tweet is already stored, so the map-reduce is really as efficient as a single collect.
However, since it is run on a (fairly) large dataset, it can't be done in real-time anymore - for example I have a report generator that works out a whole bunch of these map-reduces in a row and takes about 2 minutes for 20 thousand tweets.
What is the quickest, cheapest way to scale mongo, especially in map-reduce performance? I can set up an additional server and split the load, but wonder if I should use sharding, replication or both? Sharding may be overkill for this situation.
Would love some input on my mysql-mongo connection. mysql contains twitter profiles that store twitter ids for each profile. each time a map reduce is done, it collects all IDs to be fed as options into the mapreduce ie:
#profile_tweet_ids = current_profile_tweet_ids # array of ids
#daily_trend = TwitterTweet.daily_trend :query => {:twitter_id => {"$in" => #profile_tweet_ids}}
The mapreduce function in TwitterTweet looks like:
def daily_trend(options={})
options[:out] = "daily_trend"
map = %Q( function(){
if (this.created_at != null)
{
emit(this.created_at.toDateString(), 1);
}
})
result = collection.map_reduce(map, standard_reduce, options)
normalize_results(result)
end
Any advice is appreciated!
If you are doing simple counts, sums, uniques etc, you may be able to avoid using map-reduce completely. You can use the $inc operator to get most of the stuff that you need in real-time.
I have explained this in detail in my blog post on real-time analytics with MongoDB.
Sounds like your use case is more in the lines of online stream / event processing.
You can use mongo or other databases / caching product to store reference data, and an event processing framework for receiving and processing the events. There are a few tools that can help you with that - out the back of my head here are a few: Twitter Storm, Apache S4, GigaSpaces XAP (disclaimer - I work for GigaSpaces) and GridGain.
Use one of the cloud services like MongoLab.. Depends on your definition of cheap though..
The answer regarding using operators rather than MapReduce has merits, and may be far more beneficial to your efforts to get real time responses. Map Reduce on mongodb does not lend itself to yielding real time responses.
Further to that, you may also benefit from the new aggregation framework (http://www.mongodb.org/display/DOCS/Aggregation+Framework), once that is available in the next release.
To answer the more general question about how to scale out MapReduce, adding a new server may not help if you are simply going to add it as a secondary, as a secondary it will not have the capability to store your M/R results in a collection, so inline is your only option. If you do not need to store results in a collection then this is your easiest way forward. For more information, see an in-depth discussion here: http://groups.google.com/group/mongodb-user/browse_thread/thread/bd8f5734dc64117a
Sharding can help with scaling out, but bear in mind that you will need to run everything through a mongos process, have config servers and that the mongos will need to finalize the result sets returned from each shard, so you add a new potential bottleneck depending on your data and you will need more than just one extra machine to have it working in a reliable manner.
It is the connections between different data items that is most valuable to them (they let the public do the work of categorising the data to make it valuable) and hence also the most dangerous to you http://indresult.com