Meteor website takes too much time to load - mongodb

I have a webApp built using meteor.
Following are the specifications :
Meteor version : 1.8
Mongo Version : 4.0.5
Following is the list of packages I have used :
jquery#1.11.10
twbs:bootstrap#3.3.6
iron:router
reactive-var#1.0.11
fortawesome:fontawesome
blaze#2.1.8
accounts-password#1.5.1
mrt:mathjax
email#1.2.3
momentjs:moment
ian:accounts-ui-bootstrap-3#1.2.89
meteor-base#1.4.0
mongo#1.6.0
blaze-html-templates#1.0.4
session#1.2.0
tracker#1.2.0
logging#1.1.20
reload#1.2.0
ejson#1.1.0
spacebars#1.0.12
standard-minifier-css#1.5.2
standard-minifier-js#2.4.0
jss:jstree
meteorhacks:subs-manager
aldeed:template-extension
reywood:publish-composite
shell-server#0.4.0
stylus#=2.513.13
accounts-base#1.4.3
iron:middleware-stack#1.1.0
http#1.4.1
ecmascript#0.12.4
dynamic-import#0.5.0
sha#1.0.9
simple:json-routes
underscore#1.0.10
aldeed:simple-schema
rafaelhdr:google-charts
meteorhacks:aggregate
The webApp is hosted on AWS ec2 instance with 16GB RAM and 04 processors. The application uses pub-sub method. Now the issue is that whenever there are more than 50 concurrent connections, the CPU usage crosses sixty-percent usage and the webApp becomes annoyingly slow to use. As per my findings, it could be because of two reasons, either the pub-sub schema that I have used is too heavy, i.e., I have used database subscriptions on each page extensively and meteor maintains an open connection continuously with it. Other reason that could be leading to extensive resource usage could be mongoDB usage. As per the dbStats, the db uses more than 06GB of RAM. Following are the details :
I am not sure why such behaviour. Only way I can think of is to hit and trial (remove subscriptions and then test), but it would be too time consuming and also not full proof.
Could someone help me out as to how to proceed.

Depending on the way your app is designed, data-wise, there can be several reasons for this lack of performance.
A few suggestions:
check that you have indexes in your collections
avoid doing aggregation in the publication process, i.e. denormalize the db, publish array of cursors instead, limit the size of the documents, etc.
filter the useless fields in the query
limit the amount of data to the relevant part (lazy load & paginated subscribe)
consider global pubs/subs for the collections you use a lot, instead of reloading them too often on a route based pattern
track small component based subs and try to put them at a higher level to avoid making for instance 30 subs instead of one
My best guess is that you probably need a mix of rationalizing the db "structure" and avoid the data aggregation as much as you can.
You may also have a misuse of the low level collection api (e.g. cursor.observe() stuff) somewhere.

Related

ElasticSearch vs. MongoDB for Caching User Data

Up to this point, I have been using MongoDB (Node.js + Mongoose) to save posts which belong to a user, so that I can later retrieve them to display in a stream (just like Facebook, Twitter, etc.)
It recently became necessary to allow the user to deeply search his stream; MongoDB's search was insufficient, so I implemented ElasticSearch on my servers (Amazon EC2 m1.large instances running CentOS, FWIW).
My question: I'm now in a position that I'm duplicating the data between MongoDB (where the user's stream is cached) and ElasticSearch (where it is searched).
Is there any disadvantage to moving my cache ENTIRELY into ElasticSearch, getting rid of the MongoDB all together? It seems a waste to double the storage, and there's no other place that I'm accessing this data (it is only used when presenting/searching the stream of posts).
Specifically, I want to make sure I'm not overlooking anything re: performance. I like the idea of reducing MongoDB as a bottleneck, yet I worry about the memory overhead of ElasticSearch. MongoDB runs on its own server in my cloud setup, whereas ElasticSearch is running on the same instances as node.js. This means I would have MORE ElasticSearch servers (the node.js servers are in an auto-scaling array), but they each are not DEDICATED servers (unlike MongoDB).
The only big obstacle to using ES as a "primary datasource" is that there isn't a good backup mechanism right now. The ES team is working on it and expect it to be out by the end of the year, but in the mean time, you'll have to implement your own backup scripts.
As far as performance, it's really hard to say because almost every situation is unique. ES benefits from memory - so more is always better. In particular, sorts/filters/facets/geo all like to eat memory. If you aren't doing much in the way of faceting, for example, you may be fine with less memory.
ES doesn't need to run on a dedicated node...but it will happily use as many resources as you give it.
Another option is to use just the elastic search indexes. You can choose to not save data in a readable format, so you search in ES and then retrieve documents from MongoDB to your user as needed.
The question bellow comments exactly on that.
Storing only selected fields and not storing _all in pyes/elasticsearch

messaging service: redis or mongodb?

I am working on a messaging system that is a bit more advanced than simply sending receiving messages; it is something that looks like facebook chat/messaging: it has chat aspects but also messaging ones, like group messages, read/unread messages, and other.
On redis, I would simply use lists to store received messages, for example like this:
myID = [ "amy|how are you?", "frank|long time no see!" ]
amyID = [ "john|I'm good! you?" ]
(I have simplified it all a lot for easier reading.
But in this way I would not be able to keep track of single conversations, as they will all be always flushed once the messages are received (so basically no "inbox" feature.
On the other hand, if I use mongodb, I could use something like this: How to keep track of a private messaging system using MongoDB?
I though of the following benefits/disadvantages:
MONGODB
advantages:
can see inbox view
can check read/unread messages on each conversation
disadvantages
not as fast as redis
storage size increases a lot
REDIS
advantages:
easy to pick up new messages
no storage problems (messages are flushed)
disadvantages:
once messages are sent to the client are lost, so no read/unread features and
no inbox
Any ideas?
Thanks in advance.
I cannot answer for Redis because I don't use it and never have so I won't pretend I have.
However, if for some reason, you are not using something like an XMPP client like Facebook does: http://www.ibm.com/developerworks/xml/tutorials/x-realtimeXMPPtut/section3.html (aka Jabber) for chat then I will describe about a pure MongoDB solution in this situation.
MongoDB uses the OS' LRU as a means to cache documents and queries, fair enough it provides no direct query cache however if you are smart you will not need one; instead you just read all your queries directly from RAM. With this in mind MongoDB can be just as fast as Redis, since Redis uses the computers RAM too.
Speed between the two on a optimised query is negligible I would think. The true measure of speed comes from your schema, indexes, cluster setup and the queries you perform.
A note about storage size here, taking your comment into consideration:
the problem with flushing mongodb is bigger than I initially though: apparently when you delete something on mongo you only delete its reference, so if you delete 4mb of documents, it won't free up that much space. the only way to actually free up that memory is to run a dbRepair (or something among this line) that basically blocks the db while running....
You seem to have some misconceptions about exactly how MongoDB works.
This link will be of help to you: http://www.10gen.com/presentations/storage-engine-internals it will describe some of the reasons why excessive disk space is used and will also explain some of the misconceptions you have about how a computer works and how MongoDB frees space and reuses it.
MongoDB does not free space on a record level. Instead it will send that "empty" record (record and document are two different things as the presentation will tell you), shove it into a deleted bucket list and then reuse that space when a new document (or a updated document that has been moved) comes along and fits in that space.
It is true that if you are not careful and understanding on how MongoDB works on this level that you will probably be forced to run repairDB fairly regularly to keep any sort of performance after fragmentation.
As for memory handling. The OS handles this as I said. A good explanation of when the OS will free memory is on Wikipedia: http://en.wikipedia.org/wiki/Paging
Until there is not enough RAM to store all the data needed, the process of obtaining an empty page frame does not involve removing another page from RAM.
So the OS will handle removing pages for you and you shouldn't concern yourself with that part, instead you should be concerned with making your working set fit into RAM.
If you are worried about storing messages and don't really want to, i.e. you want them to be "flushed" you can actually use the TTL feature that comes with the later MongoDB installations: http://docs.mongodb.org/manual/tutorial/expire-data/ which will basically allow you to set a time-out for when a message should be deleted from the collection.
So personally if set-up right MongoDB could do messaging and chat like Facebook do it, of course they use the XMPP protocol and then archive messages into Cassandra for search but you don't have to do it like they do, that is just one way to achieve the same goal.
Hope this makes sense and I haven't gone round in circles, it is a bit of a long answer.
I think the big point here is the storage problems. You would need a lot of machine or a good system of flushing some conversations for you to use MongoDB. Despite wanting a sort of "inbox" system... I think redis would be more conducive to a well-working chat system - you just need to come up with some very creative workaround... or give up that design goal.
We use a mixed design, so we when we need snappy performance as in messages, queues and caches it´s on Redis and when we need to search on secondary indexes or update whole documents, we use MongoDB.
You can also try Riak, which can grow more linearly and smoothly than MongoDB.

Frequent large, multi-record updates in MongoDB, Lucene, etc

I am working on the high-level design of a web application with the following characteristics:
Millions of records
Heavily indexed/searchable by various criteria
Variable document schema
Regular updates in blocks of 10K - 200K records at a time
Data needs to remain highly available during updates
Must scale horizontally effectively
Today, this application exists in MySQL and we suffer from a few huge problems, particularly that it is challenging to adapt to flexible schema, and that large bulk updates lock the data for 10 - 15 seconds at a time, which is unacceptable. Some of these things can be tackled by better database design within the context of MySQL, however, I am looking for a better "next generation" solution.
I have never used MongoDB, but its feature set seemed to most closely match what I am looking for, so that was my first area of interest. It has some things I am excited about, such as data sharding, the ability to find-update-return in a single statement, and of course the schema flexibility of NoSQL.
There are two things I am not sure about, though, with MongoDB:
I can't seem to find solid
information about the concurrency of
updates with large data sets (see my
use case above) so I have a hard
time understanding how it might
perform.
I do need open text search
That second requirement brought me to Lucene (or possibly to Solr if I kept it external) as a search store. I did read a few cases where Lucene was being used in place of a NoSQL database like MongoDB entirely, which made me wonder if I am over-complicating things by trying to use both in a single app -- perhaps I should just store everything directly in Lucene and run it like that?
Given the requirements above, does it seem like a combination of MongoDB and Lucene would make this work effectively? If not, might it be better to attempt to tackle it entirely in Lucene?
Currently with MongoDB, updates are locking at the server-level. There are a few JIRAs open that address this, planned for v1.9-2.0. I believe the current plan is to yield writes to allow reads to perform better.
With that said, there are plenty of great ways to scale MongoDB for super high concurrency - many of which are simiar for MySQL. One such example is to use RAID 10. Another is to use master-slave where you write to master and read from slave.
You also need to consider if your "written" data needs to be 1) durable and 2) accessible via slaves immediately. The mongodb drivers allow you to specify if you want the data to be written to disk immediately (or hang in memory for the next fsync) and allow you to specify how many slaves the data should be written to. Both of these will slow down MongoDB writing, which as noted above can affect read performance.
MongoDB also does not have nearly the capability for full-text search that Solr\Lucene have and you will likely want to use both together. I am currently using both Solr and MongoDB together and am happy with it.

MongoDB on EC2 server or AWS SimpleDB?

What scenario makes more sense - host several EC2 instances with MongoDB installed, or much rather use the Amazon SimpleDB webservice?
When having several EC2 instances with MongoDB I have the problem of setting the instance up by myself.
When using SimpleDB I have the problem of locking me into Amazons data structure right?
What differences are there development-wise? Shouldn't I be able to just switch the DAO of my service layers, to either write to MongoDB or AWS SimpleDB?
SimpleDB has some scalability limitations. You can only scale by sharding and it has higher latency than mongodb or cassandra, it has a throughput limit and it is priced higher than other options. Scalability is manual (you have to shard).
If you need wider query options and you have a high read rate and you don't have so much data mongodb is better. But for durability, you need to use at least 2 mongodb server instances as master/slave. Otherwise you can lose the last minute of your data. Scalability is manual. It's much faster than simpledb. Autosharding is implemented in 1.6 version.
Cassandra has weak query options but is as durable as postgresql. It is as fast as mongo and faster on higher data size. Write operations are faster than read operations on cassandra. It can scale automatically by firing ec2 instances, but you have to modify config files a bit (if I remember correctly). If you have terabytes of data cassandra is your best bet. No need to shard your data, it was designed distributed from the 1st day. You can have any number of copies for all your data and if some servers are dead it will automatically return the results from live ones and distribute the dead server's data to others. It's highly fault tolerant. You can include any number of instances, it's much easier to scale than other options. It has strong .net and java client options. They have connection pooling, load balancing, marking of dead servers,...
Another option is hadoop for big data but it's not as realtime as others, you can use hadoop for datawarehousing. Neither cassandra or mongo have transactions, so if you need transactions postgresql is a better fit. Another option is Amazon RDS, but it's performance is bad and price is high. If you want to use databases or simpledb you may also need data caching (eg: memcached).
For web apps, if your data is small I recommend mongo, if it is large cassandra is better. You don't need a caching layer with mongo or cassandra, they are already fast. I don't recommend simpledb, it also locks you to Amazon as you said.
If you are using c#, java or scala you can write an interface and implement it for mongo, mysql, cassandra or anything else for data access layer. It's simpler in dynamic languages (eg rub,python,php). You can write a provider for two of them if you want and can change the storage maybe in runtime by a only a configuration change, they're all possible. Development with mongo,cassandra and simpledb is easier than a database, and they are free of schema, it also depends on the client library/connector you're using. The simplest one is mongo. There's only one index per table in cassandra, so you've to manage other indexes yourself, but with the 0.7 release of cassandra secondary indexes will bu possible as I know. You can also start with any of them and replace it in the future if you have to.
I think you have both a question of time and speed.
MongoDB / Cassandra are going to be much faster, but you will have to invest $$$ to get them going. This means you'll need to run / setup server instances for all them and figure out how they work.
On the other hand, you don't have to per a "per transaction" cost directly, you just pay for the hardware which is probably more efficient for larger services.
In the Cassandra / MongoDB fight here's what you'll find (based on testing I'm personally involved with over the last few days).
Cassandra:
Scaling / Redundancy is very core
Configuration can be very intense
To do reporting you need map-reduce, for that you need to run a hadoop layer. This was a pain to get configured and a bigger pain to get performant.
MongoDB:
Configuration is relatively easy (even for the new sharding, this week)
Redundancy is still "getting there"
Map-reduce is built-in and it's easy to get data out.
Honestly, given the configuration time required for our 10s of GBs of data, we went with MongoDB on our end. I can imagine using SimpleDB for "must get these running" cases. But configuring a node to run MongoDB is so ridiculously simple that it may be worth skipping the "SimpleDB" route.
In terms of DAO, there are tons of libraries already for Mongo. The Thrift framework for Cassandra is well supported. You can probably write some simple logic to abstract away connections. But it will be harder to abstract away things more complex than simple CRUD.
Now 5 years later it is not hard to set up Mongo on any OS. Documentation is easy to follow, so I do not see setting up Mongo as a problem. Other answers addressed the questions of scalability, so I will try to address the question from the point of view of a developer (what limitations each system has):
I will use S for SimpleDB and M for Mongo.
M is written in C++, S is written in Erlang (not the fastest language)
M is open source, installed everywhere, S is proprietary, can run only on amazon AWS. You should also pay for a whole bunch of staff for S
S has whole bunch of strange limitations. M limitations are way more reasonable. The most strange limitations are:
maximum size of domain (table) is 10 GB
attribute value length (size of field) is 1024 bytes
maximum items in Select response - 2500
maximum response size for Select (the maximum amount of data S can return you) - 1Mb
S supports only a few languages (java, php, python, ruby, .net), M supports way more
both support REST
S has a query syntax very similar to SQL (but way less powerful). With M you need to learn a new syntax which looks like json (also it is straight-forward to learn the basics)
with M you have to learn how you architect your database. Because many people think that schemaless means that you can throw any junk in the database and extract this with ease, they might be surprised that Junk in, Junk out maxim works. I assume that the same is in S, but can not claim it with certainty.
both do not allow case insensitive search. In M you can use regex to somehow (ugly/no index) overcome this limitation without introducing the additional lowercase field/application logic.
in S sorting can be done only on one field
because of 5s timelimit count in S can behave strange. If 5 seconds passed and the query has not finished, you end up with a partial number and a token which allows you to continue query. Application logic is responsible for collecting all this data an summing up.
everything is a UTF-8 string, which makes it a pain in the ass to work with non string values (like numbers, dates) in S. M type support is way richer.
both do not have transactions and joins
M supports compression which is really helpful for nosql stores, where the same field name is stored all-over again.
S support just a single index, M has single, compound, multi-key, geospatial etc.
both support replication and sharding
One of the most important things you should consider is that SimpleDB has a very rudimentary query language. Even basic things like group by, sum average, distinct as well as data manipulation is not supported, so the functionality is not really way richer than Redis/Memcached. On the other hand Mongo support a rich query language.

MongoDB vs. Redis vs. Cassandra for a fast-write, temporary row storage solution

I'm building a system that tracks and verifies ad impressions and clicks. This means that there are a lot of insert commands (about 90/second average, peaking at 250) and some read operations, but the focus is on performance and making it blazing-fast.
The system is currently on MongoDB, but I've been introduced to Cassandra and Redis since then. Would it be a good idea to go to one of these two solutions, rather than stay on MongoDB? Why or why not?
Thank you
For a harvesting solution like this, I would recommend a multi-stage approach. Redis is good at real time communication. Redis is designed as an in-memory key/value store and inherits some very nice benefits of being a memory database: O(1) list operations. For as long as there is RAM to use on a server, Redis will not slow down pushing to the end of your lists which is good when you need to insert items at such an extreme rate. Unfortunately, Redis can't operate with data sets larger than the amount of RAM you have (it only writes to disk, reading is for restarting the server or in case of a system crash) and scaling has to be done by you and your application. (A common way is to spread keys across numerous servers, which is implemented by some Redis drivers especially those for Ruby on Rails.) Redis also has support for simple publish/subscribe messenging, which can be useful at times as well.
In this scenario, Redis is "stage one." For each specific type of event you create a list in Redis with a unique name; for example we have "page viewed" and "link clicked." For simplicity we want to make sure the data in each list is the same structure; link clicked may have a user token, link name and URL, while the page viewed may only have the user token and URL. Your first concern is just getting the fact it happened and whatever absolutely neccesary data you need is pushed.
Next we have some simple processing workers that take this frantically inserted information off of Redis' hands, by asking it to take an item off the end of the list and hand it over. The worker can make any adjustments/deduplication/ID lookups needed to properly file the data and hand it off to a more permanent storage site. Fire up as many of these workers as you need to keep Redis' memory load bearable. You could write the workers in anything you wish (Node.js, C#, Java, ...) as long as it has a Redis driver (most web languages do now) and one for your desired storage (SQL, Mongo, etc.)
MongoDB is good at document storage. Unlike Redis it is able to deal with databases larger than RAM and it supports sharding/replication on it's own. An advantage of MongoDB over SQL-based options is that you don't have to have a predetermined schema, you're free to change the way data is stored however you want at any time.
I would, however, suggest Redis or Mongo for the "step one" phase of holding data for processing and use a traditional SQL setup (Postgres or MSSQL, perhaps) to store post-processed data. Tracking client behavior sounds like relational data to me, since you may want to go "Show me everyone who views this page" or "How many pages did this person view on this given day" or "What day had the most viewers in total?". There may be even more complex joins or queries for analytic purposes you come up with, and mature SQL solutions can do a lot of this filtering for you; NoSQL (Mongo or Redis specifically) can't do joins or complex queries across varied sets of data.
I currently work for a very large ad network and we write to flat files :)
I'm personally a Mongo fan, but frankly, Redis and Cassandra are unlikely to perform either better or worse. I mean, all you're doing is throwing stuff into memory and then flushing to disk in the background (both Mongo and Redis do this).
If you're looking for blazing fast speed, the other option is to keep several impressions in local memory and then flush them disk every minute or so. Of course, this is basically what Mongo and Redis do for you. Not a real compelling reason to move.
All three solutions (four if you count flat-files) will give you blazing fast writes. The non-relational (nosql) solutions will give you tunable fault-tolerance as well for the purposes of disaster recovery.
In terms of scale, our test environment, with only three MongoDB nodes, can handle 2-3k mixed transactions per second. At 8 nodes, we can handle 12k-15k mixed transactions per second. Cassandra can scale even higher. 250 reads is (or should be) no problem.
The more important question is, what do you want to do with this data? Operational reporting? Time-series analysis? Ad-hoc pattern analysis? real-time reporting?
MongoDB is a good option if you want the ability to do ad-hoc analysis based on multiple attributes within a collection. You can put up to 40 indexes on a collection, though the indexes will be stored in-memory, so watch for size. But the result is a flexible analytical solution.
Cassandra is a key-value store. You define a static column or set of columns that will act as your primary index right up front. All queries run against Cassandra should be tuned to this index. You can put a secondary on it, but that's about as far as it goes. You can, of course, use MapReduce to scan the store for non-key attribution, but it will be just that: a serial scan through the store. Cassandra also doesn't have the notion of "like" or regex operations on the server nodes. If you want to find all customers where the first name starts with "Alex", you'll have to scan through the entire collection, pull the first name out for each entry and run it through a client-side regex.
I'm not familiar enough with Redis to speak intelligently about it. Sorry.
If you are evaluating non-relational platforms, you might also want to consider CouchDB and Riak.
Hope this helps.
Just found this: http://blog.axant.it/archives/236
Quoting the most interesting part:
This second graph is about Redis RPUSH vs Mongo $PUSH vs Mongo insert, and I find this graph to be really interesting. Up to 5000 entries mongodb $push is faster even when compared to Redis RPUSH, then it becames incredibly slow, probably the mongodb array type has linear insertion time and so it becomes slower and slower. mongodb might gain a bit of performances by exposing a constant time insertion list type, but even with the linear time array type (which can guarantee constant time look-up) it has its applications for small sets of data.
I guess everything depends at least on data type and volume. Best advice probably would be to benchmark on your typical dataset and see yourself.
According to the Benchmarking Top NoSQL Databases (download here)
I recommend Cassandra.
If you have the choice (and need to move away from flat fies) I would go with Redis. Its blazingly fast, will comfortably handle the load you're talking about, but more importantly you won't have to manage the flushing/IO code. I understand its pretty straight forward but less code to manage is better than more.
You will also get horizontal scaling options with Redis that you may not get with file based caching.
I can get around 30k inserts/sec with MongoDB on a simple $350 Dell. If you only need around 2k inserts/sec, I would stick with MongoDB and shard it for scalability. Maybe also look into doing something with Node.js or something similar to make things more asynchronous.
The problem with inserts into databases is that they usually require writing to a random block on disk for each insert. What you want is something that only writes to disk every 10 inserts or so, ideally to sequential blocks.
Flat files are good. Summary statistics (eg total hits per page) can be obtained from flat files in a scalable manner using merge-sorty map-reducy type algorithms. It's not too hard to roll your own.
SQLite now supports Write Ahead Logging, which may also provide adequate performance.
I have hand-on experience with mongodb, couchdb and cassandra. I converted a lot of files to base64 string and insert these string into nosql.
mongodb is the fastest. cassandra is slowest. couchdb is slow too.
I think mysql would be much faster than all of them, but I didn't try mysql for my test case yet.