Bulk update and insert using Meteor method call in loop making high cpu usage - mongodb

My application is on METEOR#1.6.0.1 and I am using  reywood:publish-composite ,  matb33:collection-hooks  for db relations.
I need to insert a list of 400 people into collection from excel file, for it currently i am inserting from client using Meteor method inside loop but when i see on galaxy during this CPU usage is very high 70-80% or some time 100%.
Once all data inserted, i need to send a mail and update the record so i am sending mail and update using Meteor method call one by one that again making CPU 70-80%.
How i can do above task in correct and efficient way. Please help.
Thanks.

I suspect that you are not using oplog tailing and you are trying to insert when some other part of your app has subscriptions to publications open. Without this meteor polls the collections and generates lots of slow queries at each document insert.
You can enable it by passing an url to meteor at startup. See https://docs.meteor.com/environment-variables.html#MONGO-OPLOG-URL for more info.
Having oplog tailing eases the strain on the server and should reduce the high cpu usage to a manageable level.
If you are still having issues then you may have to set up some tracing e.g. monti-apm https://docs.montiapm.com/introduction

Related

Meteor website takes too much time to load

I have a webApp built using meteor.
Following are the specifications :
Meteor version : 1.8
Mongo Version : 4.0.5
Following is the list of packages I have used :
jquery#1.11.10
twbs:bootstrap#3.3.6
iron:router
reactive-var#1.0.11
fortawesome:fontawesome
blaze#2.1.8
accounts-password#1.5.1
mrt:mathjax
email#1.2.3
momentjs:moment
ian:accounts-ui-bootstrap-3#1.2.89
meteor-base#1.4.0
mongo#1.6.0
blaze-html-templates#1.0.4
session#1.2.0
tracker#1.2.0
logging#1.1.20
reload#1.2.0
ejson#1.1.0
spacebars#1.0.12
standard-minifier-css#1.5.2
standard-minifier-js#2.4.0
jss:jstree
meteorhacks:subs-manager
aldeed:template-extension
reywood:publish-composite
shell-server#0.4.0
stylus#=2.513.13
accounts-base#1.4.3
iron:middleware-stack#1.1.0
http#1.4.1
ecmascript#0.12.4
dynamic-import#0.5.0
sha#1.0.9
simple:json-routes
underscore#1.0.10
aldeed:simple-schema
rafaelhdr:google-charts
meteorhacks:aggregate
The webApp is hosted on AWS ec2 instance with 16GB RAM and 04 processors. The application uses pub-sub method. Now the issue is that whenever there are more than 50 concurrent connections, the CPU usage crosses sixty-percent usage and the webApp becomes annoyingly slow to use. As per my findings, it could be because of two reasons, either the pub-sub schema that I have used is too heavy, i.e., I have used database subscriptions on each page extensively and meteor maintains an open connection continuously with it. Other reason that could be leading to extensive resource usage could be mongoDB usage. As per the dbStats, the db uses more than 06GB of RAM. Following are the details :
I am not sure why such behaviour. Only way I can think of is to hit and trial (remove subscriptions and then test), but it would be too time consuming and also not full proof.
Could someone help me out as to how to proceed.
Depending on the way your app is designed, data-wise, there can be several reasons for this lack of performance.
A few suggestions:
check that you have indexes in your collections
avoid doing aggregation in the publication process, i.e. denormalize the db, publish array of cursors instead, limit the size of the documents, etc.
filter the useless fields in the query
limit the amount of data to the relevant part (lazy load & paginated subscribe)
consider global pubs/subs for the collections you use a lot, instead of reloading them too often on a route based pattern
track small component based subs and try to put them at a higher level to avoid making for instance 30 subs instead of one
My best guess is that you probably need a mix of rationalizing the db "structure" and avoid the data aggregation as much as you can.
You may also have a misuse of the low level collection api (e.g. cursor.observe() stuff) somewhere.

What is the optimal way to do server side paging in expressjs with mongoose

I'm currently doing a project with my own MEAN stack.
Now in a new project I'm creating I've got a collection that I'm paging with Express on serverside, returning the page size every time (e.g 10 results out of the total 2000) and the total rows found for the query the user preformed (e.g 193 for UserID 3).
Although this works fine, I'm afraid that this will create an enormous load on the server since a user can easily pull 50-60 pages a session with 10, 20, 50 or even 100 results each.
My question to you guys is: if I have say 1000 concurrent users paging every few seconds like this, will MongoDB be able to cope with this? If not, what might be my alternatives here?
Also is there anyway I can simulate such concurrent read tests on my app/MongoDB?
Please take in account that I must do server side paging because the app will be quite dynamic and information can change very often.
If you're planning on only using a single webserver, you could cache the result set belonging to a certain page in memory. If you're planning on using multiple webservers, caching in-memory would lead to different result sets across servers, so in that case I'd recommend storing your cache either in MongoDB or in Redis.
A certain result set would be stored under a certain key in your cache. Your key would probably be composed of something like entityName + filterOptions + offset + resultsLimit. So for example you're loading movies with title=titanic, skipping the first 100, so offset=100 and loading only 50 per page so limit=50, which would all be concatenated into a single key.
When a request comes in, you would first try to load the result set from the cache. If the result set is inside the cache, you'll return that to the client. If it's not in the cache, you'd query the database for the latest result set, put that in the cache and return it to the client.
Whether or not you could pull it off with 1000 concurrent users depends a lot on your hardware, the data you are loading, how you're loading it and the efficiency of your implementation. There's one way to find out, and that's testing.
Of course by using the asynchronous capabilities of Node.js you can achieve the best scalability, so every call that can be executed async, such as database calls, should definitely be executed asynchronously.
You could load test your application for free from your local computer using Apache JMeter or let it be tested using for example Azure.

Using MongoDB to store immutable data?

We investigation options to store and read a lot of immutable data (events) and I'd like some feedback on whether MongoDB would be a good fit.
Requirements:
We'll need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb. Would it be fine to store all of these events in the same collection?
A really important requirement is that we need to be able to replay all events in order. I've read here that MongoDB have a limit of 32 Mb when sorting documents using cursors. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary? Are cursors the way to go and would they be able to fullfil this requirement?
If MongoDB would be a good fit for this there some configuration or setting one can tune to increase performance or reliability for immutable data?
This is very similar to storing logs: lots of writes, and the data is read back in order. Luckily the Mongo Site has a recipe for this:
https://docs.mongodb.org/ecosystem/use-cases/storing-log-data/
Regarding immutability of the data, that's not a problem for MongoDB.
Edit 2022-02-19:
Replacement link:
https://web.archive.org/web/20150917095005/docs.mongodb.org/ecosystem/use-cases/storing-log-data/
Snippet of content from page:
This document outlines the basic patterns and principles for using
MongoDB as a persistent storage engine for log data from servers and
other machine data.
Problem Servers generate a large number of events (i.e. logging,) that
contain useful information about their operation including errors,
warnings, and users behavior. By default, most servers, store these
data in plain text log files on their local file systems.
While plain-text logs are accessible and human-readable, they are
difficult to use, reference, and analyze without holistic systems for
aggregating and storing these data.
Solution The solution described below assumes that each server
generates events also consumes event data and that each server can
access the MongoDB instance. Furthermore, this design assumes that the
query rate for this logging data is substantially lower than common
for logging applications with a high-bandwidth event stream.
NOTE
This case assumes that you’re using a standard uncapped collection for
this event data, unless otherwise noted. See the section on capped
collections
Schema Design The schema for storing log data in MongoDB depends on
the format of the event data that you’re storing. For a simple
example, consider standard request logs in the combined format from
the Apache HTTP Server. A line from these logs may resemble the
following:

Processing large mongo collection offsite

I have a system writing logs into mongodb (about 1kk logs per day). On a weekly basis I need to calculate some statistics on those logs. Since the calculations are very processor and memory consuming I want to copy collection I'm working to powerful offsite machine. How do I keep offsite collections up to date without copying everything? I modify offsite collection, by storing statistic within its elements i.e. adding fields {"alogirthm_1": "passed"} or {"stat1": 3.1415}. Is replication right for my use case or I should investigate other alternatives?
As to your question, yes, replication does partially resolve your issue, with limitations.
So there are several ways I know to resolve your issue:
The half-database, half-application way.
Replication keeps your data up to date. It doesn't allow you to modify the secondary nodes (which you call "offsite collection") however. So you have to do the calculation on the secondary and write data to the primary. You need to have an application running aggregation on the secondary, and write the result back to it's primary.
This requires that you run an application, PHP, .NET, Python, whatever.
full-server way
Since you are going to have multi-servers any way, you can consider using Sharding for the faster storage and directly do the calculation online. This way you don't even need to run an application. The Map/Reduce do the calculation and write output into an new collection. I DON'T recommend this solution though because of the Map/Reduce performance issue of current versions.
The full-application way
Basically you still use an replication for reading, but the server doesn't do any calculations except querying data. You can use capped collection or TTL index for removing expired data, and you just enumerate data one by one in your application and calculation by yourself.

Incrementing hundreds of counters at once, redis or mongodb?

Background/Intent:
So I'm going to create an event tracker from scratch and have a couple of ideas on how to do this but I'm unsure of the best way to proceed with the database side of things. One thing I am interested in doing is allowing these events to be completely dynamic, but at the same time to allow for reporting on relational event counters.
For example, all countries broken down by operating systems. The desired effect would be:
US # of events
iOS - # of events that occured in US
Android - # of events that occured in US
CA # of events
iOS - # of events that occured in CA
Android - # of events that occured in CA
etc.
My intent is to be able to accept these event names like so:
/?country=US&os=iOS&device=iPhone&color=blue&carrier=Sprint&city=orlando&state=FL&randomParam=123&randomParam2=456&randomParam3=789
Which means in order to do the relational counters for something like the above I would potentially be incrementing 100+ counters per request.
Assume there will be 10+ million of the above requests per day.
I want to keep things completely dynamic in terms of the event names being tracked and I also want to do it in such a manner that the lookups on the data remains super quick. As such I have been looking into using redis or mongodb for this.
Questions:
Is there a better way to do this then counters while keeping the fields dynamic?
Provided this was all in one document (structured like a tree), would using the $inc operator in mongodb to increment 100+ counters at the same time in one operation be viable and not slow? The upside here being I can retrieve all of the statistics for one 'campaign' quickly in a single query.
Would this be better suited to redis and to do a zincrby for all of the applicable counters for the event?
Thanks
Depending on how your key structure is laid out I would recommend pipelining the zincr commands. You have an easy "commit" trigger - the request. If you were to iterate over your parameters and zincr each key, then at the end of the request pass the execute command it will be very fast. I've implemented a system like you describe as both a cgi and a Django app. I set up a key structure along the lines of this:
YYYY-MM-DD:HH:MM -> sorted set
And was able to process Something like 150000-200000 increments per second on the redis side with a single process which should be plenty for your described scenario. This key structure allows me to grab data based on windows of time. I also added an expire to the keys to avoid writing a db cleanup process. I then had a cronjob that would do set operations to "roll-up" stats in to hourly, daily, and weekly using variants of the aforementioned key pattern. I bring these ideas up as they are ways you can take advantage of the built in capabilities of Redis to make the reporting side simpler. There are other ways of doing it but this pattern seems to work well.
As noted by eyossi the global lock can be a real problem with systems that do concurrent writes and reads. If you are writing this as a real time system the concurrency may well be an issue. If it is an "end if day" log parsing system then it would not likely trigger the contention unless you run multiple instances of the parser or reports at the time of input. With regards to keeping reads fast In Redis, I would consider setting up a read only redis instance slaved off of the main one. If you put it on the server running the report and point the reporting process at it it should be very quick to generate the reports.
Depending on your available memory, data set size, and whether you store any other type of data in the redis instance you might consider running a 32bit redis server to keep the memory usage down. A 32b instance should be able to keep a lot of this type of data in a small chunk of memory, but if running the normal 64 bit Redis isn't taking too much memory feel free to use it. As always test your own usage patterns to validate
In redis you could use multi to increment multiple keys at the same time.
I had some bad experience with MongoDB, i have found that it can be really tricky when you have a lot of writes to it...
you can look at this link for more info and don't forget to read the part that says "MongoDB uses 1 BFGL (big f***ing global lock)" (which maybe already improved in version 2.x - i didn't check it)
On the other hand, i had a good experience with Redis, i am using it for a lot of read / writes and it works great.
you can find more information about how i am using Redis (to get a feeling about the amount of concurrent reads / writes) here: http://engineering.picscout.com/2011/11/redis-as-messaging-framework.html
I would rather use pipelinethan multiif you don't need the atomic feature..