Architecture to create an uptime monitor in Node.js - mongodb

What's the best solution for using Node.js and Redis to create an uptime monitoring system? Can I use Redis as a queue but is not the best way to save information, maybe MongoDB is?
It seems pretty simple but needing to have more than 1 server to guarantee the server is down and make everything work together is not so easy.

To monitor uptime, you would use a Cron job on the system. With each call, you would check to see if the host is up, and how long it would take. And in that script, you would save your data in Redis.
To do this in Node.JS, you would create a script that checks the status of the server. Just making a HTTP request to the server (Or Ping, w.e.) and recording if it fails or not. Then I would just record it to Redis. How you do it does not matter, because the script (if you run the cron every 30 seconds) has [30] seconds before the next run, so you dont have to worry about getting your query to the server. How you save your data is up to you, but in this case even MySQL would work (if you are only doing a small number of sites)
More on Cron # Wikipedia

Can I use Redis as a queue but is not
the best way to save information,
maybe MongoDB is?
You can(should) use Redis as your queue. It is going to be extremely fast.
I also think it is going to be very good option to save the information inside Redis. Unfortunately Redis does not do any timing(yet). I think you could/should use Beanstalkd to put messages on the queue that get delivered when needed(every x seconds). I also think cron is not that a very good idea because you would be needing a lot of them and when using a queue you could do your work faster(share load among multiple processes) also.
Also I don't think you need that much memory to save everything in memory(makes site fast) because dataset is going to be relative simple. Even if you aren't able(smart to get more memory if you ask me) to fit entire dataset in memory you can rely on Redis's virtual memory.
It seems pretty simple but needing to
have more than 1 server to guarantee
the server is down and make everything
work together is not so easy.
Sharding/replication is what I think you should read into to solve this problem(hard). Luckily Redis supports replication(sharding can also be achieved). MongoDB supports sharding/replication out of the box. To be honest I don't think you need sharding yet and your dataset is rather simple so Redis is going to be faster:
http://redis.io/topics/replication
http://www.mongodb.org/display/DOCS/Sharding+Introduction
http://www.mongodb.org/display/DOCS/Replication
http://ngchi.wordpress.com/2010/08/23/towards-auto-sharding-in-your-node-js-app/

Related

storing huge amounts of data in mongo

I am working on a front end system for a radius server.
The radius server will pass updates to the system every 180 seconds. Which means if I have about 15,000 clients that would be around 7,200,000 entries per day...Which is a lot.
I am trying to understand what the best possible way to store and retrieve this data will be. Obviously as time goes on, this will become substantial. Will MongoDB handle this? Typical document is not much, something this
{
id: 1
radiusId: uniqueId
start: 2017-01-01 14:23:23
upload: 102323
download: 1231556
}
However, there will be MANY of these records. I guess this is something similar to the way that SNMP NMS servers handle data which as far as I know they use RRD to do this.
Currently in my testing I just push every document into a single collection. So I am asking,
A) Is Mongo the right tool for the job and
B) Is there a better/more preferred/more optimal way to store the data
EDIT:
OK, so just incase someone comes across this and needs some help.
I ran it for a while in mongo, I was really not satisfied with performance. We can chalk this up to the hardware I was running on, perhaps my level of knowledge or the framework I was using. However I found a solution that works very well for me. InfluxDB pretty much handles all of this right out of the box, its a time series database which is effectively the data I am trying to store (https://github.com/influxdata/influxdb). Performance for me has been like night & day. Again, could all be my fault, just updating this.
EDIT 2:
So after a while I think I figured out why I never got the performance I was after with Mongo. I am using sailsjs as framework and it was searching by id using regex, which obviously has a huge performance hit. I will eventually try migrate back to Mongo instead of influx and see if its better.
15,000 clients updating every 180 seconds = ~83 insertions / sec. That's not a huge load even for a moderately sized DB server, especially given the very small size of the records you're inserting.
I think MongoDB will do fine with that load (also, to be honest, almost any modern SQL DB would probably be able to keep up as well). IMHO, the key points to consider are these:
Hardware: make sure you have enough RAM. This will primarily depend on how many indexes you define, and how many queries you're doing. If this is primarily a log that will rarely be read, then you won't need much RAM for your working set (although you'll need enough for your indexes). But if you're also running queries then you'll need much more resources
If you are running extensive queries, consider setting up a replica set. That way, your master server can be reserved for writing data, ensuring reliability, while your slaves can be configured to serve your queries without affecting the write reliability.
Regarding the data structure, I think that's fine, but it'll really depend on what type of queries you wish to run against it. For example, if most queries use the radiusId to reference another table and pull in a bunch of data for each record, then you might want to consider denormalizing some of that data. But again, that really depends on the queries you run.
If you're really concerned about managing the write load reliably, consider using the Mongo front-end only to manage the writes, and then dumping the data to a data warehouse backend to run queries on. You can partially do this by running a replica set like I mentioned above, but the disadvantage of a replica set is that you can't restructure the data. The data in each member of the replica set is exactly the same (hence the name, replica set :-) Oftentimes, the best structure for writing data (normalized, small records) isn't the best structure for reading data (denormalized, large records with all the info and joins you need already done). If you're running a bunch of complex queries referencing a bunch of other tables, using a true data warehouse for the querying part might be better.
As your write load increases, you may consider sharding. I'm assuming the RadiusId points to each specific server among a pool of Radius servers. You could potentially shard on that key, which would split the writes based on which server is sending the data. Thus, as you increase your radius servers, you can increase your mongo servers proportionally to maintain write reliability. However, I don't think you need to do this right away as I bet one reasonably provisioned server should be able to manage the load you've specified.
Anyway, those are my preliminary suggestions.

Incrementing hundreds of counters at once, redis or mongodb?

Background/Intent:
So I'm going to create an event tracker from scratch and have a couple of ideas on how to do this but I'm unsure of the best way to proceed with the database side of things. One thing I am interested in doing is allowing these events to be completely dynamic, but at the same time to allow for reporting on relational event counters.
For example, all countries broken down by operating systems. The desired effect would be:
US # of events
iOS - # of events that occured in US
Android - # of events that occured in US
CA # of events
iOS - # of events that occured in CA
Android - # of events that occured in CA
etc.
My intent is to be able to accept these event names like so:
/?country=US&os=iOS&device=iPhone&color=blue&carrier=Sprint&city=orlando&state=FL&randomParam=123&randomParam2=456&randomParam3=789
Which means in order to do the relational counters for something like the above I would potentially be incrementing 100+ counters per request.
Assume there will be 10+ million of the above requests per day.
I want to keep things completely dynamic in terms of the event names being tracked and I also want to do it in such a manner that the lookups on the data remains super quick. As such I have been looking into using redis or mongodb for this.
Questions:
Is there a better way to do this then counters while keeping the fields dynamic?
Provided this was all in one document (structured like a tree), would using the $inc operator in mongodb to increment 100+ counters at the same time in one operation be viable and not slow? The upside here being I can retrieve all of the statistics for one 'campaign' quickly in a single query.
Would this be better suited to redis and to do a zincrby for all of the applicable counters for the event?
Thanks
Depending on how your key structure is laid out I would recommend pipelining the zincr commands. You have an easy "commit" trigger - the request. If you were to iterate over your parameters and zincr each key, then at the end of the request pass the execute command it will be very fast. I've implemented a system like you describe as both a cgi and a Django app. I set up a key structure along the lines of this:
YYYY-MM-DD:HH:MM -> sorted set
And was able to process Something like 150000-200000 increments per second on the redis side with a single process which should be plenty for your described scenario. This key structure allows me to grab data based on windows of time. I also added an expire to the keys to avoid writing a db cleanup process. I then had a cronjob that would do set operations to "roll-up" stats in to hourly, daily, and weekly using variants of the aforementioned key pattern. I bring these ideas up as they are ways you can take advantage of the built in capabilities of Redis to make the reporting side simpler. There are other ways of doing it but this pattern seems to work well.
As noted by eyossi the global lock can be a real problem with systems that do concurrent writes and reads. If you are writing this as a real time system the concurrency may well be an issue. If it is an "end if day" log parsing system then it would not likely trigger the contention unless you run multiple instances of the parser or reports at the time of input. With regards to keeping reads fast In Redis, I would consider setting up a read only redis instance slaved off of the main one. If you put it on the server running the report and point the reporting process at it it should be very quick to generate the reports.
Depending on your available memory, data set size, and whether you store any other type of data in the redis instance you might consider running a 32bit redis server to keep the memory usage down. A 32b instance should be able to keep a lot of this type of data in a small chunk of memory, but if running the normal 64 bit Redis isn't taking too much memory feel free to use it. As always test your own usage patterns to validate
In redis you could use multi to increment multiple keys at the same time.
I had some bad experience with MongoDB, i have found that it can be really tricky when you have a lot of writes to it...
you can look at this link for more info and don't forget to read the part that says "MongoDB uses 1 BFGL (big f***ing global lock)" (which maybe already improved in version 2.x - i didn't check it)
On the other hand, i had a good experience with Redis, i am using it for a lot of read / writes and it works great.
you can find more information about how i am using Redis (to get a feeling about the amount of concurrent reads / writes) here: http://engineering.picscout.com/2011/11/redis-as-messaging-framework.html
I would rather use pipelinethan multiif you don't need the atomic feature..

Best way to update DB (mongo) every hour?

I am preparing a small app that will aggregate data on users on my website (via socket.io). I want to insert all data to my monogDB every hour.
What is the best way to do that? setInterval(60000) seems to be a lil bit lame :)
You can use cron for example and run your node.js app as scheduled job.
EDIT:
In case where the program have to run continuously, then probably setTimeout is one of the few possible choices (which is quite simple to implement). Otherwise you can offload your data to some temporary storage system, for example redis and then regularly run other node.js program to move your data, however this may introduce new dependency on other DB system and increase complexity depending on your scenario. Redis can also be in this case as some kind of failsafe solution in case when your main node.js app will unexpectedly be terminated and lose part or all of your data batch.
You should aggregate in real time, not once per hour.
I'd take a look at this presentation by BuddyMedia to see how they are doing real time aggregation down to the minute. I am using an adapted version of this approach for my realtime metrics and it works wonderfully.
http://www.slideshare.net/pstokes2/social-analytics-with-mongodb
Why not just hit the server with a curl request that triggers the database write? You can put the command on an hourly cron job and listen on a local port.
You could have mongo store the last time you copied your data and each time any request comes in you could check to see how long it's been since you last copied your data.
Or you could try a setInterval(checkRestore, 60000) for once a minute checks. checkRestore() would query the server to see if the last updated time is greater than an hour old. There are a few ways to do that.
An easy way to store the date is to just store it as the value of Date.now() (https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Date) and then check for something like db.logs.find({lastUpdate:{$lt:Date.now()-6000000}}).
I think I confused a few different solutions there, but hopefully something like that will work!
If you're using Node, a nice CRON-like tool to use is Forever. It uses to same CRON patterns to handle repetition of jobs.

Memcache inhibits the website

I added memchached to my website.
And site started running very slow.
If I cancel memchached ,application backs to work quickly.
Why is this happening?And how to avoid it?
Thanks,
kukuwka
That is impossible to answer without knowing how you are using it and what data you are storing. For example, if you are using it as the HttpCache provider (if you are using ASP.NET), and you were previously using the in-process cache provider, then it will behave very differently; the in-process provider has no serialization or network costs, so you might be storing some insanely large objects in the cache. That is fine when it is in-process, but for any other provider this is very very bad; you will have to transfer and deserialize for every usage (and serialize and transfer for every storage).
There are ways to improve the serialization/deserialization/network times, but it sounds like you are simply storing too much data (or inappropriate data) in the cache at the moment. I'd address that first, and then look at tuning it.
Memcached doesn't mean "make things faster." It provides fast and very scalable access to a shared cache of something that is otherwise expensive to acquire.
If you add caching to something that's cheap, it may end up being slower.
For example, if it takes you five seconds to do something and you can cache that, then you'll save almost five seconds on each subsequent request assuming the results are still useful.
If it takes you a few nanoseconds to do it, then it'll slow you down considerably to fetch the results over the network.

How to get a command line process to use less processing power

I am wondering how to get a process run at the command line to use less processing power. The problem I'm having is the the process is basically taking over the CPU and taking MySQL and the rest of the server with it. Everything is becoming very slow.
I have used nice before but haven't had much luck with it. If it is the answer, how would you use it?
I have also thought of putting in sleep commands, but it'll still be using up memory so it's not the best option.
Is there another solution?
It doesn't matter to me how long it runs for, within reason.
If it makes a difference, the script is a PHP script, but I'm running it at the command line as it already takes 30+ minutes to run.
Edit: the process is a migration script, so I really don't want to spend too much time optimizing it as it only needs to be run for testing purposes and once to go live. Just for testing, it keeps bring the server to pretty much a halt...and it's a shared server.
The best you can really do without modifying the program is to change the nice value to the maximum value using nice or renice. Your best bet is probably to profile the program to find out where it is spending most of its time/using most of its memory and try to find a more efficient algorithm for what you are trying to do. For example, if your are operating on a large result set from MySQL you may want to process records one at a time instead of loading the entire result set into memory or perhaps you can optimize your queries or the processing being performed on the results.
You should use nice with 19 "niceness" this makes the process very unlikely to run if there are other processes waiting for the cpu.
nice -n 19 <command>
Be sure that the program does not have busy waits and also check the I/O wait time.
Which process is actually taking up the CPU? PHP or MySQL? If it's MySQL, 'nice' won't help at all (since the server is not 'nice'd up).
If it's MySQL in general you have to look at your queries and MySQL tuning as to why those queries are slamming the server.
Slamming your MySQL server process can show as "the whole system being slow" if your primary view of the system through MySQL.
You should also consider whether the cmd line process is IO intensive. That can be adjusted on some linux distros using the 'ionice' command, though it's usage is not nearly as simplistic as the cpu 'nice' command.
Basic usage:
ionice -n7 cmd
will run 'cmd' using 'best effort' scheduler at the lowest priority. See the man page for more usage details.
Using CPU cycles alone shouldn't take over the rest of the system. You can show this by doing:
while true; do done
This is an infinite loop and will use as much of the CPU cycles it can get (stop it with ^C). You can use top to verify that it is doing its job. I am quite sure that this won't significantly affect the overall performance of your system to the point where MySQL dies.
However, if your PHP script is allocating a lot of memory, that certainly can make a difference. Linux has a tendency to go around killing processes when the system starts to run out of memory.
I would narrow down the problem and be sure of the cause, before looking for a solution.
You could mount your server's interesting directory/filesystem/whatever on another machine via NFS and run the script there (I know, this means avoiding the problem and is not really practical :| ).