Best way to update DB (mongo) every hour? - mongodb

I am preparing a small app that will aggregate data on users on my website (via socket.io). I want to insert all data to my monogDB every hour.
What is the best way to do that? setInterval(60000) seems to be a lil bit lame :)

You can use cron for example and run your node.js app as scheduled job.
EDIT:
In case where the program have to run continuously, then probably setTimeout is one of the few possible choices (which is quite simple to implement). Otherwise you can offload your data to some temporary storage system, for example redis and then regularly run other node.js program to move your data, however this may introduce new dependency on other DB system and increase complexity depending on your scenario. Redis can also be in this case as some kind of failsafe solution in case when your main node.js app will unexpectedly be terminated and lose part or all of your data batch.

You should aggregate in real time, not once per hour.
I'd take a look at this presentation by BuddyMedia to see how they are doing real time aggregation down to the minute. I am using an adapted version of this approach for my realtime metrics and it works wonderfully.
http://www.slideshare.net/pstokes2/social-analytics-with-mongodb

Why not just hit the server with a curl request that triggers the database write? You can put the command on an hourly cron job and listen on a local port.

You could have mongo store the last time you copied your data and each time any request comes in you could check to see how long it's been since you last copied your data.
Or you could try a setInterval(checkRestore, 60000) for once a minute checks. checkRestore() would query the server to see if the last updated time is greater than an hour old. There are a few ways to do that.
An easy way to store the date is to just store it as the value of Date.now() (https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Date) and then check for something like db.logs.find({lastUpdate:{$lt:Date.now()-6000000}}).
I think I confused a few different solutions there, but hopefully something like that will work!

If you're using Node, a nice CRON-like tool to use is Forever. It uses to same CRON patterns to handle repetition of jobs.

Related

Time of first occurrence of the query from pg_stat_statements. Possible to get?

I'm debugging a DB performance issue. There's a lead that the issue was introduced after a certain deploy, e.g. when DB started to serve some new queries.
I'm looking to correlate deployment time with the performance issues, and would like to identify the queries that are causing this.
Using pg_stat_statements has been very handy so far. Unfortunately it does not store the time stamp of the first occurrence of each query.
Is there any auxiliary tables I could look into to see the time of first occurrence of queries?
Ideally, if this information would have been available in pg_stat_statements, I'd make a query like this:
select queryid from where date(first_run) = '2020-04-01';
Additionally, it'd be cool to see last_run as well, so to filter out some old queries that no longer execute at all, but remain in pg_stat_statements. That's more of a nice thing that's a necessity though.
This information is not stored anywhere, and indeed it would not be very useful. If the problem statement is a new one, you can easily identify it in your application code. If it is not a new statement, but something made the query slower, the first time the query was executed won't help you.
Is your source code not under version control?

Most efficient way to check time difference

I want to check an item in my database every x minutes for y minutes after its creation.
I've come up with two ways to do this and I'm not sure which will result in better efficiency/speed.
The first is to store a Date field in the model and do something like
Model.find({time_created > current_time - y})
inside of a cron job every x minutes.
The second is the keep a times_to_check field that keeps track of how many more times, based on x and y, the object should be checked.
Model.find({times_to_check> 0})
My thought on why these two might be comparable is because the first comparison of Dates would take longer, but the second one requires a write to the database after the object had been checked.
So either way you are going have to check the database continuously to see if it is time to query your collection. In your "second solution" you do not have a way to run your background process as you are only referencing how you are determining your collection delta.
Stick with running you unix Cron job but make sure it is fault tolerant an have controls ensuring it is actually running when you application is up. Below is a pretty good answer for how to handle that.
How do I write a bash script to restart a process if it dies?
Based on that i would ask how does your application react if your Cron job has not run for x number of minutes, hours or days. How will your application recover if this does happen?

wait/notify mechanism for multiple readers in Oracle sql?

We have multiple processes which read one database table, get available record and work with it. It works fine.
When there is no record in this table each process waits 5 seconds and reads it again.
So, record could idle in the table for 5 seconds which is not good.
What would be recommended solution to eliminate such waiting and proceed immediately when record is created? One solution could be trigger which does something when record created. But this solution requires knowledge of working processes to deliver record to the one of idle processes.
It looks that ideal solution would be when each process starts to read via SQL from something and when record is created one of waiting processes will have it record and other will continue to wait.
Does Oracle 10 provide such or similar mechanism?
Look at Database Change Notification in 10g, which has since been renamed Continuous Query Notification.
I normally like to include an example but it's hard to find a 10g instance these days, and even a short example requires a lot of code. The process looks complicated, it might be better off to use triggers as you suggested, and deal with the tight coupling.

Incrementing hundreds of counters at once, redis or mongodb?

Background/Intent:
So I'm going to create an event tracker from scratch and have a couple of ideas on how to do this but I'm unsure of the best way to proceed with the database side of things. One thing I am interested in doing is allowing these events to be completely dynamic, but at the same time to allow for reporting on relational event counters.
For example, all countries broken down by operating systems. The desired effect would be:
US # of events
iOS - # of events that occured in US
Android - # of events that occured in US
CA # of events
iOS - # of events that occured in CA
Android - # of events that occured in CA
etc.
My intent is to be able to accept these event names like so:
/?country=US&os=iOS&device=iPhone&color=blue&carrier=Sprint&city=orlando&state=FL&randomParam=123&randomParam2=456&randomParam3=789
Which means in order to do the relational counters for something like the above I would potentially be incrementing 100+ counters per request.
Assume there will be 10+ million of the above requests per day.
I want to keep things completely dynamic in terms of the event names being tracked and I also want to do it in such a manner that the lookups on the data remains super quick. As such I have been looking into using redis or mongodb for this.
Questions:
Is there a better way to do this then counters while keeping the fields dynamic?
Provided this was all in one document (structured like a tree), would using the $inc operator in mongodb to increment 100+ counters at the same time in one operation be viable and not slow? The upside here being I can retrieve all of the statistics for one 'campaign' quickly in a single query.
Would this be better suited to redis and to do a zincrby for all of the applicable counters for the event?
Thanks
Depending on how your key structure is laid out I would recommend pipelining the zincr commands. You have an easy "commit" trigger - the request. If you were to iterate over your parameters and zincr each key, then at the end of the request pass the execute command it will be very fast. I've implemented a system like you describe as both a cgi and a Django app. I set up a key structure along the lines of this:
YYYY-MM-DD:HH:MM -> sorted set
And was able to process Something like 150000-200000 increments per second on the redis side with a single process which should be plenty for your described scenario. This key structure allows me to grab data based on windows of time. I also added an expire to the keys to avoid writing a db cleanup process. I then had a cronjob that would do set operations to "roll-up" stats in to hourly, daily, and weekly using variants of the aforementioned key pattern. I bring these ideas up as they are ways you can take advantage of the built in capabilities of Redis to make the reporting side simpler. There are other ways of doing it but this pattern seems to work well.
As noted by eyossi the global lock can be a real problem with systems that do concurrent writes and reads. If you are writing this as a real time system the concurrency may well be an issue. If it is an "end if day" log parsing system then it would not likely trigger the contention unless you run multiple instances of the parser or reports at the time of input. With regards to keeping reads fast In Redis, I would consider setting up a read only redis instance slaved off of the main one. If you put it on the server running the report and point the reporting process at it it should be very quick to generate the reports.
Depending on your available memory, data set size, and whether you store any other type of data in the redis instance you might consider running a 32bit redis server to keep the memory usage down. A 32b instance should be able to keep a lot of this type of data in a small chunk of memory, but if running the normal 64 bit Redis isn't taking too much memory feel free to use it. As always test your own usage patterns to validate
In redis you could use multi to increment multiple keys at the same time.
I had some bad experience with MongoDB, i have found that it can be really tricky when you have a lot of writes to it...
you can look at this link for more info and don't forget to read the part that says "MongoDB uses 1 BFGL (big f***ing global lock)" (which maybe already improved in version 2.x - i didn't check it)
On the other hand, i had a good experience with Redis, i am using it for a lot of read / writes and it works great.
you can find more information about how i am using Redis (to get a feeling about the amount of concurrent reads / writes) here: http://engineering.picscout.com/2011/11/redis-as-messaging-framework.html
I would rather use pipelinethan multiif you don't need the atomic feature..

Architecture to create an uptime monitor in Node.js

What's the best solution for using Node.js and Redis to create an uptime monitoring system? Can I use Redis as a queue but is not the best way to save information, maybe MongoDB is?
It seems pretty simple but needing to have more than 1 server to guarantee the server is down and make everything work together is not so easy.
To monitor uptime, you would use a Cron job on the system. With each call, you would check to see if the host is up, and how long it would take. And in that script, you would save your data in Redis.
To do this in Node.JS, you would create a script that checks the status of the server. Just making a HTTP request to the server (Or Ping, w.e.) and recording if it fails or not. Then I would just record it to Redis. How you do it does not matter, because the script (if you run the cron every 30 seconds) has [30] seconds before the next run, so you dont have to worry about getting your query to the server. How you save your data is up to you, but in this case even MySQL would work (if you are only doing a small number of sites)
More on Cron # Wikipedia
Can I use Redis as a queue but is not
the best way to save information,
maybe MongoDB is?
You can(should) use Redis as your queue. It is going to be extremely fast.
I also think it is going to be very good option to save the information inside Redis. Unfortunately Redis does not do any timing(yet). I think you could/should use Beanstalkd to put messages on the queue that get delivered when needed(every x seconds). I also think cron is not that a very good idea because you would be needing a lot of them and when using a queue you could do your work faster(share load among multiple processes) also.
Also I don't think you need that much memory to save everything in memory(makes site fast) because dataset is going to be relative simple. Even if you aren't able(smart to get more memory if you ask me) to fit entire dataset in memory you can rely on Redis's virtual memory.
It seems pretty simple but needing to
have more than 1 server to guarantee
the server is down and make everything
work together is not so easy.
Sharding/replication is what I think you should read into to solve this problem(hard). Luckily Redis supports replication(sharding can also be achieved). MongoDB supports sharding/replication out of the box. To be honest I don't think you need sharding yet and your dataset is rather simple so Redis is going to be faster:
http://redis.io/topics/replication
http://www.mongodb.org/display/DOCS/Sharding+Introduction
http://www.mongodb.org/display/DOCS/Replication
http://ngchi.wordpress.com/2010/08/23/towards-auto-sharding-in-your-node-js-app/