Most efficient way to check time difference - mongodb

I want to check an item in my database every x minutes for y minutes after its creation.
I've come up with two ways to do this and I'm not sure which will result in better efficiency/speed.
The first is to store a Date field in the model and do something like
Model.find({time_created > current_time - y})
inside of a cron job every x minutes.
The second is the keep a times_to_check field that keeps track of how many more times, based on x and y, the object should be checked.
Model.find({times_to_check> 0})
My thought on why these two might be comparable is because the first comparison of Dates would take longer, but the second one requires a write to the database after the object had been checked.

So either way you are going have to check the database continuously to see if it is time to query your collection. In your "second solution" you do not have a way to run your background process as you are only referencing how you are determining your collection delta.
Stick with running you unix Cron job but make sure it is fault tolerant an have controls ensuring it is actually running when you application is up. Below is a pretty good answer for how to handle that.
How do I write a bash script to restart a process if it dies?
Based on that i would ask how does your application react if your Cron job has not run for x number of minutes, hours or days. How will your application recover if this does happen?

Related

How to create all agents at once by a database?

I'm generating my agents in anylogic based on a database table that I've created. In this DB I have some characteristics of my agent. This agent is supposed to be my "scheduling agent", since my focus is on rescheduling, it is important that my production orders are saved as agents in a queue. My problem is that when generating the agents, firstly I can't tell the system to generate all of them at once (so like "import" the line of my DB and transform each line into an agent with characteristics).
I tried doing it by adding 1s difference between every production order, but, when the last date is reached, my simulation gives an error and stops working. Could someone help me achieve my task? Do you think there would be a better solution?
I am not sure 100% what you are trying to do, but I have a similar problem I think that I have solved this way.
I have a database of batches that I want to load all at once.
enter image description here
This is going to load the batches one at a time with 0 interarrival time. This means that batches will flow continuously. Also important is the Limited number of arrivals option, which will stop the loading when the end of the database is reached.
Also, after the source, I added a queue with Maximum capacity set to infinite.
Hope that helps

Benchmarking Redshift Queries

I want to know how long my queries take to execute, so that I can see whether my changes improve the runtime or not.
Simply timing the executing of the whole query is unsuitable, since this also takes into account the (highly variable) time spent waiting in an execution queue.
Redshift provides the STL_WLM_QUERY table that contains separate columns for queue wait time and execution time. However, my queries do not reliably show up in this table. For example if I execute the same query multiple times the number of corresponding rows in STL_WLM_QUERY is often much smaller than the number of repetitions. Sometimes, but not always, only one row is generated no matter how often I run the query. I suspect some caching is going on.
Is there a better way to find the actual execution time of a Redshift query, or can someone at least explain under what circumstances exactly a row in STL_WLM_QUERY is generated?
My tips
If possible, ensure that your query has not waited at all, if it has
there should be a row on stl_wlm_query. If it did wait - then rerun
it.
Run the query once to compile it, then a second time to benchmark
it. compile time can be significant
Disable the new query result caching feature (if you have it yet -
you probably don't)
(https://aws.amazon.com/about-aws/whats-new/2017/11/amazon-redshift-introduces-result-caching-for-sub-second-response-for-repeat-queries/)

Executing same query makes time difference in postgresql

I just want to know what is the reason for having different time while executing the same query in PostgreSQL.
For Eg: select * from datas;
For the first time it takes 45ms
For the second time the same query takes 55ms and the next time it takes some other time.Can any one say What is the reason for having non static time.
Simple, everytime the database has to read the whole table and retrieve the rows. There might be 100 different things happening in database which might cause a difference of few millis. There is no need to panic. This is bound to happen. You can expect the operation to take same time with some millis accuracy. If there is a huge difference then it is something which has to be looked.
Have u applied indexing in your table . it also increases speed to a great deal!
Compiling the explanation from
Reference by matt b
EXPLAIN statement? helps us to display the execution plan that the PostgreSQL planner generates for the supplied statement.
The execution plan shows how the
table(s) referenced by the statement will be scanned — by plain
sequential scan, index scan, etc. — and if multiple tables are
referenced, what join algorithms will be used to bring together the
required rows from each input table
And Reference by Pablo Santa Cruz
You need to change your PostgreSQL configuration file.
Do enable this property:
log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this number
# of milliseconds
After that, execution time will be logged and you will be able to figure out exactly how bad (or good) are performing your queries.
Well that's about the case with every app on every computer. Sometimes the operating system is busier than other times, so it takes more time to get the memory you ask it for or your app gets fewer CPU time slices or whatever.

Random Lookup Methodology

I have a postgres database with a table that contains rows I want to look up at pseudorandom intervals. Some I want to look up once an hour, some once a day, and some once a week. I would like the lookups to be at pseudorandom intervals inside their time window. So, the look up I want to do once a day should happen at a different time each time that it runs.
I suspect there is an easier way to do this, but here's the rough plan I have:
Have a settings column for each lookup item. When the script starts, it randomizes the epoch time for each lookup and sets it in the settings column, identifying the time for the next lookup. I then run a continuous loop with a wait 1 to see if the epoch time matches any of the requested lookups. Upon running the lookup, it recalculate when the next lookup should be.
My questions:
Even in the design phase, this looks like it's going to be a duct tape and twine routine. What's the right way to do this?
If by chance, my idea is the right way to do this, is my idea of repeating the loop with a wait 1 the right way to go? If I had 2 lookups back to back, there's a chance I could miss one but I can live with that.
Thanks for your help!
Add a column to the table for NextCheckTime. You could use either a timestamp or just an integer with the raw epoch time. Add a (non-unique) index on NextCheckTime.
When you add a row to the database, populate NextCheckTime by taking the current time, adding the base interval, and adding/subtracting a random factor (maybe 25% of the base interval, or whatever is appropriate for your situation). For example:
my $interval = 3600; # 1 hour in seconds
my $next_check = time + int($interval * (0.75 + rand 0.5));
Then in your loop, just SELECT * FROM table ORDER BY NextCheckTime LIMIT 1. Then sleep until the NextCheckTime returned by that (assuming it's not already in the past), perform the lookup, and update NextCheckTime as described above.
If you need to handle rows newly added by some other process, you might put a limit on the sleep. If the NextCheckTime is more than 10 minutes in the future, then sleep 10 minutes and repeat the SELECT to see if any new rows have been added. (Again, the exact limit depends on your situation.)
How big is your data set? If it's a few thousand rows than just randomizing the whole list and grabbing the first x rows is ok. As the size of your set grows, this becomes less and less scalable. The performance drops off at a non-linear rate. But if you only need to run this once an hour at most, then it's no big deal if it takes a minute or two as long as it doesn't kill other processes on the same box.
If you have a gapless sequence, whether there from the beginning or added on, then you can use indexes with something like:
$i=random(0,sizeofset-1);
select * From table where seqid=$i;
and get good scalability to millions and millions of rows.

Best way to update DB (mongo) every hour?

I am preparing a small app that will aggregate data on users on my website (via socket.io). I want to insert all data to my monogDB every hour.
What is the best way to do that? setInterval(60000) seems to be a lil bit lame :)
You can use cron for example and run your node.js app as scheduled job.
EDIT:
In case where the program have to run continuously, then probably setTimeout is one of the few possible choices (which is quite simple to implement). Otherwise you can offload your data to some temporary storage system, for example redis and then regularly run other node.js program to move your data, however this may introduce new dependency on other DB system and increase complexity depending on your scenario. Redis can also be in this case as some kind of failsafe solution in case when your main node.js app will unexpectedly be terminated and lose part or all of your data batch.
You should aggregate in real time, not once per hour.
I'd take a look at this presentation by BuddyMedia to see how they are doing real time aggregation down to the minute. I am using an adapted version of this approach for my realtime metrics and it works wonderfully.
http://www.slideshare.net/pstokes2/social-analytics-with-mongodb
Why not just hit the server with a curl request that triggers the database write? You can put the command on an hourly cron job and listen on a local port.
You could have mongo store the last time you copied your data and each time any request comes in you could check to see how long it's been since you last copied your data.
Or you could try a setInterval(checkRestore, 60000) for once a minute checks. checkRestore() would query the server to see if the last updated time is greater than an hour old. There are a few ways to do that.
An easy way to store the date is to just store it as the value of Date.now() (https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Date) and then check for something like db.logs.find({lastUpdate:{$lt:Date.now()-6000000}}).
I think I confused a few different solutions there, but hopefully something like that will work!
If you're using Node, a nice CRON-like tool to use is Forever. It uses to same CRON patterns to handle repetition of jobs.