Random Lookup Methodology - perl

I have a postgres database with a table that contains rows I want to look up at pseudorandom intervals. Some I want to look up once an hour, some once a day, and some once a week. I would like the lookups to be at pseudorandom intervals inside their time window. So, the look up I want to do once a day should happen at a different time each time that it runs.
I suspect there is an easier way to do this, but here's the rough plan I have:
Have a settings column for each lookup item. When the script starts, it randomizes the epoch time for each lookup and sets it in the settings column, identifying the time for the next lookup. I then run a continuous loop with a wait 1 to see if the epoch time matches any of the requested lookups. Upon running the lookup, it recalculate when the next lookup should be.
My questions:
Even in the design phase, this looks like it's going to be a duct tape and twine routine. What's the right way to do this?
If by chance, my idea is the right way to do this, is my idea of repeating the loop with a wait 1 the right way to go? If I had 2 lookups back to back, there's a chance I could miss one but I can live with that.
Thanks for your help!

Add a column to the table for NextCheckTime. You could use either a timestamp or just an integer with the raw epoch time. Add a (non-unique) index on NextCheckTime.
When you add a row to the database, populate NextCheckTime by taking the current time, adding the base interval, and adding/subtracting a random factor (maybe 25% of the base interval, or whatever is appropriate for your situation). For example:
my $interval = 3600; # 1 hour in seconds
my $next_check = time + int($interval * (0.75 + rand 0.5));
Then in your loop, just SELECT * FROM table ORDER BY NextCheckTime LIMIT 1. Then sleep until the NextCheckTime returned by that (assuming it's not already in the past), perform the lookup, and update NextCheckTime as described above.
If you need to handle rows newly added by some other process, you might put a limit on the sleep. If the NextCheckTime is more than 10 minutes in the future, then sleep 10 minutes and repeat the SELECT to see if any new rows have been added. (Again, the exact limit depends on your situation.)

How big is your data set? If it's a few thousand rows than just randomizing the whole list and grabbing the first x rows is ok. As the size of your set grows, this becomes less and less scalable. The performance drops off at a non-linear rate. But if you only need to run this once an hour at most, then it's no big deal if it takes a minute or two as long as it doesn't kill other processes on the same box.
If you have a gapless sequence, whether there from the beginning or added on, then you can use indexes with something like:
$i=random(0,sizeofset-1);
select * From table where seqid=$i;
and get good scalability to millions and millions of rows.

Related

Given a fixed (but non-linear) calendar of future equipment usage; how to calculate the life at a given date in time

I have a challenge that i'm really struggling with.
I have a table which contains a 'scenario' that a user has defined, that they will consume 'usage'. For example, how many hours a machine will be turned on for.
In month 1, they will use 300 (hours, stored as an integer in minutes), Month 2 100, Month 3 450 etc etc etc.
I then have a list of tasks which need to be performed at specific intervals of usage. Given the above scenario, how could I forecast when these tasks will be due (the date). I also need to show repeat accomplishments and the dates.
The task contains the number of consumed hours at the last point of accomplishment, the interval between accomplishments and the expected life total the next time it is due (Last Done + Interval = Next Due)
I've tried SO many different options, ideally I want this to be compiled at run time (i.e. the only things that are saved into a permanent table are the forecast and the list of tasks). I have 7-800 scenarios, plus given the number of pieces of equipment, there are 12,000 tasks to be carried out. I need to show at least the next 100 years of tasks.
At the minute, I get the scenario, cross apply a list of dates between now and the year 2118. I then (in the where clause) filter out where the period number (the month number of the year) isn't matching the date, then divide the period usage by the number of days in that period. That gives me a day to day usage over the next 100 years (~36,000 Rows) When I join on the 12000 tasks and then try to filter where the due at value matches my dates table, I can't return ANY rows, even just the date column and the query runs for 7-8 minutes.
We measure more than just hours of usage too, there are 25 different measurement points, all of which are specificed in the scenario.
I can't use linear regression, because lets say for example, we shut down over the summer, and we don't utilize the equipment, then an average usage over the year means that tasks are due, even when we are telling the scenario we're shut down.
Are there any strategies out there that I could apply. I'm not looking for a 'Here's the SQL' answer, I'm just running out of strategies to form a solid query that can deal with the volume of data I'm dealing with.
I can get a query running perfectly with one task, but it just doesn't scale... at all...
If SQL isn't the answer, then i'm open to suggestions.
Thanks,
Harry

Most efficient way to check time difference

I want to check an item in my database every x minutes for y minutes after its creation.
I've come up with two ways to do this and I'm not sure which will result in better efficiency/speed.
The first is to store a Date field in the model and do something like
Model.find({time_created > current_time - y})
inside of a cron job every x minutes.
The second is the keep a times_to_check field that keeps track of how many more times, based on x and y, the object should be checked.
Model.find({times_to_check> 0})
My thought on why these two might be comparable is because the first comparison of Dates would take longer, but the second one requires a write to the database after the object had been checked.
So either way you are going have to check the database continuously to see if it is time to query your collection. In your "second solution" you do not have a way to run your background process as you are only referencing how you are determining your collection delta.
Stick with running you unix Cron job but make sure it is fault tolerant an have controls ensuring it is actually running when you application is up. Below is a pretty good answer for how to handle that.
How do I write a bash script to restart a process if it dies?
Based on that i would ask how does your application react if your Cron job has not run for x number of minutes, hours or days. How will your application recover if this does happen?

How to ensure spark not read same data twice from cassandra

I'm learning spark and cassandra. My problem is as follows.
I have cassandra table which records row data from a sensor
CREATE TABLE statistics.sensor_row (
name text,
date timestamp,
value int,
PRIMARY KEY (name, date)
)
Now I want to aggregate these rows through a spark batch job (ie. Daily)
So I could write
val rdd = sc.cassandraTable("statistics","sensor_row")
//and do map and reduce to get what i want and perhaps write back to aggregated table.
But my problem is I will be running this code periodically. I need to make sure I dont read same data twice.
One thing I can do is delete rows which I read, which looks pretty ugly, or use filter
sensorRowRDD.where("date >'2016-02-05 07:32:23+0000'")
Second one looks much nicer, but then I need to record when was the job run last and continue from there. However according to DataStax driver data locality, each worker will load data only in its local cassandra node. Which mean instead of tracking a global date, i need to track date of each cassandra/spark node. Still does not look very elegant.
Is there any better ways of doing this ?
The DataFrame filters will be pushed down to Cassandra, so this is an efficient solution to the problem. But you are right to worry about the consistency issue.
One solution is to set not just a start date, but an end date also. When your job starts, it looks at the clock. It is 2016-02-05 12:00. Perhaps you have a few minutes delay in collecting late-arriving data, and the clocks are not absolutely precise either. You decide to use 10 minutes of delay and set your end time to 2016-02-05 11:50. You record this in a file/database. The end time of the previous run was 2016-02-04 11:48. So your filter is date > '2016-02-04 11:48' and date < '2016-02-05 11:50'.
Because the date ranges cover all time, you will only miss events that have been saved into a past range after the range has been processed. You can increase the delay from 10 minutes if this happens too often.

How to create KDB query which groups by time interval and do not bring RDB down?

We receive quotes from exchange and store them in KDB Ticker Plant. We want to analyze volume in RDB and HDB with minimum impact on performance of these database since they are also used by other teams.
Firstly, how we may create a function which splits a day in 10 minutes intervals and for each interval create a stat with volume? Which KDB functions do we need to use?
Secondly, how to do it safely? Should we extract records in a loop portion by portion or in one go with one query? We have around 150 million records for each day in our database.
I'll make some assumptions about table and column names, which I'm sure you can extrapolate
We receive quotes from exchange and store them in KDB Ticker Plant
As a matter of definition, tickerplant only stores data for a very small amount of time and then logs it to file and fires the data off to RDB (and other listeners).
with minimum impact on performance of these databases
It all depends on (a) your data volume (b) a most optimal where clause. It also depends on whether you have enough RAM on your machine to cope with the queries. The closer you get to the critical, the harder it is for the OS to allocate memory, and therefore the longer it takes to make the query (although memory allocation time pales in comparison to getting data off a disk - so disk speed is also a factor).
Firstly, how we may create a function which splits a day in 10 minutes intervals and for each interval create a stat with volume?
Your friend here is xbar: http://code.kx.com/q/ref/arith-integer/#xbar
getBy10MinsRDB:{[instrument;mkt]
select max volume, min volume, sum volume, avg volume by 10 xbar `minute$time from table where sym=instrument, market=mkt
};
For an HDB the most optimal query (for a date-parted database) is date then sym then time. In your case you haven't asked for time, so I omit.
getBy10MinsHDB:{[dt;instrument;mkt]
select max volume, min volume, sum volume, avg volume by 10 xbar `minute$time from table where date=dt,sym=instrument, market=mkt
};
Should we extract records in a loop portion by portion or in one go with one query?
No, that's the absolute worst way of doing things in KDB :-) there's almost always a nice vector-ised solution.
We have around 150 million records for each day in our database.
Since KDB is a columnar database, the types of the columns you have are as important as the number of records; as that impacts memory.
because they are also used by other teams
If simple queries like above are causing issues, you need to consider splitting the table up by market, perhaps, to reduce query clash and load. If memory isn't an issue, consider -s for HDB's for multithreaded queries (over multiple days). Consider negative port number on HDB for multithreaded input queue to minimise query clash (although it doesn't necessarily make things faster).

Executing same query makes time difference in postgresql

I just want to know what is the reason for having different time while executing the same query in PostgreSQL.
For Eg: select * from datas;
For the first time it takes 45ms
For the second time the same query takes 55ms and the next time it takes some other time.Can any one say What is the reason for having non static time.
Simple, everytime the database has to read the whole table and retrieve the rows. There might be 100 different things happening in database which might cause a difference of few millis. There is no need to panic. This is bound to happen. You can expect the operation to take same time with some millis accuracy. If there is a huge difference then it is something which has to be looked.
Have u applied indexing in your table . it also increases speed to a great deal!
Compiling the explanation from
Reference by matt b
EXPLAIN statement? helps us to display the execution plan that the PostgreSQL planner generates for the supplied statement.
The execution plan shows how the
table(s) referenced by the statement will be scanned — by plain
sequential scan, index scan, etc. — and if multiple tables are
referenced, what join algorithms will be used to bring together the
required rows from each input table
And Reference by Pablo Santa Cruz
You need to change your PostgreSQL configuration file.
Do enable this property:
log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this number
# of milliseconds
After that, execution time will be logged and you will be able to figure out exactly how bad (or good) are performing your queries.
Well that's about the case with every app on every computer. Sometimes the operating system is busier than other times, so it takes more time to get the memory you ask it for or your app gets fewer CPU time slices or whatever.