KDB+/Q query too heavy to handle - kdb

I want to grab data from a KDB data base for a list of roughly 200 days within the last two years. The 200 days are in no particular pattern.
I only need the data from 09:29:00.000 to 09:31:00.000 everyday.
My first approach was to query all of the last two years data that have time stamp between 09:29:00.000 and 09:31:00.000, because I didn't see a way to just query the particular 200 days that I need.
However this proved to be too much for my server to handle.
Then I tried to summarize the 2 minute data for each date into an average and just print out the average, so now I will only have 200 rows of data as output. But somehow this still turns out to be too much. I'm not sure if this is because I'm not selecting the data correctly.
My other suspicion is that the query is garbing all the data first then averaging each date, which means averaging is not making it easier to handle.
Here's the code that I have:
select maxPriceB:max(price), minPriceB:min(price), avgPriceB:avg(price), avgSizeB:avg(qty) by date from dms where date within(2015.01.01, 2016.06,10), time within(09:29:00.000, 09:31:00.000), sym = `ZF6
poms is the table that the data is in
ZFU6 is the symbol that im looking for
I tried adding the key word distinct after select.
I want to know if there's anyway to break up the query, or make the query lighter for the server to handle.
Thank you!

If you use 32-bit kdb+ and get infamous 'wsfull error then you may try processing one day at a time like this:
raze{select maxPriceB:max(price), minPriceB:min(price), avgPriceB:avg(price), avgSizeB:avg(qty)
from dms where date=x,sym=`ZF6,time within 09:29:00.000 09:31:00.000}each 2015.01.01+1+til 2016.06.10-2015.01.01

Related

Keep table synced with another but with accumulated / grouped data

If I have large amounts of data in a table defined like
CREATE TABLE sensor_values ( ts TIMESTAMPTZ(35, 6) NOT NULL,
value FLOAT8(17, 17) DEFAULT 'NaN' :: REAL NOT NULL,
sensor_id INT4(10) NOT NULL, );
Data comes in every minute for thousands of points. Quite often though I need to extract and work with daily values over years (On a web frontend). To aid this I would like a sensor_values_days table that only has the daily sums for each point and then I can use this for faster queries over longer timespans.
I don't want a trigger for every write to the db as I am afraid that would slow down the already bottle neck of writes to the db.
Is there a way to trigger only after so many rows have been inserted ?
Or perhaps an index and maintains a index of a sum of entries over days ? I don't think that is possible.
What would be the best way to do this. It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Thanks
What would be the best way to do this.
Install clickhouse and use AggregatingMergeTree table type.
With postgres:
Create per-period aggregate table. You can have several with different granularity, like hours, days, and months.
Have a cron or scheduled task run at the end of each period plus a few minutes. First, select the latest timestamp in the per-period table, so you know at which period to start. Then, aggregate all rows in the main table for periods that came after the last available one. This process will also work if the per-period table is empty, or if it missed the last update then it will catch up.
In order to do only inserts and no updates, you have to run it at the end of each period, to make sure it got all the data. You can also store the first and last timestamp of the rows that were aggregated, so later if you check the table you see it did use all the data from the period.
After aggregation, the "hour" table should be 60x smaller than the "minute" table, that should help!
Then, repeat the same process for the "day" and "month" table.
If you want up-to-date stats, you can UNION ALL the results of the "per day" table (for example) to the results of the live table, but only pull the current day out of the live table, since all the previous days's worth of data have been summarized into the "per day" table. Hopefully, the current day's data will be cached in RAM.
It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Also if you want to partition your huge table, make sure you do it before its size becomes unmanageable...
Materialized Views and a Cron every 5 minutes can help you:
https://wiki.postgresql.org/wiki/Incremental_View_Maintenance
In PG14, we will have INCREMENTAL MATERIALIZED VIEW, but for the moment is in devel.

Statistical query to loop through different date periods

I have a massive query log table in postgresql. I have been asked to get statistical data from it, but the table is sooooo massive. It has about ~170000000 rows in it.
So I've been asked a statistical data for last 6 months, that will have count of services for each day.
The issue is that since the table is so big, it will take forever to get this data.
Here's the current query I use:
SELECT ql.query_time::timestamp::date,count(ql.query_name),ql.query_name
FROM query_log ql
WHERE ql.query_time BETWEEN '2017-12-20 14:00:00.000'::timestamp AND '2018-06-20 14:00:00.000'::timestamp AND success=TRUE
GROUP BY ql.query_time::timestamp::date, ql.query_name;
Please make proposals how to make this query faster and and effective. I want to save the output into the CSV.
I've been thinking on looping through each day for past 6 months but dont know how to do it.
OH, ql.query_time is indexed.
Thx!

How to get all missing days between two dates

I will try to explain the problem on an abstract level first:
I have X amount of data as input, which is always going to have a field DATE. Before, the dates that came as input (after some process) where put in a table as output. Now, I am asked to put both the input dates and any date between the minimun date received and one year from that moment. If there was originally no input for some day between this two dates, all fields must come with 0, or equivalent.
Example. I have two inputs. One with '18/03/2017' and other with '18/03/2018'. I now need to create output data for all the missing dates between '18/03/2017' and '18/04/2017'. So, output '19/03/2017' with every field to 0, and the same for the 20th and 21st and so on.
I know to do this programmatically, but on powercenter I do not. I've been told to do the following (which I have done, but I would like to know of a better method):
Get the minimun date, day0. Then, with an aggregator, create 365 fields, each has that "day0"+1, day0+2, and so on, to create an artificial year.
After that we do several transformations like sorting the dates, union between them, to get the data ready for a joiner. The idea of the joiner is to do an Full Outer Join between the original data, and the data that is going to have all fields to 0 and that we got from the previous aggregator.
Then a router picks with one of its groups the data that had actual dates (and fields without nulls) and other group where all fields are null, and then said fields are given a 0 to finally be written to a table.
I am wondering how can this be achieved by, for starters, removing the need to add 365 days to a date. If I were to do this same process for 10 years intead of one, the task gets ridicolous really quick.
I was wondering about an XOR type of operation, or some other function that would cut the number of steps that need to be done for what I (maybe wrongly) feel is a simple task. Currently I now need 5 steps just to know which dates are missing between two dates, a minimun and one year from that point.
I have tried to be as clear as posible but if I failed at any point please let me know!
Im not sure what the aggregator is supposed to do?
The same with the 'full outer' join? A normal join on a constant port is fine :) c
Can you calculate the needed number of 'dublicates' before the 'joiner'? In that case a lookup configured to return 'all rows' and a less-than-or-equal predicate can help make the mapping much more readable.
In any case You will need a helper table (or file) with a sequence of numbers between 1 and the number of potential dublicates (or more)
I use our time-dimension in the warehouse, which have one row per day from 1753-01-01 and 200000 next days, and a primary integer column with values from 1 and up ...
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
Ok... so you could override your source qualifier to achieve this in the selection query itself (am giving Oracle based example as its what I'm used to and I'm assuming your data in is from a table). I looked up the connect syntax here
SQL to generate a list of numbers from 1 to 100
SELECT (MIN(tablea.DATEFIELD) + levquery.n - 1) AS Port1 FROM tablea, (SELECT LEVEL n FROM DUAL CONNECT BY LEVEL <= 365) as levquery
(Check if the query works for you - haven't access to pc to test it at the minute)

Tableau Future and Current References

Tough problem I am working on here.
I have a table of CustomerIDs and CallDates. I want to measure whether there is a 'repeat call' within a certain period of time (up to 30 days).
I plan on creating a parameter called RepeatTime which is a range from 0 - 30 days, so the user can slide a scale to see the number/percentage of total repeats.
In Excel, I have this working. I sort CustomerID in order and then sort CallDate from earliest to latest. I then have formulas like:
=IF(AND(CurrentCustomerID = FutureCustomerID, FutureCallDate - CurrentCallDate <= RepeatTime), 1,0)
CurrentCustomerID = the current row, and the FutureCustomerID = the following row (so it is saying if the customer ID is the same).
FutureCallDate = the following row and the CurrentCallDate = the current row. It is subtracting the future call time from the first call time to measure the time in between.
The goal is to be able to see, dynamically, how many customers called in for a specific reason within maybe 4 hours or 1 day or 5 days, etc. All of the way up until 30 days (this is our actual metric but it is good to see the calls which are repeats within a shorter time frame so we can investigate).
I had a similar problem, see here for detailed version Array calculation in Tableau, maxif routine
In your case, that is basically the same thing as mine, so you could apply that solution, but I find it easier to understand the one I'm about to give, I would do:
1) Create a calculated field called RepeatTime:
DATEDIFF('day',MAX(CallDates),LOOKUP(MAX(CallDates),-1))
This will calculated how many days have passed since the last call to the current. You can add a IFNULL not to get Null values for the first entry.
2) Drag CustomersID, CallDates and RepeatTime to the worksheet (can be on the marks tab, don't need to be on rows or column).
3) Configure the table calculation of RepeatTIme, Compute using Advanced..., partitioning CustomersID, Adressing CallDates
Also Sort by Field CallDates, Maximum, Ascending.
This will guarantee the table calculation works properly
4) Now you have a base that you can use for what you need. You can either export it to csv or mdb and connect to it.
The best approach, actually, is to have this RepeatTime field calculated outside Tableau, on your database, so it's already there when you connect to it. But this is a way to use Tableau to do the calculation for you.
Unfortunately there's no direct way to do this directly with your database.

Executing query in chunks on Greenplum

I am trying to creating a way to convert bulk date queries into incremental query. For example, if a query has where condition specified as
WHERE date > now()::date - interval '365 days' and date < now()::date
this will fetch a years data if executed today. Now if the same query is executed tomorrow, 365 days data will again be fetched. However, I already have last 364 days data from previous run. I just want a single day's data to be fetched and a single day's data to be deleted from the system, so that I end up with 365 days data with better performance. This data is to be stored in a separate temp table.
To achieve this, I create an incremental query, which will be executed in next run. However, deleting the single date data is proving tricky when that "date" column does not feature in the SELECT clause but feature in the WHERE condition as the temp table schema will not have the "date" column.
So I thought of executing the bulk query in chunks and assign an ID to that chunk. This way, I can delete a chunk and add a chunk and other data remains unaffected.
Is there a way to achieve the same in postgres or greenplum? Like some inbuilt functionality. I went through the whole documentation but could not find any.
Also, if not, is there any better solution to this problem.
I think this is best handled with something like an aggregates table (I assume the issue is you have heavy aggregates to handle over a lot of data). This doesn't necessarily cause normalization problems (and data warehouses often denormalize anyway). In this regard the aggregates you need can be stored per day so you are able to cut down to one record per day of the closed data, plus non-closed data. Keeping the aggregates to data which cannot change is what is required to avoid the normal insert/update anomilies that normalization prevents.