how to use multiple arguments in kdb where query? - kdb

I want to select max elements from a table within next 5, 10, 30 minutes etc.
I suspect this is not possible with multiple elements in the where clause.
Using both normal < and </: is failing. My code/ query below:
`select max price from dat where time</: (09:05:00; 09:10:00; 09:30:00)`
Any ideas what am i doing wrong here?
The idea is to get the max price for each row within next 5, 10, 30... minutes of the time in that row and not just 3 max prices in the entire table.
select max price from dat where time</: time+\:(5 10 30)
This won't work but should give the general idea.
To further clarify, i want to calculate the max price in 5, 10, 30 minute intervals from time[i] of each row of the table. So for each table row max price within x+5, x+10, x+30 minutes where x is the time entry in that row.

You could try something like this:
select c1:max price[where time <09:05:00],c2:max price[where time <09:10:00],c3:max price from dat where time< 09:30:00
You can paramatize this query however you like. So if you have a list of times, l:09:05:00 09:10:00 09:15:00 09:20:00 ... You can create a function using a functional form of the query above to work for different lengths of l, something like:
q)f:{[t]?[dat;enlist (<;`time;max t);0b;(`$"c",/:string til count t)!flip (max;flip (`price;flip (where;((<),/:`time,/:t))))]}
q)f l
You can extend f to take different functions instead of max, work for different tables etc.

This works but takes a lot of time. For 20k records, ~20 seconds, too much!. Any way to make it faster
dat: update tmlst: time+\:mtf*60 from dat;
dat[`pxs]: {[x;y] {[x; ts] raze flip raze {[x;y] select min price from x where time<y}[x] each ts }[x; y`tmlst]} [dat] each dat;

this constructs a step dictionary to map the times to your buckets:
q)-1_select max price by(`s#{((neg w),x)!x,w:(type x)$0W}09:05:00 09:10:00 09:30:00)time from dat
you may also be able to abuse wj:
q)wj[{(prev x;x)}09:05:00 09:10:00 09:30:00;`time;([]time:09:05:00 09:10:00 09:30:00);(delete sym from dat;(max;`price))]
if all your buckets are the same size, it's much easier:
q)select max price by 300 xbar time from dat where time<09:30:00 / 300-second (5-min) buckets

Related

Optimize KDB query time to get rolling average price from each contributor

Each time a contributor gives an updated price I want to use this quote along with the latest prices of other quotes to calculate the total average at that moment.
t:`time xasc flip (`userID`time`price)!(`quote1`quote2`quote3`quote3`quote3`quote3`quote4`quote2`quote4`quote3`quote2`quote3`quote1`quote3`quote4`quote1`quote4`quote2`quote2`quote4;(21:11:37 03:13:29 15:35:39 09:59:13 04:34:15 13:09:01 21:21:55 16:54:39 04:03:04 18:22:39 17:05:44 05:08:40 07:35:50 15:46:15 17:32:29 19:42:47 03:28:48 04:20:03 14:16:55 09:02:12);86.4 84.4 54.26 7.76 63.75 97.61 53.97 71.63 38.86 52.23 87.25 65.69 96.25 37.15 17.45 58.97 95.51 61.59 70.25 35.5)
Desired output below
delete userIDPriceList,userIDComps from t,'raze {[idx;tab] select avgPrice:avg price, userIDPriceList:price,userIDComps:userID from select last price by userID from t where i <= idx}[;t] each til count t
userIDPriceList,userIDComps columns are not required in final output
Performance is slow and looking for better way to calculate.
q) \t do[200000;delete userIDPriceList,userIdComps from t,'raze {[idx;tab] select avgPrice:avg price, userIDPriceList:price,userIDComps:userID from select last price by userID from t where i <= idx}[;t] each til count t]
10152j
Thanks in advance
Based on your clarified requirements, another approach is to accumulate using scan:
update avgPrice:avg each{x,(1#y)!1#z}\[();userID;price] from t
Igors solution is faster if the data is static (aka you can prep the table with the attribute once).
Below code gives average of all previous prices for given userID including current row:
ungroup 0!select time, price, avgPrice: avgs price by userID from t
Just ensure that t is appropriately sorted by time before getting averages.
According to your comment to one of the answers, you're "trying to take the average prices of each userID as of the time of the record while ignoring any future records."
This query will do exactly that:
select userID,time,price,avgPrice:(avgs;price)fby userID from t
A query of yours (delete userIDPriceList ...) results in something different as #Anton Dovzhenko pointed out in his comment to your original question.
Update
After reading your comment I think I understood your requirement. Probably you could do this.
prices:exec `s#time!price by userID from t;
update avgPrice:avg each flip prices[;time] from t

How can I count elements satisfying a condition in a group, with PostgresSQL

with this query:
SELECT date_trunc('minute', ts) ts, instrument
FROM test
GROUP BY date_trunc('minute', ts), instrument
ORDER BY ts
I am grouping rows by minutes but I would like to generate a boolean value that tells me if, in the group, there is at least one row with the timestamp where the seconds are < 10 and at least one row with the timestamp where the seconds are > 50.
In short, something like:
lessThan10 = false
moreThan50 = false
for each row in the one minute group:
if row.ts.seconds < 10 then lessThan10 = true
if row.ts.seconds > 50 then moreThan50 = true
return lessThan10 && moreThan50
What I am trying to achieve is to find out if all the records I aggregate cover the beginning and the end of the minute; it's ok if there are holes here and there, but it's possible the data we capture stops and restarts at second 40 for example and, in that case, I'd like to be able to discard the whole minute.
As the data rate varies quite a lot, I can't check for a minimum number of row. There may be a better solution to achieve this, so I'm open to it as well.
Use EXTRACT() to get the seconds of the min and max values of ts:
SELECT date_trunc('minute', ts) ts, instrument,
EXTRACT(SECOND FROM MIN(ts)) < 10 lessThan10,
EXTRACT(SECOND FROM MAX(ts)) > 50 moreThan50
FROM test
GROUP BY date_trunc('minute', ts), instrument
ORDER BY ts
See the demo.

PostGIS: Query z and m dimensions (linestringzm)

Question
I have a system with multiple linestringzm, where the data is structured the following way: [x, y, speed:int, time:int]. The data is structured this way to be able to use ST_SimplifyVW on the x, y and z dimensions, but I still want to be able to query the linestring based on the m dimension e.g. get all linestrings between a time interval.
Is this possible with PostGIS or am I structuring the data incorrectly for my use case?
Example
z = speed e.g. km/h
m = Unix epoch time
CREATE TABLE t (id int NOT NULL, geom geometry(LineStringZM,4326), CONSTRAINT t_pkey PRIMARY KEY (id));
INSERT INTO t VALUES (1, 'SRID=4326;LINESTRING ZM(30 10 5 1620980688, 30 15 10 1618388688, 30 20 15 1615710288, 30 25 20 1620980688)'::geometry);
INSERT INTO t VALUES (2, 'SRID=4326;LINESTRING ZM(50 10 5 1620980688, 50 15 10 1618388688, 50 20 15 1615710288, 50 25 20 1620980688)'::geometry);
INSERT INTO t VALUES (3, 'SRID=4326;LINESTRING ZM(20 10 5 1620980688, 20 15 10 1618388688, 20 20 15 1615710288, 20 25 20 1620980688)'::geometry);
Use case A: Simplify the geometry based on x, y and z
This can be accomplished by e.g. ST_SimplifyVW which keep the m dimension after simplification.
Use case B: Query geometry based on the m dimension
I have a set of linestringzm which I want to query based on my time dimension (m). The result is either the full geometry if all m is between e.g.1618388000 and 1618388700 or the part of the geometry which satisfies the predicate. What is the most efficient way to query the data?
If you want to check every single point of your LineString you could ST_DumpPoints them and get the M dimension with ST_M. After that extract the subset as a LineString containing the overlapping M values and apply ST_MakeLine with a GROUP BY:
WITH j AS (
SELECT id,geom,(ST_DumpPoints(geom)).geom AS p
FROM t
)
SELECT id,ST_AsText(ST_MakeLine(p))
FROM j
WHERE ST_M(p) BETWEEN 1618388000 AND 1618388700
GROUP BY id;
Demo: db<>fiddle
Note: Depending on your table and LineString sizes this query may become pretty slow, as values are being parsed in query time and therefore aren't indexed. Imho a more elegant alternative would be ..
.. 1) to create a tstzrange column
ALTER TABLE t ADD COLUMN line_interval tstzrange;
.. 2) to properly index it
CREATE INDEX idx_t_line_interval ON t USING gist (line_interval);
.. and 3) to populate it with the time of geom's first and last points:
UPDATE t SET line_interval =
tstzrange(
to_timestamp(ST_M(ST_PointN(geom,1))),
to_timestamp(ST_M(ST_PointN(geom,ST_NPoints(geom)))));
After that you can speed things up by checking wether the indexed column overlaps with a given interval. This will significantly improve query time:
SELECT * FROM t
WHERE line_interval && tstzrange(
to_timestamp(1618138148),
to_timestamp(1618388700));
Demo: db<>fiddle
Further reading:
ST_M
ST_PointN
ST_NPoints
PostgreSQL Built-in Range Types

Postgres Function: how to return the first full set of data that occurs after specified date/time

I have a requirement to extract rows of data, but only if all said rows make a full set. We have a sequence table that is updated every minute, with data for 80 bins. We need to know the status of bins 1 thru 80 every minute as part of our production process.
I am generating a new report (postgres function) that needs to take a snapshot at roughly 00:01:00:AM (IE 1 minute past midnight). Initially I thougtht this to be an easy task, just grab the first 80 rows of data that occur at/after this time, however I see that, depending on network activity and industrial computer priorities, the table is not religiously updated at exactly 00:01:00AM or any minute for that matter. Updates can occur milliseconds or even seconds later, and take 500ms to 800ms to update the database. Sometimes a given minute can be missing altogether (production processes take precedence over data capture, but the sequence data is not super critical anyway)
My thinking is it would be more reliable to look for the first complete set of data anytime from 00:01:00AM onwards. So effectively, I have a table that looks a bit like this:
Apologies, I know you prefer for images of this manner to not be pasted in this manner, but I could not figure out how to create a textual table like this here (carriage return or Enter button is ignored!)
Basically, the above table is typical, but 1st minute is not guaranteed, and for that matter, I would not be 100% confident that all 80 bins are logged for a given minute. Hence my question: how to return the first complete set of data, where all 80 bins (rows) have been captured for a particular minute?
Thinking about it, I could do some sort of rowcount in the function, ensuring there are 80 rows for a given minute, but this seems less intuitive. I would like to know for sure that for each row of a given minute, bin 1 is represented, bint 2, bin 3...
Ultimately a call to this function will supply a min/max date/time and that period of time will be checked for the first available minute with a full set of bins data.
I am reasonably sure this will involve a window function, as all rows have to be assessed prior to data extraction. I've used windows functions a few times now, but still a green newbie compared to others here, so help is appreciated.
My final code, thanks to help from #klin:-
StartTime = DATE_TRUNC('minute', tme1);
EndTime = DATE_TRUNC('day', tme1) + '23 hours'::interval;
SELECT "BinSequence".*
FROM "BinSequence"
JOIN(
SELECT "binMinute" AS binminute, count("binMinute")
FROM "BinSequence"
WHERE ("binTime" >= StartTime) AND ("binTime" < EndTime)
GROUP BY 1
HAVING COUNT (DISTINCT "binBinNo") = 80 -- verifies that each and every bin is represented in returned data
) theseTuplesOnly
ON theseTuplesOnly.binminute = "binMinute"
WHERE ("binTime" >= StartTime) AND ("binTime" < EndTime)
GROUP BY 1
ORDER BY 1
LIMIT 80
Use the aggregate function count(*) grouping data by minutes (date_trunc('minute', datestamp) gives full minutes from datestamp), e.g.:
create table bins(datestamp time, bin int);
insert into bins values
('00:01:10', 1, 'a'),
('00:01:20', 2, 'b'),
('00:01:30', 3, 'c'),
('00:01:40', 4, 'd'),
('00:02:10', 3, 'e'),
('00:03:10', 2, 'f'),
('00:03:10', 3, 'g'),
('00:03:10', 4, 'h');
select date_trunc('minute', datestamp) as minute, count(bin)
from bins
group by 1
order by 1
minute | count
----------+-------
00:01:00 | 4
00:02:00 | 1
00:03:00 | 3
(3 rows)
If you are not sure that all bins are unique in consecutive minutes, use distinct (this will make the query slower):
select date_trunc('minute', datestamp) as minute, count(distinct bin)
...
You cannot select counts in aggregated minnutes and all columns of the table in a single simple select. If you want to do that, you should join a derived table or use the operator in or use a window function. A join seems to be the simplest:
select b.*, count
from bins b
join (
select date_trunc('minute', datestamp) as minute, count(bin)
from bins
group by 1
having count(bin) = 4
) s
on date_trunc('minute', datestamp) = minute
order by 1;
datestamp | bin | param | count
-----------+-----+-------+-------
00:01:10 | 1 | a | 4
00:01:20 | 2 | b | 4
00:01:30 | 3 | c | 4
00:01:40 | 4 | d | 4
(4 rows)
Note also how to use having() to filter results in the above query.
You can test the query here.

Divide records into groups - quick solution

I need to divide with UPDATE command rows (selected from subselect) in PostgreSQL table into groups, these groups will be identified with integer value in one of its columns. These groups should be with the same size. Source table contains billions of records.
For example I need to divide 213 selected rows into groups, every group should contains 50 records. The result will be:
1 - 50. row => 1
51 - 100. row => 2
101 - 150. row => 3
151 - 200. row => 4
200 - 213. row => 5
There is no problem to do it with some loop (or use PostgreSQL window functions), but I need to do it very efficiently and quickly. I can't use sequence in id because there should be gaps in these ids.
I have an idea to use random integer number generator and set it as default value for a row. But this is not useable when I need to adjust group size.
The query below should display 213 rows with a group-number from 0-4. Just add 1 if you want 1-5
SELECT i, (row_number() OVER () - 1) / 50 AS grp
FROM generate_series(1001,1213) i
ORDER BY i;
create temporary sequence s minvalue 0 start with 0;
select *, nextval('s') / 50 grp
from t;
drop sequence s;
I think it has the potential to be faster than the row_number version #Richard. But the difference could be not relevant depending on the specifics.