Join columns to a table in parallel - kdb

Currently I use a function to run an aj join for a single column from a large table on to a smaller table which has its time column shifted t milliseconds ahead, joining on a sym column as well as time. I then compute and programatically name a new column based on this joined column, before deleting the original joined column from the small table. This returns the small table with a new column based on values joined from the larger table t milliseconds ahead.
I then use an Over loop / to repeat this over a list of different delays t, recursively adding one new column for each delay in the list, passing as an argument the table so columns are added recursively.
My issue is the query, join and processing are slow on a large table. I have many cores so I would like to parallelise this operation to take advantage of all available cores, as well as optimising the steps taken to add the new columns. The large table is partitioned on disk by date and sym.
[Edit:] Here is an example of what I have at the moment.
smallT: ([] sym: (20#`AAPL),(20#`MSFT); time: (asc 00:00:00+20?til 100), (asc 00:00:00+20?til 100));
bigT: ([] sym: (100#`AAPL),(100#`MSFT); time: (asc 00:00:00+til 100), (asc 00:00:00+til 100); price: (til 100),(til 100));
delays: 00:00:00 + (7 * til 5);
foo: ([bigTab; smallTab2; delays]
smallTab2: aj[ `sym`time; `sym`time xasc select from (update time:time+delays from smallTab2); `sym`time xasc select sym, time, future pricesprice from bigTabl;
smallTab2: ![smallTab2; (); 0b; enlist[$"colnametime_", string(`int$delays)] ! enlist(%;`future_price;100)];
delete future_price from smallTab2
}[bigT];
smallT:foo/[select from smallT; delays];
smallT
I am relatively new to q and kdb so verbose explanations of how and why a solution works with working code on a toy example would be very greatly appreciated.

Your function is repeatedly sorting the tables inside the loop which will slow you down.
Also as noted in documentation attributes should be applied on the tables which will greatly improve performance.
https://code.kx.com/q/ref/aj/#performance
Rather than looping through the delay offsets you can instead create the full list and only aj once.
https://community.kx.com/t5/New-kdb-q-users-question-forum/How-do-you-start-thinking-in-vectors/td-p/12722
smallT: ([] sym: (20#`AAPL),(20#`MSFT); time: (asc 00:00:00+20?til 100), (asc 00:00:00+20?til 100));
bigT: ([] sym: (100#`AAPL),(100#`MSFT); time: (asc 00:00:00+til 100), (asc 00:00:00+til 100); price: (til 100),(til 100));
delays: 00:00:00 + (7 * til 5);
bigT:update `p#sym from `sym`time xasc bigT
res:raze {[x;y] update row:i, delay:(`$"delay_",string`int$y),time+y from x}[smallT] each delays
res:update `g#sym from `sym`time xasc res
res:aj[ `sym`time;res; select sym, time, price from bigT]
delete row from `sym`time xcols 0!(exec ({`$"delay_",string`int$x} each delays)#(delay!price) by row:row from res) lj 1!select row,sym,time from res
sym time delay_0 delay_7 delay_14 delay_21 delay_28
--------------------------------------------------------
AAPL 00:00:17 17 24 31 38 45
AAPL 00:00:18 18 25 32 39 46
AAPL 00:00:18 18 25 32 39 46
AAPL 00:00:28 28 35 42 49 56
AAPL 00:00:33 33 40 47 54 61
...

Related

PostGIS: Query z and m dimensions (linestringzm)

Question
I have a system with multiple linestringzm, where the data is structured the following way: [x, y, speed:int, time:int]. The data is structured this way to be able to use ST_SimplifyVW on the x, y and z dimensions, but I still want to be able to query the linestring based on the m dimension e.g. get all linestrings between a time interval.
Is this possible with PostGIS or am I structuring the data incorrectly for my use case?
Example
z = speed e.g. km/h
m = Unix epoch time
CREATE TABLE t (id int NOT NULL, geom geometry(LineStringZM,4326), CONSTRAINT t_pkey PRIMARY KEY (id));
INSERT INTO t VALUES (1, 'SRID=4326;LINESTRING ZM(30 10 5 1620980688, 30 15 10 1618388688, 30 20 15 1615710288, 30 25 20 1620980688)'::geometry);
INSERT INTO t VALUES (2, 'SRID=4326;LINESTRING ZM(50 10 5 1620980688, 50 15 10 1618388688, 50 20 15 1615710288, 50 25 20 1620980688)'::geometry);
INSERT INTO t VALUES (3, 'SRID=4326;LINESTRING ZM(20 10 5 1620980688, 20 15 10 1618388688, 20 20 15 1615710288, 20 25 20 1620980688)'::geometry);
Use case A: Simplify the geometry based on x, y and z
This can be accomplished by e.g. ST_SimplifyVW which keep the m dimension after simplification.
Use case B: Query geometry based on the m dimension
I have a set of linestringzm which I want to query based on my time dimension (m). The result is either the full geometry if all m is between e.g.1618388000 and 1618388700 or the part of the geometry which satisfies the predicate. What is the most efficient way to query the data?
If you want to check every single point of your LineString you could ST_DumpPoints them and get the M dimension with ST_M. After that extract the subset as a LineString containing the overlapping M values and apply ST_MakeLine with a GROUP BY:
WITH j AS (
SELECT id,geom,(ST_DumpPoints(geom)).geom AS p
FROM t
)
SELECT id,ST_AsText(ST_MakeLine(p))
FROM j
WHERE ST_M(p) BETWEEN 1618388000 AND 1618388700
GROUP BY id;
Demo: db<>fiddle
Note: Depending on your table and LineString sizes this query may become pretty slow, as values are being parsed in query time and therefore aren't indexed. Imho a more elegant alternative would be ..
.. 1) to create a tstzrange column
ALTER TABLE t ADD COLUMN line_interval tstzrange;
.. 2) to properly index it
CREATE INDEX idx_t_line_interval ON t USING gist (line_interval);
.. and 3) to populate it with the time of geom's first and last points:
UPDATE t SET line_interval =
tstzrange(
to_timestamp(ST_M(ST_PointN(geom,1))),
to_timestamp(ST_M(ST_PointN(geom,ST_NPoints(geom)))));
After that you can speed things up by checking wether the indexed column overlaps with a given interval. This will significantly improve query time:
SELECT * FROM t
WHERE line_interval && tstzrange(
to_timestamp(1618138148),
to_timestamp(1618388700));
Demo: db<>fiddle
Further reading:
ST_M
ST_PointN
ST_NPoints
PostgreSQL Built-in Range Types

Postgres Function: how to return the first full set of data that occurs after specified date/time

I have a requirement to extract rows of data, but only if all said rows make a full set. We have a sequence table that is updated every minute, with data for 80 bins. We need to know the status of bins 1 thru 80 every minute as part of our production process.
I am generating a new report (postgres function) that needs to take a snapshot at roughly 00:01:00:AM (IE 1 minute past midnight). Initially I thougtht this to be an easy task, just grab the first 80 rows of data that occur at/after this time, however I see that, depending on network activity and industrial computer priorities, the table is not religiously updated at exactly 00:01:00AM or any minute for that matter. Updates can occur milliseconds or even seconds later, and take 500ms to 800ms to update the database. Sometimes a given minute can be missing altogether (production processes take precedence over data capture, but the sequence data is not super critical anyway)
My thinking is it would be more reliable to look for the first complete set of data anytime from 00:01:00AM onwards. So effectively, I have a table that looks a bit like this:
Apologies, I know you prefer for images of this manner to not be pasted in this manner, but I could not figure out how to create a textual table like this here (carriage return or Enter button is ignored!)
Basically, the above table is typical, but 1st minute is not guaranteed, and for that matter, I would not be 100% confident that all 80 bins are logged for a given minute. Hence my question: how to return the first complete set of data, where all 80 bins (rows) have been captured for a particular minute?
Thinking about it, I could do some sort of rowcount in the function, ensuring there are 80 rows for a given minute, but this seems less intuitive. I would like to know for sure that for each row of a given minute, bin 1 is represented, bint 2, bin 3...
Ultimately a call to this function will supply a min/max date/time and that period of time will be checked for the first available minute with a full set of bins data.
I am reasonably sure this will involve a window function, as all rows have to be assessed prior to data extraction. I've used windows functions a few times now, but still a green newbie compared to others here, so help is appreciated.
My final code, thanks to help from #klin:-
StartTime = DATE_TRUNC('minute', tme1);
EndTime = DATE_TRUNC('day', tme1) + '23 hours'::interval;
SELECT "BinSequence".*
FROM "BinSequence"
JOIN(
SELECT "binMinute" AS binminute, count("binMinute")
FROM "BinSequence"
WHERE ("binTime" >= StartTime) AND ("binTime" < EndTime)
GROUP BY 1
HAVING COUNT (DISTINCT "binBinNo") = 80 -- verifies that each and every bin is represented in returned data
) theseTuplesOnly
ON theseTuplesOnly.binminute = "binMinute"
WHERE ("binTime" >= StartTime) AND ("binTime" < EndTime)
GROUP BY 1
ORDER BY 1
LIMIT 80
Use the aggregate function count(*) grouping data by minutes (date_trunc('minute', datestamp) gives full minutes from datestamp), e.g.:
create table bins(datestamp time, bin int);
insert into bins values
('00:01:10', 1, 'a'),
('00:01:20', 2, 'b'),
('00:01:30', 3, 'c'),
('00:01:40', 4, 'd'),
('00:02:10', 3, 'e'),
('00:03:10', 2, 'f'),
('00:03:10', 3, 'g'),
('00:03:10', 4, 'h');
select date_trunc('minute', datestamp) as minute, count(bin)
from bins
group by 1
order by 1
minute | count
----------+-------
00:01:00 | 4
00:02:00 | 1
00:03:00 | 3
(3 rows)
If you are not sure that all bins are unique in consecutive minutes, use distinct (this will make the query slower):
select date_trunc('minute', datestamp) as minute, count(distinct bin)
...
You cannot select counts in aggregated minnutes and all columns of the table in a single simple select. If you want to do that, you should join a derived table or use the operator in or use a window function. A join seems to be the simplest:
select b.*, count
from bins b
join (
select date_trunc('minute', datestamp) as minute, count(bin)
from bins
group by 1
having count(bin) = 4
) s
on date_trunc('minute', datestamp) = minute
order by 1;
datestamp | bin | param | count
-----------+-----+-------+-------
00:01:10 | 1 | a | 4
00:01:20 | 2 | b | 4
00:01:30 | 3 | c | 4
00:01:40 | 4 | d | 4
(4 rows)
Note also how to use having() to filter results in the above query.
You can test the query here.

Scala RDD Operation

I am new in scala.
I have a csv file stored in hdfs. I am reading that file in scala using
val salesdata = sc.textFile("hdfs://localhost:9000/home/jayshree/sales.csv")
Here is a small sample of data "sales".
C_ID T_ID ITEM_ID ITEM_Price
5 199 1 500
33 235 1 500
20 249 3 749
35 36 4 757
19 201 4 757
17 94 5 763
39 146 5 763
42 162 5 763
49 41 6 824
3 70 6 824
24 161 6 824
48 216 6 824
I have to perform the following operation on it.
1.Apply some discount on each item, on the column d(itemprice) suppose 30% of discount. The formula will be d=d-(30%(d)).
2.Find customer wise minimum and maximum item value after applying 30% discount to each item.
I tried to multiply 30 with the observation of column ITEM_Price. The problem is that the value of d as taken as string. When I am multiplying with a number in result it is showing the value that many time. like (500*3 = 500500500)
I can convert it into a dataframe and do it. But I just want to know that without converting it into a dataframe can we do these operation for an RDD.
Discount
case class Transaction(cId: Int, tId: Int, itemId: Int, itemPrice: Int)
val salesdata : RDD[String]=> Map the RDD, within the map split the line by your separator and then convert the Array to a case class called Transaction calling Array(i).toInt to cast the fields. In this step your target is to get a RDD[Transaction].
Map the RDD again and copy your transaction applying the discount ( t => t.copy(itemPrice=0.7*t.itemPrice))
You will have a new RDD[Transaction]
Customer wise
Take the last object, apply a keyBy(_.cId) to get RDD[Int, Transaction] where your key is the client.
Reduce By Key adding the prices for each item. Goal => RDD[Int, Int] where you get the total for each client.
Find your target clients!
Since you want more of a guide, let's look at this outside of Spark for a second and think about things as typical Scala collections.
Your data would look like this:
val data = Array(
(5, 199, 5, 100),
(33, 235, 5, 100),
...
)
I think you will have no trouble mapping your salesdata RDD of strings to an RDD of Array or Tuple4 using a split or regular expression or something.
Let's go with a tuple. Then you can do this:
data.map {
case (cId, tId, item, price) => (cId, tId, item, price * .7)
}
That maps the original RDD of tuples to another RDD of tuples where the last values, the prices, are reduced by 30%. So the result is a Tuple4[Int, Int, Int, Double].
To be honest, I don't know what you mean by customer-wise min and max, but maybe it is something like this:
data.map {
case (cId, tId, item, price) => (cId, tId, item, price * .7)
}.groupBy(_._1)
.mapValues { tuples =>
val discountedPrices = tuples.map(_._4)
(discountedPrices.min, discountedPrices.max)
}
First, I do a groupBy, which produces a Map from cId (the first value in the tuple, which explains the ._1) to a collection of full tuples--so a Map of cId to a collection of rows pertaining to that cId. In Spark, this would produce a PairRDD.
Map and PairRDD both have a mapValues function, which allows me to preserve the keys (the cIds) while transforming each collection of tuples. In this case, I simply map the collection to a collection of discounted prices by getting the 4th item in each tuple, the discounted prices. Then I call min and max on that collection and return a tuple of those values.
So the result is a Map of customer ID to a tuple of the min and max of the discounted prices. The beauty of the RDD API is that it follows the conventional Scala collection API so closely, so it is basically the same thing.

how to use multiple arguments in kdb where query?

I want to select max elements from a table within next 5, 10, 30 minutes etc.
I suspect this is not possible with multiple elements in the where clause.
Using both normal < and </: is failing. My code/ query below:
`select max price from dat where time</: (09:05:00; 09:10:00; 09:30:00)`
Any ideas what am i doing wrong here?
The idea is to get the max price for each row within next 5, 10, 30... minutes of the time in that row and not just 3 max prices in the entire table.
select max price from dat where time</: time+\:(5 10 30)
This won't work but should give the general idea.
To further clarify, i want to calculate the max price in 5, 10, 30 minute intervals from time[i] of each row of the table. So for each table row max price within x+5, x+10, x+30 minutes where x is the time entry in that row.
You could try something like this:
select c1:max price[where time <09:05:00],c2:max price[where time <09:10:00],c3:max price from dat where time< 09:30:00
You can paramatize this query however you like. So if you have a list of times, l:09:05:00 09:10:00 09:15:00 09:20:00 ... You can create a function using a functional form of the query above to work for different lengths of l, something like:
q)f:{[t]?[dat;enlist (<;`time;max t);0b;(`$"c",/:string til count t)!flip (max;flip (`price;flip (where;((<),/:`time,/:t))))]}
q)f l
You can extend f to take different functions instead of max, work for different tables etc.
This works but takes a lot of time. For 20k records, ~20 seconds, too much!. Any way to make it faster
dat: update tmlst: time+\:mtf*60 from dat;
dat[`pxs]: {[x;y] {[x; ts] raze flip raze {[x;y] select min price from x where time<y}[x] each ts }[x; y`tmlst]} [dat] each dat;
this constructs a step dictionary to map the times to your buckets:
q)-1_select max price by(`s#{((neg w),x)!x,w:(type x)$0W}09:05:00 09:10:00 09:30:00)time from dat
you may also be able to abuse wj:
q)wj[{(prev x;x)}09:05:00 09:10:00 09:30:00;`time;([]time:09:05:00 09:10:00 09:30:00);(delete sym from dat;(max;`price))]
if all your buckets are the same size, it's much easier:
q)select max price by 300 xbar time from dat where time<09:30:00 / 300-second (5-min) buckets

Divide records into groups - quick solution

I need to divide with UPDATE command rows (selected from subselect) in PostgreSQL table into groups, these groups will be identified with integer value in one of its columns. These groups should be with the same size. Source table contains billions of records.
For example I need to divide 213 selected rows into groups, every group should contains 50 records. The result will be:
1 - 50. row => 1
51 - 100. row => 2
101 - 150. row => 3
151 - 200. row => 4
200 - 213. row => 5
There is no problem to do it with some loop (or use PostgreSQL window functions), but I need to do it very efficiently and quickly. I can't use sequence in id because there should be gaps in these ids.
I have an idea to use random integer number generator and set it as default value for a row. But this is not useable when I need to adjust group size.
The query below should display 213 rows with a group-number from 0-4. Just add 1 if you want 1-5
SELECT i, (row_number() OVER () - 1) / 50 AS grp
FROM generate_series(1001,1213) i
ORDER BY i;
create temporary sequence s minvalue 0 start with 0;
select *, nextval('s') / 50 grp
from t;
drop sequence s;
I think it has the potential to be faster than the row_number version #Richard. But the difference could be not relevant depending on the specifics.