Replay tplog file by appending messages in the partitioned table in kdb

Replay tplog file by appending messages in the partitioned table in kdb - kdb

I'm trying to replay messages from tp log directly to partitioned table on disk by appending messages since I don't have much primary memory compared to tplog size.
TpLog is like below:
q)9 2 sublist get `:/Users/uts/Desktop/repos/ktick/tick/sym2020.05.23
`upd `trade (,0D22:38:00.083960000;,`MSFT.O;,45.15104;,710)
`upd `quote (,0D22:38:01.082882000;,`VOD.L;,341.2765;,341.3056;,732;,481)
I'm using below 'upsert' method to append these tplog msgs to partitioned table but it is failing with type error on upsert:
quote:([]time:`timespan$();sym:`symbol$();bid:`float$();ask:`float$();bsize:`int$();asize:`int$());
trade:([]time:`timespan$();sym:`symbol$();price:`float$();size:`int$());
`:/Users/uts/db/2020.05.23/quote/ set .Q.en[`:/Users/uts/db;]quote;
`:/Users/uts/db/2020.05.23/trade/ set .Q.en[`:/Users/uts/db;]trade;
upd:{[t;d]
if[`trade~t;[show raze d;`:/Users/uts/db/2020.05.23/trade/ upsert .Q.en[`:/Users/uts/db;]enlist (cols trade)!raze d]];
};
-11!`:/Users/uts/Desktop/repos/ktick/tick/sym2020.05.23
Error:
'type
[1] upd:{[t;d]
if[`trade~t;[show raze d;`:/Users/utsav/db/2020.05.23/trade/ upsert .Q.en[`:/Users/utsav/db;]enlist (cols trade)!raze d]];
^
}
But if I try to manually append the message to partitioned table, it's working fine:
`:/Users/uts/db/2020.05.23/trade/ upsert .Q.en[`:/Users/uts/db;]enlist (cols trade)!raze (enlist 0D22:39:00.083960000;enlist `MSFT.O;enlist 45.15104; enlist 710)
Not sure why 'upsert' is not working within upd function along with -11!.
Please share(details/links) if there is any better way(most probably there must be) to replay the tplogs directly to disk without using much primary memory.

I'm not sure if it'll solve your exact problem but a few suggestions:
Your code assumes that every single tickerplant log record is for a single row. This may not be the case, as many tickerplant logs will log multiple rows in a single update. What this means is that your enlist (cols trade)!raze d code wouldn't work (though I would suspect a length error in this case). A more general alternative is to use:
$[0>type first d;enlist cols[trade]!d;flip cols[trade]!d]
You should not try to write to disk for every single upd record from a tickerplant log - it's simply too many disk writes in such a short space of time. It's inefficient and could lead to disk I/O constraints. Better to insert in-memory until the table reaches a certain size and then write in a batch and wipe the table. I would suggest something like:
write:{`:/Users/uts/db/2020.05.23/trade/ upsert .Q.en[`:/Users/uts/db;value x];delete from x};
upd:{[t;d]
if[`trade~t;t insert d];
if[10000<count value t;write[t]];
};
Then your replay would look like:
-11!`:/Users/uts/Desktop/repos/ktick/tick/sym2020.05.23;
if[0<count trade;write[`trade]]; /need to write the leftovers
`sym`time xasc `:/Users/uts/db/2020.05.23/trade/;
#[`:/Users/uts/db/2020.05.23/trade/;`sym;`p#];

Related

kdb: getting one row from HDB

For a normal table, we can select one row using select[1] from t. How can I do this for HDB?
I tried select[1] from t where date=2021.02.25 but it gives error
Not yet implemented: it probably makes sense, but it’s not defined nor implemented, and needs more thinking about as the language evolves

select[n] syntax works only if table is already loaded in memory.
The easiest way to get 1st row of HDB table is:
1#select from t where date=2021.02.25
select[n] will work if applied on already loaded data, e.g.
select[1] from select from t where date=2021.02.25

I've done this before for ad-hoc queries by using the virtual index i, which should avoid the cost of pulling all data into memory just to select a couple of rows. If your query needs to map constraints in first before pulling a subset, this is a reasonable solution.
It will however pull N rows for each date partition selected due to the way that q queries work under the covers. So YMMV and this might not be the best solution if it was behind an API for example.
/ 5 rows (i[5] is the 6th row)
select from t where date=2021.02.25, sum=`abcd, price=1234.5, i<i[5]

If your table is date partitioned, you can simply run
select col1,col2 from t where date=2021.02.25,i=0
That will get the first record from 2021.02.25's partition, and avoid loading every record into memory.
Per your first request (which is different to above) select[1] from t, you can achieve that with
.Q.ind[t;enlist 0]

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost

If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)

Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

How to update a local table remotely?

I have a large table on a remote server with an unknown (millions) amount of rows of data. I'd like to be able to fetch the data in batches of 100,000 rows at a time, update my local table with those fetched rows, and complete this until all rows have been fetched. Is there a way I can update a local table remotely?
Currently I have a dummy table called t on the server along with the following variables...
t:([]sym:1000000?`A`B`Ab`Ba`C`D`Cd`Dc;id:1+til 1000000)
selector:select from t where sym like "A*"
counter:count selector
divy:counter%100000
divyUP:ceiling divy
and the below function on the client along with the variables index set to 0 and normTable, which is a copy of the remote table...
index:0
normTable:h"0#t"
batches:{[idx;divy;anty;seltr]
if[not idx=divy;
batch:select[(anty;100000)] from seltr;
`normTable upsert batch;
idx+::1;
divy:divy;
anty+:100000;
seltr:seltr;
batches[idx;divy;anty;seltr]];
idx::0}
I call that function using the following command...
batches[index;h"divyUP";0;h"selector"]
The problem with this approach though is h"selector" fetches all the rows of data at the same time (and multiple times - for each batch of 100,000 that it upserts to my local normTable).
I could move the batches function to the remote server but then how would I update my local normTable remotely?
Alternatively I could break up the rows into batches on the server and then pull each batch individually. But if I don't know how many rows there are how do I know how many variables are required? For example the following would work, but only up to the first 400k rows...
batch1:select[100000] from t where symbol like "A*"
batch2:select[100000 100000] from t where symbol like "A*"
batch3:select[200000 100000] from t where symbol like "A*"
batch4:select[300000 100000] from t where symbol like "A*"
Is there a way to set a batchX variable so that it creates a new variable that equals the count of divyUP?

I would suggest few changes as you are trying to connect to remote server:
Do not run synchronous request as that would make server to slow down its processing. Try to make asynchronous request using callbacks.
Do not do full table scan(for heavy comparison) in each call specially for regex. Its possible that most of the data might be available in cache in next call but still it is not guaranteed which will again impact the server normal operations.
Do not make data requests in burst. Either use timer or make another data request call when last batch data has arrived.
Below approach is based on above suggestions. It will avoid scanning full table for columns other than index column(which is light weight) and make next request only when last batch has arrived.
Create Batch processing function
This function will run on server and read small batch of data from table using indices and return the required data.
q) batch:{[ind;s] ni:ind+s; d:select from t where i within (ind;ni), sym like "A*";
neg[.z.w](`upd;d;$[ni<count t;ni+1;0]) }
It takes 2 arguments- starting index and batch size to work on.
This function will finally call upd function on local macine asynchronously and will pass 2 arguments.
Table index to start next batch from (return 0 in case all rows are done to stop next batch processing)
Data from current batch request
Create Callback function
Result from batch processing function will come into this function.
If index > 0 that means there is more data to process and next batch should start form this index.
q) upd:{[data;ind] t::t,data;if[ind>0;fetch ind]}
Create Main function to start process
q)fetch:{[ind] h (batch;ind;size)}
Finally open connection, create table variable and run fetch function.
q) h:hopen `:server:port
q) t:()
q) size:100
q) fetch 0
Now, above method is based on the assumption that server table is static. In case its getting updates in real time then changes would be required depending upon how the table is getting updated on server.
Also, other optimizations can be done depending upon attributes set on remote table which can improve the performance.

If you're ok sending sync messages it can be simplified to something like:
{[h;i]`mytab upsert h({select from t where i in x};i)}[h]each 0N 100000#til h"count t"
And you can easily change it to control the number of batches (rather than the size) by instead using 10 0N# (that would do it in 10 batches)

Rather than having individual variables, the cut function can split up the result of the select into chunks of 100000 rows. Indexing each element is a table.
batches:100000 cut select from t where symbol like "A*"

kdb+/q optimize union function

To give you a bit of background. I have a process which does this large complex calculation which takes a while to complete. It runs on a timer. After some investigation I realise that what is causing the slowness isn't the actual calculation but the internal q function, union.
I am trying to union two simple tables, table A and table B. A is approximately 5m rows and B is 500. Both tables have only two columns. First column is a symbol. Table A is actually a compound primary key of a table. (Also, how do you copy directly from the console?)
n:5000000
big:([]n?`4;n?100)
small:([]500?`4;500?100)
\ts big union small
I tried keying both columns and upserting, join and then distinct, "big, small where not small in big" but nothing seems to work :(
Any help will be appreciated!

If you want to upsert the big table it has to be keyed and upsert operator should be used. For example
n:5000000
//big ids are unique numbers from 0 to 499999
//table is keyed with 1! operator
big:1!([]id:(neg n)?n;val:n?100)
//big ids are unique numbers. 250 from 0-4999999 and 250 from 500000-1000000 intervals
small:([]id:(-250?n),(n+-250?n);val:500?100)
If big is global variable it is efficient to upsert it as
`big upsert small
if big is local
big: big upsert small
As the result big will have 500250 elements, because there are 250 common keys (id column) in big and small tables

this may not be relevant, but just a quick thought. If your big table has a column which has type `sym and if this column does not really show up that much throughout your program, why not cast it to string or other value? if you are doing this update process every single day then as the data gets packed in your partitioned hdb, whenever the new data is added, kdb+ process has to reassign/rewrite its sym file and i believe this is the part that actually takes a lot of time, not the union calculation itself..
if above is true, i'd suggest either rewriting your schema for the table which minimises # of rehashing(not sure if this is the right term though!) on your symfile. or, as the above person mentioned, try to assign attribute to your table.. this may reduce the time too.

How to correctly partition a table in real-time in kdb?

I have a C application that streams data to a kdb memory table all day, eventually outgrowing the size of my server RAM. The goal ultimately is to store data on disk, so I decided to run a timer partition function to transfer data gradually. I came up with this code:
part_timer : { []
(`$db) upsert .Q.en[`$sym_path] select [20000] ts,exch,ticker,side,price,qty,bid,ask from md;
delete from `md where i<20000
}
.z.ts: part_timer
.z.zd: 17 2 6i
\t 1000
Is this the correct way to partition streaming data in real-time? How would you write this code? I'm concerned about the delete statement not being synchronized with the select.

While not an explicit solution to your issue. Take a look at w.q here. This is a write only alternative to the traditional RDB. This buffers up requests and every MAXROWS records writes the data to disk.

In the above comment you asked:
If not, how can I reorganize the db effectively at the end of the day
to store symbols sequentially?
I know this answer is a bit delayed, but this might help someone else who is trying to do the same thing.
Run the following to sort the data on disk (This is slower than pulling it into ram, sorting and then writing to disk):
par:.Q.par[PATH;.z.D;TABLE];
par xasc `sym;
#[par;`sym;`p#];
Where:
PATH: `:path/on/disk/to/db/root;
For single file tables:
TABLE: `tableName;
For splayed tables:
TABLE: `$"tablename/"

At end of day (i.e. you don't expect the data to be appended), from your c program you can call:
Write to a location for 2013.01.01
.Q.dpft[`:/path/to/location;2013.01.01;`sym;`tableName];
Clear the table
delete from `tableName
Clear some memory up
.Q.gc peach til system"s"
Of course that assumed you have time/sym columns, and you want to splay by date. Otherwise
`:/path/to/location/tableName/ set tableName
Will splay.
Can append also if you wish (see IO chapter of Q for Mortals for examples)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse