How to update a local table remotely? - kdb

I have a large table on a remote server with an unknown (millions) amount of rows of data. I'd like to be able to fetch the data in batches of 100,000 rows at a time, update my local table with those fetched rows, and complete this until all rows have been fetched. Is there a way I can update a local table remotely?
Currently I have a dummy table called t on the server along with the following variables...
t:([]sym:1000000?`A`B`Ab`Ba`C`D`Cd`Dc;id:1+til 1000000)
selector:select from t where sym like "A*"
counter:count selector
divy:counter%100000
divyUP:ceiling divy
and the below function on the client along with the variables index set to 0 and normTable, which is a copy of the remote table...
index:0
normTable:h"0#t"
batches:{[idx;divy;anty;seltr]
if[not idx=divy;
batch:select[(anty;100000)] from seltr;
`normTable upsert batch;
idx+::1;
divy:divy;
anty+:100000;
seltr:seltr;
batches[idx;divy;anty;seltr]];
idx::0}
I call that function using the following command...
batches[index;h"divyUP";0;h"selector"]
The problem with this approach though is h"selector" fetches all the rows of data at the same time (and multiple times - for each batch of 100,000 that it upserts to my local normTable).
I could move the batches function to the remote server but then how would I update my local normTable remotely?
Alternatively I could break up the rows into batches on the server and then pull each batch individually. But if I don't know how many rows there are how do I know how many variables are required? For example the following would work, but only up to the first 400k rows...
batch1:select[100000] from t where symbol like "A*"
batch2:select[100000 100000] from t where symbol like "A*"
batch3:select[200000 100000] from t where symbol like "A*"
batch4:select[300000 100000] from t where symbol like "A*"
Is there a way to set a batchX variable so that it creates a new variable that equals the count of divyUP?

I would suggest few changes as you are trying to connect to remote server:
Do not run synchronous request as that would make server to slow down its processing. Try to make asynchronous request using callbacks.
Do not do full table scan(for heavy comparison) in each call specially for regex. Its possible that most of the data might be available in cache in next call but still it is not guaranteed which will again impact the server normal operations.
Do not make data requests in burst. Either use timer or make another data request call when last batch data has arrived.
Below approach is based on above suggestions. It will avoid scanning full table for columns other than index column(which is light weight) and make next request only when last batch has arrived.
Create Batch processing function
This function will run on server and read small batch of data from table using indices and return the required data.
q) batch:{[ind;s] ni:ind+s; d:select from t where i within (ind;ni), sym like "A*";
neg[.z.w](`upd;d;$[ni<count t;ni+1;0]) }
It takes 2 arguments- starting index and batch size to work on.
This function will finally call upd function on local macine asynchronously and will pass 2 arguments.
Table index to start next batch from (return 0 in case all rows are done to stop next batch processing)
Data from current batch request
Create Callback function
Result from batch processing function will come into this function.
If index > 0 that means there is more data to process and next batch should start form this index.
q) upd:{[data;ind] t::t,data;if[ind>0;fetch ind]}
Create Main function to start process
q)fetch:{[ind] h (batch;ind;size)}
Finally open connection, create table variable and run fetch function.
q) h:hopen `:server:port
q) t:()
q) size:100
q) fetch 0
Now, above method is based on the assumption that server table is static. In case its getting updates in real time then changes would be required depending upon how the table is getting updated on server.
Also, other optimizations can be done depending upon attributes set on remote table which can improve the performance.

If you're ok sending sync messages it can be simplified to something like:
{[h;i]`mytab upsert h({select from t where i in x};i)}[h]each 0N 100000#til h"count t"
And you can easily change it to control the number of batches (rather than the size) by instead using 10 0N# (that would do it in 10 batches)

Rather than having individual variables, the cut function can split up the result of the select into chunks of 100000 rows. Indexing each element is a table.
batches:100000 cut select from t where symbol like "A*"

Related

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

Replay tplog file by appending messages in the partitioned table in kdb

I'm trying to replay messages from tp log directly to partitioned table on disk by appending messages since I don't have much primary memory compared to tplog size.
TpLog is like below:
q)9 2 sublist get `:/Users/uts/Desktop/repos/ktick/tick/sym2020.05.23
`upd `trade (,0D22:38:00.083960000;,`MSFT.O;,45.15104;,710)
`upd `quote (,0D22:38:01.082882000;,`VOD.L;,341.2765;,341.3056;,732;,481)
I'm using below 'upsert' method to append these tplog msgs to partitioned table but it is failing with type error on upsert:
quote:([]time:`timespan$();sym:`symbol$();bid:`float$();ask:`float$();bsize:`int$();asize:`int$());
trade:([]time:`timespan$();sym:`symbol$();price:`float$();size:`int$());
`:/Users/uts/db/2020.05.23/quote/ set .Q.en[`:/Users/uts/db;]quote;
`:/Users/uts/db/2020.05.23/trade/ set .Q.en[`:/Users/uts/db;]trade;
upd:{[t;d]
if[`trade~t;[show raze d;`:/Users/uts/db/2020.05.23/trade/ upsert .Q.en[`:/Users/uts/db;]enlist (cols trade)!raze d]];
};
-11!`:/Users/uts/Desktop/repos/ktick/tick/sym2020.05.23
Error:
'type
[1] upd:{[t;d]
if[`trade~t;[show raze d;`:/Users/utsav/db/2020.05.23/trade/ upsert .Q.en[`:/Users/utsav/db;]enlist (cols trade)!raze d]];
^
}
But if I try to manually append the message to partitioned table, it's working fine:
`:/Users/uts/db/2020.05.23/trade/ upsert .Q.en[`:/Users/uts/db;]enlist (cols trade)!raze (enlist 0D22:39:00.083960000;enlist `MSFT.O;enlist 45.15104; enlist 710)
Not sure why 'upsert' is not working within upd function along with -11!.
Please share(details/links) if there is any better way(most probably there must be) to replay the tplogs directly to disk without using much primary memory.
I'm not sure if it'll solve your exact problem but a few suggestions:
Your code assumes that every single tickerplant log record is for a single row. This may not be the case, as many tickerplant logs will log multiple rows in a single update. What this means is that your enlist (cols trade)!raze d code wouldn't work (though I would suspect a length error in this case). A more general alternative is to use:
$[0>type first d;enlist cols[trade]!d;flip cols[trade]!d]
You should not try to write to disk for every single upd record from a tickerplant log - it's simply too many disk writes in such a short space of time. It's inefficient and could lead to disk I/O constraints. Better to insert in-memory until the table reaches a certain size and then write in a batch and wipe the table. I would suggest something like:
write:{`:/Users/uts/db/2020.05.23/trade/ upsert .Q.en[`:/Users/uts/db;value x];delete from x};
upd:{[t;d]
if[`trade~t;t insert d];
if[10000<count value t;write[t]];
};
Then your replay would look like:
-11!`:/Users/uts/Desktop/repos/ktick/tick/sym2020.05.23;
if[0<count trade;write[`trade]]; /need to write the leftovers
`sym`time xasc `:/Users/uts/db/2020.05.23/trade/;
#[`:/Users/uts/db/2020.05.23/trade/;`sym;`p#];

Pentaho Data Integration - Assure that one step will be run before another

I have a transformation in Pentaho Data Integration that stores data in several tables from a database.
But this database has constraints, meaning I can't put things in a table before the related data is put in another table.
Sometimes it works, sometimes it doesn't, depends on concurrency luck.
So I need to assure that Table Output 1 gets entirely run before Table Output 2 starts.
How can I do this?
You can use a step named "Block this step until steps finish".
You place it before the step that needs to wait. And inside the block you define which steps are to be waited for.
Below, suppose, Table Output 2 contains a foreign key to a field in table 1, but the rows you're going to reference in table 2 still don't exist in table 1. This means Table Output 2 needs to wait until Table Output finishes.
Place the "block" connected before Table Output 2:
Then enter the properties of the "block" step. Inside, add Table Output in the list (and any other steps you want to wait for):
For that, you can use job instead of transformation. because in transformation all the steps run parallelly. so use a job in that add the first transformation in which table output1 will be executed first and in second transformation table output2 will be performed

Downloading KDB tables into a [q] session from a web-facing database

What I have: the hostname/port number of an always-running [q] session that exposes several KDB tables via our internal web. I can easily run [q] commands against it in a browser (or even, through the use of [hopen], via a local [q] session invoked on the command line).
What I need: a [q] script, or the knowledge of how to write one, that will automatically connect to the web-facing database, and copy over all of its tables into the localhost [q] session's working memory (without knowing all the table names in advance).
Concerns include:
The tables are huge. I'm prepared to wait on my machine if need be, but I do need this to work eventually.
While I can get a legible list of all the server's table-names, I can never get it in a useful format (ideally it'd be a List, rather than the Symbol that the hopen-ed [tables] command always gives me). Also, I'm told that it may be possible to accomplish the transfers without ever explicitly querying the table names, though I can't imagine how; bonus points if you manage that.
Yo can implement smoething like this :
.data.oc:1000;
/connect to the session using hopen
h:hopen `::1234;
/get the table names
tabs:h"tables[]";
/create local tables with the same names
{ .[x;();:;()] } each tabs;
/for each table name
{[tab]
/get the table count
c:h({count value x};tab);
oc:.data.oc;
/cut the table count to some optimal value, say 10,000 (0-99999; 10000-19999).
idxl:$[c>oc; [ l: c div oc; ( (0;oc-1)+/:oc*til l),enlist (l*oc;c-1) ] ; enlist (0; c-1)];
/now iterate over the list and use them as indexes to query the table.
{[t;idx] t upsert h ({[t;y] ?[t; enlist (within;`i;y);0b;()] } ; t;idx ) }[tab] each idxl;
}each tabs

How to correctly partition a table in real-time in kdb?

I have a C application that streams data to a kdb memory table all day, eventually outgrowing the size of my server RAM. The goal ultimately is to store data on disk, so I decided to run a timer partition function to transfer data gradually. I came up with this code:
part_timer : { []
(`$db) upsert .Q.en[`$sym_path] select [20000] ts,exch,ticker,side,price,qty,bid,ask from md;
delete from `md where i<20000
}
.z.ts: part_timer
.z.zd: 17 2 6i
\t 1000
Is this the correct way to partition streaming data in real-time? How would you write this code? I'm concerned about the delete statement not being synchronized with the select.
While not an explicit solution to your issue. Take a look at w.q here. This is a write only alternative to the traditional RDB. This buffers up requests and every MAXROWS records writes the data to disk.
In the above comment you asked:
If not, how can I reorganize the db effectively at the end of the day
to store symbols sequentially?
I know this answer is a bit delayed, but this might help someone else who is trying to do the same thing.
Run the following to sort the data on disk (This is slower than pulling it into ram, sorting and then writing to disk):
par:.Q.par[PATH;.z.D;TABLE];
par xasc `sym;
#[par;`sym;`p#];
Where:
PATH: `:path/on/disk/to/db/root;
For single file tables:
TABLE: `tableName;
For splayed tables:
TABLE: `$"tablename/"
At end of day (i.e. you don't expect the data to be appended), from your c program you can call:
Write to a location for 2013.01.01
.Q.dpft[`:/path/to/location;2013.01.01;`sym;`tableName];
Clear the table
delete from `tableName
Clear some memory up
.Q.gc peach til system"s"
Of course that assumed you have time/sym columns, and you want to splay by date. Otherwise
`:/path/to/location/tableName/ set tableName
Will splay.
Can append also if you wish (see IO chapter of Q for Mortals for examples)