We have a postgres process that works as follows:-
1) CSV file split into a table with a row per record
2) pagent runs a postgres function that reads each record and writes it to a new table as either a new record or an update
3) a trigger runs on the new table and depending on the record value runs a plv8 function to update its data (there's a fair bit of json processing involved and plv8 was the easiest way to code it). The second update comes from plv8 and we've used the pattern below:-
query = plv8.prepare('...');
query.execute(<params>);
query.free();
When we monitor this we see that processing 5000 records uses 14Gb of virtual memory. So something is awry as the CSV record is < 1k in size. This became acute after we added a new index to the table.
Where should we look for solutions to this? Is it normal and is it linked to the indexes being updated in the transaction or another factor.
Related
I'm trying to replay messages from tp log directly to partitioned table on disk by appending messages since I don't have much primary memory compared to tplog size.
TpLog is like below:
q)9 2 sublist get `:/Users/uts/Desktop/repos/ktick/tick/sym2020.05.23
`upd `trade (,0D22:38:00.083960000;,`MSFT.O;,45.15104;,710)
`upd `quote (,0D22:38:01.082882000;,`VOD.L;,341.2765;,341.3056;,732;,481)
I'm using below 'upsert' method to append these tplog msgs to partitioned table but it is failing with type error on upsert:
quote:([]time:`timespan$();sym:`symbol$();bid:`float$();ask:`float$();bsize:`int$();asize:`int$());
trade:([]time:`timespan$();sym:`symbol$();price:`float$();size:`int$());
`:/Users/uts/db/2020.05.23/quote/ set .Q.en[`:/Users/uts/db;]quote;
`:/Users/uts/db/2020.05.23/trade/ set .Q.en[`:/Users/uts/db;]trade;
upd:{[t;d]
if[`trade~t;[show raze d;`:/Users/uts/db/2020.05.23/trade/ upsert .Q.en[`:/Users/uts/db;]enlist (cols trade)!raze d]];
};
-11!`:/Users/uts/Desktop/repos/ktick/tick/sym2020.05.23
Error:
'type
[1] upd:{[t;d]
if[`trade~t;[show raze d;`:/Users/utsav/db/2020.05.23/trade/ upsert .Q.en[`:/Users/utsav/db;]enlist (cols trade)!raze d]];
^
}
But if I try to manually append the message to partitioned table, it's working fine:
`:/Users/uts/db/2020.05.23/trade/ upsert .Q.en[`:/Users/uts/db;]enlist (cols trade)!raze (enlist 0D22:39:00.083960000;enlist `MSFT.O;enlist 45.15104; enlist 710)
Not sure why 'upsert' is not working within upd function along with -11!.
Please share(details/links) if there is any better way(most probably there must be) to replay the tplogs directly to disk without using much primary memory.
I'm not sure if it'll solve your exact problem but a few suggestions:
Your code assumes that every single tickerplant log record is for a single row. This may not be the case, as many tickerplant logs will log multiple rows in a single update. What this means is that your enlist (cols trade)!raze d code wouldn't work (though I would suspect a length error in this case). A more general alternative is to use:
$[0>type first d;enlist cols[trade]!d;flip cols[trade]!d]
You should not try to write to disk for every single upd record from a tickerplant log - it's simply too many disk writes in such a short space of time. It's inefficient and could lead to disk I/O constraints. Better to insert in-memory until the table reaches a certain size and then write in a batch and wipe the table. I would suggest something like:
write:{`:/Users/uts/db/2020.05.23/trade/ upsert .Q.en[`:/Users/uts/db;value x];delete from x};
upd:{[t;d]
if[`trade~t;t insert d];
if[10000<count value t;write[t]];
};
Then your replay would look like:
-11!`:/Users/uts/Desktop/repos/ktick/tick/sym2020.05.23;
if[0<count trade;write[`trade]]; /need to write the leftovers
`sym`time xasc `:/Users/uts/db/2020.05.23/trade/;
#[`:/Users/uts/db/2020.05.23/trade/;`sym;`p#];
I have a large table on a remote server with an unknown (millions) amount of rows of data. I'd like to be able to fetch the data in batches of 100,000 rows at a time, update my local table with those fetched rows, and complete this until all rows have been fetched. Is there a way I can update a local table remotely?
Currently I have a dummy table called t on the server along with the following variables...
t:([]sym:1000000?`A`B`Ab`Ba`C`D`Cd`Dc;id:1+til 1000000)
selector:select from t where sym like "A*"
counter:count selector
divy:counter%100000
divyUP:ceiling divy
and the below function on the client along with the variables index set to 0 and normTable, which is a copy of the remote table...
index:0
normTable:h"0#t"
batches:{[idx;divy;anty;seltr]
if[not idx=divy;
batch:select[(anty;100000)] from seltr;
`normTable upsert batch;
idx+::1;
divy:divy;
anty+:100000;
seltr:seltr;
batches[idx;divy;anty;seltr]];
idx::0}
I call that function using the following command...
batches[index;h"divyUP";0;h"selector"]
The problem with this approach though is h"selector" fetches all the rows of data at the same time (and multiple times - for each batch of 100,000 that it upserts to my local normTable).
I could move the batches function to the remote server but then how would I update my local normTable remotely?
Alternatively I could break up the rows into batches on the server and then pull each batch individually. But if I don't know how many rows there are how do I know how many variables are required? For example the following would work, but only up to the first 400k rows...
batch1:select[100000] from t where symbol like "A*"
batch2:select[100000 100000] from t where symbol like "A*"
batch3:select[200000 100000] from t where symbol like "A*"
batch4:select[300000 100000] from t where symbol like "A*"
Is there a way to set a batchX variable so that it creates a new variable that equals the count of divyUP?
I would suggest few changes as you are trying to connect to remote server:
Do not run synchronous request as that would make server to slow down its processing. Try to make asynchronous request using callbacks.
Do not do full table scan(for heavy comparison) in each call specially for regex. Its possible that most of the data might be available in cache in next call but still it is not guaranteed which will again impact the server normal operations.
Do not make data requests in burst. Either use timer or make another data request call when last batch data has arrived.
Below approach is based on above suggestions. It will avoid scanning full table for columns other than index column(which is light weight) and make next request only when last batch has arrived.
Create Batch processing function
This function will run on server and read small batch of data from table using indices and return the required data.
q) batch:{[ind;s] ni:ind+s; d:select from t where i within (ind;ni), sym like "A*";
neg[.z.w](`upd;d;$[ni<count t;ni+1;0]) }
It takes 2 arguments- starting index and batch size to work on.
This function will finally call upd function on local macine asynchronously and will pass 2 arguments.
Table index to start next batch from (return 0 in case all rows are done to stop next batch processing)
Data from current batch request
Create Callback function
Result from batch processing function will come into this function.
If index > 0 that means there is more data to process and next batch should start form this index.
q) upd:{[data;ind] t::t,data;if[ind>0;fetch ind]}
Create Main function to start process
q)fetch:{[ind] h (batch;ind;size)}
Finally open connection, create table variable and run fetch function.
q) h:hopen `:server:port
q) t:()
q) size:100
q) fetch 0
Now, above method is based on the assumption that server table is static. In case its getting updates in real time then changes would be required depending upon how the table is getting updated on server.
Also, other optimizations can be done depending upon attributes set on remote table which can improve the performance.
If you're ok sending sync messages it can be simplified to something like:
{[h;i]`mytab upsert h({select from t where i in x};i)}[h]each 0N 100000#til h"count t"
And you can easily change it to control the number of batches (rather than the size) by instead using 10 0N# (that would do it in 10 batches)
Rather than having individual variables, the cut function can split up the result of the select into chunks of 100000 rows. Indexing each element is a table.
batches:100000 cut select from t where symbol like "A*"
I have a database table with unique record numbers created with generator but because of error in code (setting generators) record numbers suddenly became large because many numbers are skipped. I would like to rewrite all record numbers starting with 1 and finish with total records number. With application it will take a lot of time.
As I see from documentation for Firebird it should be simple task using loop but I have no experience with Firebird programming, I am using only simple SQL statements, can somebody help?
Actually there is no need to program a loop, simple update statement should do. First, reset the generator:
SET GENERATOR my_GEN TO 0;
and then update all the records assigning them new id
update tab set recno = gen_id(my_GEN, 1) order by recno asc;
It assumes that all references to the recno field are via foreign key with ON UPDATE CASCADE, otherwise you either mess up your data or the update fails.
During this operation there should be no other users in the database!
That being said, you really shouldn't care about gaps in your record numbers.
I have a C application that streams data to a kdb memory table all day, eventually outgrowing the size of my server RAM. The goal ultimately is to store data on disk, so I decided to run a timer partition function to transfer data gradually. I came up with this code:
part_timer : { []
(`$db) upsert .Q.en[`$sym_path] select [20000] ts,exch,ticker,side,price,qty,bid,ask from md;
delete from `md where i<20000
}
.z.ts: part_timer
.z.zd: 17 2 6i
\t 1000
Is this the correct way to partition streaming data in real-time? How would you write this code? I'm concerned about the delete statement not being synchronized with the select.
While not an explicit solution to your issue. Take a look at w.q here. This is a write only alternative to the traditional RDB. This buffers up requests and every MAXROWS records writes the data to disk.
In the above comment you asked:
If not, how can I reorganize the db effectively at the end of the day
to store symbols sequentially?
I know this answer is a bit delayed, but this might help someone else who is trying to do the same thing.
Run the following to sort the data on disk (This is slower than pulling it into ram, sorting and then writing to disk):
par:.Q.par[PATH;.z.D;TABLE];
par xasc `sym;
#[par;`sym;`p#];
Where:
PATH: `:path/on/disk/to/db/root;
For single file tables:
TABLE: `tableName;
For splayed tables:
TABLE: `$"tablename/"
At end of day (i.e. you don't expect the data to be appended), from your c program you can call:
Write to a location for 2013.01.01
.Q.dpft[`:/path/to/location;2013.01.01;`sym;`tableName];
Clear the table
delete from `tableName
Clear some memory up
.Q.gc peach til system"s"
Of course that assumed you have time/sym columns, and you want to splay by date. Otherwise
`:/path/to/location/tableName/ set tableName
Will splay.
Can append also if you wish (see IO chapter of Q for Mortals for examples)
I have a table that we just enabled FileStreams on. We created a new varbinary column and set it to store to a filestream. Then we copied everything from the existing column to the new one in order to get the file data pushed to the file system.
So far so good.
However, we weren't able to take the DB offline while doing this (uptime SLA) and there were 2 records out of 7400 that came in after the update statement ran but before we renamed the columns. We currently have 2 columns: FileData and FileDataOld. Where FileData is the one tied to the filestream.
The average file size is a little over 2MB. So, I decided to run a very simple select statement to find the records that didn't go:
select DocumentId, FileName
from docslist
where FileData is null
When I ran this query, the CPU spiked to 80% and sat there for quite a while. Ultimately I killed the select after 2 minutes because that was just insane.
If I run something like:
select DocumentId, FileName from docslist
It returns almost instantly.
However, as soon as I try to query where FileData or FileDataOld is null it spins off into forever land.
When using Resource Monitor, and I query for 'FileData is null', I can see it pulling every byte from every single document off the file system. Which is pretty odd; you'd think that info would be stored within the table itself.
If I query for FileDataOld is null, it looks like it's trying to load the entire table (16GB) in memory.
How can I improve this?? I just need to get the 2 records that happened after the update statement and force those documents to move over.
Can't you do:
select DocumentId, FileName from docslist WHERE DATALENGTH(FileData)>0
On mdsn it says:
DATALENGTH is especially useful with varchar, varbinary, text, image,
nvarchar, and ntext data types because these data types can store
variable-length data.
The DATALENGTH of NULL is NULL.
Reference here