What is the right way to iterate through a kdb partitioned table in an client application? - kdb

I want to process all the rows of a kdb table in an R program (I use qserver.R). One way to do this is to initialize a memory handler and then iterate through all the rows one of the time, as explained here:
t: select from mytable where ts>12:30:00,ts<15:00:00,price,msg="A"
t[0]
t[1]
t[2]
...
I want to limit the number of client/server calls in R to loop as fast as possible.
How can I fetch multiple rows for each call?

NOTE: my answer below assumes that mytable is the partioned database, but that you now have t in memory.
another option using cut (using "chunks" of 1,000,000 as per your earlier post)
(`int$1e6) cut t
now you have a list of table "chunks" of your desired size and you can use accordingly.
I frequently use this for certain functions (particularly in combination with peach).
A pattern I've found useful is:
f:`function that does something useful on chunks`
fa:`function that reaggregates up to final results`
r:fa raze f peach (`int$`size`)cut t
if you're t is really large (both vertical/horizontal) you might want to avoid cut directly on the table for memory reasons, but can instead cut a list of indices for the table into the appropriate size and then feed the indices to your f and have that index to the t and grab what you want.
Below a quick comparison of both approaches (note that f here is pointless, but just to prove the point of the cut on t versus indices)
q)t:flip (`$"c",/:string til 100)!{(`int$1e7)?100} each til 100
q)\ts a:raze {select c1,c99 from x}each 1000 cut t
3827 4108103072j
q)\ts b:raze {select c1,c99 from t[x]}each 1000 cut til count t
3057 217623200j
q)4108103072j%217623200j
18.87714
q)a~b
1b

From your previous questions I assume this is a 1 person system so what benefit are you getting from kdb? Why not work fully in R and just use flat memory mapped files directly there? Avoiding unneeded complexity and overhead. If all you want to do is stream the data through R in order that should be simple.
Rather than "ts>12:30:00,ts<15:00:00" use "ts within (12:30:00;15:00:00)" it's quicker.
The larger the size of chunks you process in the more efficient it is likely to be. 100 seems quite small.
Regards,
Ryan Hamilton

Sorted out, this returns 100 rows each time:
\l /data/mydb
t: select from mytable where ts>12:30:00,ts<15:00:00,price,msg="A"
select [0 100] from t
select [100 100] from t
select [200 100] from t
..

Related

What kind of index should I use in postgresql for a column with 3 values

I have a table with 100Mil+ records and 237 fields.
One of the fields is a varchar 1 field with three possible values (Y,N,I)
I need to find all of the records with N.
Right now I have a b-tree index built and the query below takes about 20 min to run.
Is there another index I can use to get better performance?
SELECT * FROM tableone WHERE export_value='N';
Assuming your values are roughly equally distributed (say at least 15% of each value) and roughly equally distributed throughout the table (some physically at the beginning, some in the middle, some at the end) then no.
If you think about it you'll see why. You'll have to look up tens of millions of disk blocks in the index and then fetch them from the disk one by one. By the time you have done that, it would have been quicker to just scan the whole table and pick out the values as they match. The planner knows this and would probably not use the index at all.
However - if you only have 17 rows with "N" or they are all very recently added to the table and so physically happen to be close to each other then yes, and index can help.
If you only had a few rows with "N" you would have mentioned it, so we can ignore that one.
If however you mostly insert to this table you might find a BRIN index helpful. That can let the planner see that e.g. the first 80% of your table doesn't have any "N" blocks and so it just needs to look at the last bit.

Is is possible limit the number of rows in the output of a Dataprep flow?

I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.
You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.
So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!

Attributes internal working in aj for performance benefits in kdb

Considering the trade table 't' and quotes table 'q' in memory:
q)t:([] sym:`GOOG`AMZN`GOOG`AMZN; time:10:01 10:02 10:02 10:03; px:10 20 11 19)
q)q:([] sym:`GOOG`AMZN`AMZN`GOOG`AMZN; time:10:01 10:01 10:02 10:02 10:03; vol:100 200 210 110 220)
In order to get performance benefits applying grouped attribute on 'sym' column of q table and making 'time' column sorted within sym.
Using this, I can clearly see the performance benefits from it:
q)\t:1000000 aj[`sym`time;t;q]
9573
q)\t:1000000 aj[`sym`time;t;q1]
8761
q)\t:100000 aj[`sym`time;t;q]
968
q)\t:100000 aj[`sym`time;t;q1]
893
And in large tables the performance is far better.
Now, I'm trying to understand how it works internally when we are applying grouped attribute to sym column and sort time within sym.
My understanding is internally the aj should happen in below way, can someone please let me know the correct internal working?
* Since, grouped attribute is applied on sym; so it creates a hashtable for table q1, then since we are sorting on time so the internal q1 table might look like.
GOOG|(10:01;10:02)|(100;110)
AMZN|(10:01;10:02:10:03)|(200;210;220)
So in this case of q1, if the interpreter has to join (AMZN;10:02) of t table; it will directly find it in q1's hasttable in less time, but for joining same value(AMZN;10:02) of table 't' in table 'q' the interpreter will have to search linearly through table 'q' hence taking more time.
I believe you're on the right track, though we can't know for sure as we don't have access to the kdb source code to see precisely what it does.
If you look at the definition of aj you'll see that it's based on bin:
q)aj
k){.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)bin x#y;y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}
specifically,
(`sym`time#q)bin `sym`time#t
and the bin documentation provides some more details on how bin behaves: https://code.kx.com/q/ref/bin/
I believe in the two-column case it will first match on the sym column and then use bin on the second column. Like you said, the grouped attribute on sym speeds up the matching of syms part and the sorting on time ensures the bin returns the correct results. Note that for on-disk queries it's optimal to put `p# on sym rather than `g# as the parted attribute is optimal for matching/retrieving by sym from disk.

Query distinct values from historical database

If I run this query on large Historical database without specifying a date, will KDB be smart enough to retrive status values from index and not bring database down?
select distinct status from trades
The only way kdb can possibly tell all the distinct status is by reading from every partition. Yes this will take a lot of memory but unless you yourself want to maintain a cache of all distinct status, there is nothing else you can do. As previous mentioned an attribute will speed the query up but the query time will still only scale with the number of partitions.
To retrieve using index, kdb provides 'g#' attribute. Distinct alone can take more time which depends on size of your table(it will be linear search without `g# attribute).
Check this-> http://code.kx.com/q4m3/8_Tables/#88-attributes
Let's look at simple example:
q) a: 10000000#1 2 3 5
q) b:`g#a
q) \ts distinct a
68 134217888
q) \ts distinct b
0 288
Difference shows that g# attribute makes a lot of difference in time and space taken during searching. It is becauseg# attribute creates and maintains index on vector.

Pagination on large data sets? – Abort count(*) after a certain time

We use the following pagination technique here:
get count(*) of given filter
get first 25 records of given filter
-> render some pagination links on the page
This works pretty well as long as count(*) is reasonable fast. In our case the data size has grown to a point where a non-indexd query (although most stuff is covered by indices) takes more than a minute. So at this point the user waits for a mostly unimportant number (total records matching filter, number of pages). The first N records are often ready pretty fast.
Therefore I have two questions:
can I limit the count(*) to a certain number
or would it be possible to limit it by time? (no count() known after 20ms)
Or just in general: are there some easy ways to avoid that problem? We would like to keep the system as untouched as possible.
Database: Oracle 10g
Update
There are several scenarios
a) there's an index -> neither count(*) nor the actual select should be a problem
b) there's no index
count(*) is HUGE, and it takes ages to determine it -> rownum would help
count(*) is zero or very low, here a time limit would help. Or I could just dont do a count(*) if the result set is already below the page limit.
You could use 'where rownum < x' to limit the number of rows to count. And if you need to show to your user that you has more register, you could use x+1 in count just to see if there is more than x registers.