Parameterize select query in unary kdb function - kdb

I'd like to be able to select rows in batches from a very large keyed table being stored remotely on disk. As a toy example to test my function I set up the following tables t and nt...
t:([sym:110?`A`aa`Abc`B`bb`Bac];px:110?10f;id:1+til 110)
nt:0#t
I select from the table only records that begin with the character "A", count the number of characters, divide the count by the number of rows I would like to fetch for each function call (10), and round that up to the nearest whole number...
aRec:select from t where sym like "A*"
counter:count aRec
divy:counter%10
divyUP:ceiling divy
Next I set an idx variable to 0 and write an if statement as the parameterized function. This checks if idx equals divyUP. If not, then it should select the first 10 rows of aRec, upsert those to the nt table, increment the function argument, x, by 10, and increment the idx variable by 1. Once the idx variable and divyUP are equal it should exit the function...
idx:0
batches:{[x]if[not idx=divyUP;batch::select[x 10]from aRec;`nt upsert batch;x+:10;idx+::1]}
However when I call the function it returns a type error...
q)batches 0
'type
[1] batches:{[x]if[not idx=divyUP;batch::select[x 10]from aRec;`nt upsert batch;x+:10;idx+::1]}
^
I've tried using it with sublist too, though I get the same result...
batches:{[x]if[not idx=divyUP;batch::x 10 sublist aRec;`nt upsert batch;x+:10;idx+::1]}
q)batches 0
'type
[1] batches:{[x]if[not idx=divyUP;batch::x 10 sublist aRec;`nt upsert batch;x+:10;idx+::1]}
^
However issuing either of those above commands outside of the function both return the expected results...
q)select[0 10] from aRec
sym| px id
---| ------------
A | 4.236121 1
A | 5.932252 3
Abc| 5.473628 5
A | 0.7014928 7
Abc| 3.503483 8
A | 8.254616 9
Abc| 4.328712 10
A | 5.435053 19
A | 1.014108 22
A | 1.492811 25
q)0 10 sublist aRec
sym| px id
---| ------------
A | 4.236121 1
A | 5.932252 3
Abc| 5.473628 5
A | 0.7014928 7
Abc| 3.503483 8
A | 8.254616 9
Abc| 4.328712 10
A | 5.435053 19
A | 1.014108 22
A | 1.492811 25

The issue is that in your example, select[] and sublist requires a list as an input but your input is not a list. Reason for that is when there is a variable in items(which will form a list), it is no longer considered as a simple list meaning blank(space) cannot be used to separate values. In this case, a semicolon is required.
q) x:2
q) (1;x) / (1 2)
Select command: Change input to (x;10) to make it work.
q) t:([]id:1 2 3; v: 3 4 5)
q) {select[(x;2)] from t} 1
`id `v
---------
2 4
3 5
Another alternative is to use 'i'(index) column:
q) {select from t where i within x + 0 2} 1
Sublist Command: Convert left input of the sublist function to a list (x;10).
q) {(x;2) sublist t}1

You can't use the select[] form with variable input like that, instead you can use a functional select shown in https://code.kx.com/q4m3/9_Queries_q-sql/#912-functional-forms where you input as the 5th argument the rows you want
Hope this helps!

Related

Spark PIVOT performance is very slow on High volume Data

I have one dataframe with 3 columns and 20,000 no of rows. i need to be convert all 20,000 transid into column.
table macro:
prodid
transid
flag
A
1
1
B
2
1
C
3
1
so on..
Expected Op be like upto 20,000 no of columns:
prodid
1
2
3
A
1
1
1
B
1
1
1
C
1
1
1
I have tried with PIVOT/transpose function but its taking too long time for high volume data. for processing 20,000 rows to column its taking around 10 hrs.
eg.
val array =a1.select("trans_id").distinct.collect.map(x => x.getString(0)).toSeq
val a2=a1.groupBy("prodid").pivot("trans_id",array).sum("flag")
When i used pivot on 200-300 no of rows then it is working fast but when no of rows increase PIVOT is not good.
can anyone please help me to find out the solution.is there any method to avoid PIVOT function as PIVOT is good for low volume conversion only.How to deal with high volume data.
I need this type of conversion for matrix multiplication.
for matrix multiplication my input be like below table and final results will be in matrix multiplication.
|col1|col2|col3|col4|
|----|----|----|----|
|1 | 0 | 1 | 0 |
|0 | 1 | 0 | 0 |
|1 | 1 | 1 | 1 |

Reshape [cols;table]

How do I get columns from a table? If they don't exist it's ok to get them as null columns.
Trying reshape#:
q)d:`a`b!1 2
q)enlist d
a b
---
1 2
q)`a`c#d
a| 1
c|
q)`a`c#enlist d
'c
[0] `a`c#enlist d
^
Why does thereshape# operator not work on a table? It could easily act on each row (which is dict) and combine results. So I'm forced to write:
q)`a`c#/:enlist d
a c
---
1
Is it the shortest way?
Any key you try to take (#) which is not present in a dictionary will be assigned a null value of the same type as the first value in the dictionary. Similar behaviour is not available for tables.
q)`a`c#`a`b!(1 2;())
a| 1 2
c| `long$()
q)`b`c#`a`b!(();1 2)
b| 1 2
c| ()
Like you mentioned, the use of each-right (/:) will act on each row of the table IE each dictionary. Instead of using an iterator to split the table into dictionaries we can act on the dictionary itself. This will return the same output and is slightly faster.
q)d:`a`b!1 2
q)enlist`a`c#d
a c
---
1
q)(`a`c#/:enlist d)~enlist`a`c#d
1b
q)\ts:1000000 enlist`a`c#d
395 864
q)\ts:1000000 `a`c#/:enlist d
796 880

Creating nested columns in kdb table

I'd like to create a nested listed for one of my table's columns, but I'm unsure of the syntax to use. If for instance I had the following table...
q)t:([]submitter:`A`B`C; code:3?100; status:110b)
q)t
submitter code status
---------------------
A 2 1
B 39 1
C 64 0
I want to do something similar to below. However this will add the additional column x to the table and place the value there instead of creating a compound list for the code column....
q)update code,:77 from t where status<>1b
submitter code status x
------------------------
A 2 1
B 39 1
C 64 0 77
If it were a dictionary with a single value I would do the following...
q)d:`sumbitter`code`status!(`A;1?100;1)
q)d
sumbitter| `A
code | ,88
status | 1
q)d[`code],:99
q)d
sumbitter| `A
code | 88 99
status | 1
How do I perform the same operation on a table with multiple rows?
My desired output would look like...
q)t
submitter code status
----------------------
A 2 1
B 39 1
C 64 77 0
This would also do it for you, doesn't require you to change the type in advance
q)update code:(code,'(77;())status) from t
submitter code status
---------------------
A ,12 1
B ,10 1
C 1 77 0
You can't change the column type of your code column on-the-fly like you intend to do.
Instead, you first have to update the type of the column code to a list of long instead of long:
q)meta t
c | t f a
---------| -----
submitter| s
code | j
status | b
Update the type:
t: update enlist each code from t
Now the type of code is "J", which is indeed a list of long:
q)meta t
c | t f a
---------| -----
submitter| s
code | J
status | b
And then you can append an element to the code like this:
t:update code:{x,77} each code from t where status<>1b
q)t
submitter code status
----------------------
A ,2 1
B ,39 1
C 64 77 0

Kdb upsert with conditional syntax?

Is there a way I can upsert in kdb where the following occurs:
If key is not present, insert values
If key is present, check if current value is greater
A) If so, perform no action
B) If not, update values
Something like:
job upsert ([title: job1] time: enlist 1 where time > 1)
Since you're using a keyed table, and you want to change values only if they're greater and add in new keys and values, you can try avoiding upsert entirely:
t:([job:`a`b`c] val: 4 4 4) /current table
nt:([job:`a`c`d]val: 6 1 5) /new values to check
t|nt
job| val
---| ---
a | 6
b | 4
c | 4
d | 5
This will automatically add keys that aren't there, and update the current value to the new value if the new value is larger.
please find a solution and explanation below. I'll edit if I come up with a better way - thanks. *also I hope I interpreted the question correctly.
q)t1
name | age height
-------| ----------
michael| 26 173
john | 57 156
sam | 23 134
jimmy | 83 183
conor | 32 145
jim | 64 167
q)t2
name age height
---------------
john 98 220
mary 24 230
jim 50 240
q)t1 upsert t2 where{$[all null n:x[y`name];1b;y[`age]>n[`age]]}[t1;]each t2
name | age height
-------| ----------
michael| 26 173
john | 98 220
sam | 23 134
jimmy | 83 183
conor | 32 145
jim | 64 167
mary | 24 230
q)
Explanation;
The function takes 2 args, x = the keyed table t1 and y = each record from t2(as a dictionary). First we extract the name value from the t2 record(y`name) and try to index into the source keyed table with that value and store the result in the local variable n. If the name exists, the corresponding record(n, as a dictionary)will be returned from y(and all null n will be false) otherwise an empty record will be returned(and all null n will be true). If we cannot find an instance of the t2[`name] in t1 then we just return 1b from the function. Otherwise, then we want to compare the ages between the two records (n[`age] <-- age referenced in t1 for the matching name & y[`age] <-- age of this particular record of t2) - if the age for this matching record in t2 (y[`age]) is greater than the matching value from t1 then we return 1b otherwise we return 0b.
The result of this function is a list of booleans, one for each record in t2. 1b is returned under 2 scenarios - either;
(1) This particular name from t2 has no match in t1. (2) This name from t2 does have a match in t1 and the age is greater than the corresponding age in t1. 0b is returned when the age referenced in t2 is less than the corresponding age from t1.
In our example the result of the function is 110b and after we apply where to this, the result is the indexes where the list value is true i.e. where 110b --> 0 1. We use this list to index into t2 which returns the first 2 records from t2(these are either new records or records where the age is greater than what is referenced in t1), then we simply upsert this into t1.
I hope this helps and hope some better solutions come along.
For a table, a key, and a value: upsert the tuple if the key is new or the value exceeds the existing value.
q)t:([job:`a`b`c] val: 4 4 4) /current table
q)t[`a]|:6 /old key, higher value
q)t
job| val
---| ---
a | 6
b | 4
c | 4
q)t[`c]|:1 /old key, lower value
q)t
job| val
---| ---
a | 6
b | 4
c | 4
q)t[`d]|:5 /new key
q)t
job| val
---| ---
a | 6
b | 4
c | 4
d | 5
Remarks
A keyed table with a single data column could perhaps be a dictionary.
Amending through an operator works also with a new key.
Upserting a table (or dictionary) of new records is more efficient and simpler than updating a single tuple.
q)nt:([job:`a`c`d]val: 6 1 5) /new values to check
q)t|nt /maximum of two tables
job| val
---| ---
a | 6
b | 4
c | 4
d | 5
or just
q)t[([]job:`a`c`d)]|:([]val:6 1 5)
Simple-looking primitives such as maximum (|) repay careful study.

How to aggregate more than one column of many rows inside one in PostgreSQL?

If my question is a bit obscure, here is what I mean, we can aggregate one column of multiple rows using array_agg, for instance I have this table
foo | bar | baz
-------+-------+-------
1 | 10 | 20
1 | 12 | 23
1 | 15 | 26
1 | 16 | 21
If I invoke this query :
select
foo,
array_agg(bar) as bars
from table
group by (foo)
resulting in :
foo | bars
-------+----------------
1 | {10,12,15,16}
What would be the query to have this table (using bar,baz) ?
foo | barbazs
-------+------------------------------------
1 | {{10,20},{12,23},{15,26},{16,21}}
I checked into functions-aggregate (postgresql.org) but it doesn't seem to be any functions to have that effect or am I missing something ?
array_agg has arrays as possible input values.
We just need a way to build an array from the two input colums bar and baz, which can be done using the ARRAY constructor:
SELECT foo, array_agg(ARRAY[bar, baz]) as barbaz FROM table GROUP BY foo;
foo | barbaz
-----+-----------------------------------
1 | {{10,20},{12,23},{15,26},{16,21}}
Note : It also works with DISTINCT (...array_agg(distinct array[bar,baz])...)