Adding a column based on other columns KDB+ - kdb

I have a table ("ibmqt") with a number of columns, and I would like to add a new column, containing boolean values indicating for each row whether one column ("bid") is greater than or equal to another column ("ask").
My most successful attempt so far is this:
ibmqt: update (pricecross:select bid>=ask from ibmqt) from ibmqt
However, this results in the following:
time sym bid ask bsize asize pricecross
----------------------------------------------------
00:00:59.063 IBM 43.53 43.57 10000 9000 (,`ask)!,0b
00:01:03.070 IBM 43.54 43.59 6500 3000 (,`ask)!,0b
00:02:31.911 IBM 43.56 43.6 500 4500 (,`ask)!,0b
00:03:43.070 IBM 43.56 43.56 10000 2500 (,`ask)!,1b
00:06:01.170 IBM 43.54 43.56 8500 4500 (,`ask)!,0b
00:06:11.081 IBM 43.56 43.58 500 1500 (,`ask)!,0b
00:08:15.126 IBM 43.55 43.57 1500 9000 (,`ask)!,0b
Obviously in the "pricecross" column I just want 0, 0, 0, 1, 0 etc.
Any suggestions?

There is no need for a nested select. This will do what you need:
ibmqt:update pricecross:bid>=ask from ibmqt
Or you can update ibmqt in place:
update pricecross:bid>=ask from `ibmqt
q is an array language, therefore bid>=ask compares two columns pairwise and returns a list of booleans. This will illustrate the idea:
1 2 3 >= 0 2 4 / 110b
The list of booleans is then assigned to a new column pricecross.

since >= is overloaded to work with both atom and list, using pricecross:bid>=ask is the best solution here :
q)update pricecross:bid>=ask from ibmqt
But there is a slightly different way of getting the same results :
q)update pricecross:bid>='ask from ibmqt
q)update pricecross:>='[bid;ask] from ibmqt
This is particularly useful when the dyadic function works only with atoms :
q)update pricetolerance:tolerance'[bid;ask] from ibmqt

Related

What is the meaning of `s attribute on a table?

In the Abridged Q Language Manual Arthur mentioned:
`s#table marks the table to use binary search and marks first column sorted
And if we look into 3.6 version:
N:1000000;
t1:t2:([]n:til N; m:N?`6);
t1:update `p#n from t1;
t2:`s#t2;
(meta t1)[`n]`a / `p
(meta t2)[`n]`a / `p
attr t1 / `
attr t2 / `s
\ts:10000 select count i from t1 where n in 1000?N
/ ~7000
\ts:10000 select count i from t2 where n in 1000?N
/ ~7000
we find that yes, t2 has this attribute: s.
But for some reason an attribute on the first column is not s but p. And also search times are the same. And the sizes of both tables with attributes are the same - I used objsize function described in AquaQ blogpost to ensure.
So are there any differences in 3.6+ version of q between 's#table and a table with '#p attribute on a first column?
I think the only way that the s# on the table itself would improve search times is if you were doing lookups using ? as described here: https://code.kx.com/q/ref/find/#searching-tables
q)\ts:100000 t1?t1[0]
105 800
q)\ts:100000 t2?t2[0]
86 800
q)
q)\ts:100000 t1?t1[500000]
108 800
q)\ts:100000 t2?t2[500000]
83 800
q)
q)\ts:100000 t1?t1[999999]
107 800
q)\ts:100000 t2?t2[999999]
83 800
It behaves differently for a keyed table (turns it into a step function) but I think that's beyond the scope of your original question.

Creating "running total" in Scala

I have a history DataFrame that has the following structure
id amount date
12345 150 1/1/2016
12345 50 1/4/2016
12345 250 1/4/2016
12345 950 1/9/2016
I would like to have a cumulative sum of $ with respect to date, such that the resulting sum is calculated as the sum of all earlier days with the same ID. results should be generated even for dates that do not have entries in the source DataFrame, assuming they are between the start and end dates. The expected output for the example input can be seen below.
ID date cumulative_sum
12345 1/1/2016 150
12345 1/2/2016 150
12345 1/3/2016 150
12345 1/4/2016 450
12345 1/5/2016 450
12345 1/6/2016 450
12345 1/7/2016 450
12345 1/8/2016 450
12346 1/9/2016 1400
Does anyone know how to calculate this sort of running total?
Basically, you first find subtotals for each date (doesn't really have to happen as a separate step, but this makes things a little more generic - I'll explain why below):
val subtotals = data
.groupBy(_.date)
.mapValues(_.map(_.amount).sum)
.withDefault(_ => 0)
Now, you can scan through the date range, and sum things up with something like this:
(0 to numberOfMonths)
.map(startDate.plusMonths)
.scanLeft(null -> 0) { case ((_, sum), date) =>
date -> (subtotals(date) + sum)
}.drop(1)
This is how you would do this is in "plain scala". Now, because you have mentioned "data frame", in your question, I suspect, you are actually using spark. This makes it a little bit more complicated, because the data may be distributed. The good news is, while you may have a huge number of transaction, there aren't enough months in the history of the world to make it impossible for you to process the aggregated data as a single task.
So, you just need to replace the first step above with a distributed equivalent:
val subtotals = dataFrame
.rdd
.map(tx => tx.date -> tx.amount)
.reduceByKey(_ + _)
.collect
.toMap
And now you can to the second step in exactly the same way I showed above.

Iterate over current row values in kdb query

Consider the table:
q)trade
stock price amt time
-----------------------------
ibm 121.3 1000 09:03:06.000
bac 5.76 500 09:03:23.000
usb 8.19 800 09:04:01.000
and the list:
q)x: 10000 20000
The following query:
q)select from trade where price < x[first where (x - price) > 100f]
'length
fails as above. How can I pass the current row value of price in each iteration of the search query?
While price[0] in the square brackets above works, that's obviously not what I want. I even tried price[i] but that gives the same error.

Divide records into groups - quick solution

I need to divide with UPDATE command rows (selected from subselect) in PostgreSQL table into groups, these groups will be identified with integer value in one of its columns. These groups should be with the same size. Source table contains billions of records.
For example I need to divide 213 selected rows into groups, every group should contains 50 records. The result will be:
1 - 50. row => 1
51 - 100. row => 2
101 - 150. row => 3
151 - 200. row => 4
200 - 213. row => 5
There is no problem to do it with some loop (or use PostgreSQL window functions), but I need to do it very efficiently and quickly. I can't use sequence in id because there should be gaps in these ids.
I have an idea to use random integer number generator and set it as default value for a row. But this is not useable when I need to adjust group size.
The query below should display 213 rows with a group-number from 0-4. Just add 1 if you want 1-5
SELECT i, (row_number() OVER () - 1) / 50 AS grp
FROM generate_series(1001,1213) i
ORDER BY i;
create temporary sequence s minvalue 0 start with 0;
select *, nextval('s') / 50 grp
from t;
drop sequence s;
I think it has the potential to be faster than the row_number version #Richard. But the difference could be not relevant depending on the specifics.

Stop Crosstab summary display string count from Resetting on each page

I have a crosstab in Crystal Reports XI with a display string that displays a count for each time the condition is met:
Basically I have a summary column amountspent that I need to compare to avblcredit, and count every order where amountspent (the summary) exceeds avblcredit for each group, customer. I then have to display the total amount of orders for that customer where available credit is exceeded.
After much struggle due to the fact that I cannot use calculated members in Crystal Xi. I
created a second duplicate summary for item exditures and edited the display string of the second summary to compare itself to avbl credit then count:
global numbervar count;
if currentvalue > avblcredit
then count := count + 1;
count;
The count then increments everywhere it finds the current value (sum of items) > available credit.
This works correctly if the crosstab prints fully on the page, however if the crosstab extends to the next page the count resets back to 0.
So basically as an example page 1 looks as follows:
customer 1
orders avblcredit amountspent count itema itemb itemc
ord1 4000 6000 1 2000 3000 1000
ord2 3734 5001 2 1000 2000 2001
ord3 4123 5000 3 4000 1000 0
ord4 2321 5000 4 5000 0 0
ord5 4000 5003 5 1200 3800 3
ord6 4000 6000 6 1000 2000 3000
page 2 with customer 1 group continued:
orders avblcredit amountspent count itema itemb itemc
ord7 4000 6000 1 2000 3000 1000
ord8 3734 5001 2 1000 2000 2001
ord9 4123 5000 3 4000 1000 0
ord10 2321 5000 4 5000 0 0
My question is how can I get my count from resetting on each new page?
Thanks
If you can modify your datasource, you can do the calculation upstream. Just extend your existing dataset with a new field named something like spentExceedsCredit with a value of 0 or 1. Then you can sum them per customer, or use custom formatting in the report, or have the list of customer expenditures exceed a page, or whatever, and you're not dependent on Crystal's global variables.