I'm trying to write a function using kdb+ which will look at the list, and find the values that simply meet two conditions.
Let's call the list DR (for data range). And I want a function that will combine these two conditions
"DR where (DR mod 7) in 2"
and
"DR where (DR.dd) in 1"
I'm able to apply them one at a time but I really need to combine them into one function. I was hoping I could do this
"DR was (DR.dd mod 7) in 2 and DR where (DR.dd) in 1"
but this obviously didn't work. Any advice?
You can utilize the and function to help with this, which is the same as &:
q)dr:.z.d+til 100
q)and
&
q)2=dr mod 7
10000001000000100000010000001000000100000010000001000000100000010000001000000..
q)1=dr.dd
00000000000000000000000001000000000000000000000000000000100000000000000000000..
q)(1=dr.dd)&2=dr mod 7
00000000000000000000000000000000000000000000000000000000100000000000000000000..
q)dr where(1=dr.dd)&2=dr mod 7
2021.02.01 2021.03.01
Its necessary wrap the first part in brackets due to how kdb reads code from right to left. This format changes slightly when doing this in a where clause, the brackets arent needed due to how each where clause is parsed, that is each clause between the commas are parsed seperately. However it is essentially doing the same thing as the code above.
q)t:([]date:dr)
q)select from t where 1=date.dd,2=date mod 7
date
----------
2021.02.01
2021.03.01
You could also do this using min to achieve similar results, like so:
DR where min(1=DR.dd;2=DR mod 7)
Related
I have a table with about 500 variables and 2000 cases. The type of these variables varies. My supervisor has asked me to produce a table listing all the numeric variables, along with their maximums and minimums. I am supposed to use SPSS because R apparently messes up the value labels.
I've only done very basic things in SPSS before this, like finding statistics for single variables, and I'm not sure how to do this. I think I should probably do something like:
*Create new table*
DATASET DECLARE maxAndMin.
*Loop through all variables: Use conditional statement to identify numeric variables*
DO REPEAT R=var1 TO varN.
FREQUENCIES VARIABLES /STATISTICS=MINIMUM
END REPEAT
*Find max and minimum*
I'm not sure how to go about this though. Any suggestions would be appreciated.
The following code will first make a list of all numeric variables in the dataset (and store it in a macro called !nums) and then it will run an analysis of those variables to tell you the mean, maximum and minimum of each:
SPSSINC SELECT VARIABLES MACRONAME="!nums" /PROPERTIES TYPE= NUMERIC.
DESCRIPTIVES !nums /STATISTICS=MEAN MIN MAX.
You can use the following code to create a tiny dataset to test the above code on:
data list list/n1 (f1) t1(a1) n2(f1) t2(a1).
begin data
1 "a" 34 "b"
2 "a" 23 "b"
3 "a" 52 "b"
4 "a" 71 "b"
end data.
If SUMMARIZE produces a nice enough table for you, here is a "non-extension" way of doing it.
file handle mydata /name="<whatever/wherever>".
data list free /x (f1) y (a5) z (F4.2).
begin data.
1 yes 45.67
2 no 32.00
3 maybe .
4 yes 22.02
5 no 12.79
end data.
oms select tables
/destination format=sav outfile=mydata
/if subtypes="Descriptive Statistics" /tag="x".
des var all.
omsend tag="x".
get file mydata.
summarize Var1 Mean Minimum Maximum /format list nocasenum nototal
/cells none /statistics none /title "Numeric Variables Only".
or use a DATASET command instead of file handle if you don't need the file on disk.
Using LIKE '%[0-9]%' to find "yellow 2 x 4 plate".
This is a simplified example that isn't working as I believe it should.
Table (command) with single column (commandtext, varchar).
Single entry in commandtext: yellow 2 x 4 plate.
SELECT
*
FROM command
WHERE commandtext LIKE '%[0-9]%'
Results = Empty set.
I expect that all this should be looking for is a digit between 0-9 surrounded by anythings else.
I am CLEARLY not getting something here...
As you have found out, you can use SQL standard SIMILAR TO:
... WHERE col SIMILAR TO '%[[:digit:]]%'
You could also use POSIX regular expressions:
... WHERE col ~ '[[:digit:]]'
I'm looking for a way to write functional select in KDB such that the where phrases is only apply if the column exists (on order to avoid error). If the column doesn't exist, it defaults to true.
I tried this but it didn't work
enlist(|;enlist(in;`colname;key flip table);enlist(in;`colname;filteredValues[`colname]));
I tried to write a simple boolean expression and use parse to get my functional form
(table[`colname] in values)|(not `colname in key flip table)
But kdb doesn't have short circuit so the left-hand expression is still evaluated despite the right-hand expression evaluating to true. This caused a weird output boolean$() which is a list of booleans all evaluating to false 0b
Any help is appreciated. Thanks!
EDIT 1: I have to join a series of condition with parameter specified in the dictionary filters
cond,:(,/) {[l;k] enlist(in;k;enlist l[k])}[filters]'[a:(key filters)]
Then I pass this cond on and it gets executed on a few different selects on different tables. How can I make sure that whatever conditional expression I put in place of enlist(in;k;enlist l[k] will only get evaluated as the select statement gets executed.
You can use the if-else conditional $ here to do what you want
For example:
q)$[`bid in cols`quotes;enlist (>;`bid;35);()]
> `bid 35
q)$[`bad in cols`quotes;enlist (>;`bad;35);()]
Note that in the second example, the return is an empty list, as this column isn't in quotes table
So you can put this into the functional select like so:
?[`quotes;$[`bid in cols`quotes;enlist (>;`bid;35);()];0b;()]
and the where clause will be applied the the column is present, otherwise no where clause will be applied:
q)count ?[`quotes;$[`bid in cols`quotes;enlist (>;`bid;35);()];0b;()]
541 //where clause applied, table filtered
q)count ?[`quotes;$[`bad in cols`quotes;enlist (>;`bad;35);()];0b;()]
1000 //where clause not applied, full table returned
Hope this helps
Jonathon
AquaQ Analytics
EDIT: If I'm understanding your updated question correctly, you might be able to do something a like the following. Firstly, let's define an example "filters" dictionary:
q)filters:`a`b`c!(1 2 3;"abc";`d`e`f)
q)filters
a| 1 2 3
b| a b c
c| d e f
So here we are assuming a few different columns of different types, for illustration purposes. You can build up your list of where clauses like so:
q)(in),'flip (key filters;value filters)
in `a 1 2 3
in `b "abc"
in `c `d`e`f
(this is equivalent to the code you had to generate cond, but it's a little neater & more efficient - you also have the values enlisted, which isn't necessary)
You could then use a vector conditional to generate your list of where clauses to apply to a given table e.g.
q)t:([] a:1 2 3 4 5 6;b:"adcghf")
q)?[key[filters] in cols[t];(in),'flip (key filters;value filters);count[filters]#()]
(in;`a;,1 2 3)
(in;`b;,"abc")
()
As you can see, in this example the table "t" has columns a and b, but not c. So using the vector conditional, you get the where clauses for a and b but not c.
Finally to actually apply this list of output where clauses to the table, you can make use of an over to apply each in turn:
q)l:?[key[filters] in cols[t];(in),'flip (key filters;value filters);count[filters]#()]
q){?[x;$[y~();y;enlist y];0b;()]}/[t;l]
a b
---
1 a
3 c
One thing to note here is that in the where clause of the functional select we need to check if y is an empty list - this is so we can enlist it if it is not an empty list
Hope this helps
I want to understand what is happening with the below statement:
sum(til 100) where 000011110000b
That line evaluates to 22, and I can't figure out why. sum(til 100) is 4950. where 000011110000b returns the list 4 5 6 7. The kdb reference page doesn't seem to explain this use case. Why does the above line evaluate to 22?
Also, why does the below line result in error
4950 where 000011110000b
Square brackets are used for function call arguments, rather than parentheses. So the above is being interpreted as:
sum[(til 100) where 000011110000b]
And you can probably see now why that would evaluate to 22, i.e. it is the sum of the 5th,6th,7th,8th values of the list til 100, i.e. (0...99), because where is indexing into til 100 using the boolean list 000011110000b
q)til[100] where 000011110000b
4 5 6 7
As you will have noticed, you can omit square brackets when calling functions; but when doing so you need to be sure it will be parsed/interpreted as intended. In general, code is evaluated right-to-left so in this case (til 100) where 000011110000b was evaluated first, and the result passed to the sum function.
Regarding 4950 where 000011110000b - again if we think right to left, the interpreter is trying to pass the result of where 000011110000b to the function 4590. Although 4590 isn't a named function in kdb - this is the format used for IPC, i.e. executing commands on other kdb processes over TCP. So you are telling kdb to use the OS file descriptor number 4590 (which almost certainly won't correspond to a valid TCP connection) to run the command where 000011110000b. See the "message formats" section here for more details:
http://code.kx.com/q/tutorials/startingq/ipc/
I need to take a pipe that has a column of labels with associated values, and pivot that pipe so that there is a column for each label with the correct values in each column. So f example if I have this:
Id Label Value
1 Red 5
1 Blue 6
2 Red 7
2 Blue 8
3 Red 9
3 Blue 10
I need to turn it into this:
ID Red Blue
1 5 6
2 7 8
3 9 10
I know how to do this using the pivot command, but I have to explicitly know the values of the labels. How can I can dynamically read the labels from the “label” column into a list that I can then pass into the pivot command? I have tried to create list with:
pipe.groupBy('id) {_.toList('label) }
, but I get a type mismatch saying it found a symbol but is expecting (cascading.tuple.Fields, cascading.tuple.Fields). Also, from reading online, it sounds like using toList is frowned upon. The number of things in 'label is finite and not that big (30-50 items maybe), but may be different depending on what sample of data I am working with.
Any suggestions you have would be great. Thanks very much!
I think you're on the right track, you just need to map the desired values to Symbols:
val newHeaders = lines
.map(_.split(" "))
.map(a=>a(1))
.distinct
.map(f=>Symbol(f))
.toList
The Execution type will help you to combine with the subsequent pivot, for performance reasons.
Note that I'm using a TypedPipe for the lines variable.
If you want your code to be super-concise, you could combine lines 1 & 2, but it's just a stylistic choice:
map(_.split(" ")(1))
Try using Execution to get the list of values from the data. More info on executions: https://github.com/twitter/scalding/wiki/Calling-Scalding-from-inside-your-application