KDB get substring - kdb

How can I add a column containing a substring of a another columns containing symbols. So, go from
t:flip `date`sym`pos!(`d1`d1`d1`d2;`aaaA1`bbbA1`aaaA2`aaaA3;1 2 3 1)
date sym pos
d1 aaaA1 1
d1 bbA1 2
d1 aaaA2 3
d2 aaaA3 1
to
t:flip `date`sym`pos`ext!(`d1`d1`d1`d2;`aaaA1`bbbA1`aaaA2`aaaA3;1 2 3 1;`aaa`bbb`aaa`aaa)
date sym pos ext
d1 aaaA1 1 aaa
d1 bbA1 2 bb
d1 aaaA2 3 aaa
d2 aaaA3 1 aaa
EDIT. The substring should always contain the first len(symbol) -2 characters, so in my example above, aaa for aaaAx and bb for bbAx

If the substring you wish to extract is a constant length, you can do something like this following:
q)t:flip `date`sym`pos!(`d1`d1`d1`d2;`aaaA1`bbbA1`aaaA2`aaaA3;1 2 3 1)
q)update ext:`$3#'string sym from t
date sym pos ext
------------------
d1 aaaA1 1 aaa
d1 bbbA1 2 bbb
d1 aaaA2 3 aaa
d2 aaaA3 1 aaa
If that's not the case, please provide some more detail with regards to how the substring to be extracted can be identified
Hope this helps
Jonathon

There can be a clever way of applying this below, but this is what i first came up with.
t:flip `date`sym`pos!(`d1`d1`d1`d2;`aaaA1`bbbA1`aaaA2`aaaA3;1 2 3 1)
t: update ctr: {-2 + count string x} each sym from t;
t:{[x] :update ext:x[`ctr]#string(x[`sym]) from x} each t;
2nd line is applying your logic of: len(symbol) - 2
3rd line is taking 'ctr' number of characters from the original symbol characters.

You didn’t say so, but this is kdb+, so let’s assume:
your table is long
your sym column has duplicates
You don’t need to convert all the symbols to strings and back: only the distinct ones. (In this example, I’ve changed one of the symbols to create a duplicate.)
q)t:flip `date`sym`pos!(`d1`d1`d1`d2;`aaaA1`bbbA1`aaaA2`aaaA1;1 2 3 1)
q)update ext:{nub:distinct x;(`$-2 _'string nub)nub?x}sym from t
date sym pos ext
------------------
d1 aaaA1 1 aaa
d1 bbbA1 2 bbb
d1 aaaA2 3 aaa
d2 aaaA1 1 aaa
The utility .Q.fu applies a function to the distinct items.
q)update ext:.Q.fu[{`$-2 _'string x};sym] from t
date sym pos ext
------------------
d1 aaaA1 1 aaa
d1 bbbA1 2 bbb
d1 aaaA2 3 aaa
d2 aaaA1 1 aaa
This operation would be faster if the sym column were already stored as an enumeration, because the distinct values would then be available without calculation.

Using drop:
q)t:flip `date`sym`pos!(`d1`d1`d1`d2;`aaaA1`bbA1`aaaA2`aaaA3;1 2 3 1)
q)update ext:`$-2_'string sym from t
date sym pos ext
------------------
d1 aaaA1 1 aaa
d1 bbA1 2 bb
d1 aaaA2 3 aaa
d2 aaaA3 1 aaa

Related

spark scala - How to create a UDT (cassandra user data type [list <UDT>]) from a CSV file

I have a CSV file with ID, ID1, ID2, col1, col2, and col3 fields...I need to group the record based on the ID field and convert it to a UDT list.
ex:
ID ID1 ID2 COL1 COL2 COL3 COL4
1 AA 01 A B C D
1 AA 02 A B C D
1 AA 02 B C D E
1 AA 03 A B C D
2 BB 01 A B C D
2 BB 02 A B C D
3 CC 01 A B C D
3 CC 01 B C D E
THE OUTPUT SHOULD BE
1,[{ID1:"AA",ID2:"01"},{ID1:"AA",ID2:"02"},{ID1:"AA",ID2:"03"}]
2,[{ID1:"BB",ID2:"01"},{ID1:"BB",ID2:"02"}]
3,[{ID1:"CC",ID2:"01"}] (grouped by ID; rest of the ID fields in a list array)
I tried collect_list / collect_set to group the fields but could not convert them to an array.

kdb/q -- how to number the rows by certain groupings

I have tables with date;sym columns. But each date might have multiple syms. I want to number the occurrences of symbol in each date
For example:
date sym
-------------------
2019.06.04 ABC
2019.06.04 DEF
2019.06.04 ABC
2019.06.05 DEF
2019.06.05 ABC
will give me
date sym c
-------------------
2019.06.04 ABC 1
2019.06.04 DEF 1
2019.06.04 ABC 2 / here ABC appears for the second time on this date.
2019.06.05 DEF 1
2019.06.05 ABC 1
This may be a little cleaner, here the c column is just a running sum of all rows that have been grouped by each combination of date and sym.
q)t:([]date:2019.06.04+0 0 0 1 1;sym:`ABC`DEF`ABC`DEF`ABC)
q)update c:sums i=i by date,sym from t
date sym c
----------------
2019.06.04 ABC 1
2019.06.04 DEF 1
2019.06.04 ABC 2
2019.06.05 DEF 1
2019.06.05 ABC 1
To count the occurrences of syms by date across all of the tables in a HDB we can run a count by date for each of the partitioned tabled .Q.pt and then scan that over pj plus join, as each table is keyed on date (matching keys). As pj is similar to an ij we need to ensure that there are no rows dropped as each date might be missing different syms
q)cntTabs:{2!0!update c:count each sym,sym:first each sym from select sym by date from x} each .Q.pt
q){t:pj[x;y];t,k!y k:key[y] except key[t]}/[cntTabs]

How to concatenate columns in update statement

I have this table:
t:([] name:("aaa";"bbb";"ccc";"dddd"); side:(1;2;1;2))
Now I want to add a new column "concatenated", which contains a symbol, which is the concatenation of both values for each row:
I would assume that I have to do this with an each-both adverb, but this here does not work:
update concatenated:((`$name),'(`$side)) from t
How would I have to change this? Thanks.
Your attempt is close the issue with it works if you convert the 'side' column to string format first
I've added two versions one where the concatenation does not merge the 2 values and one where they are merged as a single symbol
q)t:([] name:("aaa";"bbb";"ccc";"dddd"); side:(1;2;1;2))
q)update conc:((`$name),'`$string side) from t
name side conc
------------------
"aaa" 1 aaa 1
"bbb" 2 bbb 2
"ccc" 1 ccc 1
"dddd" 2 dddd 2
q)update conc:(`$name,'string side) from t
name side conc
-----------------
"aaa" 1 aaa1
"bbb" 2 bbb2
"ccc" 1 ccc1
"dddd" 2 dddd2
Hope this helps

Filtering inside LOD expression

Product Item Status
A aa 0
A aaa 0
A aaaa 0
A aaaaa 1
B bb 2
B bbb 0
B bbbb 3
C cc 4
C cccc 5
I need to calculate the count of items which have status = '0' at the Product level. So my output shall be:
Product Count
A 3
B 1
My formula is as follows:
{FIXED [Product]: CountD([Item) }
and I dragged the status into filters.
But this is not working. Can someone help?
Try this:
{ FIXED [Product]:COUNT(IF [Status]=0 THEN [Status] END)}
Edit-------------------------------------------------------------

Getting wrong Count of the Combinations Between Two Fields of a File thru OUTFIL

i need to include this condition:
1) Total no.of records per combination of field1 and field3 (INCLUDE=(1,2,8,3,CH,A)
INPUT FILE: FIELD1 AND FIELD3 have 5 combinations,if you see in example below
field1 field2 field3 field4
AA 00000 123 ABC
AA 00000 123 ABC
AA 00000 456 ABC
BB 00000 123 ABC
BB 00000 123 ABC
BB 00000 789 ABC
AA 00000 567 ABC
OUTPUT FILE: gets 5 rows, one for each combination, gives no.of occurrences for it
FIELD1 FIELD3 COUNT-OF-COMBINATION
AA 123 2
AA 456 1
AA 567 1
BB 123 2
AA 789 1
My method is:
//SYSIN DD *
SORT FIELDS=COPY
OPTION COPY
OUTFIL REMOVECC,NODETAIL,
TRAILER1=(1,2,'ON',8,3,'=',COUNT=(M11,LENGTH=10)))
/*
Answer i got is:
AA ON 123 = 7
which is wrong:
its should have been
AA ON 123 = 2
AA ON 456 = 1
AA ON 567 = 1
BB ON 123 = 2
AA ON 789 = 1
You have:
SORT FIELDS=COPY
OPTION COPY
OUTFIL REMOVECC,NODETAIL,
TRAILER1=(1,2,'ON',8,3,'=',COUNT=(M11,LENGTH=10)))
First problem is you have SORT FIELDS=COPY and OPTION COPY. These mean the same thing. Remove one or the other (I tend to use OPTION COPY).
Next, you have a spare right parenthesis.
Then you are using TRAILER1. There are three types of TRAILERn: 1 is "at the end of the report"; 2 is at the end of a page; 3 is at a control break.
You use TRAILER1, so at the end of your file you get one record, containing file totals.
After that, your positions for the TRAILER1 match the output, not the input file.
Which brings us to the fact that you are not running those Sort control cards with that data. The control cards have a syntax error which means they don't run. Correcting the cards and retaining the TRAILER1 gets AAON567=0000000007.
Which brings us to the control break, which is what you have missing.
You define a control break with SECTIONS. TRAILER3 is part of SECTIONS.
Fixing everything except your output format:
OPTION COPY
OUTFIL REMOVECC,NODETAIL,
SECTIONS=(1,2,
15,3,
TRAILER3=(1,2,
'ON',
15,3,
'=',
COUNT=(M11,
LENGTH=10)))
Which gives you:
AAON123=0000000002
AAON456=0000000001
BBON123=0000000002
BBON789=0000000001
AAON567=0000000001
If you want column headings, look at how to use HEADER3 (HEADER1 and HEADER2 would also work in this simple case). If you want "page totals" look at TRAILER2. If you want file totals, use TRAILER1.