I am new to kdb and researching it for a use case to generate time series data using a table of various function inputs. Each row of the table consists of function inputs keyed by an id and segment and will call one function per row. I have figured out how to identify which function albeit using brute force nested conditions.
My question is 2 part
How does one employ kicking off the execution of these functions?
Once the time series data is generated for each id and segment, how best can the the output be compiled into a singular table (sample output noted below - I have thought about one table for each id and then compile in two steps which would work as well but we'll have thousands of ids)
Below is a sample table and some conditions to add meta data including which function to apply
//Create sample table and add columns to identify unknown and desired function
t:([id:`AAA`AAA`AAA`BBB`CCC;seg:1 2 3 1 1];aa: 1500 0n 400 40 900;bb:0n 200 30 40 0n;cc: .40 .25 0n 0n .35)
t: update Uknown:?[0N = aa;`aa;?[0N = bb;`bb;?[0N = cc;`cc;`UNK]]] from t
t: update Call_Function:?[0N = aa;`Solveaa;?[0N = bb;`Solvebb;?[0N = cc;`Solvecc;`NoFunction]]] from t
A sample function below uses the inputs from table t to generate time series data (limited to 5 periods for example here) and test using X
//dummy function to generate output for first 5 time periods
Solvebb:{[aa;cc]
(aa%cc)*(1-exp(neg cc*1+til 5))
}
//test the function as an example for dummy output in result table below
x: flip enlist Solvebb[1500;.40] //sample output for AAA seg1 from t for example
The result would ideally be a sample table similar to below
t2: `id`seg xkey ("SIIIS";enlist",") 0:`:./Data/sampleOutput.csv
id seg| seg_idx tot_idx result
-------| ------------------------
AAA 1 | 1 1 1,236.30
AAA 1 | 2 2 2,065.02
AAA 1 | 3 3 2,620.52
AAA 1 | 4 4 2,992.89
AAA 1 | 5 5 3,242.49
AAA 2 | 1 6
AAA 2 | 2 7
AAA 2 | 3 8
AAA 2 | 4 9
AAA 2 | 5 10
AAA 3 | 1 11
AAA 3 | 2 12
AAA 3 | 3 13
AAA 3 | 4 14
AAA 3 | 5 15
BBB 1 | 1 1
BBB 1 | 2 2
BBB 1 | 3 3
BBB 1 | 4 4
BBB 1 | 5 5
..
It's difficult without more details, but something like the following may help.
First, it may be easier to define Solvebb so that it can take 3 inputs and simple ignores the middle one:
q)Solvebb:{[aa;bb;cc](aa%cc)*(1-exp(neg cc*1+til 5))}
And adding dummy functions for the other two in your table (NB. it's important for the use of ungroup later that the output of these functions are lists):
q)Solveaa:{[aa;bb;cc] (bb+cc;bb*cc)}
q)Solvecc:{[aa;bb;cc] (aa+bb;aa*bb)}
You can apply each call each function on all three vectors of input with:
q)update result:first[Call_Function]'[aa;bb;cc] by Call_Function from t
id seg| aa bb cc Uknown Call_Function result
-------| -------------------------------------------------------------------------------
AAA 1 | 1500 0.4 bb Solvebb 1236.3 2065.016 2620.522 2992.888 3242.493
AAA 2 | 200 0.25 aa Solveaa 200.25 50
AAA 3 | 400 30 cc Solvecc 430 12000f
BBB 1 | 40 40 cc Solvecc 80 1600f
CCC 1 | 900 0.35 bb Solvebb 759.3735 1294.495 1671.589 1937.322 2124.581
and you can unravel this table by applying the ungroup function
q)ungroup update result:first[Call_Function]'[aa;bb;cc] by Call_Function from t
id seg aa bb cc Uknown Call_Function result
---------------------------------------------------
AAA 1 1500 0.4 bb Solvebb 1236.3
AAA 1 1500 0.4 bb Solvebb 2065.016
AAA 1 1500 0.4 bb Solvebb 2620.522
AAA 1 1500 0.4 bb Solvebb 2992.888
AAA 1 1500 0.4 bb Solvebb 3242.493
AAA 2 200 0.25 aa Solveaa 200.25
AAA 2 200 0.25 aa Solveaa 50
AAA 3 400 30 cc Solvecc 430
AAA 3 400 30 cc Solvecc 12000
BBB 1 40 40 cc Solvecc 80
BBB 1 40 40 cc Solvecc 1600
CCC 1 900 0.35 bb Solvebb 759.3735
CCC 1 900 0.35 bb Solvebb 1294.495
CCC 1 900 0.35 bb Solvebb 1671.589
CCC 1 900 0.35 bb Solvebb 1937.322
CCC 1 900 0.35 bb Solvebb 2124.581
Related
I have a datasource that changed recently (same table however) and I am trying to clean up my table and having an issue with a pesky character " that I am trying to replace with null
When the table is pushed to kdb, it is a symbol column that can have a single double quote (ascii 34.) I have been running ssr to replace it with null and using fills to populate which it had worked at one point before the datasource change. I thought it might be a leading/trailing spaces so I checked with trim which seems to be fine so no rogue spaces are included.
For some reason, I am unable to perform ssr function on it. I'm verifying that the char is correct. I thought I had it working however my update below doesn't.
Any thoughts? I'm assuming its my regex?
P.S. I was hoping to avoid having to use the 'int $ seg to cast it as ascii but that is my next idea.
// update query does not fail but doesn't update the single, double quote in seg
t: update seg: fills `$ssr[;"\"\"";""] each string seg from t;
//verify the data types are symbols
meta t;
c | t f a
-------------| -----
id | s
seg | s
Here is my attempt at casting the seg as string to see how to escape the quote as well as the desired goal.
id seg Displaystring Ticker ---->> Desired output
------------------------------------------------------
AAA 1 GOOG "GOOG" GOOG
AAA 2 " ,"\"" GOOG
AAA 3 " ,"\"" GOOG
AAA 4 " ,"\"" GOOG
AAA 5 " ,"\"" GOOG
AAA 6 " ,"\"" GOOG
BBB 1 AMZN "AMZN" AMZN
BBB 2 " ,"\"" AMZN
BBB 3 " ,"\"" AMZN
CCC 1 AAPL "AAPL" AAPL
CCC 2 " ,"\"" AAPL
CCC 3 " ,"\"" AAPL
DDD 1 TSLA ,"\"" TSLA
DDD 2 " ,"\"" TSLA
This should do the job
q)tab:([]id:`$raze{x,/:string 1+til y}'[("AAA ";"BBB ";"CCC ";"DDD ");6 3 3 2];seg:#[`$'14#"\"";0 6 9 12;:;`GOOG`AMZN`AAPL`TSLA])
q)tab
id seg
----------
AAA 1 GOOG
AAA 2 "
AAA 3 "
AAA 4 "
AAA 5 "
AAA 6 "
BBB 1 AMZN
BBB 2 "
BBB 3 "
CCC 1 AAPL
CCC 2 "
CCC 3 "
DDD 1 TSLA
DDD 2 "
q)update fills?[seg=`$"\"";`;seg]from tab
id seg
----------
AAA 1 GOOG
AAA 2 GOOG
AAA 3 GOOG
AAA 4 GOOG
AAA 5 GOOG
AAA 6 GOOG
BBB 1 AMZN
BBB 2 AMZN
BBB 3 AMZN
CCC 1 AAPL
CCC 2 AAPL
CCC 3 AAPL
DDD 1 TSLA
DDD 2 TSLA
Input Table:
prod
acct
acctno
newcinsfx
John
A01
1
89
John
A01
2
90
John
A01
2
92
Mary
A02
1
92
Mary
A02
3
81
Desired output table:
prod
acct
newcinsfx1
newcinsfx2
John
A01
89
John
A01
90
92
Mary
A02
92
Mary
A02
81
I tried to do it by distinct function.
df.select('prod',"acctno").distinct()
df.show()
I have data made of varying periodic strings that are effectively a time value list with a periodicity flag contained within. Unfortunately, each string length can have a different number of elements but no more than 7.
Example below - (# and #/M at the end of each string means these are monthly values) starting at 8/2020 while #/Y are annual numbers so we divide by 12 for example to get to a monthly value. # at the beginning simple means continue from prior period.
copied from CSV
ID,seg,strField
AAA,1,8/2020 2333 2456 2544 2632 2678 #/M
AAA,2,# 3333 3456 3544 3632 3678 #
AAA,3,# 4333 4456 4544 4632 4678 #/M
AAA,4,11/2021 5333 5456 #/M
AAA,5,# 6333 6456 6544 6632 6678 #/Y
t:("SSS";enlist",") 0:`:./Data/src/strField.csv; // read in csv data above
t:update result:count[t]#enlist`float$() from t; // initiate empty result column
I would normally tokenize then pass each of the 7 columns to a function but the limit is 8 arguments and I would like to send other meta data in addition to these 7 arguments.
t:#[t;`tok1`tok2`tok3`tok4`tok5`tok6`tok7;:;flip .Q.fu[{" " vs'x}]t `strField];
t: ungroup t;
//Desired result
ID seg iDate result
AAA 1 8/31/2020 2333
AAA 1 9/30/2020 2456
AAA 1 10/31/2020 2544
AAA 1 11/30/2020 2632
AAA 1 12/31/2020 2678
AAA 2 1/31/2021 3333
AAA 2 2/28/2021 3456
AAA 2 3/31/2021 3544
AAA 2 4/30/2021 3632
AAA 2 5/31/2021 3678
AAA 3 6/30/2021 4333
AAA 3 7/31/2021 4456
AAA 3 8/31/2021 4544
AAA 3 9/30/2021 4632
AAA 3 10/31/2021 4678
AAA 4 11/30/2021 5333
AAA 4 12/31/2021 5456
AAA 5 1/31/2022 527.75 <-- 6333/12
AAA 5 2/28/2022 527.75
AAA 5 3/31/2022 527.75
AAA 5 4/30/2022 527.75
AAA 5 5/31/2022 527.75
AAA 5 6/30/2022 527.75
AAA 5 7/31/2022 527.75
AAA 5 8/31/2022 527.75
AAA 5 9/30/2022 527.75
AAA 5 10/31/2022 527.75
AAA 5 11/30/2022 527.75
AAA 5 12/31/2022 527.75
AAA 5 1/31/2023 538.00 <--6456/12
AAA 5 2/28/2023 538.00
AAA 5 3/31/2023 538.00
AAA 5 4/30/2023 538.00
AAA 5 5/31/2023 538.00
AAA 5 6/30/2023 538.00
AAA 5 7/31/2023 538.00
AAA 5 8/31/2023 538.00
AAA 5 9/30/2023 538.00
AAA 5 10/31/2023 538.00
AAA 5 11/30/2023 538.00
AAA 5 12/31/2023 538.00
AAA 5 1/31/2024 etc..
AAA 5 2/29/2024
AAA 5 3/31/2024
AAA 5 4/30/2024
AAA 5 5/31/2024
AAA 5 6/30/2024
AAA 5 7/31/2024
ddonelly is correct, a dictionary or list gets around the limitation of 8 parameters for functions but I think it is not the right approach here. Below achieves the desired output:
t:("SSS";enlist",") 0:`:so.csv;
// This will process each distinct ID separately as the date logic I have here would break if you had a BBB entry that starts date over
{[t]
t:#[{[x;y] select from x where ID = y}[t;]';exec distinct ID from t];
raze {[t]
t:#[t;`strField;{" "vs string x}'];
t:ungroup update`$date from delete strField from #[t;`date`result`year;:;({first x}each t[`strField];"J"${-1_1_x}each t[`strField];
`Y =fills #[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
delete year from ungroup update date:`$'string date from update result:?[year;result%12;result],
date:{x+til count x} each {max($[z;12#(x+12-x mod 12);1#x+1];y)}\[0;"M"$/:raze each reverse each
"/" vs/: string date;year] from t
} each t
}[t]
ID seg date result
AAA 1 2020.08 2333
AAA 1 2020.09 2456
AAA 1 2020.10 2544
AAA 1 2020.11 2632
AAA 1 2020.12 2678
AAA 2 2021.01 3333
AAA 2 2021.02 3456
AAA 2 2021.03 3544
AAA 2 2021.04 3632
AAA 2 2021.05 3678
AAA 3 2021.06 4333
AAA 3 2021.07 4456
AAA 3 2021.08 4544
AAA 3 2021.09 4632
AAA 3 2021.10 4678
AAA 4 2021.11 5333
AAA 4 2021.12 5456
AAA 5 2022.01 527.75
AAA 5 2022.02 527.75
AAA 5 2022.03 527.75
...
AAA 5 2023.01 538
AAA 5 2023.02 538
AAA 5 2023.03 538
AAA 5 2023.04 538
...
AAA 5 2024.01 545.3333
AAA 5 2024.02 545.3333
...
Below is a full breakdown of whats going on inside the nested function should you need it for understanding.
// vs (vector from scalar) is useful for string manipulation to separate the strField column into a more manageable list of seperate strings
t:#[t;`strField;{" "vs string x}'];
// split the strField out to more manageable columns
t:#[t;`date`result`year;:;
// date column from the first part of strField
({first x}each t[`strField];
// result for the actual value fields in the middle
"J"${-1_1_x}each t[`strField];
// year column which is a boolean to indicate special handling is needed.
// I also forward fill to account for rows which are continuation of
// the previous rows time period,
// e.g. if you had 2 or 3 lines in a row of continuous yearly data
`Y =fills #[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
// ungroup to split each result into individual rows
t:ungroup update`$date from delete strField from t;
t:update
// divide yearly rows where necessary with a vector conditional
result:?[year;result%12;result],
// change year into a progressive month list
date:{x+til count x} each
// check if a month exists, if not take previous month + 1.
// If a year, previous month + 12 and convert to Jan
// create a list of Jans for the year which I convert to Jan->Dec above
{max($[z;12#(x+12-x mod 12);1#x+1];y)}\
// reformat date to kdb month to feed with year into the scan iterator above
[0;"M"$/:raze each reverse each "/" vs/: string date;year] from t;
// finally convert date to symbol again to ungroup year rows into individual rows
delete year from ungroup update date:`$'string date from t
could you pass the columns into a dictionary and then pass the dictionary into the function? This with circumvent the issue of having a maximum of 8 arguments since the dictionary can be as long as you require.
I look for a way to save a q/kdb table into a parquet file. The most straightforward way that I figured out is to convert a q table into a pandas dataframe using embedPy. Does someone has achieve this?
cheers,
didier
To convert a q table to a pandas dataframe,you can use this function:
tab2df:{
r:.p.import[`pandas;`:DataFrame;x][#;cols x];
$[count k:keys x;r[`:set_index]k;r]}
To convert the pandas dataframe to a q table, you can use this function:
df2tab:{n:$[.p.isinstance[x`:index;.p.import[`pandas]`:RangeIndex]`;0;x[`:index.nlevels]`];n!flip $[n;x[`:reset_index][];x][`:to_dict;`list]`}
The df2tab requires the pandas dataframe to an embedPy object. You can use .p.wrap .
See an example below
q)\l p.q
q)tab:([]a:10?10.;b:10?10;c:10?`aaa`bbb`ccc)
q)tab
a b c
---------------
1.086824 2 ccc
9.598964 7 aaa
0.3668341 8 aaa
6.430982 5 ccc
6.708738 6 aaa
6.789082 4 bbb
4.12317 1 aaa
9.877844 3 aaa
3.867353 3 aaa
7.26781 7 ccc
q)tab2df[tab]
{[f;x]embedPy[f;x]}[foreign]enlist
q)print tab2df[tab]
a b c
0 1.086824 2 ccc
1 9.598964 7 aaa
2 0.366834 8 aaa
3 6.430982 5 ccc
4 6.708738 6 aaa
5 6.789082 4 bbb
6 4.123170 1 aaa
7 9.877844 3 aaa
8 3.867353 3 aaa
9 7.267810 7 ccc
q)pdtab:tab2df[tab]
q)df2tab[pdtab]
a b c
-----------------
1.086824 2 "ccc"
9.598964 7 "aaa"
0.3668341 8 "aaa"
6.430982 5 "ccc"
6.708738 6 "aaa"
6.789082 4 "bbb"
4.12317 1 "aaa"
9.877844 3 "aaa"
3.867353 3 "aaa"
7.26781 7 "ccc"
Hope this helps!!
This question already has answers here:
How to replicate a SAS merge
(2 answers)
Closed 7 years ago.
Table t1:
person | visit | code_num1 | code_desc1
1 1 100 OTD
1 2 101 SED
2 3 102 CHM
3 4 103 OTD
3 4 103 OTD
4 5 101 SED
Table t2:
person | visit | code_num2 | code_desc2
1 1 104 DME
1 6 104 DME
3 4 103 OTD
3 4 103 OTD
3 7 103 OTD
4 5 104 DME
I have the following SAS code that merges the two tables t1 and t2 by person and visit:
DATA t3;
MERGE t1 t2;
BY person visit;
RUN;
Which produces the following output:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED
1 6 104 DME
2 3 102 CHM
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 103 OTD
4 5 101 SED 104 DME
I want to replicate this in a hive query, and tried using a full outer join:
create table t3 as
select case when a.person is null then b.person else a.person end as person,
case when a.visit is null then b.visit else a.visit end as visit,
a.code_num1, a.code_desc1, b.code_num2, b.code_desc2
from t1 a
full outer join t2 b
on a.person=b.person and a.visit=b.visit
Which yields the table:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED null null
1 6 null null 104 DME
2 3 102 CHM null null
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 null null 103 OTD
4 5 101 SED 104 DME
Which is almost the same as SAS, but we have 2 extra rows for (person=3, visit=4). I assume this is because hive is matching each row in one table with two rows in the other, producing the 4 rows in t3, whereas SAS does not. Any suggestions on how I could get my query to match the output of the SAS merge?
If you merge two data sets and they have variables with the same names (besides the by variables) then variables from the second data set will overwwrite any variables having the same name in the first data set. So your sas code creates a overlaid dataset. A full outer join does not do this.
It seems to me if you first dedupe the right side table then do a full outer join you should get the equivalent table in hive. I don't see a need for the case when statements either as Joe pointed out. Just do a join on the key values:
create table t3 as
select coalesce(a.person, b.person) as person
, coalesce(a.visit, b.visit) as visit
, a.code_num1
, a.code_desc1
, b.code_num2
, b.code_desc2
from
(select * from t1) a
full outer join
(select person, visit, code_num2, code_desc2
group by person, visit, code_num2, code_desc2 from t2) b
on a.person=b.person and a.visit=b.visit
;
I can't test this code currently so be sure to test it. Good luck.