Add a column of differences to tables of summary statistics in Stata

Add a column of differences to tables of summary statistics in Stata - group-by

If I make a two way summary statistics table in Stata using table, can I add another column that is the difference of two other columns?
Say that I have three variables (a, b, c). I generate quintiles on a and b then generate a two-way table of means of c in each quintile-quintile intersection. I would like to generate a sixth column that is the difference of mean c between the top and bottom quintiles of b for each quintile of a.
I can generate the table of mean c for each quintile-quintile intersection, but I can't figure out the difference column.
* generate data
clear
set obs 2000
generate a = rnormal()
generate b = rnormal()
generate c = rnormal()
* generate quantiles for for a and b
xtile a_q = a, nquantiles(5)
xtile b_q = b, nquantiles(5)
* calculate the means of each quintile intersection
table a_q b_q, c(mean c)
* if I want the top and bottom b quantiles
table a_q b_q if b_q == 1 | b_q == 5, c(mean c)
Update: Here's an example of what I would like to do.

With the collapse command you can create customized tables like the one you have in mind.
preserve
collapse (mean) c, by(a_q b_q)
keep if inlist(b_q, 1, 5)
reshape wide c, i(a_q) j(b_q)
gen c5_c1 = c5 - c1
set obs `=_N + 1'
replace c1 = c1[`=_N - 1'] - c1[1] if mi(a_q)
replace c5 = c5[`=_N - 1'] - c5[1] if mi(a_q)
replace c5_c1 = c5_c1[`=_N - 1'] - c5_c1[1] if mi(a_q)
list, sep(0) noobs
restore
Then you should obtain something like this in your output:
+-----------------------------------------+
| a_q c1 c5 c5_c1 |
|-----------------------------------------|
| 1 .2092651 .1837719 -.0254932 |
| 2 .0256483 -.0118134 -.0374617 |
| 3 .022957 .0586441 .0356871 |
| 4 .0431809 .0876745 .0444935 |
| 5 -.0859874 .0199202 .1059076 |
| . -.2952525 -.1638517 .1314008 |
+-----------------------------------------+
If you are not very familiar with Stata, the following help pages might be useful in understanding the code
help _variables
help subscripting

Related

Processing each row in kdb table and appending arbitrary results in a new table

I have a table
t:([]a:`a`b`c;b:1 2 3;c:`x`y`z)
I would like to iterate and process each row.
The thing is that the processing logic for each row may result in arbitrary lines of data, after the full iteration the result maybe as such e.g.
results:([]a:`a1`b1`b2`b3`c1`c2;x:1 2 2 2 3 3)
I have the following idea so far but doesn't seem to work:
uj { // some processing function } each t
But how does one return arbitrary number of data append the results into a new table?

Assuming you are using something from the table entries to indicate your arbitrary value, you can use a dictionary to indicate a number (or a function) which can be used to apply these values.
In this example, I use the c column of the original table to indicate the number of rows to return (and the number from 1 to count to).
As each entry of the table is a dictionary, I can index using the column names to get the values and build a new table.
I also use raze to join each of the results together, as they will each have the same schema.
raze {[x]
d:`x`y`z!1 3 2;
([]a:((),`$string[x[`a]],/:string 1+til d[x[`c]]);x:((),d[x[`c]])#x[`b])
} each t

Not sure if this is what you want, but you can try something like this:
ungroup select a:`${y,/:x}[string b]'[string a],b from t
Or you can use accumulators if you need the result of the previous row calculations like this:
{y[`b]+:last[x]`b;x,y}/[t;t]

If your processing function is outputting tables that conform, just raze should suffice:
raze {y#enlist x}'[t;1 3 2]
a b c
-----
a 1 x
b 2 y
b 2 y
b 2 y
c 3 z
c 3 z
Otherwise use (uj/)
(uj/) {y#enlist x}'[t;1 3 2]
a b c
-----
a 1 x
b 2 y
b 2 y
b 2 y
c 3 z
c 3 z

Your best answer will depend very much on how you want to use the results computed from each row of t. It might suit you to normalise t; it might not. The key point here:
A table cell can be any q data structure.
The minimum you can do in this regard is to store the result of your processing function in a new column.
Below, an arbitrary binary function f returns its result as a dictionary.
q)f:{n:1+rand 3;(`$string[x],/:"123" til n)!n#y}
q)f [`a;2]
a1| 2
a2| 2
q)update d:a f'b from t
a b c d
---------------------
a 1 x `a1`a2`a3!1 1 1
b 2 y (,`b1)!,2
c 3 z `c1`c2!3 3
But its result could be any q data structure.
You were considering a unary processing function:
q)pf:{#[x;`d;:;] f . x`a`b}
q)pf each t
a b c d
---------------------
a 1 x `a1`a2`a3!1 1 1
b 2 y `b1`b2!2 2
c 3 z `c1`c2`c3!3 3 3
You might find other suggestions at KX Community.

If I understand correctly your question you need something like this :
(uj/){}each t
Check this bit :
(uj/)enlist[t],{x:update x:i from?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]}each t
This part :
x:update x:i from
// functional form of a function that takes random rows/columns
?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];
// some for of if-else and an update to generate column a (not bullet proof)
{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]
Basically the above gives something like :
q){x:update x:i from?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]}each t
+`a`b`c`x!(`a0`a1`a2`a3`a4`a5`a6`a7;1 1 1 1 1 1 1 1;`x`x`x`x`x`x`x`x;0 1 2 3 ..
+`a`x!(`a0`a1`a2`a3`a4`a5;0 1 2 3 4 5)
+`a`b`c`x!(`a0`a1`a2;1 1 1;`x`x`x;0 1 2)
+`a`b`c`x!(`a0`a1`a2`a3`a4`a5`a6`a7`a8`a9`a10`a11;1 1 1 1 1 1 1 1 1 1 1 1;`x`..
or taking the first one :
q)first{x:update x:i from?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]}each t
a b x
--------
a0 1 0
a1 1 1
a2 1 2
a3 1 3
a4 1 4
a5 1 5
a6 1 6
a7 1 7
a8 1 8
a9 1 9
a10 1 10
You can do
(uj/)enist[t],{ // some function }each t
to get what you want. Drop the enlist[t] if you don't want the table you start with in your result
Hope this helps.

Kafka Streams: group by subsequent identical keys and time windows

I have a scenario that could be described like this. Imagine there are 2 types of keys coming in: A and B (in reality, there are more). Let's say (different) records enter the KStream in following order:
A1
A2
A3
B1
B2
A4
A5
In my scenario, the order of operations is important and I cannot process A4 before B1 or B2. However, I'd like to be able to process the records as batches. The way I see it, the best option to batch this input looks like this, i.e. reduce the input to 3 "batch" objects:
[A1 A2 A3]
[B1 B2]
[A4 A5]
I could then apply a function to each batch using forEach.
Complexity: The application is time-sensitive and it's not acceptable to keep aggregating records until its type changes (and a new batch is required). In other words, if the time between A1 and A2 is more than some time t, a batch with A1 should be generated when t expires. The output then look like this (assuming all other records entered the stream in close succession):
[A1]
[A2 A3]
[B1 B2]
[A4 A5]
Question: How do I obtain a KStream with such batch objects while also taking time windows into account?
My initial solution (which probably doesn't work) for the final scenario (with delay between A1 and A2):
[KStream] incoming data with 2 possible keys: A or B, e.g. [ (A, A1), (A, A2), ...]
|
| selectKey(key + something to separate A1, A2, A3 from A4 and A5 because B1 and B2 are inbetween)
v
[KStream] e.g. [ (A-group1, A1), (A-group1, A2), ... , (B-group2, B1), ..., (A-group3, A4) ]
|
| groupBy(key) // So either A-group1, B-group2 or A-group3
v
[KGroupedStream] 3 different streams
|
| WindowedBy e.g. 1 second
v
[TimeWindowedKStream] (still 3 different streams I guess?)
|
| reduce() --> Make "batch" objects out of window, e.g. a batch object is [A2, A3]
v
[KTable] (no idea how this look like, I guess one row per time window?)
|
| toStream()
v
[KStream] 1 stream with 4 entries like I described in the final scenario above
Could this work? Is this efficient? What are your thoughts?

Joining multiple times in kdb

I have two tables
table 1 (orders) columns: (date,symbol,qty)
table 2 (marketData) columns: (date,symbol,close price)
I want to add the close for T+0 to T+5 to table 1.
{[nday]
value "temp0::update date",string[nday],":mdDates[DateInd+",string[nday],"] from orders";
value "temp::temp0 lj 2! select date",string[nday],":date,sym,close",string[nday],":close from marketData";
table1::temp
} each (1+til 5)
I'm sure there is a better way to do this, but I get a 'loop error when I try to run this function. Any suggestions?

See here for common errors. Your loop error is because you're setting views with value, not globals. Inside a function value evaluates as if it's outside the function so you don't need the ::.
That said there's lots of room for improvement, here's a few pointers.
You don't need the value at all in your case. E.g. this line:
First line can be reduced to (I'm assuming mdDates is some kind of function you're just dropping in to work out the date from an integer, and DateInd some kind of global):
{[nday]
temp0:update date:mdDates[nday;DateInd] from orders;
....
} each (1+til 5)
In this bit it just looks like you're trying to append something to the column name:
select date",string[nday],":date
Remember that tables are flipped dictionaries... you can mess with their column names via the keys, as illustrated (very noddily) below:
q)t:flip `a`b!(1 2; 3 4)
q)t
a b
---
1 3
2 4
q)flip ((`$"a","1"),`b)!(t`a;t`b)
a1 b
----
1 3
2 4
You can also use functional select, which is much neater IMO:
q)?[t;();0b;((`$"a","1"),`b)!(`a`b)]
a1 b
----
1 3
2 4

Seems like you wanted to have p0 to p5 columns with prices corresponding to date+0 to date+5 dates.
Using adverb over to iterate over 0 to 5 days :
q)orders:([] date:(2018.01.01+til 5); sym:5?`A`G; qty:5?10)
q)data:([] date:20#(2018.01.01+til 10); sym:raze 10#'`A`G; price:20?10+10.)
q)delete d from {c:`$"p",string[y]; (update d:date+y from x) lj 2!(`d`sym,c )xcol 0!data}/[ orders;0 1 2 3 4]
date sym qty p0 p1 p2 p3 p4
---------------------------------------------------------------
2018.01.01 A 0 10.08094 6.027448 6.045174 18.11676 1.919615
2018.01.02 G 3 13.1917 8.515314 19.018 19.18736 6.64622
2018.01.03 A 2 6.045174 18.11676 1.919615 14.27323 2.255483
2018.01.04 A 7 18.11676 1.919615 14.27323 2.255483 2.352626
2018.01.05 G 0 19.18736 6.64622 11.16619 2.437314 4.698096

how to handle type F in a table in Q/KDB

I have started to learn q/KDB since a while, therefore forgive me in advance for trivial question but I am facing the following problem I don't know how to solve.
I have a table named "res" showing, side, summation of orders and average_price of some simbols
sym side | sum_order avg_price
----------| -------------------
ALPHA B | 95109 9849.73
ALPHA S | 91662 9849.964
BETA B | 47 9851.638
BETA S | 60 9853.383
with these types
c | t f a
---------| -----
sym | s p
side | s
sum_order| f
avg_price| f
I would like to calculate close and open positions, average point, made by close position, and average price of the open position.
I have used this query which I believe it is pretty bizarre (I am sure there will be a more professional way to do it) but it works as expected
position_summary:select
close_position:?[prev[sum_order]>sum_order;sum_order;prev[sum_order]],
average_price:avg_price-prev[avg_price],
open_pos:prev[sum_order]-sum_order,
open_wavgprice:?[sum_order>next[sum_order];avg_price;next[avg_price]][0]
by sym from res
giving me the following table
sym | close_position average_price open_pos open_wavgprice
----------| ----------------------------------------------------
ALPHA | 91662 0.2342456 3447 9849.73
BETA | 47 1.745035 -13 9853.38
and types are
c | t f a
--------------| -----
sym | s s
close_position| F
average_price | F
open_pos | F
open_wavgprice| f
Now my problem starts here, imagine I join position_summary table with another table appending another column "current_price" of type f
What I want to do is to determinate the points of the open positions.
I have tried this way:
select
?[open_pos>0;open_price-open_wavgprice;open_wavgprice-open]
from position_summary
but I got 'type error,
surely because sum_order is type F and open_wavgprice and current_price are f. I have search on internet by I did not find much about F type.
First: how can I handle this ? I have tried "cast" or use "raze" but no effects and moreover I am not sure if they are right on this particular occasion.
Second: is there a better way to use "if-then" during query tables (for example, in plain English :if this row of this column then take the previous / next of another column or the second or third of previous /next column)
Thank you for you help

Let me rephrase your question using a slightly simpler table:
q)show res:([sym:`A`A`B`B;side:`B`S`B`S]size:95 91 47 60;price:49.7 49.9 51.6 53.3)
sym side| size price
--------| ----------
A B | 95 49.7
A S | 91 49.9
B B | 47 51.6
B S | 60 53.3
You are trying to find the closing position for each symbol using a query like this:
q)show summary:select close:?[prev[size]>size;size;prev[size]] by sym from res
sym| close
---| -----
A | 91
B | 47
The result seems to have one number in each row of the "close" column, but in fact it has two. You may notice an extra space before each number in the display above or you can display the first row
q)first 0!summary
sym | `A
close| 0N 91
and see that the first row in the "close" column is 0N 91. Since the missing values such as 0N are displayed as a space, it was hard to see them in the earlier display.
It is not hard to understand how you've got these two values. Since you select by sym, each column gets grouped by symbol and for the symbol A, you have
q)show size:95 91
95 91
and
q)prev size
0N 95
that leads to
q)?[prev[size]>size;size;prev[size]]
0N 91
(Recall that 0N is smaller than any other integer.)
As a side note, ?[a>b;b;a] is element-wise minimum and can be written as a & b in q, so your conditional expression could be written as
q)size & prev size
0N 91
Now we can see why ? gave you the type error
q)close:exec close from summary
q)close
91
47
While the display is deceiving, "close" above is a list of two vectors:
q)first close
0N 91
and
q)last close
0N 47
The vector conditional does not support that:
q)?[close>0;10;20]
'type
[0] ?[close>0;10;20]
^
One can probably cure that by using each:
q)?[;10;20]each close>0
20 10
20 10
But I don't think this is what you want. Your problem started when you computed the summary table. I would expect the closing position to be the sum of "B" orders minus the sum of "S" orders that can be computed as
q)select close:sum ?[side=`B;size;neg size] by sym from res
sym| close
---| -----
A | 4
B | -13
Now you should be able to fix the rest of the columns in your summary query. Just make sure that you use an aggregation function such as sum in the expression for every column.

Type F means the "cell" in the column contains a vector of floats rather than an atom. So your column is actually a vector of vectors rather than a flat vector.
In your case you have a vector of size 1 in each cell, so in your case you could just do:
select first each close_position, first each average_price.....
which will give you a type f.
I'm not 100% on what you were trying to do in the first query, and I don't have a q terminal to hand to check but you could put this into your query:
select close_position:?[prev[sum_order]>sum_order;last sum_order; last prev[sum_order].....
i.e. get the last sum_order in the list.

Simplify Boolean Function with don't care

Can you help me with this problem:
"Simplify the Boolean Function together with the don't care condition d in sum of the products and product of sum.
F(x,y,z) = ∑(0,1,2,4,5)
d(x, y, z) = ∑(3,6,7)"
I try to solve it but I came up with 1 and 0.

I would use a Karnaugh map for this problem. The order of the minterms would be (in the top row), 0,2,6,4 and (in the bottom row) 1, 3, 7, 5. This evaluates to 1 since the 'don't cares' can be whatever value (1 or 0).
|1|_1_|d|_1_|
| 1 | d | d | 1 |

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Add a column of differences to tables of summary statistics in Stata - group-by

Related

Processing each row in kdb table and appending arbitrary results in a new table

Kafka Streams: group by subsequent identical keys and time windows

Joining multiple times in kdb

how to handle type F in a table in Q/KDB

Simplify Boolean Function with don't care

Categories

Resources