Spark Window Functions That depend on itself - scala

Say I have a column of sorted timestamps in a DataFrame. I want to write a function that adds a column to this DataFrame that cuts the timestamps into sequential time slices according to the following rules:
start at the first row and keep iterating down to the end
for each row, if you've walked n number of rows in the current group OR you have walked more than time interval t in the current group, make a cut
return a new column with the group assignment for each row, which should be an increasing integer
In English: each group should be no more than n rows, and should not span more than t time
For example: (Using integers for timestamps to simplify)
INPUT
time
---------
1
2
3
5
10
100
2000
2001
2002
2003
OUTPUT (after slice function with n = 3 and t = 5)
time | group
----------|------
1 | 1
2 | 1
3 | 1
5 | 2 // cut because there were no cuts in the last 3 rows
10 | 2
100 | 3 // cut because 100 - 5 > 5
2000 | 4 // cut because 2000 - 100 > 5
2001 | 4
2002 | 4
2003 | 5 // cut because there were no cuts in the last 3 rows
I have a feeling this can be done with window functions in Spark. Afterall, window functions were created to help developers compute moving averages. You'd basically calculate an aggregate (in this case average) of a column (stock price) per window of n rows.
The same should be able to be accomplished here. For each row, if the last n rows contains no cut, or the timespan between the last cut and the current timestamp is greater than t, cut = true, o.w. cut = false. But what I can't seem to figure out is how to make the Window Function aware of itself. That would be like the moving average of a particular row aware of the last moving average.

Related

95% Confidence Interval for a weighted column in KDB+/Q

I have a table like the following where each row corresponds to an execution:
table:([]name:`account1`account1`account1`account2`account2`account1`account1`account1`account1`account2;
Pnl:13.7,13.2,74.1,57.8,29.9,15.9,7.8,-50.4,2.3,-16.2;
markouts:.01,.002,-.003,-.02,.004,.001,-.008,.04,.011,.09;
notional:1370,6600,-24700,-2890,7475,15900,-975,-1260,210,-180)
I'd like to create a 95% confidence interval of Pnl for `account1. The problem is, Pnl is the product of markouts and notional values, so it's weighted and the mean wouldn't be a simple mean. I'm pretty sure the standard deviation calculation would also be a bit different than normal.
Is there a way to still do this in KDB? I'm not really sure how to go about this. Any advice is greatly appreciated!
statistics isn't my strong point but most of this can be done with some keywords for the standard calculation:
q)select { avg[x] + -1 1* 1.960*sdev[x]%sqrt count x } Pnl by name from table
name | Pnl
--------| ------------------
account1| -15.90856 37.76571
account2| -18.45611 66.12278
https://code.kx.com/q/ref/avg/#avg
https://code.kx.com/q/ref/sqrt/
https://code.kx.com/q/ref/dev/#sdev
As shown on the kx ref, the sdev calculation is as follows which you could use as a base to create your own to suit what you want/expect.
{sqrt var[x]*count[x]%-1+count x}
There is also wavg if you want to do weighted average:
https://code.kx.com/q/ref/avg/#wavg
Edit: Assuming this can work by substituting in weighted calculations, here's a weighted sdev I've thrown together wsdev:
table:update weight:2 6 3 5 2 4 5 6 7 3 from table;
wsdev:{[w;v] sqrt (sum ( (v-wavg[w;v]) xexp 2) *w)%-1+sum w }
// substituting avg and sdev above
w95CI:{[w;v] wavg[w;v] + -1 1* 1.960*wsdev[w;v]%sqrt count v };
select w95CI[weight;Pnl] by name from table
name | Pnl
--------| ------------------
account1| -19.70731 28.47701
account2| -8.201463 68.24146

Reshaping and merging simulations in Stata

I have a dataset, which consists of 1000 simulations. The output of each simulation is saved as a row of data. There are variables alpha, beta and simulationid.
Here's a sample dataset:
simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
I want to estimate a new value - let's call it new - which depends on alpha and beta as well as different levels of two other variables which we'll call risk and price. Values of risk range from 0 to 100, price from 0 to 500 in steps of 5.
What I want to achieve is a dataset that consists of values representing the probability that (across the simulations) new is greater than 0 for combinations of risk and price.
I can achieve this using the code below. However, the reshape process takes more hours than I'd like. And it seems to me to be something that could be completed a lot quicker.
So, my question is either:
i) is there an efficient way to generate multiple datasets from a single row of data without multiple reshape, or
ii) am I going about this in totally the wrong way?
set maxvar 15000
/* Input sample data */
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
forvalues risk = 0(1)100 {
forvalues price = 0(5)500 {
gen new_r`risk'_p`price' = `price' * (`risk'/200)* beta - alpha
gen probnew_r`risk'_p`price' = 0
replace probnew_r`risk'_p`price' = 1 if new_r`risk'_p`price' > 0
sum probnew_r`risk'_p`price', mean
gen mnew_r`risk'_p`price' = r(mean)
drop new_r`risk'_p`price' probnew_r`risk'_p`price'
}
}
drop if simulationid > 1
save simresults.dta, replace
forvalues risk = 0(1)100 {
clear
use simresults.dta
reshape long mnew_r`risk'_p, i(simulationid) j(price)
keep simulation price mnew_r`risk'_p
rename mnew_r`risk'_p risk`risk'
save risk`risk'.dta, replace
}
clear
use risk0.dta
forvalues risk = 1(1)100 {
merge m:m price using risk`risk'.dta, nogen
save merged.dta, replace
}
Here's a start on your problem.
So far as I can see, you don't need more than one dataset.
The various reshapes and merges just rearrange what was first generated and that can be done within one dataset.
The code here in the first instance is for just one pair of values of alpha and beta. To simulate 1000 such, you would need 1000 times more observations, i.e. about 10 million, which is not usually a problem and to loop over the alphas and betas. But the loop can be tacit. We'll get to that.
This code has been run and is legal. It's limited to one alpha, beta pair.
clear
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
local N = 101 * 101
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
gen result = (price * (risk/200)* beta[1] - alpha[1]) > 0
bysort price risk: gen mean = sum(result)
by price risk: replace mean = mean[_N]/_N
Assuming now that you first read in 1000 values, here is a sketch of how to get the whole thing. This code has not been tested. That is, your dataset starts with 1000 observations; you then enlarge it to 10 million or so, and get your results. The tricksy part is using an expression for the subscript to ensure that each block of results is for a distinct alpha, beta pair. That's not compulsory; you could do it in a loop, but then you would need to generate outside the loop and replace within it.
local N = 101 * 101 * 1000
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
egen sim = seq(), block(10201)
gen result = (price * (risk/200)* beta[ceil(_n/10201)] - alpha[ceil(_n/10201)]) > 0
bysort sim price risk: gen mean = sum(result)
by sim price risk: replace mean = mean[_N]/_N
Other devices used: egen to set up in blocks; getting the mean without repeated calls to summarize; using a true-or-false expression directly.
NB: I haven't tried to understand what you are doing, but it seems to me that the price-risk-simulation conditions define single values, so calculating a mean looks redundant. But perhaps that is in the code because you wish to add further detail to the code once you have it working.
NB2: This seems a purely deterministic calculation. Not sure that you need this code at all.

In MATLAB, how do you compare the difference between two rows

An input is a data file with ID number of multiple occurrences. (e.g ID# 123) Now what I want is to gather all rows with same ID numbers, compare column by column, and see if what column do they have difference.
Now after that I will move on to the next ID number with multiple occurrences (e.g. ID#456) and do the same.
I repeat everything until I finish with the last ID number of multiple occurrence.
So my output will be like this,
(1)The column headers will be the same.
(2)The ID# column will have unique entries. Only the ID numbers which have multiple occurrences will be included in this column.
(3)I will add an extra column whose entry contains the number of occurrences the ID number occurred. Example, if it occurred 5 times, the entry is 5.
(4)For, the other columns, if the column has same entries for all the occurrences of a certain ID number, we write "0", else "1". E.g. if for ID#123, the entries in column "Section" is the same for all the occurrences of ID#123, then for our output table, the column "Section" will contain the value of "0". If there is any difference, the output will be "1"
Your question is not very clear but I think you want to count the number of unique values and the number of times the unique rows occur. The table below might demonstrate this.
+-------+---------+-----+----------+---------------------+
| ID | Column1 | ... | Column n | num of occurrencies |
+-------+---------+-----+----------+---------------------+
This can be done with unique and accumarray
In the example below, A is the original data and output is your desired output. The first n columns of output are your unique data and the last column contains the number of times this row occurred. The row [1 5] occurred twice, [2 3] once etc.
A = [1 5
1 5
2 3
2 4
3 9];
[k,~,idx]= unique(A,'rows');
n = accumarray(idx(:),1);
output = [k n]
output =
1 5 2
2 3 1
2 4 1
3 9 1

Conditional Distinct Count in Crystal Reports

I have a dataset like this:
ID PersonID ClassID Attended Converted
1 1 1 1 0
2 1 1 1 1
3 1 1 1 1
4 2 1 1 1
5 3 2 0 0
6 3 2 1 1
7 4 2 1 0
I'm building a report that groups by ClassID (actually I'm using a parameter that allows grouping on a few different cols, but for simplicity here, I'm just using ClassID). I need to do a calculation in each group footer. In order to do the calculation, I need to count records with PersonIDs unique to that group. The catch is, in one case, these records also need to match a criteria. EG:
X = [Count of records where Converted = 1 with distinct PersonID]
Y = [Count of records where Attended = 1]
Then I need to display the quotient as a percentage:
(X/Y)*100
So the final report would look something like this:
ID PersonID Attended Converted
CLASS 1 GROUP
1 1 1 0
2 1 1 1
3 1 1 1
4 2 1 1
Percent= 2/4 = 50%
CLASS 2 GROUP
5 3 0 0
6 3 1 1
7 4 1 0
Percent= 1/2 = 50%
Notice in Class 1 Group, there are 3 records with Converted = 1 but 'X' (the numerator) is equal to 2 because of the duplicate PersonID. How can I calculate this in Crystal Reports?
I had to create a few different formulas to make this work with the help of this site.
First I created a function called fNull as suggested by that site, that is just blank. I was wondering if just typing null in its place would do the job but didn't get to testing it. Next I created formulas to evaluate if a row was attended and if a row was converted.
fTrialAttended:
//Allows row to be counted if AttendedTrial is true
if {ConversionData.AttendedTrial} = true
then CStr({ConversionData.PersonID})
else {#fNull}
fTrialsConverted:
//Allows row to be counted if Converted is true
if {ConversionData.Converted} = true
then CStr({ConversionData.PersonID})
else {#fNull}
Note that I'm returning the PersonID if attended or converted is true. This lets me do the distinct count in the next formula (X from the original question):
fX:
DistinctCount({#fTrialsConverted}, {ConversionData.ClassID})
This is placed in the group footer. Again remember #fTrialsConverted is returning the PersonID of trials converted (or fNull, which won't be counted). One thing I don't understand is why I had to explicitly include the group by field (ClassID) if it's in the group footer, but I did or it would count the total across all groups. Next, Y was just a straight up count.
fY:
//Counts the number of trials attended in the group
Count({#fTrialsAttended}, {ConversionData.ClassID})
And finally a formula to calculate the percentage:
if {#fY} = 0 then 0
else ({#fX}/{#fY})*100
The last thing I'll share is I wanted to also calculate the total across all groups in the report footer. Counting total Y was easy, it's the same as the fY formula except you leave out the group by parameter. Counting total X was trickier because I need the sum of the X from each group and Crystal can't sum another sum. So I updated my X formula to also keep a running total in a global variable:
fX:
//Counts the number of converted trials in the group, distinct to a personID
whileprintingrecords;
Local NumberVar numConverted := DistinctCount({#fTrialsConverted}, {#fGroupBy});
global NumberVar rtConverted := rtConverted + numConverted; //Add to global running total
numConverted; //Return this value
Now I can use rtConverted in the footer for the calculation. This lead to just one other bewildering thing that took me a couple hours to figure out. rtConverted was not being treated as a global variable until I explicitly added the global keyword, despite all the documentation I've seen saying global is the default. Once I figured that out, it all worked great.

Using SUM and UNIQUE to count occurrences of value within subset of a matrix

So, presume a matrix like so:
20 2
20 2
30 2
30 1
40 1
40 1
I want to count the number of times 1 occurs for each unique value of column 1. I could do this the long way by [sum(x(1:2,2)==1)] for each value, but I think this would be the perfect use for the UNIQUE function. How could I fix it so that I could get an output like this:
20 0
30 1
40 2
Sorry if the solution seems obvious, my grasp of loops is very poor.
Indeed unique is a good option:
u=unique(x(:,1))
res=arrayfun(#(y)length(x(x(:,1)==y & x(:,2)==1)),u)
Taking apart that last line:
arrayfun(fun,array) applies fun to each element in the array, and puts it in a new array, which it returns.
This function is the function #(y)length(x(x(:,1)==y & x(:,2)==1)) which finds the length of the portion of x where the condition x(:,1)==y & x(:,2)==1) holds (called logical indexing). So for each of the unique elements, it finds the row in X where the first is the unique element, and the second is one.
Try this (as specified in this answer):
>>> [c,~,d] = unique(a(a(:,2)==1))
c =
30
40
d =
1
3
>>> counts = accumarray(d(:),1,[],#sum)
counts =
1
2
>>> res = [c,counts]
Consider you have an array of various integers in 'array'
the tabulate function will sort the unique values and count the occurances.
table = tabulate(array)
look for your unique counts in col 2 of table.