Pyspark groupBy - multiply and divide gives wrong results - pyspark

Data :
Profit Amount Rate Accunt Status Yr
0.3065 56999 1 Acc3 S1 1
0.3956 57000 1 Acc3 S1 1
0.3065 57001 1 Acc3 S1 1
0.3956 57002 1 Acc3 S1 1
0.3065 57003 1 Acc3 S1 2
0.3065 57004 0.89655 Acc3 S1 3
0.3956 57005 0.89655 Acc3 S1 3
0.2984 57006 0.89655 Acc3 S1 3
0.3956 57007 1 Acc3 S2 2
0.3956 57008 1 Acc3 S2 2
0.2984 57009 1 Acc3 S2 2
0.2984 57010 1 Acc1 S1 1
0.3956 57011 1 Acc1 S1 1
0.3065 57012 1 Acc1 S1 1
0.3065 57013 1 Acc1 S1 1
0.3065 57013 1 Acc1 S1 1
Code:
df = df1\
.join(df12,(df12.code == df2.code),how = 'left').drop(df2.code).filter(col('Date') == '20Jan2019')\
.join(df3,df1.id== df3.id,how = 'left').drop(df3.id)\
.join(df4,df1.id == df4.id,how = 'left').drop(df4.id)\
.join(df5,df1.id2 == df5.id2,how ='left').drop(df5.id2)\
.withColumn("Account",concat(trim(df3.name1),trim(df4name1)))\
.withColumn("Status",when(df1.FB_Ind == 1,"S1").otherwise("S2"))\
.withColumn('Year',((df1['date'].substr(6, 4))+df1['Year']))
df6 = df.distinct()
df7 = df6.groupBy('Yr','Status','Account')\
.agg(sum((Profit * amount)/Rate).alias('output'))
The output I am receiving is in decimals such as 0.234 instead in thousands 23344.2
Converting Sum((Profit*amount)/Rate) as Output code in pyspark

This is how you should be writing your code. Also, I did not get it why are you adding df1['Year']?
df = df1\
.join(df12,"code",how = 'left') \
.filter(col('Date') == '20Jan2019') \
.join(df3,df1.id== df3.id,how = 'left') \
.drop(df3.id)\
.join(df4,"id",how = 'left') \
.join(df5,"id2",how ='left') \
.withColumn("Account",F.concat(F.trim(df3.name1), F.trim(df4name1)))\
.withColumn("Status",F.when(df1.FB_Ind == 1, "S1").otherwise("S2"))\
.withColumn('Year',F.substr(F.col('date'), 6, 4)+F.col('Year'))
df6 = df.distinct()
df7 = df6.groupBy('Yr', 'Status', 'Account')\
.agg(F.sum(F.col("Profit") * F.col("amount"))/F.col("Rate")).alias('output'))
For details on how do we apply groupby, partitonby and other functions in pyspark, please refer to - Analysis using Pyspark

Related

How to repeat the simulation 1000 times to have 1000 different datasets

I have 1 binary covariate x and want to use logistic regression to get the outcome y binary using the logistic formula. Then after that, I want to simulate a data on the basis of the probability from the logistic equation
I have tried and I am able to simulate the first data but repeating the process and adding subject id and simulation number is giving me the problem.
simulation id y x
1 1 1 0
1 2 0 1
1 3 0 0
1 4 1 1
1 5 0 1
2 1 1 1
2 2 0 0
2 3 1 1
2 4 1 1
2 5 1 0
3 1 1 1
3 2 1 0
3 3 1 0
3 4 0 0
3 5 0 1
The code is as follows:
nsample <- 100
set.seed(1234)
id <- rep(1:nsample)
p <- 0.01
b0 <- log(p/(1-p))
b1 <- 0.5
b2 <- 0.1
b5 <- -4
x1 <- rbinom(nsample, 1, 0.4)
x2 <- rbinom(nsample, 1, 0.6)
z2 <- b0 + b1*x1+b2*x2
p_vector <- 1/(1+exp(-z2))
y <- rbinom(n = length(nsample), size = 1, prob = p_vector)

How to compute survival rates in SAS Guide?

I would like to compute a formula, which is the survival rate, in my case I will call it by Z variable. I was thinking about use a macro but i cant get a easy way to do it. In the table below i have an example what it is pretended.
the propose is to perform the Z calculation by id.
So X(i,j) variable is the probability of default of id i in id_time j, where i = 1,..,3 and j = 1,..,4
Y(i,j) = 1 - x(i,j) always.
Z(i,j) = Y(i,j-1) * Z(i,j-1) except when j = 1, where Z(i,1) = 1 = 100%.
If you guys need some more details just let me know.
Here it is the example:
id id_time x y z
1 1 0,010 0,990 1
1 2 0,015 0,985 0,990
1 3 0,020 0,980 0,975
1 4 0,025 0,975 0,956
2 1 0,010 0,990 1
2 2 0,015 0,985 0,990
2 3 0,020 0,980 0,975
2 4 0,020 0,980 0,956
3 1 0,005 0,995 1
3 2 0,010 0,990 0,995
3 3 0,020 0,980 0,985
3 4 0,030 0,970 0,965
I do it as your formula.
data test;
format x y 12.3;
input id: id_time: x: comma9. y: comma9.;
x = x * 0.001;
y = y * 0.001;
cards;
1 1 0,010 0,990
1 2 0,015 0,985
1 3 0,020 0,980
1 4 0,025 0,975
2 1 0,010 0,990
2 2 0,015 0,985
2 3 0,020 0,980
2 4 0,020 0,980
3 1 0,005 0,995
3 2 0,010 0,990
3 3 0,020 0,980
3 4 0,030 0,970
;
run;
data _null_;
retain Z;
set test;
by id notsorted;
LagY = Lag(y);
if first.id then LagY = .;
if first.id then Z = 1;
if not first.id then Z = round(LagY * Z,0.001);
put (id id_time x y z)(=);
run;
Output:
id=1 id_time=1 x=0.010 y=0.990 Z=1
id=1 id_time=2 x=0.015 y=0.985 Z=0.99
id=1 id_time=3 x=0.020 y=0.980 Z=0.975
id=1 id_time=4 x=0.025 y=0.975 Z=0.956
id=2 id_time=1 x=0.010 y=0.990 Z=1
id=2 id_time=2 x=0.015 y=0.985 Z=0.99
id=2 id_time=3 x=0.020 y=0.980 Z=0.975
id=2 id_time=4 x=0.020 y=0.980 Z=0.956
id=3 id_time=1 x=0.005 y=0.995 Z=1
id=3 id_time=2 x=0.010 y=0.990 Z=0.995
id=3 id_time=3 x=0.020 y=0.980 Z=0.985
id=3 id_time=4 x=0.030 y=0.970 Z=0.965

Legend entries not working in matlab

I am having trouble getting legend entries for a scatter plot in matlab.
I should have four different entries for each combination of two colors and two shapes.
colormap jet
x = rand(1,30); %x data
y = rand(1,30); %y data
c = [1 2 2 1 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 2 2 1 1 1 1 1 2 2 1 1]; %color
s = [2 2 1 1 1 2 1 2 2 1 1 1 1 2 2 2 1 1 1 1 2 2 1 1 1 2 2 1 1 2]; %shape
%index data for each shape (s)
s1 = s == 1; %square
s2 = s == 2; %circle
xsq = x(s1);
ysq = y(s1);
csq = c(s1);
xcirc = x(s2);
ycirc = y(s2);
ccirc = c(s2);
%plot data with different colors and shapes
h1 = scatter(xsq, ysq, 50,csq,'s','jitter','on','jitterAmount',0.2);
hold on
h2 = scatter(xcirc, ycirc, 50, ccirc, 'o','jitter','on','jitterAmount',0.2);
This plots a scatter plot with red circles and squares and blue circles and squares. Now I want a legend (this doesn't work).
%legend for each combination
legend([h1(1) h1(2) h2(1) h2(2)],'red+square','red+circle','blue+square','blue+circle')
Any ideas? Thanks :)
scatter is very limited when you want to place more than one set of points together. I would use plot instead as you can chain multiple sets in one command. Once you do that, it's very easy to use legend. Do something like this:
colormap jet
x = rand(1,30); %x data
y = rand(1,30); %y data
c = [1 2 2 1 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 2 2 1 1 1 1 1 2 2 1 1]; %color
s = [2 2 1 1 1 2 1 2 2 1 1 1 1 2 2 2 1 1 1 1 2 2 1 1 1 2 2 1 1 2]; %shape
%index data for each shape (s)
s1 = s == 1; %square
s2 = s == 2; %circle
c1 = c == 1; %circle colour %// NEW
c2 = c == 2; %square colour %// NEW
red_squares = s1 & c1; %// NEW
blue_squares = s1 & c2; %// NEW
red_circles = s2 & c1; %// NEW
blue_circles = s2 & c2; %// NEW
plot(x(red_squares), y(red_squares), 'rs', x(blue_squares), y(blue_squares), 'bs', x(red_circles), y(red_circles), 'ro', x(blue_circles), y(blue_circles), 'bo');
legend('red+square','blue+square','red+circle','blue+circle');
What's important is this syntax:
red_squares = s1 & c1;
blue_squares = s1 & c2;
red_circles = s2 & c1;
blue_circles = s2 & c2;
This uses logical indexing so that we select those circles and squares that belong to one colour or another colour. In this case, we choose only those shapes that are square and that belong to the first colour. There are four different combinations:
s1, c1
s1, c2
s2, c1
s2, c2
We get:

Multiple definitions of node winbugs

I have some problem with this code in winbugs. The model is sintatically correct and data are loaded, but when I compile, software output is "multiple definitions of node Z". I don't know how to solve the problem.
This is the model:
#BUGS Model
model {
for (i in 1:n){
for (j in 1:p){
Y[i , j] ~ dcat ( prob [i , j , 1: M[j]])
B <- sum(alpha[j])
}
theta [i] ~ dnorm (0.0 , 1.0)
}
for (i in 1:n){
for (j in 1:p){
for (k in 1:M[j]){
Z <- sum(delta [k ])
eta [i , j , k] <- 1.7* alpha [j] * (B * (theta [i] - beta [j] ) + Z)
exp.eta[i , j , k] <- exp( eta[i , j , k])
psum[ i , j , k] <- sum(eta[i , j , 1:k])
prob[i , j , k] <- exp.eta[i , j , k] / psum[i , j , 1:M[j]]
}
}
}
for (j in 1:p){
alpha [j] ~ dnorm (0 , pr.alpha) I(0 , )
for (k in 2:M[j]){
delta [k] ~ dnorm (0.0 , 1.0)
}
for (k in 1:M[j]){
beta [j] ~ dnorm (0 , pr.beta )
}
}
delta [1] <- 0.0
pr.alpha <- pow(1 , -2)
pr.beta <- pow(1, -2)
}
#data
list(n=10, p=8)
M[] M[] M[] M[] M[] M[] M[] M[]
2 2 4 2 2 3 4 2
2 1 1 2 1 2 2 3
1 2 1 3 1 1 4 4
2 1 1 2 1 1 2 4
3 4 4 3 3 3 1 1
4 3 4 4 4 4 4 4
1 1 2 2 1 2 4 4
2 1 1 3 1 4 2 4
3 4 1 1 1 2 2 2
2 2 2 1 4 4 4 4
END
Thanks to everyone that will answer.
Your problems lie in defining some nodes multiple times in BUGS loops. For example B is defined np times in the first i and j loop. BUGS will not allow this. You cannot override a node value. You need to either
1) Add some indexes to B, Z, delta[k] and beta[j] to enable BUGS to store simulated values within elements of nodes during the loops. e.g replace B with B[i,j] and Z with Z[i,j,k]
or
2) Move B, Z, delta[k] and beta[j] to loops that only cover the indexes they already have. i.e. B, Z not in a loop as they have no index, delta[k] only in a for(k in 1:...) loop.
The decision depends on what you have in mind for your model and what you want parameters you want to store.

specific tuples generation and counting (matlab)

I need to generate (I prefere MATLAB) all "unique" integer tuples k = (k_1, k_2, ..., k_r) and
its corresponding multiplicities, satisfying two additional conditions:
1. sum(k) = n
2. 0<=k_i<=w_i, where vector w = (w_1,w_2, ..., w_r) contains predefined limits w_i.
"Unique" tuples means, that it contains unique unordered set of elements
(k_1,k_2, ..., k_r)
[t,m] = func(n,w)
t ... matrix of tuples, m .. vector of tuples multiplicities
Typical problem dimensions are about:
n ~ 30, n <= sum(w) <= n+10, 5 <= r <= n
(I hope that exist any polynomial time algorithm!!!)
Example:
n = 8, w = (2,2,2,2,2), r = length(w)
[t,m] = func(n,w)
t =
2 2 2 2 0
2 2 2 1 1
m =
5
10
in this case exist only two "unique" tuples:
(2,2,2,2,0) with multiplicity 5
there are 5 "identical" tuples with same set of elements
0 2 2 2 2
2 0 2 2 2
2 2 0 2 2
2 2 2 0 2
2 2 2 2 0
and
(2,2,2,1,1) with multiplicity 10
there are 10 "identical" tuples with same set of elements
1 1 2 2 2
1 2 1 2 2
1 2 2 1 2
1 2 2 2 1
2 1 1 2 2
2 1 2 1 2
2 1 2 2 1
2 2 1 1 2
2 2 1 2 1
2 2 2 1 1
Thanks in advance for any help.
Very rough (extremely ineffective) solution. FOR cycle over 2^nvec-1 (nvec = r*maxw) test samples and storage of variable res are really terrible things!!!
This solution is based on tho following question.
Is there any more effective way?
function [tup,mul] = tupmul(n,w)
r = length(w);
maxw = max(w);
w = repmat(w,1,maxw+1);
vec = 0:maxw;
vec = repmat(vec',1,r);
vec = reshape(vec',1,r*(maxw+1));
nvec = length(vec);
res = [];
for i = 1:(2^nvec - 1)
ndx = dec2bin(i,nvec) == '1';
if sum(vec(ndx)) == n && all(vec(ndx)<=w(ndx)) && length(vec(ndx))==r
res = [res; vec(ndx)];
end
end
tup = unique(res,'rows');
ntup = size(tup,1);
mul = zeros(ntup,1);
for i=1:ntup
mul(i) = size(unique(perms(tup(i,:)),'rows'),1);
end
end
Example:
> [tup mul] = tupmul(8,[2 2 2 2 2])
tup =
0 2 2 2 2
1 1 2 2 2
mul =
5
10
Or same case but with changed limits for first two positions:
>> [tup mul] = tupmul(8,[1 1 2 2 2])
tup =
1 1 2 2 2
mul =
10
This is far more better algorithm, created by Bruno Luong (phenomenal MATLAB programmer):
function [t, m, v] = tupmul(n, w)
v = tmr(length(w), n, w);
t = sort(v,2);
[t,~,J] = unique(t,'rows');
m = accumarray(J(:),1);
end % tupmul
function v = tmr(p, n, w, head)
if p==1
if n <= w(end)
v = n;
else
v = zeros(0,1);
end
else
jmax = min(n,w(end-p+1));
v = cell2mat(arrayfun(#(j) tmr(p-1, n-j, w, j), (0:jmax)', ...
'UniformOutput', false));
end
if nargin>=4 % add a head column
v = [head+zeros(size(v,1),1,class(head)) v];
end
end %tmr