Using Same column for lag on which lag is performed - lag

I have an excel which has the below logic, and I am looking to convert the logic in PySpark.
col1 (Either Y/N)
derivedvalue ( logic: if col1 = N then 1 else previous row value value +1)
N
1
Y
2
Y
3
N
1
N
1
Y
2
Y
3
Y
4

counter_var =0
def cal_duplicate(input_flag):
global counter_var
if input_flag == 'N':
counter_var=1
return 1
else:
counter_var+=1
return counter_var
df['duplicate_counter'] = df.apply(lambda x : cal_duplicate(x['flag']),axis=1)

Related

Shuffle vector elements such that two similar elements coming together at most twice

For the sake of an example:
I have a vector named vec containing ten 1s and ten 2s. I am trying to randomly arrange it but with one condition that two same values must not come together more than twice.
What I have done till now is generating random indexes of vec using the randperm function and shuffling vec accordingly. This is what I have:
vec = [1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2];
atmost = 2;
indexes = randperm(length(vec));
vec = vec(indexes);
>> vec =
2 1 1 1 2 1 2 1 2 1 1 1 2 2 1 2 1 2 2 2
My code randomly arranges elements of vec but does not fulfill the condition of two similar values coming at most two times. How can I do this? Any ideas?
You can first determine the lengths of the runs of one value, and then fit the other values around them.
In the explanation I'll use values of a and b in the vector, so as not to confuse the values of the elements of the vector (1 and 2) and the lengths of the runs for each element (1 and 2).
The end goal is to use repelem to construct the shuffled vector. repelem takes a vector of the elements to repeat, and a vector of how many times to repeat each element. For example, if we have:
v = [b a b a b a b a b a b a b a b a b]
n = [1 1 1 2 1 1 2 1 2 1 1 1 1 2 1 1 0]
repelem will return:
shuffled_vec = [b a b a a b a b b a b b a b a b a a b a]
As a first step, I'll generate random values for the counts corresponding to the a values. In this example, that would be:
a_grouping = [1 2 1 1 1 1 2 1]
First, randomly select the number of 2's in the grouping vector. There can be at most n/2 of them. Then add 1's to make up the desired total.
num_total = 10; % number of a's (and b's)
% There can be 0..num_total/2 groups of two a's in the string.
two_count = randi(num_total/2 + 1) - 1;
% The remainder of the groups of a's have length one.
one_count = num_total - (2*two_count);
% Generate random permutation of two_count 2's and one_count 1's
a_grouping = [repmat(2, 1, two_count), repmat(1, 1, one_count)];
This will give us something like:
a_grouping = [2 2 1 1 1 1 1 1]
Now shuffle:
a_grouping = a_grouping(randperm(numel(a_grouping)));
With the result:
a_grouping = [1 2 1 1 1 1 2 1]
Now we need to figure out where the b values go. There must be at least one b between each run of a values (and at most two), and there may be 0, 1 or 2 b values at the beginning and end of the string. So we need to generate counts for each of the x and y values below:
all_grouping = [y 1 x 2 x 1 x 1 x 1 x 1 x 2 x 1 y]
The x values must be at least 1, so we'll assign them first. Since the y values can be either 0, 1 or 2, we'll leave them set to 0.
% Between each grouping of a's, there must be at least one b.
% There can be 0, 1, or 2 b's before and after the a's,
% so make room for them as well.
b_grouping = zeros(1, numel(a_grouping) - 1 + 2);
b_grouping(2:end-1) = 1; % insert one b between each a group
For each of the the remaining counts we need to assign, just select a random slot. If it's not filled yet (i.e. if it's < 2), increment the count, otherwise find a different slot.
% Assign location of remaining 2's
for s = numel(a_grouping):num_total
unassigned = true;
while unassigned
% generate random indices until we find one that's open
idx = randi(numel(b_grouping));
if b_grouping(idx) < 2
b_grouping(idx) = b_grouping(idx) + 1;
unassigned = false;
end
end
end
Now we've got separate counts for the a's and b's:
a_grouping = [1 2 1 1 1 1 2 1]
b_grouping = [1 1 1 2 2 1 1 1 0]
We'll build the value vector (v from the start of the example) and interleave the groupings (the n vector).
% Interleave the a and b values
group_values = zeros(1, numel(a_grouping) + numel(b_grouping));
group_values(1:2:end) = 2;
group_values(2:2:end) = 1;
% Interleave the corresponding groupings
all_grouping = zeros(size(group_values));
all_grouping(2:2:end) = a_grouping;
all_grouping(1:2:end) = b_grouping;
Finally, repelem puts everything together:
shuffled_vec = repelem(group_values, all_grouping)
The final result is:
shuffled_vec =
1 2 2 1 1 2 1 1 2 2 1 1 2 2 1 1 2 2 1 2
Full code:
num_total = 10; % number of a's (and b's)
% There can be 0..num_total/2 groups of two a's in the string.
two_count = randi(num_total/2 + 1) - 1;
% The remainder of the groups of a's have length one.
one_count = num_total - (2*two_count);
% Generate random permutation of two_count 2's and one_count 1's
a_grouping = [repmat(2, 1, two_count), repmat(1, 1, one_count)];
a_grouping = a_grouping(randperm(numel(a_grouping)));
% disp(a_grouping)
% Between each grouping of a's, there must be at least one b.
% There can be 0, 1, or 2 b's before and after the a's,
% so make room for them as well.
b_grouping = zeros(1, numel(a_grouping) - 1 + 2);
b_grouping(2:end-1) = 1; % insert one b between each a group
% Assign location of remaining 2's
for s = numel(a_grouping):num_total
unassigned = true;
while unassigned
% generate random indices until we find one that's open
idx = randi(numel(b_grouping));
if b_grouping(idx) < 2
b_grouping(idx) = b_grouping(idx) + 1;
unassigned = false;
end
end
end
% Interleave the a and b values
group_values = zeros(1, numel(a_grouping) + numel(b_grouping));
group_values(1:2:end) = 2;
group_values(2:2:end) = 1;
% Interleave the corresponding groupings
all_grouping = zeros(size(group_values));
all_grouping(2:2:end) = a_grouping;
all_grouping(1:2:end) = b_grouping;
shuffled_vec = repelem(group_values, all_grouping)
This should generate fairly random non-consecutive vectors. Whether it covers all possibilities uniformly, I'm not sure.
out=[];
for i=1:10
if randi(2)==1
out=[out,1,2];
else
out=[out,2,1];
end
end
disp(out)
Example results
1,2,1,2,1,2,1,2,2,1,2,1,1,2,2,1,2,1,1,2,
2,1,2,1,1,2,2,1,2,1,1,2,2,1,2,1,1,2,1,2,
1,2,2,1,2,1,1,2,1,2,2,1,2,1,2,1,2,1,2,1,
2,1,2,1,2,1,1,2,1,2,2,1,1,2,2,1,2,1,1,2,
2,1,1,2,1,2,1,2,2,1,2,1,1,2,1,2,1,2,1,2,
2,1,1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,2,2,1,
1,2,2,1,1,2,1,2,2,1,2,1,2,1,1,2,2,1,1,2,
2,1,1,2,2,1,2,1,1,2,2,1,1,2,1,2,1,2,1,2,
1,2,2,1,1,2,1,2,2,1,2,1,1,2,1,2,1,2,1,2,
1,2,1,2,2,1,2,1,2,1,2,1,1,2,1,2,2,1,2,1,

How to zero out the centre k by k matrix in an input matrix with odd number of columns and rows

I am trying to solve this problem:
Write a function called cancel_middle that takes A, an n-by-m
matrix, as an input where both n and m are odd numbers and k, a positive
odd integer that is smaller than both m and n (the function does not have to
check the input). The function returns the input matrix with its center k-by-k
matrix zeroed out.
Check out the following run:
>> cancel_middle(ones(5),3)
ans =
1 1 1 1 1
1 0 0 0 1
1 0 0 0 1
1 0 0 0 1
1 1 1 1 1
My code works only when k=3. How can I generalize it for all odd values of k? Here's what I have so far:
function test(n,m,k)
A = ones(n,m);
B = zeros(k);
A((end+1)/2,(end+1)/2)=B((end+1)/2,(end+1)/2);
A(((end+1)/2)-1,((end+1)/2)-1)= B(1,1);
A(((end+1)/2)-1,((end+1)/2))= B(1,2);
A(((end+1)/2)-1,((end+1)/2)+1)= B(1,3);
A(((end+1)/2),((end+1)/2)-1)= B(2,1);
A(((end+1)/2),((end+1)/2)+1)= B(2,3);
A(((end+1)/2)+1,((end+1)/2)-1)= B(3,1);
A(((end+1)/2)+1,((end+1)/2))= B(3,2);
A((end+1)/2+1,(end+1)/2+1)=B(3,3)
end
You can simplify your code. Please have a look at
Matrix Indexing in MATLAB. "one or both of the row and column subscripts can be vectors", i.e. you can define a submatrix. Then you simply need to do the indexing correct: as you have odd numbers just subtract m-k and n-k and you have the number of elements left from your old matrix A. If you divide it by 2 you get the padding on the left/right, top/bottom. And another +1/-1 because of Matlab indexing.
% Generate test data
n = 13;
m = 11;
A = reshape( 1:m*n, n, m )
k = 3;
% Do the calculations
start_row = (n-k)/2 + 1
start_col = (m-k)/2 + 1
A( start_row:start_row+k-1, start_col:start_col+k-1 ) = zeros( k )
function b = cancel_middle(a,k)
[n,m] = size(a);
start_row = (n-k)/2 + 1;
start_column = (m-k)/2 + 1;
end_row = (n-k)/2 + k;
end_column = (m-k)/2 + k;
a(start_row:end_row,start_column:end_column) = 0;
b = a;
end
I have made a function in an m file called cancel_middle and it basically converts the central k by k matrix as a zero matrix with the same dimensions i.e. k by k.
the rest of the matrix remains the same. It is a general function and you'll need to give 2 inputs i.e the matrix you want to convert and the order of submatrix, which is k.

count co-occurrence neighbors in a vector

I have a vector : for example S=(0,3,2,0,1,2,0,1,1,2,3,3,0,1,2,3,0).
I want to count co-occurrence neighbors, for example the first neighbor "o,3" how many times did it happen till the end of the sequence? Then it investigates the next pair"2,0" and similarly do it for other pairs.
Below is a part of my code:
s=size(pic);
S=s(1)*s(2);
V = reshape(pic,1,S);
min= min(V);
Z=zeros(1,S+1);
Z(1)=min;
Z(2:end)=V;
for i=[0:2:length(Z)-1];
contj=0
for j=0;length(Z)-1;
if Z(i,i+1)= Z(j,j+1)
contj=contj+1
end
end
count(i)= contj
end
It gives me this error:
The expression to the left of the equals sign is not a valid target for an assignment
in this line:
if Z(i,i+1)= Z(j,j+1)
I read similar questions and apply the tips on it but they didn't work!
If pairs are defined without overlapping (according to comments):
S = [0,3,2,0,1,2,0,1,1,2,3,3,0,1,2,3]; %// define data
S2 = reshape(S,2,[]).'; %'// arrange in pairs: one pair in each row
[~, jj, kk] = unique(S2,'rows'); %// get unique labels for pairs
pairs = S2(jj,:); %// unique pairs
counts = accumarray(kk, 1); %// count of each pair. Or use histc(kk, 1:max(kk))
Example: with S as above (I introduce blanks to make pairs stand out),
S = [0,3, 2,0, 1,2, 0,1, 1,2, 3,3, 0,1, 2,3];
the result is
pairs =
0 1
0 3
1 2
2 0
2 3
3 3
counts =
2
1
2
1
1
1
If pairs are defined without overlapping but counted with overlapping:
S = [0,3,2,0,1,2,0,1,1,2,3,3,0,1,2,3]; %// define data
S2 = reshape(S,2,[]).'; %'// arrange in pairs: one pair in each row
[~, jj] = unique(S2,'rows'); %// get unique labels for pairs
pairs = S2(jj,:); %// unique pairs
P = [S(1:end-1).' S(2:end).']; %// all pairs, with overlapping
counts = sum(pdist2(P,pairs,'hamming')==0);
If you don't have pdist2 (Statistics Toolbox), replace last line by
counts = sum(all(bsxfun(#eq, pairs.', permute(P, [2 3 1]))), 3);
Result:
>> pairs
pairs =
0 1
0 3
1 2
2 0
2 3
3 3
>> counts
counts =
3 1 3 2 2 1
do it using sparse command
os = - min(S) + 1; % convert S into indices
% if you want all pairs, i.e., for S = (2,0,1) count (2,0) AND (0,1):
S = sparse( S(1:end-1) + os, S(2:end) + os, 1 );
% if you don't want all pairs, i.e., for S = (2,0,1,3) count (2,0) and (1,3) ONLY:
S = sparse( S(1:2:end)+os, S(2:2:end) + os, 1 );
[f s c] = find(S);
f = f - os; % convert back
s = s - os;
co-occurences and their count are in the pairs (f,s) - c
>> [f s c]
ans =
2 0 2 % i.e. the pair (2,0) appears twice in S...
3 0 2
0 1 3
1 1 1
1 2 3
3 2 1
0 3 1
2 3 2
3 3 1

For each element in vector, sum previous n elements

I am trying to write a function that sums the previous n elements for each the element
v = [1 1 1 1 1 1];
res = sumLastN(v,3);
res = [0 0 3 3 3 3];
Until now, I have written the following function
function [res] = sumLastN(vec,ppts)
if iscolumn(vec)~=1
error('First argument must be a column vector')
end
sz_x = size(vec,1);
res = zeros(sz_x,1);
if sz_x > ppts
for jj = 1:ppts
res(ppts:end,1) = res(ppts:end,1) + ...
vec(jj:end-ppts+jj,1);
end
% for jj = ppts:sz_x
% res(jj,1) = sum(vec(jj-ppts+1:jj,1));
% end
end
end
There are around 2000 vectors of about 1 million elements, so I was wondering if anyone could give me any advice of how I could speed up the function.
Using cumsum should be much faster:
function [res] = sumLastN(vec,ppts)
w=cumsum(vec)
res=[zeros(1,ppts-1),w(ppts+1:end)-w(1:end-ppts)]
end
You basically want a moving average filter, just without the averaging.
Use a digital filter:
n = 3;
v = [1 1 1 1 1 1];
res = filter(ones(1,n),1,v)
res =
1 2 3 3 3 3
I don't get why the first two elements should be zero, but why not:
res(1:n-1) = 0
res =
0 0 3 3 3 3

Replacing zeros (or NANs) in a matrix with the previous element row-wise or column-wise in a fully vectorized way

I need to replace the zeros (or NaNs) in a matrix with the previous element row-wise, so basically I need this Matrix X
[0,1,2,2,1,0;
5,6,3,0,0,2;
0,0,1,1,0,1]
To become like this:
[0,1,2,2,1,1;
5,6,3,3,3,2;
0,0,1,1,1,1],
please note that if the first row element is zero it will stay like that.
I know that this has been solved for a single row or column vector in a vectorized way and this is one of the nicest way of doing that:
id = find(X);
X(id(2:end)) = diff(X(id));
Y = cumsum(X)
The problem is that the indexing of a matrix in Matlab/Octave is consecutive and increments columnwise so it works for a single row or column but the same exact concept cannot be applied but needs to be modified with multiple rows 'cause each of raw/column starts fresh and must be regarded as independent. I've tried my best and googled the whole google but coukldn’t find a way out. If I apply that same very idea in a loop it gets too slow cause my matrices contain 3000 rows at least. Can anyone help me out of this please?
Special case when zeros are isolated in each row
You can do it using the two-output version of find to locate the zeros and NaN's in all columns except the first, and then using linear indexing to fill those entries with their row-wise preceding values:
[ii jj] = find( (X(:,2:end)==0) | isnan(X(:,2:end)) );
X(ii+jj*size(X,1)) = X(ii+(jj-1)*size(X,1));
General case (consecutive zeros are allowed on each row)
X(isnan(X)) = 0; %// handle NaN's and zeros in a unified way
aux = repmat(2.^(1:size(X,2)), size(X,1), 1) .* ...
[ones(size(X,1),1) logical(X(:,2:end))]; %// positive powers of 2 or 0
col = floor(log2(cumsum(aux,2))); %// col index
ind = bsxfun(#plus, (col-1)*size(X,1), (1:size(X,1)).'); %'// linear index
Y = X(ind);
The trick is to make use of the matrix aux, which contains 0 if the corresponding entry of X is 0 and its column number is greater than 1; or else contains 2 raised to the column number. Thus, applying cumsum row-wise to this matrix, taking log2 and rounding down (matrix col) gives the column index of the rightmost nonzero entry up to the current entry, for each row (so this is a kind of row-wise "cummulative max" function.) It only remains to convert from column number to linear index (with bsxfun; could also be done with sub2ind) and use that to index X.
This is valid for moderate sizes of X only. For large sizes, the powers of 2 used by the code quickly approach realmax and incorrect indices result.
Example:
X =
0 1 2 2 1 0 0
5 6 3 0 0 2 3
1 1 1 1 0 1 1
gives
>> Y
Y =
0 1 2 2 1 1 1
5 6 3 3 3 2 3
1 1 1 1 1 1 1
You can generalize your own solution as follows:
Y = X.'; %'// Make a transposed copy of X
Y(isnan(Y)) = 0;
idx = find([ones(1, size(X, 1)); Y(2:end, :)]);
Y(idx(2:end)) = diff(Y(idx));
Y = reshape(cumsum(Y(:)), [], size(X, 1)).'; %'// Reshape back into a matrix
This works by treating the input data as a long vector, applying the original solution and then reshaping the result back into a matrix. The first column is always treated as non-zero so that the values don't propagate throughout rows. Also note that the original matrix is transposed so that it is converted to a vector in row-major order.
Modified version of Eitan's answer to avoid propagating values across rows:
Y = X'; %'
tf = Y > 0;
tf(1,:) = true;
idx = find(tf);
Y(idx(2:end)) = diff(Y(idx));
Y = reshape(cumsum(Y(:)),fliplr(size(X)))';
x=[0,1,2,2,1,0;
5,6,3,0,1,2;
1,1,1,1,0,1];
%Do it column by column is easier
x=x';
rm=0;
while 1
%fields to replace
l=(x==0);
%do nothing for the first row/column
l(1,:)=0;
rm2=sum(sum(l));
if rm2==rm
%nothing to do
break;
else
rm=rm2;
end
%replace zeros
x(l) = x(find(l)-1);
end
x=x';
I have a function I use for a similar problem for filling NaNs. This can probably be cutdown or sped up further - it's extracted from pre-existing code that has a bunch more functionality (forward/backward filling, maximum distance etc).
X = [
0 1 2 2 1 0
5 6 3 0 0 2
1 1 1 1 0 1
0 0 4 5 3 9
];
X(X == 0) = NaN;
Y = nanfill(X,2);
Y(isnan(Y)) = 0
function y = nanfill(x,dim)
if nargin < 2, dim = 1; end
if dim == 2, y = nanfill(x',1)'; return; end
i = find(~isnan(x(:)));
j = 1:size(x,1):numel(x);
j = j(ones(size(x,1),1),:);
ix = max(rep([1; i],diff([1; i; numel(x) + 1])),j(:));
y = reshape(x(ix),size(x));
function y = rep(x,times)
i = find(times);
if length(i) < length(times), x = x(i); times = times(i); end
i = cumsum([1; times(:)]);
j = zeros(i(end)-1,1);
j(i(1:end-1)) = 1;
y = x(cumsum(j));