apply np.sign to pyspark series not working even using udf - pyspark

I am currently trying to convert all the row values to certain sign by using numpy built-in function np.sign
My code:
import numpy as np
pd_dataframe = pd.DataFrame({'id': [i for i in range(10)],
'values': [10,5,3,-1,0,-10,-4,10,0,10]})
sp_dataframe = spark.createDataFrame(pd_dataframe)
sign_acc_row = F.udf(lambda x: np.sign([x]), IntegerType())
sp_dataframe = sp_dataframe.withColumn('sign', sign_acc_row('values'))
sp_dataframe.show()
Errors:
y4JJavaError: An error occurred while calling o2586.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 320.0 failed 1 times, most recent failure: Lost task 0.0 in stage 320.0 (TID 3199, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
Expected output:
id values sign
0 0 10 1
1 1 5 1
2 2 3 1
3 3 -1 -1
4 4 0 0
5 5 -10 -1
6 6 -4 -1
7 7 10 1
8 8 0 0
9 9 10 1
Side question if allowed:
I wanted to create another column with which it returns an additional of 1 when the value is different from previous row.
Expected output:
id values sign numbering
0 0 10 1 1
1 1 5 1 1
2 2 3 1 1
3 3 -1 -1 2
4 4 0 0 3
5 5 -10 -1 4
6 6 -4 -1 4
7 7 10 1 5
8 8 0 0 6
9 9 10 1 7

You are almost there. np.sign return an numpy.int64 object which is not understood by pyspark. To make them compatiable, you can do:
sign_acc_row = F.udf(lambda x: int(np.sign(x)), IntegerType())

Related

Implementation of FIFO pnl in kdb/q

Consider the table below:
Id
Verb
Qty
Price
1
Buy
6
10.0
2
Sell
5
11.0
3
Buy
4
10.0
4
Sell
3
12.0
5
Sell
8
9.0
6
Buy
7
8.0
I would like to compute the PnL in a FIFO way. For example for Id=1, PnL is -6*(10.0) +5*(11.0) + 1*(12.0) = +$7.00. For Id=5, this case is a bit different: our position is +2, and we will firstly fill this position(which will not take account into the PnL of Id=5), then we sell the remaining 6 assets. At Id=6, the -6 position is fulfilled and we get the PnL of Id=5 which is +6*(9.0)-6*(8.0)=+$6.00. Hence this table with PnL is what I want to have :
Id
Verb
Qty
Price
PnL
1
Buy
6
10.0
7.0
2
Sell
5
11.0
0.0
3
Buy
4
10.0
2.0
4
Sell
3
12.0
0.0
5
Sell
8
9.0
6.0
6
Buy
7
8.0
0.0(with 1 asset remaining)
I have read this post and KDB: pnl in FIFO manner and https://code.kx.com/q4m3/1_Q_Shock_and_Awe/#114-example-fifo-allocation. But in their approach, they don't care about the order between buy orders and sell orders, which is not my case.
My idea is to firstly produce the FIFO allocation matrix where the dimension is the trades number:
Id
1
2
3
4
5
6
1
6
0
0
0
0
0
2
1
0
0
0
0
0
3
1
0
4
0
0
0
4
0
0
2
0
0
0
5
0
0
0
0
-6
0
6
0
0
0
0
0
1
Then I compute the diff(price). The inner product of each column and diff(price) is PnL of each trade.
I am having trouble to implement this allocation matrix. Or any advice on solving this problem more directly?
Here's one approach. It's more convoluted than I'd like but it covers a lot of the intermediary steps and generates a type of allocation matrix as you suggested. There are likely edge-cases and tweaks needed but this should give you some ideas at least.
t:([]id:1+til 6;side:`b`s`b`s`s`b;qty:6 5 4 3 8 7;px:10 11 10 12 9 8f);
t:update pos:sums delta from update delta:qty*(1;-1)side=`s from t;
f:{signum[x]*x,{#[(-). z;x;:;abs[y]-sum z 1]}[y;x y]{(x;deltas y&sums x)}[abs where[signum[x]<>signum x y]#x;abs x y]};
t:update fifo:deltas[id!delta;f\[id!delta;id]] from t;
q)update pnl:sum each(id!px)*/:fifo from t
id side qty px delta pos fifo pnl
-----------------------------------------------------
1 b 6 10 6 6 1 2 3 4 5 6!-6 5 0 1 0 0 7
2 s 5 11 -5 1 1 2 3 4 5 6!0 0 0 0 0 0 0
3 b 4 10 4 5 1 2 3 4 5 6!0 0 -4 2 2 0 2
4 s 3 12 -3 2 1 2 3 4 5 6!0 0 0 0 0 0 0
5 s 8 9 -8 -6 1 2 3 4 5 6!0 0 0 0 6 -6 6
6 b 7 8 7 1 1 2 3 4 5 6!0 0 0 0 0 0 0

Find rows of a matrix whose certain columns all match a condition

Suppose I have a matrix with many rows and columns, for example a small subset would be:
1 2 3 4 5 6
1 1 5 6 0 0
1 2 2 3 2 1
1 2 0 3 4 5
1 9 5 7 3 0
I want to find the rows whose columns #4, #5 and #6 contain elements greater than zero, so I in this case would get a vector like this:
1
3
4
I have tried using the find() function this way:
idx = find(y(:, 4:6) > 0)
but I get this:
1
2
3
4
5
6
8
9
10
11
13
14
You can use a combination of find and all like this:
idx = find(all(y(:,4:6) > 0, 2))
This gives:
>> y = [1 2 3 4 5 6; 1 1 5 6 0 0; 1 2 2 3 2 1; 1 2 0 3 4 5; 1 9 5 7 3 0]
y =
1 2 3 4 5 6
1 1 5 6 0 0
1 2 2 3 2 1
1 2 0 3 4 5
1 9 5 7 3 0
>> idx = find(all(y(:,4:6) > 0, 2))
idx =
1
3
4
The idea is that we extract columns 4 to 6, check which values are greater than 0, operate along the 2nd dimension with an AND condition (all), and then extract which indices (rows) are 1/true in the resulting column vector.

Bin interaction frequencies

I have 10 bins, and each bin contains a specific number of observations, e.g.:
a = [0,0,1,0,0,2,0,0,0,2]
I'd like to subsequently tally how many times any given pair of (non-zero) bins co-occur - based on the number of observations.
Given the above example, bin#3 = 1, bin#6 = 2 and bin#10 = 2.
This means that bin 3 and 6 co-occurred once, bin 3 and 10 co-occurred once, and bin 6 and 10 co-occurred twice (the minimum value of the pair is taken).
My desired output is a full matrix, listing every possible bin combination (columns 1-2) and the tally of what was observed (column 3):
1 2 0
1 3 0
1 4 0
1 5 0
1 6 0
1 7 0
1 8 0
1 9 0
1 10 0
2 3 0
2 4 0
2 5 0
2 6 0
2 7 0
2 8 0
2 9 0
2 10 0
3 4 0
3 5 0
3 6 1
3 7 0
3 8 0
3 9 0
3 10 1
4 5 0
4 6 0
4 7 0
4 8 0
4 9 0
4 10 0
5 6 0
5 7 0
5 8 0
5 9 0
5 10 0
6 7 0
6 8 0
6 9 0
6 10 2
7 8 0
7 9 0
7 10 0
8 9 0
8 10 0
9 10 0
Is there a short and/or fast way of doing this?
You can get all combinations of the bin numbers in many ways. I'll use combvec for ease.
Then it's relatively simple to vectorise this using min...
a = [0,0,1,0,0,2,0,0,0,2];
n = 1:numel(a);
% Use unique and sort to get rid of duplicate pairs when order doesn't matter
M = unique( sort( combvec( n, n ).', 2 ), 'rows' );
% Get rid of rows where columns 1 and 2 are equal
M( M(:,1) == M(:,2), : ) = [];
% Get the overlap count for bins
M( :, 3 ) = min( a(M), [], 2 );
Try this.
bin_output = [....];
bin_matrix = [0,0,1,0,0,2,0,0,0,2];
bin_nonzero = find(bin_matrix);
for j = 1:length(bin_nonzero);
if isequal(j,length(bin_nonzero))
break;
end
for k = (j+1):(length(bin_nonzero))
for m = 1:length(bin_output)
if isequal(bin_output(m,1),j) && isequal(bin_output(m,2),k)
bin_output(m,3) = bin_output(m,3) + min([bin_matrix(1,bin_nonzero(1,j)),bin_matrix(1,bin_nonzero(1,k))]);
end
end
end
end

Every possible sum combination in a vector

Assuming I'm having a vectors of numbers A, for example: A=[1 3 5 3 9 6](A's length >= 2) and an Integer X=6. Need to find how many pairs (A[i],A[j]) where i<j exist in the vector which answer this condition: A[i]+A[j]=X. The number of pairs is printed.
Not allowed to use for/while. Allowed only ceil,floor,mod,repmat,reshape,size,length,transpose,sort,isempty,all,any,find ,sum,max,min.
With repmat, length and sum -
integer1 = 6; %// One of the paramters
A_ind = 1:length(A) %// Get the indices array
%// Expand A_ind into rows and A_ind' into columns, to form a meshgrid structure
A_ind_mat1 = repmat(A_ind,[length(A) 1])
A_ind_mat2 = repmat(A_ind',[1 length(A)]) %//'
%// Expand A into rows and A' into columns, to form a meshgrid structure
A_mat1 = repmat(A,[length(A) 1])
A_mat2 = repmat(A',[1 length(A)]) %//'
%// Form the binary matrix of -> (A[i],A[j]) where i<j
cond1 = A_ind_mat1 < A_ind_mat2
%// Use the binary matrix as a logical mask to select elements from the two
%// matrices and see which element pairs satisfy -> A[i]+A[j]=X and get a
%// count of those pairs with SUM
pairs_count = sum((A_mat1(cond1) + A_mat2(cond1))==integer1)
Outputs from code run to make it clearer -
A =
1 3 5 3 9 6
A_ind =
1 2 3 4 5 6
A_ind_mat1 =
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
A_ind_mat2 =
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
A_mat1 =
1 3 5 3 9 6
1 3 5 3 9 6
1 3 5 3 9 6
1 3 5 3 9 6
1 3 5 3 9 6
1 3 5 3 9 6
A_mat2 =
1 1 1 1 1 1
3 3 3 3 3 3
5 5 5 5 5 5
3 3 3 3 3 3
9 9 9 9 9 9
6 6 6 6 6 6
cond1 =
0 0 0 0 0 0
1 0 0 0 0 0
1 1 0 0 0 0
1 1 1 0 0 0
1 1 1 1 0 0
1 1 1 1 1 0
pairs_count =
2
A bit more explanation -
Taking few more steps to clarify why pairs_count must be 2 here -
Set all values in A_mat1 and A_mat2 to be zeros that do not satisfy the less than criteria
>> A_mat1(~cond1)=0
A_mat1 =
0 0 0 0 0 0
1 0 0 0 0 0
1 3 0 0 0 0
1 3 5 0 0 0
1 3 5 3 0 0
1 3 5 3 9 0
>> A_mat2(~cond1)=0
A_mat2 =
0 0 0 0 0 0
3 0 0 0 0 0
5 5 0 0 0 0
3 3 3 0 0 0
9 9 9 9 0 0
6 6 6 6 6 0
Now, add A_mat1 and A_mat2 and see how many 6's you got -
>> A_mat1 + A_mat2
ans =
0 0 0 0 0 0
4 0 0 0 0 0
6 8 0 0 0 0
4 6 8 0 0 0
10 12 14 12 0 0
7 9 11 9 15 0
As you can see there are two 6's and thus our result is verified.

How to create a matrix with different elements using Matlab [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
how to create 9×9 matrix with the first 3 rows all zeros, 4 to 6 rows are all filled with 5, and the remaining rows first elements are 1's and the remaining elements are 5's, using MATLAB?
Here's an answer that'll teach you how to use MATLAB if you're interested enough:
A = bsxfun(#times, ones(9), kron([0 5 5], [1 1 1])') - ...
[kron([0 0 4], [1 1 1])' zeros(9,8)]
result:
A =
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
5 5 5 5 5 5 5 5 5
5 5 5 5 5 5 5 5 5
5 5 5 5 5 5 5 5 5
1 5 5 5 5 5 5 5 5
1 5 5 5 5 5 5 5 5
1 5 5 5 5 5 5 5 5
subZero = zeros(3, 9);
subFive = 5*ones(3, 9);
subsubOnes = ones(3, 1);
subsubFive = 5*ones(3, 8);
subOneFive = [subsubOnes subsubFive];
yourMatrix = [subZero; subFive; subOneFive];
Have you tried creating matrix with values at the time of initialization like this:
myMatrix = [...
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
5 5 5 5 5 5 5 5 5
5 5 5 5 5 5 5 5 5
5 5 5 5 5 5 5 5 5
1 5 5 5 5 5 5 5 5
1 5 5 5 5 5 5 5 5
1 5 5 5 5 5 5 5 5];
I know there are simpler ways to initialize.