Groupby with Pyspark through filters - pyspark

I have a df derived from clustering that looks like this:
Cluster
Variable 1
Variable 2
0
334
32
0
0
45
3
453
0
3
320
0
0
0
28
1
467
49
3
324
16
1
58
2
And i'm trying to achive the next result for each cluster and every variable:
Variable 1
Cluster
%of0
%ofvals != 0
Count of vals != 0
Sum of values
%universe
0
67
33
1
334
17
1
0
100
2
525
27
3
0
100
3
1097
56
Variable 2
Cluster
%of0
%ofvals != 0
Count of vals != 0
Sum of values
%universe
0
0
100
0
105
61
1
0
100
0
51
29
3
67
33
1
16
10
Note: % universe is the total sum of values of every variable, in this case for variable 1 would be: 334 + 525 + 1097 = 1956 (this is 100% so 334 its 17% of this total).
I'm in the process of learning Pyspark and I'm struggling with the syntax, this is the code I'm trying but i'm at loss because I donĀ“t know how to manage the filterings to iterate for variable and for cluster:
for i in list_of_variables:
print(i)
df.groupBy('Cluster').agg((count((col(i) == 0) / df.filter(col('Cluster') == 0).count()) * 100).alias('% of 0'), (count((col(i) != 0) / df.filter(col('Cluster') == 0).count() * 100).alias('% of vals diff than 0')..
I would be very grateful for any ideas that could give me light on how to materialize this objective. Have an awesome day!

Maybe you could try with something like this to obtain the part of counts:
for i in list:
print(i)
output = df.filter(col(i) != 0).groupBy(col('Cluster')).agg(
count(col('*')).alias('Count_vals_dif_0')).show()

Related

kdb - KDB Apply logic where column exists - data validation

I'm trying to perform some simple logic on a table but I'd like to verify that the columns exists prior to doing so as a validation step. My data consists of standard table names though they are not always present in each data source.
While the following seems to work (just validating AAA at present) I need to expand to ensure that PRI_AAA (and eventually many other variables) is present as well.
t: $[`AAA in cols `t; temp: update AAA_VAL: AAA*AAA_PRICE from t;()]
Two part question
This seems quite tedious for each variable (imagine AAA-ZZZ inputs and their derivatives). Is there a clever way to leverage a dictionary (or table) to see if a number of variables exists or insert a place holder column of zeros if they do not?
Similarly, can we store a formula or instructions to to apply within a dictionary (or table) to validate and return a calculation (i.e. BBB_VAL: BBB*BBB_PRICE.) Some calculations would be dependent on others (i.e. BBB_Tax_Basis = BBB_VAL - BBB_COSTS costs for example so there could be iterative issues.
Thank in advance!
A functional update may be the best way to achieve this if your intention is to update many columns of a table in a similar fashion.
func:{[t;x]
if[not x in cols t;t:![t;();0b;(enlist x)!enlist 0]];
:$[x in cols t;
![t;();0b;(enlist`$string[x],"_VAL")!enlist(*;x;`$string[x],"_PRICE")];
t;
];
};
This function will update t with *_VAL columns for any column you pass as an argument, while first also adding a zero column for any missing columns passed as an argument.
q)t:([]AAA:10?100;BBB:10?100;CCC:10?100;AAA_PRICE:10*10?10;BBB_PRICE:10*10?10;CCC_PRICE:10*10?10;DDD_PRICE:10*10?10)
q)func/[t;`AAA`BBB`CCC`DDD]
AAA BBB CCC AAA_PRICE BBB_PRICE CCC_PRICE DDD_PRICE AAA_VAL BBB_VAL CCC_VAL DDD DDD_VAL
---------------------------------------------------------------------------------------
70 28 89 10 90 0 0 700 2520 0 0 0
39 17 97 50 90 40 10 1950 1530 3880 0 0
76 11 11 0 0 50 10 0 0 550 0 0
26 55 99 20 60 80 90 520 3300 7920 0 0
91 51 3 30 20 0 60 2730 1020 0 0 0
83 81 7 70 60 40 90 5810 4860 280 0 0
76 68 98 40 80 90 70 3040 5440 8820 0 0
88 96 30 70 0 80 80 6160 0 2400 0 0
4 61 2 70 90 0 40 280 5490 0 0 0
56 70 15 0 50 30 30 0 3500 450 0 0
As you've already mentioned, to cover point 2, a dictionary of functions might be the best way to go.
q)dict:raze{(enlist`$string[x],"_VAL")!enlist(*;x;`$string[x],"_PRICE")}each`AAA`BBB`DDD
q)dict
AAA_VAL| * `AAA `AAA_PRICE
BBB_VAL| * `BBB `BBB_PRICE
DDD_VAL| * `DDD `DDD_PRICE
And then a slightly modified function...
func:{[dict;t;x]
if[not x in cols t;t:![t;();0b;(enlist x)!enlist 0]];
:$[x in cols t;
![t;();0b;(enlist`$string[x],"_VAL")!enlist(dict`$string[x],"_VAL")];
t;
];
};
yields a similar result.
q)func[dict]/[t;`AAA`BBB`DDD]
AAA BBB CCC AAA_PRICE BBB_PRICE CCC_PRICE DDD_PRICE AAA_VAL BBB_VAL DDD DDD_VAL
-------------------------------------------------------------------------------
70 28 89 10 90 0 0 700 2520 0 0
39 17 97 50 90 40 10 1950 1530 0 0
76 11 11 0 0 50 10 0 0 0 0
26 55 99 20 60 80 90 520 3300 0 0
91 51 3 30 20 0 60 2730 1020 0 0
83 81 7 70 60 40 90 5810 4860 0 0
76 68 98 40 80 90 70 3040 5440 0 0
88 96 30 70 0 80 80 6160 0 0 0
4 61 2 70 90 0 40 280 5490 0 0
56 70 15 0 50 30 30 0 3500 0 0
Here's another approach which handles dependent/cascading calculations and also figures out which calculations are possible or not depending on the available columns in the table.
q)show map:`AAA_VAL`BBB_VAL`AAA_RevenueP`AAA_RevenueM`BBB_Other!((*;`AAA;`AAA_PRICE);(*;`BBB;`BBB_PRICE);(+;`AAA_Revenue;`AAA_VAL);(%;`AAA_RevenueP;1e6);(reciprocal;`BBB_VAL));
AAA_VAL | (*;`AAA;`AAA_PRICE)
BBB_VAL | (*;`BBB;`BBB_PRICE)
AAA_RevenueP| (+;`AAA_Revenue;`AAA_VAL)
AAA_RevenueM| (%;`AAA_RevenueP;1000000f)
BBB_Other | (%:;`BBB_VAL)
func:{c:{$[0h=type y;.z.s[x]each y;-11h<>type y;y;y in key x;.z.s[x]each x y;y]}[y]''[y];
![x;();0b;where[{all in[;cols x]r where -11h=type each r:(raze/)y}[x]each c]#c]};
q)t:([] AAA:1 2 3;AAA_PRICE:1 2 3f;AAA_Revenue:10 20 30;BBB:4 5 6);
q)func[t;map]
AAA AAA_PRICE AAA_Revenue BBB AAA_VAL AAA_RevenueP AAA_RevenueM
---------------------------------------------------------------
1 1 10 4 1 11 1.1e-05
2 2 20 5 4 24 2.4e-05
3 3 30 6 9 39 3.9e-05
/if the right columns are there
q)t:([] AAA:1 2 3;AAA_PRICE:1 2 3f;AAA_Revenue:10 20 30;BBB:4 5 6;BBB_PRICE:4 5 6f);
q)func[t;map]
AAA AAA_PRICE AAA_Revenue BBB BBB_PRICE AAA_VAL BBB_VAL AAA_RevenueP AAA_RevenueM BBB_Other
--------------------------------------------------------------------------------------------
1 1 10 4 4 1 16 11 1.1e-05 0.0625
2 2 20 5 5 4 25 24 2.4e-05 0.04
3 3 30 6 6 9 36 39 3.9e-05 0.02777778
The only caveat is that your map can't have the same column name as both the key and in the value of your map, aka cannot re-use column names. And it's assumed all symbols in your map are column names (not global variables) though it could be extended to cover that
EDIT: if you have a large number of column maps then it will be easier to define it in a more vertical fashion like so:
map:(!). flip(
(`AAA_VAL; (*;`AAA;`AAA_PRICE));
(`BBB_VAL; (*;`BBB;`BBB_PRICE));
(`AAA_RevenueP;(+;`AAA_Revenue;`AAA_VAL));
(`AAA_RevenueM;(%;`AAA_RevenueP;1e6));
(`BBB_Other; (reciprocal;`BBB_VAL))
);

How can I make a diamond of zeroes in a matrix of any size? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a square Matrix N x M, odd dimensions, and I want to put a diamond of zeroes, for example, for a 5 x 5 matrix:
1 3 2 4 2
5 7 8 9 5
3 2 4 6 3
6 8 2 1 3
3 3 3 3 3
Is transform to:
1 3 0 4 2
5 0 8 0 5
0 2 4 6 0
6 0 2 0 3
3 3 0 3 3
How can this be done efficiently?
I'll bite, here is one approach:
% NxN matrix
N = 5;
assert(N>1 && mod(N,2)==1);
A = magic(N);
% diamond mask
N2 = fix(N/2);
[I,J] = meshgrid(-N2:N2);
mask = (abs(I) + abs(J)) == N2;
% fill with zeros
A(mask) = 0;
The result:
>> A
A =
17 24 0 8 15
23 0 7 0 16
0 6 13 20 0
10 0 19 0 3
11 18 0 2 9
I also had some time to play around. For my solution there are no limits concerning A being odd or even or larger than 1. Every integer is fine (even 0 works, though it does not make sense).
% NxN matrix
N = 7;
A = magic(N);
half = ceil( N/2 );
mask = ones( half );
mask( 1 : half+1 : half*half ) = 0;
mask = [ fliplr( mask ) mask ];
mask = [ mask; flipud( mask ) ];
if( mod(N,2) == 1 )
mask(half, :) = []
mask(:, half) = []
end
A( ~mask ) = 0;
A
I am first creating a square sub-matrix mask of "quarter" size (half the number of columns and half the number of rows, ceil() to get one more in the case N is odd).
Example for N=7 -> half=4.
mask =
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
I then set it's diagonal values to zero:
mask =
0 1 1 1
1 0 1 1
1 1 0 1
1 1 1 0
Mirror the mask horizontally:
mask =
1 1 1 0 0 1 1 1
1 1 0 1 1 0 1 1
1 0 1 1 1 1 0 1
0 1 1 1 1 1 1 0
Then mirror it vertically:
mask =
1 1 1 0 0 1 1 1
1 1 0 1 1 0 1 1
1 0 1 1 1 1 0 1
0 1 1 1 1 1 1 0
0 1 1 1 1 1 1 0
1 0 1 1 1 1 0 1
1 1 0 1 1 0 1 1
1 1 1 0 0 1 1 1
As N is odd we got a redundant row and redundant column that are then removed:
mask =
1 1 1 0 1 1 1
1 1 0 1 0 1 1
1 0 1 1 1 0 1
0 1 1 1 1 1 0
1 0 1 1 1 0 1
1 1 0 1 0 1 1
1 1 1 0 1 1 1
The logical not is then used as a mask to select the values in the original matrix that are set to 0.
Probably not as efficient as #Amro's solution, but it works. :D
My solution:
looking at the first left half of the matrix
in the first row 0 is in the middle column (let's call it mc)
in the second row the 0is in column mc-1
and so on while the rows increase
when you reach column 1 the sequence continue but with mc+1 but the rows decrease
In a similar way for the right half of the matrix
n=7
a=randi([20 30],n,n)
% Centre of the matrix
p=ceil(n/2)
% Identify the column sequence
col=[p:-1:1 2:p p+1:n n-1:-1:p]
% Identify the row sequence
row=[1:n n-1:-1:1]
% Transorm the row and column index in linear index
idx=sub2ind(size(a),row,col)
% Set the 0'
a(idx)=0
a =
22 29 23 27 27 21 23
29 29 21 27 24 26 24
30 28 21 27 29 28 25
28 22 24 20 27 24 25
23 26 21 20 30 20 29
26 20 26 23 25 22 25
21 24 25 25 23 21 30
a =
22 29 23 0 27 21 23
29 29 0 27 0 26 24
30 0 21 27 29 0 25
0 22 24 20 27 24 0
23 0 21 20 30 0 29
26 20 0 23 0 22 25
21 24 25 0 23 21 30
Hope this helps.
Qapla'
Using indexing (only works when N is odd):
N = 7;
% Random matrix
A = randi(100, N);
idx = [N-1:-2:1; 2:2:N];
A(cumsum([ceil(N/2) idx(:)' idx(end-1:-1:1)])) = 0
A =
60 77 74 0 54 83 9
8 48 0 76 0 28 67
6 0 32 78 83 0 10
0 27 25 5 11 39 0
76 0 49 43 67 0 16
79 7 0 86 0 70 78
57 28 85 0 81 44 81

Change vector if even/odd in a matrix

I'm new to Matlab programming and I've only had 3 classes so far. I'm having problem with my homework. (Also I am from Iceland so english is not my first language, so please forgive my grammar)
I'm given a matrix, A and I'm supposed to change the value? of a vector to 0 if it is an even number and to 1 if it is an odd number.
This is what I have so far.
A = [90 100 87 43 20 58; 29 5 12 94 8 62; 75 21 36 83 35 24; 47 51 70 59 82 33];
B = zeros(size(A));
for k = 1:length(A)
if mod(A(k),2)== 0 %%number is even
B(k) = 0;
else
B(k) = 1; %%number is odd
end
end
B(A,2==0) = 0;
B(A,2~=0) = 1
What I am getting it this:
B =
0 0 0 0 0 0
1 1 0 0 0 0
1 0 0 0 0 0
1 0 0 0 0 0
1 0 0 0 0 0
If anyone could please help me, it would be greatly appreciated :)
You are very close. Don't use length(A) - use numel(A). length(A) returns the number of elements along the largest dimension. As such, because you have 6 columns and 4 rows, this loop will only iterate 6 times. numel returns the total number of elements in the array A, which is what you want as you want to iterate over each value in A.
Therefore:
A = [90 100 87 43 20 58; 29 5 12 94 8 62; 75 21 36 83 35 24; 47 51 70 59 82 33];
B = zeros(size(A));
for k = 1:numel(A) %// Change
if mod(A(k),2)== 0 %%number is even
B(k) = 0;
else
B(k) = 1; %%number is odd
end
end
The above loop will go through every single element in the matrix and set the corresponding element to 0 if even and 1 if odd.
However, I encourage you to use vectorized operations on your code. Don't use loops for this. Specifically, you can do this very easily with a single mod call:
B = mod(A,2);
mod(A,2) will compute the modulus of every value in the matrix A with 2 as the operand and output a matrix B of the same size. This will exactly compute the parity of each number.
We get for B:
>> A = [90 100 87 43 20 58; 29 5 12 94 8 62; 75 21 36 83 35 24; 47 51 70 59 82 33];
>> B = mod(A,2)
B =
0 0 1 1 0 0
1 1 0 0 0 0
1 1 0 1 1 0
1 1 0 1 0 1

strange result with JPEG compression

I want to implement the JPEG compression by using MATLAB. Well at the point where the symbols' probabilities (Huffman coding) are calculated i can see some NEGATIVE values. I am sure that this is not correct!!! if someone can give some help or directions i would really appreciate it. Thank all of you in advance. I use MATLAB R2012b. Here is the code:
clc;
clear all;
a = imread('test.png');
b = rgb2gray(a);
b = imresize(b, [256 256]);
b = double(b);
final = zeros(256, 256);
mask = [1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 0
1 1 1 1 1 1 0 0
1 1 1 1 1 0 0 0
1 1 1 1 0 0 0 0
1 1 1 0 0 0 0 0
1 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0];
qv1 = [ 16 11 10 16 24 40 51 61
12 12 14 19 26 58 60 55
14 13 16 24 40 57 69 56
14 17 22 29 51 87 80 62
18 22 37 56 68 109 103 77
24 35 55 64 81 104 113 92
49 64 78 87 103 121 120 101
72 92 95 98 112 100 103 99];
t = dctmtx(8);
DCT2D = #(block_struct) t*block_struct.data*t';
msk = #(block_struct) mask.*block_struct.data;
for row = 1:8:256
for column = 1:8:256
x = (b(row:row+7, column:column+7));
xf = blockproc(x, [8 8], DCT2D);
xf1 = blockproc(xf, [8 8], msk);
xf1 = round(xf1./qv1).*qv1;
final(row:row+7, column:column+7) = xf1;
end
end
[symbols,p] = hist(final,unique(final));
bar(p, symbols);
p = p/sum(p); %NEGATIVE VALUES????
I think you might have the outputs of hist (symbols and p) swapped. The probability should be calculated from the bin counts, which is the first output of hist.
[nelements,centers] = hist(data,xvalues) returns an additional row vector, centers, indicating the location of each bin center on the x-axis. To plot the histogram, you can use bar(centers,nelements).
In other words, instead of your current line,
[symbols,p] = hist(final,unique(final));
just use,
[p,symbols] = hist(final,unique(final));
Also, final is a matrix rather than a vector, so nelements will be a matrix:
If data is a matrix, then a histogram is created separately for each column. Each histogram plot is displayed on the same figure with a different color.

How to compare a matrix element with its neighbours without using a loop in MATLAB?

I have a matrix in MATLAB. I want to check the 4-connected neighbours (left, right, top, bottom) for every element. If the current element is less than any of the neighbours then we set it to zero otherwise it will keep its value. It can easily be done with loop, but it is very expensive as I have thousands of these matrices.
You might recognize it as nonmaxima suppression after edge detection.
If you have the image processing toolbox, you can do this with a morpological dilation to find local maxima and suppress all other elements.
array = magic(6); %# make some data
msk = [0 1 0;1 0 1;0 1 0]; %# make a 4-neighbour mask
%# dilation will replace the center pixel with the
%# maximum of its neighbors
maxNeighbour = imdilate(array,msk);
%# set pix to zero if less than neighbors
array(array<maxNeighbour) = 0;
array =
35 0 0 26 0 0
0 32 0 0 0 25
31 0 0 0 27 0
0 0 0 0 0 0
30 0 34 0 0 16
0 36 0 0 18 0
edited to use the same data as #gnovice, and to fix the code
One way to do this is with the function NLFILTER from the Image Processing Toolbox, which applies a given function to each M-by-N block of a matrix:
>> A = magic(6) %# A sample matrix
A =
35 1 6 26 19 24
3 32 7 21 23 25
31 9 2 22 27 20
8 28 33 17 10 15
30 5 34 12 14 16
4 36 29 13 18 11
>> B = nlfilter(A,[3 3],#(b) b(5)*all(b(5) >= b([2 4 6 8])))
B =
35 0 0 26 0 0
0 32 0 0 0 25
31 0 0 0 27 0
0 0 0 0 0 0
30 0 34 0 0 16
0 36 0 0 18 0
The above code defines an anonymous function which uses linear indexing to get the center element of a 3-by-3 submatrix b(5) and compare it to its 4-connected neighbors b([2 4 6 8]). The value in the center element is multiplied by the logical result returned by the function ALL, which is 1 when the center element is larger than all of its nearest neighbors and 0 otherwise.
If you don't have access to the Image Processing Toolbox, another way to accomplish this is by constructing four matrices representing the top, right, bottom and left first differences for each point and then searching for corresponding elements in all four matrices that are non-negative (i.e. the element exceeds all of its neighbours).
Here's the idea broken down...
Generate some test data:
>> sizeA = 3;
A = randi(255, sizeA)
A =
254 131 94
135 10 124
105 191 84
Pad the borders with zero-elements:
>> A2 = zeros(sizeA+2) * -Inf;
A2(2:end-1,2:end-1) = A
A2 =
0 0 0 0 0
0 254 131 94 0
0 135 10 124 0
0 105 191 84 0
0 0 0 0 0
Construct the four first-difference matrices:
>> leftDiff = A2(2:end-1,2:end-1) - A2(2:end-1,1:end-2)
leftDiff =
254 -123 -37
135 -125 114
105 86 -107
>> topDiff = A2(2:end-1,2:end-1) - A2(1:end-2,2:end-1)
topDiff =
254 131 94
-119 -121 30
-30 181 -40
>> rightDiff = A2(2:end-1,2:end-1) - A2(2:end-1,3:end)
rightDiff =
123 37 94
125 -114 124
-86 107 84
>> bottomDiff = A2(2:end-1,2:end-1) - A2(3:end,2:end-1)
bottomDiff =
119 121 -30
30 -181 40
105 191 84
Find the elements that exceed all of the neighbours:
indexKeep = find(leftDiff >= 0 & topDiff >= 0 & rightDiff >= 0 & bottomDiff >= 0)
Create the resulting matrix:
>> B = zeros(sizeA);
B(indexKeep) = A(indexKeep)
B =
254 0 0
0 0 124
0 191 0
After wrapping this all into a function and testing it on 1000 random 100x100 matrices, the algorithm appears to be quite fast:
>> tic;
for ii = 1:1000
A = randi(255, 100);
B = test(A);
end; toc
Elapsed time is 0.861121 seconds.