Array in pyspark don't return the groupby - pyspark

I have the df spark that I need build a array field.
I tried:
df_array = (df_csv.select(df_csv.depth,df_csv.height,df_csv.weight,df_csv.width,df_csv.seller_id,df_csv.sku,df_csv.navigation_id,df_csv.category,
df_csv.subcategory,df_csv.max_dimension_p,df_csv.max_side_p,
sf.struct(df_csv.id_distribution_center,df_csv.id_modality,df_csv.id_modality,
df_csv.zipcode_initial,df_csv.zipcode_final,df_csv.cost,df_csv.city,df_csv.state).alias("infos_gerais_product")))
But return the follow df:
depth height weight width seller_id sku navigation_id category subcategory max_dimension_p max_side_p infos_gerais_product
0.21 0.06 1.417 0.38 aabb 111122 333333333 QU GGGG 0.65 0.38 {2500, 2, 2, 8680...
0.21 0.06 1.417 0.38 aabb 111122 333333333 QU GGGG 0.65 0.38 {490, 1, 1, 85685...
0.21 0.06 1.417 0.38 aabb 111122 333333333 QU GGGG 0.65 0.38 {2500, 1, 1, 68556...
I need the return only one row, because all columns have the same informations how observation in df and only agregation the informations 'infos_gerais_product', can anyone help me?

Related

vectorise foor loop with a variable that is incremented in each iteration

I am trying to optimise the running time of my code by getting rid of some for loops. However, I have a variable that is incremented in each iteration in which sometimes the index is repeated. I provide here a minimal example:
a = [1 4 2 2 1 3 4 2 3 1]
b = [0.5 0.2 0.3 0.4 0.1 0.05 0.7 0.3 0.55 0.8]
c = [3 5 7 9]
for i = 1:10
c(a(i)) = c(a(i)) + b(i)
end
Ideally, I would like to compute it by writting:
c(a) = c(a) + b
but obviously it would not give me the same results since I have to recalculate the value for the same index several times so this way to vectorise it would not work.
Also, I am working in Matlab or Octave in case that this is important.
Thank you very much for any help, I am not sure that it is possible to be vectorise.
Edit: thank you very much for your answers so far. I have discovered accumarray, which I did not know before and also understood why changing the for loop between Matlab and Octave was giving me such different times. I also understood my problem better. I gave a too simple example which I thought I could extend, however, what if b was a matrix?
(Let's forget about c at the moment):
a = [1 4 2 2 1 3 4 2 3 1]
b =[0.69 -0.41 -0.13 -0.13 -0.42 -0.14 -0.23 -0.17 0.22 -0.24;
0.34 -0.39 -0.36 0.68 -0.66 -0.19 -0.58 0.78 -0.23 0.25;
-0.68 -0.54 0.76 -0.58 0.24 -0.23 -0.44 0.09 0.69 -0.41;
0.11 -0.14 0.32 0.65 0.26 0.82 0.32 0.29 -0.21 -0.13;
-0.94 -0.15 -0.41 -0.56 0.15 0.09 0.38 0.58 0.72 0.45;
0.22 -0.59 -0.11 -0.17 0.52 0.13 -0.51 0.28 0.15 0.19;
0.18 -0.15 0.38 -0.29 -0.87 0.14 -0.13 0.23 -0.92 -0.21;
0.79 -0.35 0.45 -0.28 -0.13 0.95 -0.45 0.35 -0.25 -0.61;
-0.42 0.76 0.15 0.99 -0.84 -0.03 0.27 0.09 0.57 0.64;
0.59 0.82 -0.39 0.13 -0.15 -0.71 -0.84 -0.43 0.93 -0.74]
I understood now that what I would be doing is rowSum per group, and given that I am using Octave I cannot use "splitapply". I tried to generalise your answers, but accumarray would not work for matrices and also I could not generalise #rahnema1 solution. The desired output would be:
[0.34 0.26 -0.93 -0.56 -0.42 -0.76 -0.69 -0.02 1.87 -0.53;
0.22 -1.03 1.53 -0.21 0.37 1.54 -0.57 0.73 0.23 -1.15;
-0.20 0.17 0.04 0.82 -0.32 0.10 -0.24 0.37 0.72 0.83;
0.52 -0.54 0.02 0.39 -1.53 -0.05 -0.71 1.01 -1.15 0.04]
that is "equivalent" to
[sum(b([1 5 10],:))
sum(b([3 4 8],:))
sum(b([6 9],:))
sum(b([2 7],:))]
Thank you very much, If you think I should include this in another question instead of adding the edit I will do so.
Original question
It can be done with accumarray:
a = [1 4 2 2 1 3 4 2 3 1];
b = [0.5 0.2 0.3 0.4 0.1 0.05 0.7 0.3 0.55 0.8];
c = [3 5 7 9];
c(:) = c(:) + accumarray(a(:), b(:));
This sums the values from b in groups defined by a, and adds that to the original c.
Edited question
If b is a matrix, you can use
full(sparse(repmat(a, 1, size(b,1)), repelem(1:size(b,2), size(b,1)), b))
or
accumarray([repmat(a, 1, size(b,1)).' repelem(1:size(b,2), size(b,1)).'], b(:))
Matrix multiplication and implicit expansion and can be used (Octave):
nc = numel(c);
c += b * (1:nc == a.');
For input of large size it may be more memory efficient to use sparse matrix:
nc = numel(c);
nb = numel(b);
c += b * sparse(1:nb, a, 1, nb, nc);
Edit: When b is a matrix you can extend this solution as:
nc = numel(c);
na = numel(a);
out = sparse(a, 1:na, 1, nc, na) * b;

Fitting a curve through scipy.optimize.curve_fit with endpoints fixed

The following script fits a curve bowing-like via curve_fit (from scipy.optimize), see below:
ydata = numpy.array[ 1.6504 1.63928044 1.62855028 1.6181874 1.60817119 1.59848249 1.58910347 1.58001759 1.57120948 1.56266487 1.55437054 1.54631424 1.5384846 1.53087109 1.52346397 1.5162542 1.5092334 1.50239383 1.4957283 1.48923013 1.48289315 1.47671162 1.4706802 1.46479393 1.45904821 1.45343874 1.44796151 1.44261281 1.43738913 1.43228723 1.42730406 1.42243677 1.4176827 1.41303936 1.40850439 1.40407561 1.39975096 1.39552851 1.39140647 1.38738314 1.38345695 1.37962642 1.37589018 1.37224696 1.36869555 1.36523487 1.36186389 1.35858169 1.35538741 1.35228028 1.34925958 1.34632469 1.34347504 1.34071015 1.33802957 1.33543295 1.33291998 1.33049042 1.32814407 1.32588081 1.32370057 1.32160331 1.31958908 1.31765795 1.31581005 1.31404556 1.31236472 1.3107678 1.30925513 1.30782709 1.30648411 1.30522666 1.3040553 1.30297062 1.30197327 1.30106398 1.30024355 1.29951286 1.29887287 1.29832464 1.29786933 1.29750821 1.29724268 1.29707426 1.29700463 1.29703564 1.29716927 1.29740773 1.2977534 1.29820885 1.29877688 1.29946049 1.3002629 1.30118751 1.30223793 1.30341792 1.30473139 1.30618232 1.30777475 1.30951267 1.3114 ]
xdata = numpy.array[ 0. 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1. ]
sigma = np.ones(len(xdata))
sigma[[0, -1]] = 0.01
def function_cte(x, b):
return 1.31*x + 1.57*(1-x) - b*x*(1-x)
def function_linear(x, c1, c2):
return 1.31*x + 1.57*(1-x) - (c1+c2*x)*x*(1-x)
popt_cte, pcov_cte = curve_fit(function_cte, xdata, ydata, sigma=sigma)
popt_lin, pcov_lin = curve_fit(function_linear, xdata, ydata, sigma=sigma)
But I'm getting the plot in figure ,
i.e., the initial points from both functions disagree with the data to fit (xdata, ydata).
I would like a fit constrained on the endpoints (0.0, 1.57) and (1.0, 1.31) at the same point that minimize the error. Any idea based on this code or it is better taking another way?
thanks!

Matrix columns correlations, excluding self-correlation, Matlab

I have a couple of matrices (1800 x 27) that represent subjects and their recordings (3 minutes equivalent for each of 27 subjects). Each column represents a subject.
I need to do intercorrelation between subjects, let's say to correlate F to G, G to H, and H to F for all 27 subjects.
I use CORR command corr(B) where B is a matrix and it returns the next example:
1 0.07 -0.05 0.10 0.04 0.12
0.07 1 -0.02 -0.08 0.17 0.03
-0.05 -0.02 1 0.04 0.16 0.13
0.10 -0.08 0.04 1 -0.04 0.34
0.04 0.18 0.16 -0.04 1 0.13
How can I adjust the code to exclude self-correlation (eg F to F) so I won't get "1" numerals?
(it's present in each row/column)
I have to perform some transformations afterwards, like Fisher Z-Transformation, which returns inf for each "1", and as result, I can't use further calculations.

Use matlab to arrange excel data

I have data in excel like this:
id date value
a 9/17/2012 0.25
a 9/18/2012 0.48
a 9/19/2012 0.29
a 9/20/2012 0.46
a 9/21/2012 0.17
a 9/24/2012 0.89
a 9/25/2012 0.20
a 9/26/2012 0.65
a 9/27/2012 0.26
b 9/17/2012 0.83
b 9/18/2012 0.87
b 9/19/2012 0.40
b 9/20/2012 0.33
b 9/21/2012 0.71
b 9/24/2012 0.13
b 9/25/2012 0.91
b 9/26/2012 0.73
b 9/27/2012 0.87
c 9/17/2012 0.47
c 9/18/2012 0.15
c 9/19/2012 0.73
c 9/20/2012 0.47
c 9/21/2012 0.03
c 9/24/2012 0.23
c 9/25/2012 0.21
c 9/26/2012 0.39
c 9/27/2012 0.77
and I would like to use Matlab to re-arrange to:
date a b c
9/17/2012 0.25 0.83 0.47
9/18/2012 0.48 0.87 0.15
9/19/2012 0.29 0.40 0.73
9/20/2012 0.46 0.33 0.47
9/21/2012 0.17 0.71 0.03
9/24/2012 0.89 0.13 0.23
9/25/2012 0.20 0.91 0.21
9/26/2012 0.65 0.73 0.39
9/27/2012 0.26 0.87 0.77
What's the easiest way to do this?
Use:
importdata or xlsread
join (statistics toolbox) or see MATLAB Combine matrices of different dimensions, filling values of corresponding indices.

matlab: sorting and random

I need to sort out few small matrices from 1 huge raw matrix ...according to sorting 1st column (1st column contain either 1, 2, or 3)...
if 1st column is 1, then randomly 75% of the 1 save in file A1, 25% of the 1 save in file A2.
if 1st column is 2, then randomly 75% of the 2 save in file B1, 25% of the 2 save in file B2.
if 1st column is 3, then randomly 75% of the 3 save in file C1, 25% of the 3 save in file C2.
how am i going to write the code?
Example:
a raw matrix has 15 rows x 6 columns:
7 rows are 1 in 1st column, 5 rows are 2 in 1st column, and 3 rows are 3 in 1st column.
1 -0.05 -0.01 0.03 0.07 0.11
1 -0.4 -0.36 -0.32 -0.28 -0.24
1 0.3 0.34 0.38 0.42 0.46
1 0.75 0.79 0.83 0.87 0.91
1 0.45 0.49 0.53 0.57 0.61
1 0.8 0.84 0.88 0.92 0.96
1 0.05 0.09 0.13 0.17 0.21
2 0.5 0.54 0.58 0.62 0.66
2 0.4 0.44 0.48 0.52 0.56
2 0.9 0.94 0.98 1.02 1.06
2 0.85 0.89 0.93 0.97 1.01
2 0.75 0.79 0.83 0.87 0.91
3 0.36 0.4 0.44 0.48 0.52
3 0.6 0.64 0.68 0.72 0.76
3 0.4 0.44 0.48 0.52 0.56
7 rows got 1 in 1st column, randomly take out 75% of 7 rows (which is 7*0.75=5.25) to be new matrix (5rows x 6 columns), the rest of 25% become another new matrix
5 rows got 2 in 1st column, randomly take out 75% of 5 rows (which is 5*0.75=3.75) to be new matrix (4rows x 6 columns), the rest of 25% become another new matrix
3 rows got 3 in 1st column, randomly take out 75% of 3 rows (which is 3*0.75=2.25) to be new matrix (2rows x 6 columns), the rest of 25% become another new matrix
Result:
A1=
1 -0.4 -0.36 -0.32 -0.28 -0.24
1 0.3 0.34 0.38 0.42 0.46
1 0.75 0.79 0.83 0.87 0.91
1 0.8 0.84 0.88 0.92 0.96
1 -0.05 -0.01 0.03 0.07 0.11
B1=
2 0.9 0.94 0.98 1.02 1.06
2 0.85 0.89 0.93 0.97 1.01
2 0.5 0.54 0.58 0.62 0.66
2 0.75 0.79 0.83 0.87 0.91
C1=
3 0.36 0.4 0.44 0.48 0.52
3 0.4 0.44 0.48 0.52 0.56
here is one possible solution to your problem using the function randperm:
% Create matrices
firstcol=ones(15,1);
firstcol(8:12)=2;
firstcol(13:15)=3;
mat=[firstcol rand(15,5)];
% Sort according to first column
A=mat(mat(:,1)==1,:);
B=mat(mat(:,1)==2,:);
C=mat(mat(:,1)==3,:);
% Randomly rearrange lines
A=A(randperm(size(A,1)),:);
B=B(randperm(size(B,1)),:);
C=C(randperm(size(C,1)),:);
% Select first 75% lines (rounding)
A1=A(1:round(0.75*size(A,1)),:);
A2=A(round(0.75*size(A,1))+1:end,:);
B1=B(1:round(0.75*size(B,1)),:);
B1=B(round(0.75*size(B,1))+1:end,:);
C1=C(1:round(0.75*size(C,1)),:);
C1=C(round(0.75*size(C,1))+1:end,:);
Hope it helps.