Again, sorry for asking stupid question, I'm new to Pyspark.
I'm trying to transpose my dataset from:
id dif BA AB
1 1 100 30
1 2 200 40
1 3 300 20
1 4 100 15
2 1 99 18
2 2 89 9
3 1 29 20
3 2 70 32
3 3 39 13
4 1 24 14
4 2 56 23
4 3 24 26
5 1 27 31
5 2 30 17
6 1 100 19
to become
id BA1 BA2 BA3 BA4 AB1 AB2 AB3 AB4
1 100 200 300 100 30 40 20 15
2 99 89 NaN NaN 18 9 NaN NaN
3 29 70 39 NaN 20 32 13 NaN
4 24 56 24 NaN 14 23 26 NaN
5 27 30 NaN NaN 31 17 NaN NaN
6 100 NaN NaN NaN 19 NaN NaN NaN
In python, I used the following code and it works.
df['AB_idx'] = 'AB' + df.dif.astype(str)
AB = df.pivot(index='id',columns='AB_idx',values='AB').reset_index()
df['BA_idx'] = 'BA' + df.dif.astype(str)
BA = df.pivot(index='id',columns='BA_idx',values='BA').reset_index()
df1=pd.merge(BA, AB, on='id', how='left')
Now the problem is I have to translate into pyspark again since I have more than 1 billion records.
any thought? thanks
I do see some example like this post:
Pyspark: reshape data without aggregation
But it seems do not work for me, since I have more than 100 different dif values.
You can try pivot with agg as first and then rename the columns:
output = (df.groupBy("id").pivot("dif")
.agg(*[F.first(i).alias(i) for i in ["BA","AB"]]).orderBy("id"))
d = dict(zip(output.columns,[''.join(i.split('_')[::-1]) for i in output.columns]))
output.select(*[F.col(k).alias(v) for k,v in d.items()]).show()
+---+---+---+----+----+----+----+----+----+
| id|BA1|AB1| BA2| AB2| BA3| AB3| BA4| AB4|
+---+---+---+----+----+----+----+----+----+
| 1|100| 30| 200| 40| 300| 20| 100| 15|
| 2| 99| 18| 89| 9|null|null|null|null|
| 3| 29| 20| 70| 32| 39| 13|null|null|
| 4| 24| 14| 56| 23| 24| 26|null|null|
| 5| 27| 31| 30| 17|null|null|null|null|
| 6|100| 19|null|null|null|null|null|null|
+---+---+---+----+----+----+----+----+----+
Related
I have been stuck on this for a while now, but cannot come up with a solution, any help would be appriciated
I have 2 table like
q)x
a b c d
--------
1 x 10 1
2 y 20 1
3 z 30 1
q)y
a b| c d
---| ----
1 x| 1 10
3 h| 2 20
Would like to sum the common columns and append the new ones. Expected result should be
a b c d
--------
1 x 11 11
2 y 20 1
3 z 30 1
3 h 2 20
pj looks to only update the (1,x) but doesn't insert the new (3,h). I am assuming there has to be a way to do some sort of union+plus join in kdb
You can take advantage of the plus (+) operator here by simply keying x and adding the table y to get the desired table:
q)(2!x)+y
a b| c d
---| -----
1 x| 11 11
2 y| 20 1
3 z| 30 1
3 h| 2 20
The same "plus if there's a matching key, insert if not" behaviour works for dictionaries too:
q)(`a`b!1 2)+`a`c!10 30
a| 11
b| 2
c| 30
got it :)
q) (x pj y), 0!select from y where not ([]a;b) in key 2!x
a b c d
--------
1 x 11 11
2 y 20 1
3 z 30 1
3 h 2 20
Always open for a better implementation :D I am sure there is one.
q)(`a`b`c!101 0N 103)~100 101 102 103 104#`a`b`c!1 5 3
1b
I noticed that below two lines are equivalent
100 101 102 103 104#`a`b`c!1 5 3
`a`b`c!100 101 102 103 104#1 5 3
Is it in general true that a#b!c is equivalent to b!a#c?
On the overloaded glyphs page for #, is this usage Index At or Apply At or something else? For x#y is there documentation on the exact behavior when y is a dictionary?
An operation that works on a list will also work on a list that just so happens to be the value of a dictionary (it applies "through" the dictionary and keeps the keys untouched). E.g.:
q)1+`a`b!1 2
a| 2
b| 3
q)2*`a`b!1 2
a| 2
b| 4
Thus an index at would apply the same way:
q)10 20 30#`a`b!1 2
a| 20
b| 30
since it's the same as:
q)10 20 30#1 2
20 30
It's not always true that a#b!c is equivalent to b!a#c if the a#c doesn't make sense. E.g.
q)10 20 30#1.1 2.2
'type
[0] 10 20 30#1.1 2.2
^
q)10 20 30#`a`b!1.1 2.2
'type
[0] 10 20 30#`a`b!1.1 2.2
^
It would only be true if the underlying datatype of the indexes is boolean/short/int/long.
Answered above mainly, it's index at. There's no explicit documentation for when y is a dictionary as it essentially is the same behaviour when y is a list
In matlab, I want to fit a piecewise regression and find where on the x-axis the first change-point occurs. For example, for the following data, the output might be changepoint=20 (I don't actually want to plot it, just want the change point).
data = [1 4 4 3 4 0 0 4 5 4 5 2 5 10 5 1 4 15 4 9 11 16 23 25 24 17 31 42 35 45 49 54 74 69 63 46 35 31 27 15 10 5 10 4 2 4 2 2 3 5 2 2];
x = 1:52;
plot(x,data,'.')
If you have the Signal Processing Toolbox, you can directly use the findchangepts function (see https://www.mathworks.com/help/signal/ref/findchangepts.html for documentation):
data = [1 4 4 3 4 0 0 4 5 4 5 2 5 10 5 1 4 15 4 9 11 16 23 25 24 17 31 42 35 45 49 54 74 69 63 46 35 31 27 15 10 5 10 4 2 4 2 2 3 5 2 2];
x = 1:52;
ipt = findchangepts(data);
x_cp = x(ipt);
data_cp = data(ipt);
plot(x,data,'.',x_cp,data_cp,'o')
The index of the change point in this case is 22.
Plot of data and its change point circled in red:
I know this is an old question but just want to provide some extra thoughts. In Maltab, an alternative implemented by me is a Bayesian changepoint detection algorithm that estimates not just the number and locations of the changepoints but also reports the occurrence probability of changepoints. In its current implementation, it deals with only time-series-like data (aka, 1D sequential data). More info about the tool is available at this FileExchange entry (https://www.mathworks.com/matlabcentral/fileexchange/72515-bayesian-changepoint-detection-time-series-decomposition).
Here is its quick application to your sample data:
% Automatically install the Rbeast or BEAST library to local drive
eval(webread('http://b.link/beast')) %
data = [1 4 4 3 4 0 0 4 5 4 5 2 5 10 5 1 4 15 4 9 11 16 23 25 24 17 31 42 35 45 49 54 74 69 63 46 35 31 27 15 10 5 10 4 2 4 2 2 3 5 2 2];
out = beast(data, 'season','none') % season='none': there is no seasonal/periodic variation in the data
printbeast(out)
plotbeast(out)
Below is a summary of the changepoint, given by printbeast():
#####################################################################
# Trend Changepoints #
#####################################################################
.-------------------------------------------------------------------.
| Ascii plot of probability distribution for number of chgpts (ncp) |
.-------------------------------------------------------------------.
|Pr(ncp = 0 )=0.000|* |
|Pr(ncp = 1 )=0.000|* |
|Pr(ncp = 2 )=0.000|* |
|Pr(ncp = 3 )=0.859|*********************************************** |
|Pr(ncp = 4 )=0.133|******** |
|Pr(ncp = 5 )=0.008|* |
|Pr(ncp = 6 )=0.000|* |
|Pr(ncp = 7 )=0.000|* |
|Pr(ncp = 8 )=0.000|* |
|Pr(ncp = 9 )=0.000|* |
|Pr(ncp = 10)=0.000|* |
.-------------------------------------------------------------------.
| Summary for number of Trend ChangePoints (tcp) |
.-------------------------------------------------------------------.
|ncp_max = 10 | MaxTrendKnotNum: A parameter you set |
|ncp_mode = 3 | Pr(ncp= 3)=0.86: There is a 85.9% probability |
| | that the trend component has 3 changepoint(s).|
|ncp_mean = 3.15 | Sum{ncp*Pr(ncp)} for ncp = 0,...,10 |
|ncp_pct10 = 3.00 | 10% percentile for number of changepoints |
|ncp_median = 3.00 | 50% percentile: Median number of changepoints |
|ncp_pct90 = 4.00 | 90% percentile for number of changepoints |
.-------------------------------------------------------------------.
| List of probable trend changepoints ranked by probability of |
| occurrence: Please combine the ncp reported above to determine |
| which changepoints below are practically meaningful |
'-------------------------------------------------------------------'
|tcp# |time (cp) |prob(cpPr) |
|------------------|---------------------------|--------------------|
|1 |33.000000 |1.00000 |
|2 |42.000000 |0.98271 |
|3 |19.000000 |0.69183 |
|4 |26.000000 |0.03950 |
|5 |11.000000 |0.02292 |
.-------------------------------------------------------------------.
Here is the graphic output. Three major changepoints are detected:
You can use sgolayfilt function, that is a polynomial fit to the data, or reproduce OLS method: http://www.utdallas.edu/~herve/Abdi-LeastSquares06-pretty.pdf (there is a+bx notation instead of ax+b)
For linear fit of ax+b:
If you replace x with constant vector of length 2n+1: [-n, ... 0 ... n] on each step, you get the following code for sliding regression coeffs:
for i=1+n:length(y)-n
yi = y(i-n : i+n);
sum_xy = sum(yi.*x);
a(i) = sum_xy/sum_x2;
b(i) = sum(yi)/n;
end
Notice that in this code b means sliding average of your data, and a is a least-square slope estimate (first derivate).
Not sure how to explain it so here it goes as an example:
A=[1 0 0 1 4 4 4 4
0 0 0 0 2 3 2 2
0 0 0 0 0 0 0 1
2 3 4 5 2 3 4 1 ]
result:
b=[ 1 1 13 12
5 9 5 6];
Each of the elements is computed by adding a N size submatrix inside the original, in this case N=2.
so b(1,1) is A(1,1)+A(1,2)+A(2,1)+A(2,2), and b(1,4) is A(1,7)+A(2,7)+A(1,8)+A(2,8).
Visually and more clearly:
A=[|1 0| 0 1| 4 4| 4 4|
|0 0| 0 0| 2 3| 2 2|
____________________
|0 0| 0 0| 0 0| 0 1|
|2 3| 4 5| 2 3| 4 1| ]
b is the sum of the elements on those squares, in this example of size 2.
I can imagine how to make it with loops, but its just feels vectorizable. Any ideas of how it could be done?
Assume that the matrix A has sizes that are multipliers of N.
If you have the Image processing toolbox blockproc could be an option as well:
B = blockproc(A,[2 2],#(x) sum(x.data(:)))
Method 1:
Using mat2cell and cellfun
n = 2;
AC = mat2cell(A,repmat(n,size(A,1)/n,1),repmat(n,size(A,2)/n,1));
out = cellfun(#(x) sum(x(:)), AC)
Method 2:
Using permute and reshape
n = 2;
[rows,cols] = size(A);
out = reshape(sum(sum(permute(reshape(A,n,rows/n,n,[]),[1 3 2 4]))),rows/n,[]);
PS: Here is a close question related to this one, which you might find useful. That question is to find mean while this one is to find sum.
Here are two alternative methods:
Method #1 - im2col
Another method using the image processing toolbox is to use im2col with the distinct flag and sum over all of the resulting columns. You would then need to reshape the matrix back to the right size:
n = 2;
B = im2col(A, [n n], 'distinct');
C = reshape(sum(B, 1), size(A,1)/n, size(A,2)/n);
We get for C:
>> C
C =
1 1 13 12
5 9 5 6
Method #2 - accumarray and kron
We can generate an index matrix with kron which we can use as bins into accumarray and invoke sum as the custom function. We would again have to reshape the matrix back to the right size:
n = 2;
M = reshape(1:prod([size(A,1)/n, size(A,2)/n]), size(A,1)/n, size(A,2)/n);
ind = kron(M, ones(n));
C = reshape(accumarray(ind(:), A(:), [], #sum), size(A,1)/n, size(A,2)/n);
Again we get for C:
C =
1 1 13 12
5 9 5 6
I am creating an iphone app where I have a grid view of 25 images as:
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
Now when any 5 consecutive images are selected it should say bingo, like if 0,6, 12, 18, 24 are selected it should say Bingo.
How will i do that, please help me.
Many Thanks for your help.
Rs
iPhone Developer
-----------------------------------
| 0 | 1 | 2 | 3 | 4 | 5 |
-----------------------------------
| 6 | 7 | 8 | 9 | 10 | 11 |
-----------------------------------
| 12 | 13 | 14 | 15 | 16 | 17 |
-----------------------------------
| 18 | 19 | 20 | 21 | 22 | 23 |
-----------------------------------
| 24 |
-----------------------------------
Hope this is how your grid looks like.
Associate each column with an array. The array will contain the list of all neighbour elements of that column,
For example, the neighbor array of the column [ 6 ] will ollk like array(0, 7, 12), which are all the immediate neighbors of [ 6 ].
Set counter = 0;
Now, when someone clicks an element, increment the counter (Now counter = 1)
When he clicks the second element, check if the element is in the neighbor list of the previous element OR the 1st element.
If the element clicked is in the neighbor list, increment the counter (now counter = 2)
ELSE
If the element clicked is not in the neighbor array, reset the counter (counter = 0) and start over.
Check if the value of counter = 5. If it is, Say Bingo!
The algorithm is not fully correct, but I hope you got the idea :)