Pivot table in Pyspark - pyspark

I am very very new to pyspark. My data frame looks like -
id value subject
1 75 eng
1 80 his
2 83 math
2 73 science
3 88 eng
I want my data frame -
id eng his math science
1 .49 .51 0 0
2 0 0 .53 .47
3 1 0 0 0
That means row-wise summation and then divide with each cell. Want to calculate % of each cell.
I have done the following code but it's not working -
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn('rank',F.dense_rank().over(Window.orderBy("id","value","subject")))
df.withColumn('combcol',F.concat(F.lit('col_'),df['rank'])).groupby('id').pivot('combcol').agg(F.first('value')).show()

Check if the following code works for you.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[ (1,75,'eng'), (1,80,'his'), (2,83,'math'), (2,73,'science'), (3,88,'eng') ]
, [ 'id','value','subject' ]
)
# create the pivot table
df1 = df.groupby('id').pivot('subject').agg(F.first('value')).fillna(0)
# column names used to sum up for total
cols = df1.columns[1:]
# calculate the total and then percentage accordingly for each cols
df1.withColumn('total', sum([F.col(c) for c in cols])) \
.select('id', *[ F.format_number(F.col(c)/F.col('total'),2).alias(c) for c in cols] ) \
.show()
#+---+----+----+----+-------+
#| id| eng| his|math|science|
#+---+----+----+----+-------+
#| 1|0.48|0.52|0.00| 0.00|
#| 3|1.00|0.00|0.00| 0.00|
#| 2|0.00|0.00|0.53| 0.47|
#+---+----+----+----+-------+

Related

kdb union join (with plus join)

I have been stuck on this for a while now, but cannot come up with a solution, any help would be appriciated
I have 2 table like
q)x
a b c d
--------
1 x 10 1
2 y 20 1
3 z 30 1
q)y
a b| c d
---| ----
1 x| 1 10
3 h| 2 20
Would like to sum the common columns and append the new ones. Expected result should be
a b c d
--------
1 x 11 11
2 y 20 1
3 z 30 1
3 h 2 20
pj looks to only update the (1,x) but doesn't insert the new (3,h). I am assuming there has to be a way to do some sort of union+plus join in kdb
You can take advantage of the plus (+) operator here by simply keying x and adding the table y to get the desired table:
q)(2!x)+y
a b| c d
---| -----
1 x| 11 11
2 y| 20 1
3 z| 30 1
3 h| 2 20
The same "plus if there's a matching key, insert if not" behaviour works for dictionaries too:
q)(`a`b!1 2)+`a`c!10 30
a| 11
b| 2
c| 30
got it :)
q) (x pj y), 0!select from y where not ([]a;b) in key 2!x
a b c d
--------
1 x 11 11
2 y 20 1
3 z 30 1
3 h 2 20
Always open for a better implementation :D I am sure there is one.

How to flatten a data frame in apache spark | Scala

I have the following data frame :
df1
uid text frequency
1 a 1
1 b 0
1 c 2
2 a 0
2 b 0
2 c 1
I need to flatten it on the basis of uid to :
df2
uid a b c
1 1 0 2
2 0 0 1
I've worked on similar lines in R but haven't been able to translate it into sql or scala.
Any suggestions on how to approach this?
You can group by uid, use text as a pivot column and sum frequencies:
df1
.groupBy("uid")
.pivot("text")
.sum("frequency").show()

How to remove 'u'(unicode) from all the values in a column in a DataFrame?

How does one remove 'u'(unicode) from all the values in a column in a DataFrame?
table.place.unique()
array([u'Newyork', u'Chicago', u'San Francisco'], dtype=object)
>>> df = pd.DataFrame([u'c%s'%i for i in range(11,21)], columns=["c"])
>>> df
c
0 c11
1 c12
2 c13
3 c14
4 c15
5 c16
6 c17
7 c18
8 c19
9 c20
>>> df['c'].values
array([u'c11', u'c12', u'c13', u'c14', u'c15', u'c16', u'c17', u'c18',
u'c19', u'c20'], dtype=object)
>>> df['c'].astype(str).values
array(['c11', 'c12', 'c13', 'c14', 'c15', 'c16', 'c17', 'c18', 'c19', 'c20'], dtype=object)
>>>

Split 2 x N matrix into two submatrices in MATLAB

I have a matrix 2 x N (lets call it MyMatrix) containing pairs of elements (element in (1,1) corresponds to element (2,1), element in (1,2) correspords to element (2,2) and so on.) Entries in first row are sorted in ascending order. What I would like to do is split this matrix into 2 matrices 2 x K and 2 x N-K. First matrix will contain part of MyMatrix where entries in row 1 are less than some given value (in my example it will be (max-min)/2 , where max = maximum value in row 1, min = minimum walue in row 1) and second matrix will consist of the rest of MyMatrix. I'm sorry if it is confusing but I tried my best to explain to you what I would like to achieve.
Here is an example:
MyMat =
|1 2 4 6 13 52 65 120 125|
|4 132 53 1 64 34 5 2 66 |
min = 1 , max = 125, avg = (125-1)/2 = 62.
so result will be as follows:
a =
|1 2 4 6 13 52 |
|4 132 53 1 64 34 |
b=
|65 120 125|
|5 2 66 |
Thanks in advance for your help.
Kind regards,
Tom.
You can simply do
a=MyMat(:,MyMat(1,:)<avg);
b=MyMat(:,MyMat(1,:)>=avg);

MatLab to convert a matrix with respect to 1st col

This question is an outgrowth of MatLab (or any other language) to convert a matrix or a csv to put 2nd column values to the same row if 1st column value is the same? and Group values in different rows by their first-column index
If
A = [2 3 234 ; 2 44 99999; 2 99999 99999; 3 123 99; 3 1232 45; 5 99999 57]
1st column | 2nd column | 3rd column
--------------------------------------
2 3 234
2 44 99999
2 99999 99999
3 123 99
3 1232 45
5 99999 57
I want to make
1st col | 2nd col | 3rd col | 4th col | 5th col | 6th col| 7th col
--------------------------------------------------------------------
2 3 234 44
3 123 99 1232 45
5 57
That is, for each numbers in the 1st col of A, I want to put numbers EXCEPT "99999"
If we disregard the "except 99999" part, we can code as Group values in different rows by their first-column index
[U, ix, iu] = unique(A(:,1));
vals = reshape(A(:, 2:end).', [], 1); %'// Columnize values
subs = reshape(iu(:, ones(size(A, 2) - 1, 1)).', [], 1); %'// Replicate indices
r = accumarray(subs, vals, [], #(x){x'});
But obviously this code won't ignore 99999.
I guess there are two ways
1. first make r, and then remove 99999
2. remove 99999 first, and then make r
Whichever, I just want faster one.
Thank you in advance!
I think options 1 is better, i.e. first make r, and then remove 99999. Having r, u can remove 99999 as follows:
r2 = {}; % new cell array without 99999
for i = 1:numel(r)
rCell = r{i};
whereIs9999 = rCell == 99999;
rCell(whereIs9999) = []; % remove 99999
r2{i} = rCell;
end
Or more fancy way:
r2= cellfun(#(c) {c(c~=99999)}, r);