Pyspark - Add column with number of occurences

Pyspark - Add column with number of occurences - scala

Let's say I have a dataframe with a column A and I would like to aggregate another column with the occurence of the distinct values in A without resulting in a grouped dataframe, I mean I would like to have the following:
A
A_counts
a
3
a
3
b
1
c
2
c
2
a
3
Thanks in advance

Related

[spark-scalapi]calculate correlation between multiple columns and one specific column after groupby the spark data frame

I have a dataframe like below:
groupid
datacol1
datacol2
datacol3
datacol*
corr_co
00001
1
2
3
4
5
00001
2
3
4
6
5
00002
4
2
1
7
5
00002
8
9
3
2
5
00003
7
1
2
3
5
00003
3
5
3
1
5
I want to calculate the correlation between datacol* columns and corr_col column by each groupid.
So I used the following spark scala codes as below:
df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),.....)
This is very inefficient,is there an efficient way to do this?
[EDIT] I mean if I have 30 data_cols columns, I need to input 30 times functions.corr to calculate correlation.
I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter.
So any way to do this efficiently? I prefer to use spark scala API to do this.
Thanks

I have found one solution. The steps are as below:
use following codes to create a mutable data frame df_all.
df.groupby("groupid").agg(functions.corr("datacol1","corr_col")
iterate all remaining data_col columns, create a temp data frame for this iteration. In this iteration, use df_all to join the temp data frame on the groupid column, then drop duplicated groupid column.
after the iteration, I will get the dataframe which contains all correlation data. I need to verify the data.
UPDATE:
Found the efficient way to do it. generate a list of function which calculate the correlation, like List(corr(),corr(),...,corr()). Then pass this list into agg function to generate the correlation data frame.

generate serial number in decreasing order given a variable in tableau

would like to find out the syntax in tableau, given column number, trying to generate rows for number in decreasing order down to 0.
Below is an example of what I'm trying to do
BEFORE
ID
NUMBER
A
4
B
5
AFTER
ID
NUMBER
A
4
A
3
A
2
A
1
B
5
B
4
B
3
B
2
B
1

If your Tableau version supports, try using ROWNUMBER with order by. Something like this:
{ ORDERBY [Your column]:ROW_NUMBER()}

Aggregating from multiple columns in Tableau

I have a table that looks like:
id aff1 aff2 aff3 value
1 a x b 5
2 b c x 4
3 a b g 1
I would like to aggregate the aff columns to calculate the sum of "value" for each aff. For example, the above gives:
aff sum
a 6
b 10
c 4
g 1
x 9
Ideally, I'd like to do this directly in tableau without remaking the table by unfolding it along all the aff columns.

You can use Tableau’s inbuilt pivot method as below, without reshaping in source .
CTRL Select all 3 dimensions you want to merge , and click on pivot .
You will get your new reshaped data as below, delete other columns :
Finally build your view.
I hope this answers . Rest other options for the above results include JOIN at DB level, or creating multiple calculated fields for each attribute value which are not scalable.

Comparing, matching and combining columns of data

I need some help matching data and combining it. I currently have four columns of data in an Excel sheet, similar to the following:
Column: 1 2 3 4
U 3 A 0
W 6 B 0
R 1 C 0
T 9 D 0
... ... ... ...
Column two is a data value that corresponds to the letter in column one. What I need to do is compare column 3 with column 1 and whenever it matches copy the corresponding value from column 2 to column 4.
You might ask why don't I do this manually ? I have a spreadsheet with around 100,000 rows so this really isn't an option!
I do have access to MATLAB and have the information imported, if this would be more easily completed within that environment, please let me know.

As mentioned by #bla:
a formula similar to =IF(A1=C1,B1,0)
should serve (Excel).

Multiple Columns, way to select closest to a value

I'm trying to analyze data sets that are obtained from CSV files. After the data is read into matlab, I am left with a variable of my data only. The number of columns and rows changes between each file. Is there a way to average each column and then create a variable for the one with the closest average to a certain value? and then also select the columns directly before and after this middle column and create variables for them, as well as create a variable for the column with the lowest average? Currently, I am selecting the columns manually and creating a variables for them that way.
For example:
I have this table of numbers. (I used the same number in each column for sake of easy averaging in this example.
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Let's say I want the column whose average is closest to 3.2
That column would be column 3 whose average is 3. Then I would want the code to select the column before (column 2) and the column after (column 4). As well as the column with the lowest average (column 1)

First get the averages (I assume the data matrix is in variable X):
Xmns = mean(X);
Then to find the minimum, use "min":
[val,ind] = min(Xmns);
"val" holds the minimum value, "ind" the corresponding index in Xmns, which is the corresponding column.
To find the column mean closest to a particular value, again you can use min:
[val,ind] = min(abs(Xmns-key_val));
Now "ind" holds the column index with mean closest to "key_val". The next column is just "ind+1" and the previous "ind-1" - just be sure to check you are not beyond the ends of the matrix (i.e. ind may already be 1 or size(X,2)).
Also, given the column index "ind", to create a new variable with that column, you just use:
sc= X(:,ind);
and if you want to remove that column from X:
X(:,ind) = [];
and that is all.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark - Add column with number of occurences - scala

Let's say I have a dataframe with a column A and I would like to aggregate another column with the occurence of the distinct values in A without resulting in a grouped dataframe, I mean I would like to have the following: A A_counts a 3 a 3 b 1 c 2 c 2 a 3 Thanks in advance

Related

[spark-scalapi]calculate correlation between multiple columns and one specific column after groupby the spark data frame

generate serial number in decreasing order given a variable in tableau

Aggregating from multiple columns in Tableau

Comparing, matching and combining columns of data

Multiple Columns, way to select closest to a value

Categories

Resources