Replace values of one pyspark dataframe with another - pyspark

I have a pyspark dataframe df2 :-
ID
Total_Count
Final_A
Final_B
Final_C
Final_D
11
80
36
30
8
6
4
80
36
30
8
6
13
65
30
24
6
5
12
56
26
21
5
4
2
65
30
24
6
5
1
56
26
21
5
4
I have another dataframe df1 :-
ID
Total_Count
A
B
C
D
4
80
0
0
3
0
11
80
0
0
0
0
13
65
0
0
0
0
12
56
0
4
0
0
2
65
0
0
0
0
1
56
0
0
0
0
10
34
10
10
10
4
I want to replace values of df1 by df2 for respective ID(primary key).
Expected df1 :-
ID
Total_Count
A
B
C
D
11
80
36
30
8
6
4
80
36
30
8
6
13
65
30
24
6
5
12
56
26
21
5
4
2
65
30
24
6
5
1
56
26
21
5
4
10
34
10
10
10
4

df2=spark.read.option("header","True").option("inferSchema","True").csv("df1.csv")
df1=spark.read.option("header","True").option("inferSchema","True").csv("df2.csv")
df2 = df2.withColumnRenamed("ID",'df2_ID').withColumnRenamed("Total_Count",'df2_Total_Count')
final_df = df1.join(df2,(df1.ID == df2.df2_ID) & (df1.Total_Count == df2.df2_Total_Count),"left")
from pyspark.sql.functions import when
for i in ('A','B','C','D'):
final_df = final_df.withColumn(i, when(final_df[i] == 0, final_df["Final_{}".format(i)]).otherwise(final_df[i]))
cols = df2.columns
final_df = final_df.drop(*cols)

df = df1.join(df2.select('Final_A', 'Final_B', 'Final_C', 'Final_D'), 'ID'], 'left')
df =df.withColumn('A', coalesce(df['Final_A'],df['A'])).\
withColumn('B', coalesce(df['Final_B'],df['B'])).\
withColumn('C', coalesce(df['Final_C'],df['C'])).\
withColumn('D', coalesce(df['Final_D'],df['D']))
df1 = df.select('ID', 'Total_Count','A', 'B', 'C', 'D')
df1.show()

Related

Adding 0 for missing data rather than excluding the category in matlab

I have the two following tables of data, one named data1, the other named data2. The left-hand column is a categorical variable and the right hand column is frequency I would like to rewrite these tables but where there are missing categories in the left-hand column I would like it to put in the correct missing category and then put a '0' in the right-hand frequency column.
data1 = [
1 170
2 120
3 100
4 40
5 30
6 20
7 10
9 8
10 2
11 1
14 1
];
data2 = [
1 240
2 200
3 180
4 60
5 50
6 40
7 30
8 20
9 8
10 2
12 1
19 1
];
To be clearer I will explain with an example. In data1, 8 12 and 13 are missing in the left-hand column. I would like matlab to recreate this table but with 0 values for 8, 12 and 13 so it looks as follows. I would also like it to have additional empty categories after '14' because data2 is longer and has more categories. I have also included what data2 should look like with filled in values.
data1 = [
1 170
2 120
3 100
4 40
5 30
6 20
7 10
8 0
9 8
10 2
11 1
12 0
13 0
14 1
15 0
16 0
17 0
18 0
19 0
];
data2 = [
1 240
2 200
3 180
4 60
5 50
6 40
7 30
8 20
9 8
10 2
11 0
12 1
13 0
14 0
15 0
16 0
17 0
18 0
19 1
];
I have a handful of datasets which generally all start with 1,2,3,4,5...etc but then they all have slightly different categories on the left-hand column, because where values are missing it just omits the category rather than putting 0. How do i write a code so that it automatically fills in any blanks with a 0. It would be good if the code could identify what the 'highest' number of categories is amongst all the datasets and then fill in blanks based on this.
my aim is to put together a grouped bar chart with data series that are all the same length.
UPDATED OUTPUT WITH 3 DATASETS
this is what your AllJoins code outputs in my matlab:
A table1 table2 table3
__ ______ ______ ______
1 170 240 2400
2 120 200 2000
3 100 180 0
4 40 60 0
5 30 50 0
6 20 40 0
7 10 30 0
8 0 20 0
9 8 8 0
10 2 2 0
11 1 0 0
12 0 1 0
14 1 0 0
19 0 1 0
20 0 0 1800
I would like the code to fill in the missing consecutive numbers in column A so that it looks as follows:
A table1 table2 table3
__ ______ ______ ______
1 170 240 2400
2 120 200 2000
3 100 180 0
4 40 60 0
5 30 50 0
6 20 40 0
7 10 30 0
8 0 20 0
9 8 8 0
10 2 2 0
11 1 0 0
12 0 1 0
13 0 0 0
14 1 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 0 0
19 0 1 0
20 0 0 1800
You can convert the datasets to a table and then use outerjoin. Then you can replace the NaNs with whatever you want using fillmissing.
table1 = array2table(data1);
table1.Properties.VariableNames = {'A', 'B'};
table2 = array2table(data2);
table2.Properties.VariableNames = {'A', 'B'};
newTable = outerjoin(table1, table2, 'LeftKeys', {'A'}, 'RightKeys', {'A'}, 'MergeKeys', true)
which produces:
A B_table1 B_table2
__ ________ ________
1 170 240
2 120 200
3 100 180
4 40 60
5 30 50
6 20 40
7 10 30
8 NaN 20
9 8 8
10 2 2
11 1 NaN
12 NaN 1
14 1 NaN
19 NaN 1
And then get your zeros with newTable2 = fillmissing(newTable, 'constant', 0), which prints:
A B_table1 B_table2
__ ________ ________
1 170 240
2 120 200
3 100 180
4 40 60
5 30 50
6 20 40
7 10 30
8 0 20
9 8 8
10 2 2
11 1 0
12 0 1
14 1 0
19 0 1
UPDATE
To combine multiple tables, you can either nest the outerjoin or write a function to loop over it (see similar Matlab forum question). Here's an example.
Given data1 and data2 in OP, plus a new data3:
data3 = [
1 2400
2 2000
20 1800
];
Contents of myscript.m:
table1 = MakeTable(data1);
table2 = MakeTable(data2);
table3 = MakeTable(data3);
AllJoins = MultiOuterJoin(table1, table2, table3);
% Functions
function Table = MakeTable(Array)
Table = array2table(Array);
Table.Properties.VariableNames = {'A', 'B'}; % set your column names, e.g. {'freq', 'count'}
end
function Joined = MultiOuterJoin(varargin)
Joined = varargin{1};
Joined.Properties.VariableNames{end} = inputname(1); % set #2 column name to be based on table name
for k = 2:nargin
Joined = outerjoin(Joined, varargin{k}, 'LeftKeys', {'A'}, 'RightKeys', {'A'}, 'MergeKeys', true);
name = inputname(k);
Joined.Properties.VariableNames{end} = name; % set merged column name to be based on table name
end
end
Which returns AllJoins:
A table1 table2 table3
__ ______ ______ ______
1 170 240 2400
2 120 200 2000
3 100 180 NaN
4 40 60 NaN
5 30 50 NaN
6 20 40 NaN
7 10 30 NaN
8 0 20 NaN
9 8 8 NaN
10 2 2 NaN
11 1 0 NaN
12 0 1 NaN
13 0 0 NaN
14 1 0 NaN
15 0 0 NaN
16 0 0 NaN
17 0 0 NaN
18 0 0 NaN
19 0 1 NaN
20 NaN NaN 1800
Feel free to change the maximum length of the array, this is a generic answer. The maximum length is max(data1(:,1)), but you can compute this in any way, e.g. the maximum value of multiple arrays.
% make new data
new_data1=zeros(max(data1(:,1),2));
new_data(:,1)=1:max(data1(:,1));
% Fill data. You can do this in a loop if its easier for you to understand.
% in essence, it says: in all the data1(:,1) indices of new_data's second column, put data1(:,2)
new_data(data1(:,1),2)=data1(:,2);

Pyspark dataframe conditional filter and imputation

I have a pyspark dataframe df
ID
Total_Count
A
B
C
D
Group
Name
Chain
1
56
0
0
0
0
1
Apple
Fruits1
2
65
0
0
0
0
1
Apple
Fruits1
3
72
0
0
30
0
1
Banana
Fruits1
4
80
0
0
0
0
1
Strawberry
Fruits1
5
142
58
58
14
12
1
Apple
Fruits1
6
130
63
50
9
8
1
Apple
Fruits1
7
145
74
44
17
10
1
Apple
Fruits1
8
119
54
48
8
9
1
Apple
Fruits1
11
161
71
63
16
11
1
Banana
Fruits1
12
124
54
43
19
8
1
Banana
Fruits1
I want to impute the A,B,C,D columns wherever there is 0 in A,B,C,D columns(ID 1,2,3,4).
1.) Logic : Average of GroupxName(if available) or Average of GroupxChain(if available) or at Average of Group :
Taking the example to impute ID 1,2 for demo:
Post filering for Group 1 and Name Apple, Proportion for ID 1&2 is obtained as follows( For ID 1 and 2 resp. filtering rows with similar Group as 1 and similar Name (Apple)) ,proportion is calculated as A/Total_Count, B/Total_Count and so on :
A_PROP
B_PROP
C_PROP
D_PROP
0.408451
0.408450704
0.098592
0.084507042
0.484615
0.384615385
0.069231
0.061538462
0.510345
0.303448276
0.117241
0.068965517
0.453782
0.403361345
0.067227
0.075630252
2.) Average of the above 4 rows is to be taken (for ID 1 & 2 for example).
A,B,C,D in df2 is calcualted as X_prop_avg*Total_Count.
Expected output (df2) :
ID
Total_Count
A_prop_avg
B_prop_avg
C_prop_avg
D_prop_avg
A
B
C
D
1
56
0.46429811
0.37496893
0.08807265
0.07266032
26
21
5
4
2
65
0.464298107
0.374968927
0.088072647
0.072660318
30
24
6
5
3
72
0.43823883
0.369039271
0.126302344
0.066419555
32
27
9
5
4
80
0.455611681
0.372992375
0.10081588
0.070580064
36
30
8
6

How to get list of neighbors with distance N from index in matrix? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a matrix like this:
35 1 6 26 19 24
3 32 7 21 23 25
31 9 2 22 27 20
8 28 33 17 10 15
30 5 34 12 14 16
4 36 29 13 18 11
I want a list of neighbors with distance 3 for each cell. For example,
the list of neighbors with distance 3 for (1, 1) is:
[8, 28, 33, 17, 26, 21, 22, 17]
Visual explanation:
[35] 1 6 |26| 19 24
3 32 7 |21| 23 25
31 9 2 |22| 27 20
-------------------
8 28 33 |17| 10 15
-------------------
30 5 34 12 14 16
4 36 29 13 18 11
The list of neighbors with distance 3 for (3, 3) is
[4, 36, 29, 13, 18, 11, 24, 25, 20, 15, 16]
Visual explanation:
35 1 6 26 19 |24|
3 32 7 21 23 |25|
31 9 [2] 22 27 |20|
8 28 33 17 10 |15|
30 5 34 12 14 |16|
------------------------
4 36 29 13 18 |11|
------------------------
Generate an all-zero "index matrix" idx with the same size of your matrix A, and set the "seed" to 1:
A = [ ...
35 1 6 26 19 24; ...
3 32 7 21 23 25; ...
31 9 2 22 27 20; ...
8 28 33 17 10 15; ...
30 5 34 12 14 16; ...
4 36 29 13 18 11 ...
]
idx = zeros(size(A));
idx(3, 2) = 1
We get:
A =
[...]
idx =
0 0 0 0 0 0
0 0 0 0 0 0
0 1 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Now, we use 2-D convolution, i.e. MATLAB's conv2 method to create the correct index matrix w.r.t. to the distance d:
idx = logical(conv2(idx, ones(2*d+1), 'same') - conv2(idx, ones(2*d-1), 'same'))
(Convolution is the key to success.)
Then, we get:
idx =
0 0 0 0 1 0
0 0 0 0 1 0
0 0 0 0 1 0
0 0 0 0 1 0
0 0 0 0 1 0
1 1 1 1 1 0
Since we already casted the indices to logical, we can directly access the proper elements in the matrix A:
B = A(idx).'
The final result:
B =
4 36 29 13 19 23 27 10 14 18
Please notice the difference in the result as you wrote (3, 2) in your second example, but actually marked (3, 3) as "seed".
Hope that helps!
Disclaimer: Tested with Octave 5.1.0, but also works with MATLAB Online.

How can I select rows with specific column values from a matrix?

I have a matrix train3.
1 2 3 4 5 6 7
2 12 13 14 15 16 17
3 62 53 44 35 26 17
4 52 13 24 15 26 37
I want to select only those rows of whose 1st columns contain specific values (in my case 1 and 2).
I have tried the following,
>> train3
train3 =
1 2 3 4 5 6 7
2 12 13 14 15 16 17
3 62 53 44 35 26 17
4 52 13 24 15 26 37
>> ind1 = train3(:,1) == 1
ind1 =
1
0
0
0
>> ind2 = train3(:,1) == 2
ind2 =
0
1
0
0
>> mat1 = train3(ind1, :)
mat1 =
1 2 3 4 5 6 7
>> mat2 = train3(ind2, :)
mat2 =
2 12 13 14 15 16 17
>> mat3 = [mat1 ; mat2]
mat3 =
1 2 3 4 5 6 7
2 12 13 14 15 16 17
>>
Is there any better way to do this?
Presumably you are trying to get mat3 in a single step which you can do with:
mat3 = train3(train3(:,1)==1 | train3(:,1)==2,:)
A more general way to do this would be to use ismember to get all of the rows that match the values in a list:
train3 =[
1 2 3 4 5 6 7
2 12 13 14 15 16 17
3 62 53 44 35 26 17
4 52 13 24 15 26 37];
chooseList = [1 2];
colIndex = ismember(train3(:, 1), chooseList);
subset = train3(colIndex, :);
subset =
1 2 3 4 5 6 7
2 12 13 14 15 16 17

Matlab: how I can transform this algorithm associated with matrices manipulation?

(For my problem, I use a matrix A 4x500000. And the values of A(4,k) varies between 1 and 200).
I give here an example for a case A 4x16 and A(4,k) varies between 1 and 10.
I want first to match a name to the value from 1 to 5 (=10/2):
1 = XXY;
2 = ABC;
3 = EFG;
4 = TXG;
5 = ZPF;
My goal is to find,for a vector X, a matrix M from the matrix A:
A = [20 52 70 20 52 20 52 20 20 10 52 20 11 1 52 20
32 24 91 44 60 32 24 32 32 12 11 32 2 5 24 32
40 37 24 30 11 40 37 40 40 5 10 40 40 3 37 40
2 4 1 3 4 5 2 1 3 3 8 6 7 9 6 10]
A(4,k) takes all values between 1 and 10. These values can be repeated and they all appear on the 4th line.
20
X= 32 =A(1:3,1)=A(1:3,6)=A(1:3,8)=A(1:3,9)=A(1:3,12)=A(1:3,16)
40
A(4,1) = 2;
A(4,6) = 5;
A(4,8) = 1;
A(4,9) = 3;
A(4,12) = 6;
A(4,16) = 10;
for A(4,k) corresponding to X, I associate 2 if A(4,k)<= 5, and 1 if A(4,k)> 5. For the rest of the value of A(4,k) which do not correspond to X, I associate 0:
[ 1 2 3 4 5 %% value of the fourth line of A between 1 and 5
2 2 2 0 2
ZX = 6 7 8 9 10 %% value of the fourth line of A between 6 and 10
1 0 0 0 1
2 2 2 0 2 ] %% = max(ZX(2,k),ZX(4,k))
the ultimate goal is to find the matrix M:
M = [ 1 2 3 4 5
XXY ABC EFG TXG ZPF
2 2 2 0 2 ] %% M(3,:)=ZX(5,:)
Code -
%// Assuming A, X and names to be given to the solution
A = [20 52 70 20 52 20 52 20 20 10 52 20 11 1 52 20
32 24 91 44 60 32 24 32 32 12 11 32 2 5 24 32
40 37 24 30 11 40 37 40 40 5 10 40 40 3 37 40
2 4 1 3 4 5 2 1 3 3 8 6 7 9 6 10];
X = [20 ; 32 ; 40];
names = {'XXY','ABC','EFG','TXG','ZPF'};
limit = 10; %// The maximum limit of A(4,:). Edit this to 200 for your actual case
%// Find matching 4th row elements
matches = A(4,ismember(A(1:3,:)',X','rows'));
%// Matches are compared against all possible numbers between 1 and limit
matches_pos = ismember(1:limit,matches);
%// Finally get the line 3 results of M
vals = max(2*matches_pos(1:limit/2),matches_pos( (limit/2)+1:end ));
Output -
vals =
2 2 2 0 2
For a better way to present the results, you can use a struct -
M_struct = cell2struct(num2cell(vals),names,2)
Output -
M_struct =
XXY: 2
ABC: 2
EFG: 2
TXG: 0
ZPF: 2
For writing the results to a text file -
output_file = 'results.txt'; %// Edit if needed to be saved to a different path
fid = fopen(output_file, 'w+');
for ii=1:numel(names)
fprintf(fid, '%d %s %d\n',ii, names{ii},vals(ii));
end
fclose(fid);
Text contents of the text file would be -
1 XXY 2
2 ABC 2
3 EFG 2
4 TXG 0
5 ZPF 2
A bsxfun() based approach.
Suppose your inputs are (where N can be set to 200):
A = [20 52 70 20 52 20 52 20 20 10 52 20 11 1 52 20
32 24 91 44 60 32 24 32 32 12 11 32 2 5 24 32
40 37 24 30 11 40 37 40 40 5 10 40 40 3 37 40
2 4 1 3 4 5 2 1 3 3 8 6 7 9 6 10]
X = [20; 32; 40]
N = 10;
% Match first 3 rows and return 4th
idxA = all(bsxfun(#eq, X, A(1:3,:)));
Amatch = A(4,idxA);
% Match [1:5; 5:10] to 4th row
idxZX = ismember([1:N/2; N/2+1:N], Amatch)
idxZX =
1 1 1 0 1
1 0 0 0 1
% Return M3
M3 = max(bsxfun(#times, idxZX, [2;1]))
M3 =
2 2 2 0 2