how to join two differents dataframes in pyspark? - pyspark

I have two dataframes one with columns of X year,month and measure, and
columns with x1, x2 which corresonds to the first day and the second
day . The first dataframe is:
X year month measure X1 X2
1 1 2014 12 Max.TemperatureF 64 42
2 2 2014 12 Mean.TemperatureF 52 38
3 3 2014 12 Min.TemperatureF 39 33
The second dataframe where only I have the days.
X3 X4 X5 X6 X7
1 51 43 42 45
2 44 37 34 42
3 37 30 26 38
So I want to join the two dataframes and obtain in pyspark
X year month measure X1 X2 X3 X4 X5 X6
'1 1 2014 12 Max.TemperatureF 64 42 1 51 43 42
'2 2 2014 12 Mean.TemperatureF 52 38 2 44 37 34
'3 3 2014 12 Min.TemperatureF 39 33 3 37 30 26
I have joined them but they get one dataframe above the another dataframe instead of that they remain in the same rows
from functools import reduce
from pyspark.sql import DataFrame
def unionAll(*dfs):
return reduce(DataFrame.unionAll, dfs)
td = unionAll(*[weather1, weather2])
X year month measure X1 X2
1 1 2014 12 Max.TemperatureF 64 42
2 2 2014 12 Mean.TemperatureF 52 38
3 3 2014 12 Min.TemperatureF 39 33
X3 X4 X5 X6
1 51 43 42 45
2 44 37 34 42
3 37 30 26 38
So this is a wrong joining .

I suppose what you are trying to do is join two tables. To join two tables, you need a common column and since you don't have a common column, you will have to create something. This is how I would tackle this:
# Copy the entire 'X' column (which I am assuming is the index)
weather2 = weather2.withColumn('X', weather1['X'])
# Join the two tables on 'X'
joinExpr = 'X'
td = weather1.join(weather2, joinExpr)
This should solve the problem.

Joining two differents dataframes in pyspark?
Assuming
two DataFrame to be df and df1 respectively
X1 as common column in two dataframe
sqlContext.sql( "select * from df, df1 where df.X1 == df1.X1").show()

Related

how I delete combination rows that have the same numbers from matrix and only keeping one of the combinations?

for a=1:50; %numbers 1 through 50
for b=1:50;
c=sqrt(a^2+b^2);
if c<=50&c(rem(c,1)==0);%if display only if c<=50 and c=c/1 has remainder of 0
pyth=[a,b,c];%pythagorean matrix
disp(pyth)
else c(rem(c,1)~=0);%if remainder doesn't equal to 0, omit output
end
end
end
answer=
3 4 5
4 3 5
5 12 13
6 8 10
7 24 25
8 6 10
8 15 17
9 12 15
9 40 41
10 24 26
12 5 13
12 9 15
12 16 20
12 35 37
14 48 50
15 8 17
15 20 25
15 36 39
16 12 20
16 30 34
18 24 30
20 15 25
20 21 29
21 20 29
21 28 35
24 7 25
24 10 26
24 18 30
24 32 40
27 36 45
28 21 35
30 16 34
30 40 50
32 24 40
35 12 37
36 15 39
36 27 45
40 9 41
40 30 50
48 14 50
This problem involves the Pythagorean theorem but we cannot use the built in function so I had to write one myself. The problem is for example columns 1 & 2 from the first two rows have the same numbers. How do I code it so it only deletes one of the rows if the columns 1 and 2 have the same number combination? I've tried unique function but it doesn't really delete the combinations. I have read about deleting duplicates from previous posts but those have confused me even more. Any help on how to go about this problem will help me immensely!
Thank you
welcome to StackOverflow.
The problem in your code seems to be, that pyth only contains 3 values, [a, b, c]. The unique() funcion used in the next line has no effect in that case, because only one row is contained in pyth. another issue is, that the values idx and out are calculated in each loop cycle. This should be placed after the loops. An example code could look like this:
pyth = zeros(0,3);
for a=1:50
for b=1:50
c = sqrt(a^2 + b^2);
if c<=50 && rem(c,1)==0
abc_sorted = sort([a,b,c]);
pyth = [pyth; abc_sorted];
end
end
end
% do final sorting outside of the loop
[~,idx] = unique(pyth, 'rows', 'stable');
out = pyth(idx,:);
disp(out)
a few other tips for writing MATLAB code:
You do not need to end for or if/else stements with a semicolon
else statements cover any other case not included before, so they do not need a condition.
Some performance reommendations:
Due to the symmetry of a and b (a^2 + b^2 = b^2 + a^2) the b loop could be constrained to for b=1:a, which would roughly save you half of the loop cycles.
if you use && for contencation of scalar values, the second part is not evaluated, if the first part already fails (source).
Regards,
Chris
You can also linearize your algorithm (but we're still using bruteforce):
[X,Y] = meshgrid(1:50,1:50); %generate all the combination
C = (X(:).^2+Y(:).^2).^0.5; %sums of two square for every combination
ind = find(rem(C,1)==0 & C<=50); %get the index
res = unique([sort([X(ind),Y(ind)],2),C(ind)],'rows'); %check for uniqueness
Now you could really optimized your algorithm using math, you should read this question. It will be useful if n>>50.

How to remove zero columns from array

I have an array which looks similar to:
0 2 3 4 0 0 7 8 0 10
0 32 44 47 0 0 37 54 0 36
I wish to remove all
0
0
from this to get:
2 3 4 7 8 10
32 44 47 37 54 36
I've tried x(x == 0) = []
but I get:
x =
2 32 3 44 4 47 7 37 8 54 10 36
How can I remove all zero columns?
Here is a possible solution:
x(:,all(x==0))=[]
You had the right approach with x(x == 0) = [];. By doing this, you would remove the right amount of elements that can still form a 2D matrix and this actually gives you a vector of values that are non-zero. All you have to do is reshape the matrix back to its original form with 2 rows:
x(x == 0) = [];
y = reshape(x, 2, [])
y =
2 3 4 7 8 10
32 44 47 37 54 36
Another way is with any:
y = x(:,any(x,1));
In this case, we look for any columns that are non-zero and use these locations to index into x and extract out those corresponding columns.
Result:
y =
2 3 4 7 8 10
32 44 47 37 54 36
Another way which is more for academic purposes is to use unique. Assuming that your matrix has all positive values:
[~,~,id] = unique(x.', 'rows');
y = x(:, id ~= 1)
y =
2 3 4 7 8 10
32 44 47 37 54 36
We transpose x so that each column becomes a row, and we look for all unique rows. The reason why the matrix needs to have all positive values is because the third output of unique assigns unique ID to each unique row in sorted order. Therefore, if we have all positive values, then a row of all zeroes would be assigned an ID of 1. Using this array, we search for IDs that were not assigned a value of 1, and use those to index into x to extract out the necessary columns.
You could also use sum.
Sum over the columns and any column with zeros only will be zeros after the summation as well.
sum(x,1)
ans =
0 34 47 51 0 0 44 62 0 46
x(:,sum(x,1)>0)
ans =
2 3 4 7 8 10
32 44 47 37 54 36
Also by reshaping nonzeros(x) as follows:
reshape(nonzeros(x), size(x,1), [])
ans =
2 3 4 7 8 10
32 44 47 37 54 36

Matlab replace the nan with average of previous and next non-missing value

all,
I have a large dataset with a lot of continuous NAs, is there any fast way to replace the NAs with the average of previous and next non-missing value by column?
Thanks a lot
Lou
Interesting question... if only you explained clearly what you want. Maybe it's this?
data = [1 3 NaN 7 6 NaN NaN 2].'; %'// example data: column vector
isn = isnan(data); %// determine which values are NaN
inum = find(~isn); %// indices of numbers
inan = find(isn); %// indices of NaNs
comp = bsxfun(#lt,inan.',inum); %'// for each (number,NaN): 1 if NaN precedes num
[~, upper] = max(comp); %// next number to each NaN (max finds *first* maximum)
data(isn) = (data(inum(upper))+data(inum(upper-1)))/2; %// fill with average
In this example: original data:
>> data.'
ans =
1 3 NaN 7 6 NaN NaN 2
Result:
>> data.'
ans =
1 3 5 7 6 4 4 2
If you have a 2D array and want to work by columns, a for loop over columns is probably the best option.
And of course, if there can be NaN's at the beginning or end of a column, the problem is undefined.
Assuming NaNs are not in the first/last row in any column, here is how I would do it:
(If there are multiple consecutive NaNs, it searches for previous ann next non-missing values and averages them).
% Creating A
A=magic(7);
newA=A; %Result will be in newA
A(3,4)=NaN;
A(2,1)=NaN;
A(5,6)=NaN;
A(6,6)=NaN;
A(4,6)=NaN;
% Finding NaN position and calculating positions where we have to average numbers
ind=find(isnan(A));
otherInd=setdiff(1:numel(A(:)),ind);
for i=1:size(ind,1)
temp=otherInd(otherInd<ind(i));
prevInd(i,1)=temp(end);
temp=otherInd(otherInd>ind(i));
nextInd(i,1)=temp(1);
end
% For faster processing purposes
allInd(1:2:2*length(prevInd))=prevInd;
allInd(2:2:2*length(prevInd))=nextInd;
fun=#(block_struct) mean(block_struct.data)
prevNextNums=A(allInd);
A
newA(ind)=blockproc(prevNextNums,[1 2],fun)
%-----------------------Answer--------------------------
A =
30 39 48 1 10 19 28
NaN 47 7 9 18 27 29
46 6 8 NaN 26 35 37
5 14 16 25 34 NaN 45
13 15 24 33 42 NaN 4
21 23 32 41 43 NaN 12
22 31 40 49 2 11 20
newA =
30 39 48 1 10 19 28
38 47 7 9 18 27 29
46 6 8 17 26 35 37
5 14 16 25 34 23 45
13 15 24 33 42 23 4
21 23 32 41 43 23 12
22 31 40 49 2 11 20

Extract matrix elements using a vector of column indices per row

I have an MxN matrix and I want a column vector v, using the vector s that tells me for each row in the matrix what column I will take.
Here's an example:
Matrix =
[ 4 13 93 20 42;
31 18 94 64 02;
7 44 24 91 15;
11 20 43 38 31;
21 42 72 60 99;
13 81 31 87 50;
32 22 83 24 04]
s = [4 4 5 4 4 4 3].'
And the desired output is:
v = [20 64 15 38 60 87 83].'
I thought using the expression
Matrix(:,s)
would've work but it doesn't. Is there a solution without using for loops to access the rows separately?
It's not pretty, and there might be better solutions, but you can use the function sub2ind like this:
M(sub2ind(size(M),1:numel(s),s'))
You can also do it with linear indexing, here is an example:
M=M'; s=s';
M([0:size(M,1):numel(M)-1]+s)

Matrix division & permutation to achieve Baker map

I'm trying to implement the Baker map.
Is there a function that would allow one to divide a 8 x 8 matrix by providing, for example, a sequence of divisors 2, 4, 2 and rearranging pixels in the order as shown in the matrices below?
X = reshape(1:64,8,8);
After applying divisors 2,4,2 to the matrix X one should get a matrix like A shown below.
A=[31 23 15 7 32 24 16 8;
63 55 47 39 64 56 48 40;
11 3 12 4 13 5 14 6;
27 19 28 20 29 21 30 22;
43 35 44 36 45 37 46 38;
59 51 60 52 61 53 62 54;
25 17 9 1 26 18 10 2;
57 49 41 33 58 50 42 34]
The link to the document which I am working on is:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.5132&rep=rep1&type=pdf
This is what I want to achieve:
Edit: a little more generic solution:
%function Z = bakermap(X,divisors)
function Z = bakermap()
X = reshape(1:64,8,8)'
divisors = [ 2 4 2 ];
[x,y] = size(X);
offsets = sum(divisors)-fliplr(cumsum(fliplr(divisors)));
if any(mod(y,divisors)) && ~(sum(divisors) == y)
disp('invalid divisor vector')
return
end
blocks = #(div) cell2mat( cellfun(#mtimes, repmat({ones(x/div,div)},div,1),...
num2cell(1:div)',...
'UniformOutput',false) );
%create index matrix
I = [];
for ii = 1:numel(divisors);
I = [I, blocks(divisors(ii))+offsets(ii)];
end
%create Baker map
Y = flipud(X);
Z = [];
for jj=1:I(end)
Z = [Z; Y(I==jj)'];
end
Z = flipud(Z);
end
returns:
index matrix:
I =
1 1 3 3 3 3 7 7
1 1 3 3 3 3 7 7
1 1 4 4 4 4 7 7
1 1 4 4 4 4 7 7
2 2 5 5 5 5 8 8
2 2 5 5 5 5 8 8
2 2 6 6 6 6 8 8
2 2 6 6 6 6 8 8
Baker map:
Z =
31 23 15 7 32 24 16 8
63 55 47 39 64 56 48 40
11 3 12 4 13 5 14 6
27 19 28 20 29 21 30 22
43 35 44 36 45 37 46 38
59 51 60 52 61 53 62 54
25 17 9 1 26 18 10 2
57 49 41 33 58 50 42 34
But have a look at the if-condition, it's just possible for these cases. I don't know if that's enough. I also tried something like divisors = [ 1 4 1 2 ] - and it worked. As long as the sum of all divisors is equal the row-length and the modulus as well, there shouldn't be problems.
Explanation:
% definition of anonymous function with input parameter: div: divisor vector
blocks = #(div) cell2mat( ... % converts final result into matrix
cellfun(#mtimes, ... % multiplies the next two inputs A,B
repmat(... % A...
{ones(x/div,div)},... % cell with a matrix of ones in size
of one subblock, e.g. [1,1,1,1;1,1,1,1]
div,1),... % which is replicated div-times according
to actual by cellfun processed divisor
num2cell(1:div)',... % creates a vector [1,2,3,4...] according
to the number of divisors, so so finally
every Block A gets an increasing factor
'UniformOutput',false...% necessary additional property of cellfun
));
Have also a look at this revision to have a simpler insight in what is happening. You requested a generic solution, thats the one above, the one linked was with more manual inputs.