How to flatten a data frame in apache spark | Scala - scala

I have the following data frame :
df1
uid text frequency
1 a 1
1 b 0
1 c 2
2 a 0
2 b 0
2 c 1
I need to flatten it on the basis of uid to :
df2
uid a b c
1 1 0 2
2 0 0 1
I've worked on similar lines in R but haven't been able to translate it into sql or scala.
Any suggestions on how to approach this?

You can group by uid, use text as a pivot column and sum frequencies:
df1
.groupBy("uid")
.pivot("text")
.sum("frequency").show()

Related

Counting presence or absence of tallies by rows

Hi I am not sure how to explain what I need but I'll try. I need a query (if there is one) for counting if a tally was present in a point (column). So all species with more than one tally in a point will count only as one.
This is how the data looks:
Sp Site Pnt1 Pnt2 Pnt3 Total
A 1 1 1 1 3
A 2 1 1 2
A 3 1 1
B 1 1 1 1 3
B 2 1 1
C 1 1 1 2
C 2 0
I want to count if the sites have tally or not and if they are repeated by points (for the same species) I want to count them as one. I would like the resulting table to look like this.
Sp Pnt1 Pnt2 Pnt3 Total
A 1 1 1 3
B 1 1 1 3
C 0 1 1 2
Thanks for all the help you can provide.

How do I implement convolution in KDB?

I'm trying to convolve a small vector (kernel) across a longer series.
The simplest possible kernel is (-1 1), which is the equivalent of the diff operator.
My failed attempt is:
{sum (x;y)*(-1;1)} scan til 10
0 1 1 2 2 3 3 4 4 5
This doesn't work, as it supplies the previous evaluation as the right component of the binary fn. What I should be doing is evaluating the function on each pair and storing the result. The result I'm looking for is:
1 1 1 1 1 1 1 1 1
What I can't figure out is an elegant KDB way of doing this calculation.
Is there an elegant way to do this in KDB?
You could use the each prior iterator to do this - https://code.kx.com/q/ref/maps/#each-prior
q)-':[til 10]
0 1 1 1 1 1 1 1 1 1
Maybe something like:
q)1_sum(1;-1)*1 prev\til 10
1 1 1 1 1 1 1 1 1
Can you provide more examples?

Create a Boolean column displaying comparison between 2 other columns in kdb+

I'm currently learning kdb+/q.
I have a table of data. I want to take 2 columns of data (just numbers), compare them and create a new Boolean column that will display whether the value in column 1 is greater than or equal to the value in column 2.
I am comfortable using the update command to create a new column, but I don't know how to ensure that it is Boolean, how to compare the values and a method to display the "greater-than-or-equal-to-ness" - is it possible to do a simple Y/N output for that?
Thanks.
/ dummy data
q) show t:([] a:1 2 3; b: 0 2 4)
a b
---
1 0
2 2
3 4
/ add column name 'ge' with value from b>=a
q) update ge:b>=a from t
a b ge
------
1 0 0
2 2 1
3 4 1
Use a vector conditional:
http://code.kx.com/q/ref/lists/#vector-conditional
q)t:([]c1:1 10 7 5 9;c2:8 5 3 4 9)
q)r:update goe:?[c1>=c2;1b;0b] from t
c1 c2 goe
-------------
1 8 0
10 5 1
7 3 1
5 4 1
9 9 1
Use meta to confirm the goe column is of boolean type:
q)meta r
c | t f a
-------| -----
c1 | j
c2 | j
goe | b
The operation <= works well with vectors, but in some cases when a function needs atoms as input for performing an operation, you might want to use ' (each-both operator).
e.g. To compare the length of symbol string with another column value
q)f:{x<=count string y}
q)f[3;`ab]
0b
q)t:([] l:1 2 3; s: `a`bc`de)
q)update r:f'[l;s] from t
l s r
------
1 a 1
2 bc 1
3 de 0

MATLAB what lines are different between matrices

I am trying to find the number of the lines where the values of two matrices are not the same
I found only a way to know the indexs on the not same items by:
find(a~=b)
where a is N*N and b is N*N
How can I know the rows numbers of the not same items
ps
looking for nicer way then
dint the find and then having some vector in a loop filling with
ind2sub(size(A), 6)
You can use max on the logical array of such matches or mis-mistaches in this case along a certain dimension, alongwith find.
If you are looking to find unique row IDs for mismatches, do this -
find(max(a~=b,[],2))
For unique column IDs, just change the dimension specifier -
find(max(a~=b,[],1))
Sample run -
>> a
a =
1 2 2 2 1
1 2 1 1 1
2 2 2 2 1
1 1 2 1 1
>> b
b =
1 2 1 1 2
1 2 1 2 1
2 2 2 2 1
1 1 2 2 2
>> a~=b
ans =
0 0 1 1 1
0 0 0 1 0
0 0 0 0 0
0 0 0 1 1
>> find(max(a~=b,[],2)) %// unique row IDs
ans =
1
2
4
>> find(max(a~=b,[],1)) %// unique col IDs
ans =
3 4 5
here I found an easy way if any one will need it
indexs=find(a~=b)
[~,rows]=ind2sub(size(a),indexs)
rows=unique( sort( rows ) )
now rows are only the different rows
NotSame = 0;
for ii = 1:size(a,1)
if a(ii,:) ~= b(ii,:)
NotSame = NotSame+1;
end
end
This checks it row by row and when a row in a is not the same as the row in b this will increase the count of NotSame. Not the fastest way, I'm sure someone can produce a solution using bsxfun, but I'm not an expert in that.
You can also use the double output of find
[row, col] = find(a~=b)
myrows = unique(row);
You can also have the columns where a & b have different values
mycols = unique(col);

How to select specific rows based upon column attribute values in matlab?

I have [sentence cross words] logical matrix where value = 1 shows presence of a word in that sentence and 0 shows absence like as follows:
0 0 1 1
1 0 1 0
0 0 0 1
1 1 0 0
I have done some processing and selected specific words i.e. 2 & 3
result = 2 3
Now, I want to select only those rows in which value of words 2 & 3 are equal to 1 and return there row number as follows:
row = 1 2 4
This should be done for every word that is in result variable - thanks.
Think you are looking for something like this, assuming A as the input binary array -
result = [2 3]; %// select words by IDs
row = find(any(A(:,result),2))
Sample run -
A =
0 0 1 1
1 0 1 0
0 0 0 1
1 1 0 0
row =
1
2
4
For fun-sake, you can also use matrix-multiplication as an alternative approach -
row = find(A(:,result)*ones(numel(result),1))
First choose the columns that you want to extract and create a matrix that concatenates all of these columns together. Next, use any and operate along the columns in combination with find to obtain the desired locations.
Therefore, given your logical matrix stored in X, do:
ind = [2 3];
matr = X(:,ind);
vals = find(any(matr, 2));
With your above example, we get:
vals =
1
2
4