scala, filter RDD

scala, filter RDD - scala

i have val:
val key: RDD[String]= Seq("0000005","0000001","0000007").toRDD
and
val file2: Array[String] = Array(("0000005", 82, 79, 16, 21, 80),
("0000001", 46, 39, 8, 5, 21),
("0000004", 58, 71, 20, 10, 6),
("0000009", 60, 89 33 18 6),
("0000003", 30, 50, 71, 36, 30),
("0000007", 50, 2, 33, 15, 62))
I would like to filter in file2 exists element in "key"
I want something like this:
0000005 82 79 16 21 80
0000001 46 39 8 5 21
0000007 50 2 33 15 62

I simplified this to standard Scala collection types:
val keys = Seq("0000005","0000001","0000007")
val all = Seq("0000005 82 79 16 21 80",
"0000001 46 39 8 5 21",
"0000004 58 71 20 10 6",
"0000009 60 89 33 18 6",
"0000003 30 50 71 36 30",
"0000007 50 2 33 15 62")
Here is the filter function that will give your resuklt:
val filtered = all.map(_.split(" ").toList)
.filter{ case x::_ => keys.contains(x) }
.map(_.mkString(" "))
println(filtered) // -> List(0000005 82 79 16 21 80, 0000001 46 39 8 5 21, 0000007 50 2 33 15 62)
See Scalafiddle

First, need to map file2 for key->value structure: (I assume all the numbers in file2 are actually strings..):
val file2Map: RDD[(String, Array[String])] = file2.map(value => (value.head, value)).toRDD
Now, if you do:
keys.join(file2Map).take(10).foreach(println)
The output would be something like:
(0000005, (0000005, 0000005 82 79 16 21 80)
(0000001, (0000001, 0000001 46 39 8 5 21)
(0000007, (0000001, 0000001 50 2 33 15 62)
And from that it's easy to get only the second tuple from the value.

Related

How to sum columns of a matrix for a specified number of columns?

I have a matrix A of size 2500 x 500. I want to sum each 10 columns and get the result as a matrix B of size 2500 x 50. That is, the first column of B is the sum of the first 10 columns of A, the second column of B is the sum of second 10 columns of A, and so on.
How can I do that without a for loop? Since I have to do that hundreds of times and it is highly time consuming to do that using for loop.

First, we "block reshape" A, such that we have the desired number of columns. Therefore, we shamelessly steal the code from the great Divakar, and put in some minimal effort to generalize it. Then, we just need to sum along the second axis, and reshape to the original form.
Here's an example with five columns to be summed:
% Sample input data
A = reshape(1:100, 10, 10).'
[r, c] = size(A);
% Number of columns to be summed
n_cols = 5;
% Block reshape to n_cols, see https://stackoverflow.com/a/40508999/11089932
B = reshape(permute(reshape(A, r, n_cols, []), [1, 3, 2]), [], n_cols);
% Sum along second axis
B = sum(B, 2);
% Reshape to original form
B = reshape(B, r, c / n_cols)
That's the output:
A =
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100
B =
15 40
65 90
115 140
165 190
215 240
265 290
315 340
365 390
415 440
465 490
Hope that helps!

This can be done with splitapply. An advantage of this approach is that it works even if the group size does not divide the number of columns (the last group is smaller):
A = reshape(1:120, 12, 10).'; % example 10×12 data (borrowed from HansHirse)
n_cols = 5; % number of columns to sum over
result = splitapply(#(x)sum(x,2), A, ceil((1:size(A,2))/n_cols));
In this example,
A =
1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80 81 82 83 84
85 86 87 88 89 90 91 92 93 94 95 96
97 98 99 100 101 102 103 104 105 106 107 108
109 110 111 112 113 114 115 116 117 118 119 120
result =
15 40 23
75 100 47
135 160 71
195 220 95
255 280 119
315 340 143
375 400 167
435 460 191
495 520 215
555 580 239

Why there is 'expected an element of either String'

import networkx as nx
from bokeh.transform import linear_cmap
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
from bokeh.models import Circle, HoverTool, TapTool, BoxSelectTool
from bokeh.models.graphs import from_networkx
from bokeh.palettes import Spectral6
output_notebook()
G=nx.Graph(G_fb)
plot = figure(x_range=(-1.1, 1.1), y_range=(-1.1, 1.1))
plot.add_tools(HoverTool(tooltips=[("Name", "#name"),
("Club", "#club")]),
TapTool(),
BoxSelectTool())
graph = from_networkx(G, nx.spring_layout, iterations=1000, scale=1, center=(0,0))
graph.node_renderer.data_source.data['name'] = list(G.nodes())
graph.node_renderer.glyph = Circle(size=10)
graph.node_renderer.glyph = Circle(
size=10,
fill_color=linear_cmap(degrees, palette=Spectral6,low=0, high=100, low_color='blue', high_color='red')
plot.renderers.append(graph)
show(plot)
It comes up with an error saying:
expected an element of either String, Dict(Enum('expr', 'field', 'value', 'transform'), Either(String, Instance(Transform), Instance(Expression), Color)) or Color, got {'field': [5, 6, 12, 7, 8, 6, 6, 9, 8, 4, 3, 5, 5, 7, 5, 7, 4, 4, 3, 6, 3, 5, 10, 1, 7, 4, 8, 2, 9, 6, 5, 5, 9, 11, 11, 3, 9, 1, 10, 1, 6, 6, 5, 8, 7, 7, 4, 7, 2, 1, 3, 1, 6, 3, 1, 1, 2, 2, 1, 2, 2, 1], 'transform': LinearColorMapper(id='1194', ...)}
part of G_fb:
11 1
15 1
16 1
41 1
43 1
48 1
18 2
20 2
27 2
28 2
29 2
37 2
42 2
55 2
11 3
43 3
45 3
62 3
9 4
15 4
60 4
52 5
10 6
14 6
57 6
58 6
10 7
14 7
18 7
55 7
57 7
58 7
20 8
28 8
31 8
41 8
55 8
21 9
29 9
38 9
46 9
60 9
14 10
18 10
33 10
42 10
58 10
30 11
43 11
48 11
52 12
34 13
18 14
33 14
42 14
55 14
58 14
17 15
25 15
34 15
35 15
38 15
39 15
41 15
44 15
51 15
53 15
19 16

range of number to subset slices

I would like to reshape a vector into a number 'slices' (in Matlab) but find myself in a brain freeze and can't come up with a good way (e.g. a one-liner) to do it:
a=1:119;
slices=[47 24 1 47];
result={1:47,48:71,...};
the result doesn't need to be stored in a cell array.
Thanks

This is what mat2cell does:
>> a=1:119;
>> slices=[47 24 1 47];
>> result = mat2cell(a, 1, slices) % 1 is # of rows in result
result =
{
[1,1] =
Columns 1 through 15:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Columns 16 through 30:
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Columns 31 through 45:
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
Columns 46 and 47:
46 47
[1,2] =
Columns 1 through 15:
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
Columns 16 through 24:
63 64 65 66 67 68 69 70 71
[1,3] = 72
[1,4] =
Columns 1 through 13:
73 74 75 76 77 78 79 80 81 82 83 84 85
Columns 14 through 26:
86 87 88 89 90 91 92 93 94 95 96 97 98
Columns 27 through 39:
99 100 101 102 103 104 105 106 107 108 109 110 111
Columns 40 through 47:
112 113 114 115 116 117 118 119
}

Problems with matrix indexing using for loop and if condition

I to do some indexing, something like what follows:
for c=1:size(params,1)
for d=1:size(data,2)
if params(c,8)==1
value1(c,d)=data(params(c,11),d);
elseif params(c,8)==2
value2(c,d)=data(params(c,11),d);
elseif params(c,8)==3
value3(c,d)=data(params(c,11),d);
end
end
end
The problems with this is that if we have params(:,8)=1,3,1,3,2,3,1... then value1 will contain all zeros in rows 2, 4, 5, 6, etc. These are the rows that do not have 1 in column 8 in params. Similarly, value2 will contains all zeros in rows 1, 2, 3, 4, 6, 7... and value3 will contain all zeros in row 1, 3, 5, 7, .... Could anyone tell me how to index so I don't have 'gaps' of zeros in between rows? Thanks!
Edit; below is a sample dataset:
data (1080x15 double)
168 432 45 86
170 437 54 82
163 423 52 83
178 434 50 84
177 444 42 87
177 444 58 85
175 447 48 77
184 451 59 86
168 455 52 104
174 437 62 88
175 443 55 85
179 456 51 92
168 450 73 82
175 454 60 68
params (72x12 double - we are interested in only column 8 and 11 ) so I'm showing only column 8-11 for the sake of space:
1 10 15 1
3 12 16 16
2 10 15 32
3 12 16 47
1 8 14 63
2 10 15 77
2 8 14 92
3 10 15 106
1 12 16 121
3 8 14 137
2 10 15 151
The expected output for value1, value2, and value3 should be 24x15. This is because there are 15 columns in data and value 1, 2, 3 occur 24 times each in column 8 in params.

You can use bsxfun to avoid for-loop (note that it is actually not vertorizing):
value1 = bsxfun(#times,data(params(:,11),:),(params(:,8)==1));
value2 = bsxfun(#times,data(params(:,11),:),(params(:,8)==2));
value3 = bsxfun(#times,data(params(:,11),:),(params(:,8)==3));
But it still gives you the results with zero rows. So you can remove zero-rows by:
value1(all(value1==0,2),:)=[];
value2(all(value2==0,2),:)=[];
value3(all(value3==0,2),:)=[];
You can also use above commands to remove zero-rows in your results without using bsxfun. It is not always good to loose the transparency.

Matlab: How to replace certain elements of a matrix A by other values of A in both directions?

for a matrix A (10x100000) containing numbers between 1 and 100, how to interchange some elements of A by other values of A in both directions?
example:
replace numbers [5 7 9 18 55 4] by [47 78 41 1 99 98] and [47 78 41 1 99 98] by [5 7 9 18 55 4]

Use the two outputs of ismember:
n1 = [1 2 3]; %// first set of numbers
n2 = [4 5 6]; %// second set of numbers
[v1, i1] = ismember(A,n1);
[v2, i2] = ismember(A,n2);
A(v1) = n2(i1(v1));
A(v2) = n1(i2(v2));
Example:
>> A = randi(8,4,5)
A =
2 2 8 4 6
2 5 3 8 2
5 4 3 2 5
4 3 2 3 4
is transformed into
A =
5 5 8 1 3
5 2 6 8 5
2 1 6 5 2
1 6 5 6 1

bsxfun based approach -
%// Input matrix
A = randi(100,10,10)
vec1 = [5 7 9 18 55 4 , 47 78 41 1 99 98]; %// Numbers to be replaced
vec2 = [47 78 4 1 99 98, 5 7 9 18 55 4]; %// Numbers to be used as replacements
[v1,v2] = max(bsxfun(#eq,A(:),vec1),[],2);
A(find(v1)) = vec2(v2(v1))
Sample run -
Input A
A =
27 37 27 59 37 13 55 45 29 16
84 41 58 46 75 39 75 51 49 16
100 37 88 87 71 82 85 54 69 16
65 47 7 67 71 99 17 86 21 9
71 51 45 36 1 87 91 68 61 46
94 92 9 35 38 9 11 81 33 67
69 21 57 26 91 34 75 54 89 84
57 34 54 96 32 24 73 96 14 80
39 58 77 30 60 32 72 7 11 72
64 49 24 16 30 99 14 55 96 48
Output A
A =
27 37 27 59 37 13 99 45 29 16
84 9 58 46 75 39 75 51 49 16
100 37 88 87 71 82 85 54 69 16
65 5 78 67 71 55 17 86 21 4
71 51 45 36 18 87 91 68 61 46
94 92 4 35 38 4 11 81 33 67
69 21 57 26 91 34 75 54 89 84
57 34 54 96 32 24 73 96 14 80
39 58 77 30 60 32 72 78 11 72
64 49 24 16 30 55 14 99 96 48
As can be seen, the 7s from (4,3) and (9,8) in the original A are replaced by 78s and 47 in (4,2) by 5.

Matlab is a strange and mysterious place. Searching through the documentation I found a function called changem in the Mapping toolbox. I've never used it, but apparently if you have your original matrix A and two substitution vectors v1 and v2:
v1 = [ 5 7 9 18 55 4];
v2 = [47 78 41 1 99 98];
All you have to do is:
B = changem(A, [v1 v2], [v2 v1]);

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

scala, filter RDD - scala

Related

How to sum columns of a matrix for a specified number of columns?

Why there is 'expected an element of either String'

range of number to subset slices

Problems with matrix indexing using for loop and if condition

Matlab: How to replace certain elements of a matrix A by other values of A in both directions?

Categories

Resources