Breeze extract columns from DenseMatrix based on list - scala

I just started working with Breeze in Scala and I am trying to figure out how to extract columns from a DenseMatrix where the column indexes that I want to extract are in a list. For example, if you want to extract column 1,3 and 6 from a numpy array myArrayin Python you can write
myArray[:,[1,3,6]]
In Breeze I have tried something similar
myArray(::,(1,3,6))
but it creates a syntax error. I have looked at the Breeze Linear Algebra Cheat Sheet and but I only see the functionality to extract individual columns or contiguous columns. Is there any way to directly specify the columns that I want to extract as in my example?

I had the same problem; the fix that I found was to import numerics as well as linalg from breeze:
import breeze.numerics._
It's mentioned at the top of the cheat sheet you found, but it's very easy to miss, especially if you've searched for a particular operation (what happened to me).

We assign a dense matrix "dm" with the following contents:
import breeze.linalg._
val dm = DenseMatrix((0,1,2,3,4,5,6,7),
(8,9,10,11,12,13,14,15),
(16,17,18,19,20,21,22,23),
(24,25,26,27,28,29,30,31))
dm
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
If we want to extract columns of a matrix (e.g. 1,3,6), we could extract a part of matrix with columns indexing
dm(::,IndexedSeq(1,3,6))
1 3 6
9 11 14
17 19 22
25 27 30
or rows indexing
dm(IndexedSeq(1,3),::)
8 9 10 11 12 13 14 15
24 25 26 27 28 29 30 31
Thank you. Please feel free to comment

Related

Displaying data to a map, creating a choropleth

What I would like to do is create a choropleth map which is darker or lighter based on the number of data points in a particular area.
I have the following data:
RO-B, 9
PL-MZ, 24
SE-C, 3
DE-NI, 5
PL-DS, 14
ES-CM, 11
RO-IS, 2
DE-BY, 51
SE-Z, 18
CH-BE, 10
PL-WP, 1
ES-IB, 1
DE-BW, 21
DE-BE, 24
DE-BB, 1
IE-M, 26
ES-PV, 1
DE-SN, 6
CH-ZH, 31
ES-GA, 1
NL-GE, 2
IE-U, 1
ES-AN, 4
FR-J, 82
DE-HH, 34
PL-PD, 1
PL-LD, 6
GB-WLS, 60
GB-ENG, 8619
RO-BV, 45
CH-VD, 2
PL-SL, 1
DE-HE, 17
SE-I, 1
HU-PE, 4
PL-MA, 4
SE-AB, 3
CH-BS, 20
ES-CT, 31
DE-TH, 25
IE-C, 1
CZ-ST, 1
DE-NW, 29
NL-NH, 3
DE-RP, 9
CZ-PR, 4
IE-L, 134
HU-BU, 10
RO-CJ, 1
GB-NIR, 29
ES-MD, 33
CH-LU, 11
GB-SCT, 172
CH-GE, 3
BE-BRU, 30
BE-VLG, 25
It references the ISO3166-2 of a country and sub region, and the # corresponds to the amount of data points affiliated with that region.
I've seen this project on GitHub which seems to also use the same ISO3166-2 to reference countries.
I'm trying to figure out how I could modify their code to display my data points, such that if the number is higher the area would be darker, if the number is less it would be lighter.
It seems it should be possible, the first thing I was trying to do was modify this jsfiddle code, which seems to be very close to what I need, but I couldn't get it to work.
For instance this line:
osmeRegions.geoJSON('RU-MOW',
Seems to directly reference a ISO3166-2 code, but it's not as simple as just changing that (or maybe it is but I couldn't get that to work properly).
Does anyone know if I could possibly adapt the code from that project to create the map rendering I've described?
Or perhaps there's a different way to achieve the same goal?

visualizing a distance matrix

Sorry if there's already an answer to this. I can't seem to find it.
I'm working on an application that pulls legislators' voting records on bills, and I'm trying to come up with some interesting ways of visualizing the data. There's one idea in my head right now, but I'm not sure it's mathematically possible to do the visualization I want to in two dimensions.
The data begins like this:
HB1 HB2 HB3
Smith 1 0 1
Hill 1 1 1
Davis 0 1 0
Where 1 = aye, 0 = nay.
The next step I take is to measure the "distance" of each legislator from the other by summing the XORs of their voting records, so that each time one legislator disagrees with another they get a distance "point" with that legislator. That creates a table like this:
Smith Hill Davis
Smith 0 1 3
Hill 1 0 2
Davis 3 2 0
So my idea is to graph each legislator as a point on a plane, and to have the distances between those points reflect the distance rating in the table. I think it presents an interesting opportunity to see if there are clusters of legislators with similar voting patterns, etc.
Now, obviously, this is easy to do with 3 points since you can always draw a triangle with three given lengths for sides. But I can't figure out whether it's mathematically possible to graph lots more (35-70) legislators and still have all the distances right within a 2-dimensional space, or whether you potentially need one additional dimension with each legislator after three.
So, for example, is it possible to preserve all the distances if the data table looks like this?
0 13 6 8 10 14 12 14 12 12
13 0 13 13 13 7 9 11 9 7
6 13 0 12 8 16 14 10 12 14
8 13 12 0 12 10 6 10 10 8
10 13 8 12 0 10 12 12 14 14
14 7 16 10 10 0 10 10 12 8
12 9 14 6 12 10 0 12 8 10
14 11 10 10 12 10 12 0 8 10
12 9 12 10 14 12 8 8 0 10
12 7 14 8 14 8 10 10 10 0
If so, does Octave have a built-in function? or can anyone point me to an algorithm?
Ok, found the answer.
No, it's generally not mathematically possible to do what I wanted to do.
The best approximation is an algorithm called multidimensional scaling. Octave has a built-in function: cmdscale.
Hope others may find this helpful.

How to display all x-labels on 'bar' plot?

I have the following data that I wish to plot in a bar graph in MatLab:
publications = [15 12 35 12 19 14 21 15 7 16 40 28 6 13 16 6 7 22 23 16 45];
bar(publications,0.4)
set(gca,'XTickLabel',{'G1','G2','G3','G4','G5','G6','G7','G8','G9','G10',...
'G11','G12','G14','G16','G17','G18','G19','G20','G21','G22','G23'})
However, when I execute this, I get the following plot:
Obviously the x-label is incorrect here as the first bar should have the x-label 'G1', the second should have 'G2', etc, until we get to the last bar which is supposed to have 'G23'.
If anyone knows how I can fix this, I would really, really appreciate it!
Add the following line:
set(gca,'XTick',1:numel(publications))
before you set the labels.
Now it depends how big your resulting plot is, because the labels are a little packed.
You may adjust fontsize or Orientation or the gaps between the bars.
Probably the publication names are a little longer so a 90° rotation is the best and you may find this answer or this link helpful.
Another suggestion would be to use barh and rotate after print:
publications = [15 12 35 12 19 14 21 15 7 16 40 28 6 13 16 6 7 22 23 16 45];
bh = barh(publications,0.4)
set(gca','XAxisLocation','top')
set(gca,'YTick',1:numel(publications))
set(gca,'YTickLabel',{'G1','G2','G3','G4','G5','G6','G7','G8','G9','G10',...
'G11','G12','G14','G16','G17','G18','G19','G20','G21','G22','G23'})

How Matlab extract a subset of a bigger matrix by specifying the indices?

I have a matrix A
A =
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
i have another matrix to specify the indices
index =
1 2
3 4
Now, i have got third matrix C from A
C = A(index)
C =
1 6
11 16
Problem: I am unable to understand how come i have received such a matrixC. I mean, what is logi behind it?
The logic behind it is linear indexing: When you provide a single index, Matlab moves along columns first, then along rows, then along further dimensions (according to their order).
So in your case (4 x 5 matrix) the entries of A are being accessed in the following order (each number here represents order, not the value of the entry):
1 5 9 13 17
2 6 10 14 18
3 7 11 15 19
4 8 12 16 20
Once you get used to it, you'll see linear indexing is a very powerful tool.
As an example: to obtain the maximum value in A you could just use max(A(1:20)). This could be further simplified to max(A(1:end)) or max(A(:)). Note that "A(:)" is a common Matlab idiom, used to turn any array into a column vector; which is sometimes called linearizing the array.
See also ind2sub and sub2ind, which are used to convert from linear index to standard indices and vice versa.

Stata longwise average

I'm using Stata and trying to compute conditional means based on time/date. For each store I want to calculate mean (inventory) per year. If there are missing year gaps, then I want to take the mean from the closest two dates' inventory values.
I've used (below) to get overall means per store, but I need more granularity.
egen mean_inv = mean(inventory), by (store)
I've also tried this loop with similar results:
by id, sort: gen v1'=_n'
forvalues x = 1/'=n'{
by store: sum inventory if v1==`x'
replace mean_inv= r(mean) if v1==`x'
}
Visually, I want mean inventory per store: (store id is not sequential)
5/1/2003 2/3/2006 8/9/2006 3/5/2007 6/9/2007 2/1/2008
13 18 12 15 24 11
[mean1] [mean2] [mean3] [mean4] [mean5]
store date inventory
1 16750 17
1 18234 16
1 15844 13
1 17111 14
1 17870 13
1 16929 13.5
1 17503 13
4 15987 18
4 15896 16
4 18211 16
4 17154 18
4 17931 24
4 16776 23
12 16426 26
12 17681 17
12 16386 17
12 16603 18
12 17034 16
12 17205 16
42 15798 18
42 16022 18
42 17496 16
42 17870 18
42 16204 18
42 16778 14
33 18053 23
33 16086 13
33 16450 21
33 17374 19
33 16814 19
33 15834 16
33 16167 16
56 17686 16
56 17623 18
56 17231 20
56 15978 16
56 16811 15
56 17861 20
It is hard to relate your code to the word description of your problem.
Your egen call calculates means by store, not year.
Your longer fragment does not make complete sense given lack of definitions and at least one typo.
Note that your variable v1 contains identifiers that run 1 up within groups of store, and does not distinguish different values of store, as you (seem to) imply. It strains credibility that it produces results anywhere near those by the egen call.
n is not defined and the code evaluating it is presumably intended to be
`=n'
If you calculate
by store: sum inventory if v1 == `x'
several means will be calculated in turn but only the last to be calculated will be accessible as r(mean).
The sample data are unrelated to the problem. There is no year variable and even if the dates are Stata daily dates, they are all dates within 1960.
Setting all that aside, suppose you have variables store, inventory and year. You can try
collapse inventory, by(store year)
fillin store year
ipolate inventory year, gen(inventory2) by(store)
The collapse produces a reduced dataset of means. The ipolate interpolates across gaps, as you ask. fillin may not be adequate to give all the store and year combinations you want and you may need to add further years manually before the interpolation. If you want to put these results back with the original data, that's a merge.
In total, this is a pretty messy question.