Plot portfolio composition map in Julia (or Matlab) - matlab

I am optimizing portfolio of N stocks over M levels of expected return. So after doing this I get the time series of weights (i.e. a N x M matrix where where each row is a combination of stock weights for a particular level of expected return). Weights add up to 1.
Now I want to plot something called portfolio composition map (right plot on the picture), which is a plot of these stock weights over all levels of expected return, each with a distinct color and length (at every level of return) is proportional to it's weight.
My questions is how to do this in Julia (or MATLAB)?

I came across this and the accepted solution seemed so complex. Here's how I would do it:
using Plots
#userplot PortfolioComposition
#recipe function f(pc::PortfolioComposition)
weights, returns = pc.args
weights = cumsum(weights,dims=2)
seriestype := :shape
for c=1:size(weights,2)
sx = vcat(weights[:,c], c==1 ? zeros(length(returns)) : reverse(weights[:,c-1]))
sy = vcat(returns, reverse(returns))
#series Shape(sx, sy)
end
end
# fake data
tickers = ["IBM", "Google", "Apple", "Intel"]
N = 10
D = length(tickers)
weights = rand(N,D)
weights ./= sum(weights, dims=2)
returns = sort!((1:N) + D*randn(N))
# plot it
portfoliocomposition(weights, returns, labels = tickers)

matplotlib has a pretty powerful polygon plotting capability, e.g. this link on plotting filled polygons:
ploting filled polygons in python
You can use this from Julia via the excellent PyPlot.jl package.
Note that the syntax for certain things changes; see the PyPlot.jl README and e.g. this set of examples.
You "just" need to calculate the coordinates from your matrix and build up a set of polygons to plot the portfolio composition graph. It would be nice to see the code if you get this working!

So I was able to draw it, and here's my code:
using PyPlot
using PyCall
#pyimport matplotlib.patches as patch
N = 10
D = 4
weights = Array(Float64, N,D)
for i in 1:N
w = rand(D)
w = w/sum(w)
weights[i,:] = w
end
weights = [zeros(Float64, N) weights]
weights = cumsum(weights,2)
returns = sort!([linspace(1,N, N);] + D*randn(N))
##########
# Plot #
##########
polygons = Array(PyObject, 4)
colors = ["red","blue","green","cyan"]
labels = ["IBM", "Google", "Apple", "Intel"]
fig, ax = subplots()
fig[:set_size_inches](5, 7)
title("Problem 2.5 part 2")
xlabel("Weights")
ylabel("Return (%)")
ax[:set_autoscale_on](false)
ax[:axis]([0,1,minimum(returns),maximum(returns)])
for i in 1:(size(weights,2)-1)
xy=[weights[:,i] returns;
reverse(weights[:,(i+1)]) reverse(returns)]
polygons[i] = matplotlib[:patches][:Polygon](xy, true, color=colors[i], label = labels[i])
ax[:add_artist](polygons[i])
end
legend(polygons, labels, bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0)
show()
# savefig("CompositionMap.png",bbox_inches="tight")
Can't say that this is the best way, to do this, but at least it is working.

Related

N-dimensional GP Regression

I'm trying to use GPflow for a multidimensional regression. But I'm confused by the shapes of the mean and variance.
For example: A 2-dimensional input space X of shape (20,20) is supposed to be predicted. My training samples are of shape (8,2) which means 8 training samples overall for the two dimensions. The y-values are of shape (8,1) which of course means one value of the ground truth per combination of the 2 input dimensions.
If I now use model.predict_y(X) I would expect to receive a mean of shape (20,20) but obtain a shape of (20,1). Same goes for the variance. I think that this problem comes from the shape of the y-values but I have have no idea how to fix it.
bound = 3
num = 20
X = np.random.uniform(-bound, bound, (num,num))
print(X_sample.shape) # (8,2)
print(Y_sample.shape) # (8,1)
k = gpflow.kernels.RBF(input_dim=2)
m = gpflow.models.GPR(X_sample, Y_sample, kern=k)
m.likelihood.variance = sigma_n
m.compile()
gpflow.train.ScipyOptimizer().minimize(m)
mean, var = m.predict_y(X)
print(mean.shape) # (20, 1)
print(var.shape) # (20, 1)
It sounds like you may be confused between the shape of a grid of input positions and the shape of the numpy arrays: if you want to predict on a 20 x 20 grid in two dimensions, you have 400 points in total, each with 2 values. So X (the one that you pass to m.predict_y()) should have shape (400, 2). (Note that the second dimension needs to have the same shape as X_sample!)
To construct this array of shape (400,2) you can use np.meshgrid (e.g., see What is the purpose of meshgrid in Python / NumPy?).
m.predict_y(X) only predicts the marginal variance at each test point, so the returned mean and var both have shape (400,1) (same length as X). You can of course reshape them to the 20 x 20 values on your grid.
(It is also possible to compute the full covariance, for the latent f this is implemented as m.predict_f_full_cov, which for X of shape (400,2) would return a 400x400 matrix. This is relevant if you want consistent samples from the GP, but I suspect that goes well beyond this question.)
I was indeed making the mistake to not flatten the arrays which in return produced the mistake. Thank you for the fast response STJ!
Here is an example of the working code:
# Generate data
bound = 3.
x1 = np.linspace(-bound, bound, num)
x2 = np.linspace(-bound, bound, num)
x1_mesh,x2_mesh = np.meshgrid(x1, x2)
X = np.dstack([x1_mesh, x2_mesh]).reshape(-1, 2)
z = f(x1_mesh, x2_mesh) # evaluation of the function on the grid
# Draw samples from feature vectors and function by a given index
size = 2
np.random.seed(1991)
index = np.random.choice(range(len(x1)), size=(size,X.ndim), replace=False)
samples = utils.sampleFeature([x1,x2], index)
X1_sample = samples[0]
X2_sample = samples[1]
X_sample = np.column_stack((X1_sample, X2_sample))
Y_sample = utils.samplefromFunc(f=z, ind=index)
# Change noise parameter
sigma_n = 0.0
# Construct models with initial guess
k = gpflow.kernels.RBF(2,active_dims=[0,1], lengthscales=1.0,ARD=True)
m = gpflow.models.GPR(X_sample, Y_sample, kern=k)
m.likelihood.variance = sigma_n
m.compile()
#print(X.shape)
mean, var = m.predict_y(X)
mean_square = mean.reshape(x1_mesh.shape) # Shape: (num,num)
var_square = var.reshape(x1_mesh.shape) # Shape: (num,num)
# Plot mean
fig = plt.figure(figsize=(16, 12))
ax = plt.axes(projection='3d')
ax.plot_surface(x1_mesh, x2_mesh, mean_square, cmap=cm.viridis, linewidth=0.5, antialiased=True, alpha=0.8)
cbar = ax.contourf(x1_mesh, x2_mesh, mean_square, zdir='z', offset=offset, cmap=cm.viridis, antialiased=True)
ax.scatter3D(X1_sample, X2_sample, offset, marker='o',edgecolors='k', color='r', s=150)
fig.colorbar(cbar)
for t in ax.zaxis.get_major_ticks(): t.label.set_fontsize(fontsize_ticks)
ax.set_title("$\mu(x_1,x_2)$", fontsize=fontsize_title)
ax.set_xlabel("\n$x_1$", fontsize=fontsize_label)
ax.set_ylabel("\n$x_2$", fontsize=fontsize_label)
ax.set_zlabel('\n\n$\mu(x_1,x_2)$', fontsize=fontsize_label)
plt.xticks(fontsize=fontsize_ticks)
plt.yticks(fontsize=fontsize_ticks)
plt.xlim(left=-bound, right=bound)
plt.ylim(bottom=-bound, top=bound)
ax.set_zlim3d(offset,np.max(z))
which leads to (red dots are the sample points drawn from the function). Note: Code not refactored what so ever :)

Finding the longest linear section of non-linear plot in MATLAB

Apologies for the long post but this takes a bit to explain. I'm trying to make a script that finds the longest linear portion of a plot. Sample data is in a csv file here, it is stress and strain data for calculating the shear modulus of 3D printed samples. The code I have so far is the following:
x_data = [];
y_data = [];
x_data = Data(:,1);
y_data = Data(:,2);
plot(x_data,y_data);
grid on;
answer1 = questdlg('Would you like to load last attempt''s numbers?');
switch answer1
case 'Yes'
[sim_slopes,reg_data] = regr_and_longest_part(new_x_data,new_y_data,str2num(answer2{3}),str2num(answer2{2}),K);
case 'No'
disp('Take a look at the plot, find a range estimate, and press any button to continue');
pause;
prompt = {'Eliminate values ABOVE this x-value:','Eliminate values BELOW this x-value:','Size of divisions on x-axis:','Factor for similarity of slopes:'};
dlg_title = 'Point elimination';
num_lines = 1;
defaultans = {'0','0','0','0.1'};
if isempty(answer2) < 1
defaultans = {answer2{1},answer2{2},answer2{3},answer2{4}};
end
answer2 = inputdlg(prompt,dlg_title,num_lines,defaultans);
uv_of_x_range = str2num(answer2{1});
lv_of_x_range = str2num(answer2{2});
x_div_size = str2num(answer2{3});
K = str2num(answer2{4});
close all;
iB = find(x_data > str2num(answer2{1}),1,'first');
iS = find(x_data > str2num(answer2{2}),1,'first');
new_x_data = x_data(iS:iB);
new_y_data = y_data(iS:iB);
[sim_slopes, reg_data] = regr_and_longest_part(new_x_data,new_y_data,str2num(answer2{3}),str2num(answer2{2}),K);
end
[longest_section0, Midx]= max(sim_slopes(:,4)-sim_slopes(:,3));
longest_section=1+longest_section0;
long_sec_x_data_start = x_div_size*(sim_slopes(Midx,3)-1)+lv_of_x_range;
long_sec_x_data_end = x_div_size*(sim_slopes(Midx,4)-1)+lv_of_x_range;
long_sec_x_data_start_idx=find(new_x_data >= long_sec_x_data_start,1,'first');
long_sec_x_data_end_idx=find(new_x_data >= long_sec_x_data_end,1,'first');
long_sec_x_data = new_x_data(long_sec_x_data_start_idx:long_sec_x_data_end_idx);
long_sec_y_data = new_y_data(long_sec_x_data_start_idx:long_sec_x_data_end_idx);
[b_long_sec, longes_section_reg_data] = robustfit(long_sec_x_data,long_sec_y_data);
plot(long_sec_x_data,b_long_sec(1)+b_long_sec(2)*long_sec_x_data,'LineWidth',3,'LineStyle',':','Color','k');
function [sim_slopes,reg_data] = regr_and_longest_part(x_points,y_points,x_div,lv,K)
reg_data = cell(1,3);
scatter(x_points,y_points,'.');
grid on;
hold on;
uv = lv+x_div;
ii=0;
while lv <= x_points(end)
if uv > x_points(end)
uv = x_points(end);
end
ii=ii+1;
indices = find(x_points>lv & x_points<uv);
temp_x_points = x_points((indices));
temp_y_points = y_points((indices));
if length(temp_x_points) <= 2
break;
end
[b,stats] = robustfit(temp_x_points,temp_y_points);
reg_data{ii,1} = b(1);
reg_data{ii,2} = b(2);
reg_data{ii,3} = length(indices);
plot(temp_x_points,b(1)+b(2)*temp_x_points,'LineWidth',2);
lv = lv+x_div;
uv = lv+x_div;
end
sim_slopes = NaN(length(reg_data),4);
sim_slopes(1,:) = [reg_data{1,1},0,1,1];
idx=1;
for ii=2:length(reg_data)
coff =sim_slopes(idx,1);
if abs(reg_data{ii,1}-coff) <= K*coff
C=zeros(ii-sim_slopes(idx,3)+1,1);
for kk=sim_slopes(idx,3):ii
C(kk)=reg_data{kk,1};
end
sim_slopes(idx,1)=mean(C);
sim_slopes(idx,2)=std(C);
sim_slopes(idx,4)=ii;
else
idx = idx + 1;
sim_slopes(idx,1)=reg_data{ii,1};
sim_slopes(idx,2)=0;
sim_slopes(idx,3)=ii;
sim_slopes(idx,4)=ii;
end
end
end
Apologies for the code not being well optimized, I'm still relatively new to MATLAB. I did not use derivatives because my data is relatively noisy and derivation might have made it worse.
I've managed to get the get the code to find the longest straight part of the plot by splitting the data up into sections called x_div_size then performing a robustfit on each section, the results of which are written into reg_data. The code then runs through reg_data and finds which lines have the most similar slopes, determined by the K factor, by calculating the average of the slopes in a section of the plot and makes a note of it in sim_slopes. It then finds the longest interval with max(sim_slopes(:,4)-sim_slopes(:,3)) and performs a regression on it to give the final answer.
The problem is that it will only consider the first straight portion that it comes across. When the data is plotted, it has a few parts where it seems straightest:
As an example, when I run the script with answer2 = {'0.2','0','0.0038','0.3'} I get the following, where the black line is the straightest part found by the code:
I have the following questions:
It's clear that from about x = 0.04 to x = 0.2 there is a long straight part and I'm not sure why the script is not finding it. Playing around with different values the script always seems to pick the first longest straight part, ignoring subsequent ones.
MATLAB complains that Warning: Iteration limit reached. because there are more than 50 regressions to perform. Is there a way to bypass this limit on robustfit?
When generating sim_slopes there might be section of the plot whose slope is too different from the average of the previous slopes so it gets marked as the end of a long section. But that section sometimes is sandwiched between several other sections on either side which instead have similar slopes. How would it be possible to tell the script to ignore one wayward section and to continue as if it falls within the tolerance allowed by the K value?
Take a look at the Douglas-Peucker algorithm. If you think of your (x,y) values as the vertices of an (open) polygon, this algorithm will simplify it for you, such that the largest distance from the simplified polygon to the original is smaller than some threshold you can choose. The simplified polygon will be the set of straight lines. Find the two vertices that are furthest apart, and you're done.
MATLAB has an implementation in the Mapping Toolbox called reducem. You might also find an implementation on the File Exchange (but be careful, there is also really bad code on there). Or, you can roll your own, it's quite a simple algorithm.
You can also try using the ischange function to detect changes in the intercept and slope of the data, and then extract the longest portion from that.
Using the sample data you provided, here is what I see from a basic attempt:
>> T = readtable('Data.csv');
>> T = rmmissing(T); % Remove rows with NaN
>> T = groupsummary(T,'Var1','mean'); % Average duplicate timestamps
>> [tf,slopes,intercepts] = ischange(T.mean_Var2, 'linear', 'SamplePoints', T.Var1); % find changes
>> plot(T.Var1, T.mean_Var2, T.Var1, slopes.*T.Var1 + intercepts)
which generates the plot
You should be able to extract the longest segment based on the indices given by find(tf).
You can also tune the parameters of ischange to get fewer or more segments. Adding the name-value pair 'MaxNumChanges' with a value of 4 or 5 produces more linear segments with a tighter fit to the curve, for example, which effectively removes the kink in the plot that you see.

Calculate standard error of contrast using a linear mixed-effect model (fitlme) in MATLAB

I would like to calculate standard errors of contrasts in a linear mixed-effect model (fitlme) in MATLAB.
y = randn(100,1);
area = randi([1 3],100,1);
mea = randi([1 3],100,1);
sub = randi([1 5],100,1);
data = array2table([area mea sub y],'VariableNames',{'area','mea','sub','y'});
data.area = nominal(data.area,{'A','B','C'});
data.mea = nominal(data.mea,{'Baseline','+1h','+8h'});
data.sub = nominal(data.sub);
lme = fitlme(data,'y~area*mea+(1|sub)')
% Plot Area A on three measurements
coefv = table2array(dataset2table(lme.Coefficients(:,2)));
bar([coefv(1),sum(coefv([1 4])),sum(coefv([1 5]))])
Calculating the contrast means, e.g. area1-measurement1 vs area1-measurement2 vs area1-measurement3 can be done by summing the related coefficient parameters. However, does anyone know how to calculate the related standard errors?
I know a hypothesis test can be done by coefTest(lme,H), but only p values can be extracted.
An example for Area A is shown below:
I have resolved this issue!
Matlab uses the 'predict' function to estimate contrasts. To find confidence intervals for area A, at measurement +8h in this particular example use:
dsnew = dataset();
dsnew.area = nominal('A');
dsnew.mea = nominal('+8h');
dsnew.sub = nominal(1);
[yh yCI] = predict(lme,dsnew,'Conditional',false)
A result is shown below:

Solving a system of equations using Python/Scipy for a set of measurements

I have an physical instrument of measurement (force platform with load cells) which gives me three values, A, B and C. It happens, though, that these values - that should be orthogonal - actually are somewhat coupled, due to physical characteristics of the measuring device, which causes cross-talk between applied and returned values of force and torque.
Then, it is recommended that a calibration matrix be used to transform the measured values into a better estimate of the actual values, like this:
The problem is that it is necessary to perform a SET of measurements, so that different measured(Fz, Mx, My) and actual(Fz, Mx, My) are least-squared to get some C matrix that works best for the system as a whole.
I can solve Ax = B problems with scipy.linalg.lststq, or even scipy.linalg.solve (giving an exact solution) for ONE measurement, but how should I proceed to consider a set of different measurements, each one with its own equation giving a potentially different 3x3 matrix?
Any help is much appreciated, thanks for reading.
I posted a similar question containing just the mathematical part of this at math.stackexchange.com, and this answer solved the problem:
math.stackexchange.com/a/232124/27435
In case anyone have a similar problem in the future, here is the almost literal Scipy implementation of that answer (first lines are initialization boilerplate code):
import numpy
import scipy.linalg
### Origin of the coordinate system: upper left corner!
"""
1----------2
| |
| |
4----------3
"""
platform_width = 600
platform_height = 400
# positions of each load cell (one per corner)
loadcell_positions = numpy.array([[0, 0],
[platform_width, 0],
[platform_width, platform_height],
[0, platform_height]])
platform_origin = numpy.array([platform_width, platform_height]) * 0.5
# applying a known force at known positions and taking the measurements
measurements_per_axis = 5
total_load = 50
results = []
for x in numpy.linspace(0, platform_width, measurements_per_axis):
for y in numpy.linspace(0, platform_height, measurements_per_axis):
position = numpy.array([x,y])
for loadpos in loadcell_positions:
moments = platform_origin-loadpos * total_load
load = numpy.array([total_load])
result = numpy.hstack([load, moments])
results.append(result)
results = numpy.array(results)
noise = numpy.random.rand(*results.shape) - 0.5
measurements = results + noise
# now expand ("stuff") the 3x3 matrix to get a linearly independent 3x3 matrix
expands = []
for n in xrange(measurements.shape[0]):
k = results[n,:]
m = measurements[n,:]
expand = numpy.zeros((3,9))
expand[0,0:3] = m
expand[1,3:6] = m
expand[2,6:9] = m
expands.append(expand)
expands = numpy.vstack(expands)
# perform the actual regression
C = scipy.linalg.lstsq(expands, measurements.reshape((-1,1)))
C = numpy.array(C[0]).reshape((3,3))
# the result with pure noise (not actual coupling) should be
# very close to a 3x3 identity matrix (and is!)
print C
Hope this helps someone!

Matlab fourier descriptors what's wrong?

I am using Gonzalez frdescp function to get Fourier descriptors of a boundary. I use this code, and I get two totally different sets of numbers describing two identical but different in scale shapes.
So what is wrong?
im = imread('c:\classes\a1.png');
im = im2bw(im);
b = bwboundaries(im);
f = frdescp(b{1}); // fourier descriptors for the boundary of the first object ( my pic only contains one object anyway )
// Normalization
f = f(2:20); // getting the first 20 & deleting the dc component
f = abs(f) ;
f = f/f(1);
Why do I get different descriptors for identical - but different in scale - two circles?
The problem is that the frdescp code (I used this code, that should be the same as referred by you) is written also in order to center the Fourier descriptors.
If you want to describe your shape in a correct way, it is mandatory to mantain some descriptors that are symmetric with respect to the one representing the DC component.
The following image summarize the concept:
In order to solve your problem (and others like yours), I wrote the following two functions:
function descriptors = fourierdescriptor( boundary )
%I assume that the boundary is a N x 2 matrix
%Also, N must be an even number
np = size(boundary, 1);
s = boundary(:, 1) + i*boundary(:, 2);
descriptors = fft(s);
descriptors = [descriptors((1+(np/2)):end); descriptors(1:np/2)];
end
function significativedescriptors = getsignificativedescriptors( alldescriptors, num )
%num is the number of significative descriptors (in your example, is was 20)
%In the following, I assume that num and size(alldescriptors,1) are even numbers
dim = size(alldescriptors, 1);
if num >= dim
significativedescriptors = alldescriptors;
else
a = (dim/2 - num/2) + 1;
b = dim/2 + num/2;
significativedescriptors = alldescriptors(a : b);
end
end
Know, you can use the above functions as follows:
im = imread('test.jpg');
im = im2bw(im);
b = bwboundaries(im);
b = b{1};
%force the number of boundary points to be even
if mod(size(b,1), 2) ~= 0
b = [b; b(end, :)];
end
%define the number of significative descriptors I want to extract (it must be even)
numdescr = 20;
%Now, you can extract all fourier descriptors...
f = fourierdescriptor(b);
%...and get only the most significative:
f_sign = getsignificativedescriptors(f, numdescr);
I just went through the same problem with you.
According to this link, if you want invariant to scaling, make the comparison ratio-like, for example by dividing every Fourier coefficient by the DC-coefficient. f*1 = f1/f[0], f*[2]/f[0], and so on. Thus, you need to use the DC-coefficient where the f(1) in your code is not the actual DC-coefficient after your step "f = f(2:20); % getting the first 20 & deleting the dc component". I think the problem can be solved by keeping the value of the DC-coefficient first, the code after adjusted should be like follows:
% Normalization
DC = f(1);
f = f(2:20); % getting the first 20 & deleting the dc component
f = abs(f) ; % use magnitudes to be invariant to translation & rotation
f = f/DC; % divide the Fourier coefficients by the DC-coefficient to be invariant to scale