How to get a new solution with high probability from previously found, incomplete solutions with different probabilities? - matlab

I am working on a AI algorithm. first when program runs a random solution is generated from which ,in first iteration of the program 10 solution vectors are created, by analyzing these solutions we could give each of them a probability ( highest , second highest, third highest and so on) towards the optimal solution , for the second input of the program I want it to be a vector (possible solution) obtained from those 10 vectors previously found. But i need the vector solution to consider all the previous solutions with a different impact depending on their probability ...
i.e A=[4.7 ,5.6, 3.5,9 ] b=[-7.9 ,8 ,-2.8 ,4.6] c=[7 ,9.7 , 4,6,3.9] ......
i used mean in my program
NextPossibleSolution = mean(([A;B;C;]))
But do you think mean is the right move ? i don't think because all the solution contributes equal to Next Possible Solution (next input) regardless of their likelihood ... Please if there is a method formula or anything , Let me know that ... I really need it badly .... A Billion Thanks

Related

Johansen test on two stocks (for pairs trading) yielding weird results

I hope you can help me with this one.
I am using cointegration to discover potential pairs trading opportunities within stocks and more precisely I am utilizing the Johansen trace test for only two stocks at a time.
I have several securities, but for each test I only test two at a time.
If two stocks are found to be cointegrated using the Johansen test, the idea is to define the spread as
beta' * p(t-1) - c
where beta'=[1 beta2] and p(t-1) is the (2x1) vector of the previous stock prices. Notice that I seek a normalized first coefficient of the cointegration vector. c is a constant which is allowed within the cointegration relationship.
I am using Matlab to run the tests (jcitest), but have also tried utilizing Eviews for comparison of results. The two programs yields the same.
When I run the test and find two stocks to be cointegrated, I usually get output like
beta_1 = 12.7290
beta_2 = -35.9655
c = 121.3422
Since I want a normalized first beta coefficient, I set beta1 = 1 and obtain
beta_2 = -35.9655/12.7290 = -2.8255
c =121.3422/12.7290 = 9.5327
I can then generate the spread as beta' * p(t-1) - c. When the spread gets sufficiently low, I buy 1 share of stock 1 and short beta_2 shares of stock 2 and vice versa when the spread gets high.
~~~~~~~~~~~~~~~~ The problem ~~~~~~~~~~~~~~~~~~~~~~~
Since I am testing an awful lot of stock pairs, I obtain a lot of output. Quite often, however, I receive output where the estimated beta_1 and beta_2 are of the same sign, e.g.
beta_1= -1.4
beta_2= -3.9
When I normalize these according to beta_1, I get:
beta_1 = 1
beta_2 = 2.728
The current pairs trading literature doesn't mention any cases where the betas are of the same sign - how should it be interpreted? Since this is pairs trading, I am supposed to long one stock and short the other when the spread deviates from its long run mean. However, when the betas are of the same sign, to me it seems that I should always go long/short in both at the same time? Is this the correct interpretation? Or should I modify the way in which I normalize the coefficients?
I could really use some help...
EXTRA QUESTION:
Under some of my tests, I reject both the hypothesis of r=0 cointegration relationships and r<=1 cointegration relationships. I find this very mysterious, as I am only considering two variables at a time, and there can, at maximum, only be r=1 cointegration relationship. Can anyone tell me what this means?

Periodic to incremental (data reduction)

It's some time I was thinking about solving this problem. I have a registration of angular data (Angle(~20000,1)) variating between 0 and 355 (a potentiometer attached to a rotary testing machine), and I wanted to convert it in an incremental form, since I want the final total angular displacement. The main issue is that between 355 and the next 0 there are no jumps but a fast decrement (with strongly negative slope in time vs angle space). I've tried 2 ways up to now:
Calculate the Angslope=diff(Angle), extract with find the indexes j1=find(Angslope>0.2 & Angslope<0.2) to avoid the negative slopes due to the inversion of angular signal, then try to apply those indexes to the original Angle(n,1), as Angle2=Angle(j1). The trouble is the n-1 length of Angslope and the fact that somehow there is not a simple shift of my indexes of one position.
For cycles and logical, wanting to exclude data if the previous one is < the current value,etc
Angle2=zeros(size(Angle,1),1);
for i=2:size(Angle,1)
if Angle(i,1)<Angle(i-1,1)
Angle2(i,1)=NaN;
else Angle2(i,1)=Angle(i,1);
end
end
Which works good, but I don't know how to "match up" the single increment steps I obtain!
Any help or simple comment would be of great help!!
You are probably looking for the unwrap function. For this you have to convert your angles into radians, but that's not a big deal.
You can get the increments in one line:
Inc = diff(unwrap(Angle*pi/180))*180/pi;
and your total angular displacement:
Tot = sum(Inc);
Best,

Partitioning a number into a number of almost equal partitions

I would like to partition a number into an almost equal number of values in each partition. The only criteria is that each partition must be in between 60 to 80.
For example, if I have a value = 300, this means that 75 * 4 = 300.
I would like to know a method to get this 4 and 75 in the above example. In some cases, all partitions don't need to be of equal value, but they should be in between 60 and 80. Any constraints can be used (addition, subtraction, etc..). However, the outputs must not be floating point.
Also it's not that the total must be exactly 300 as in this case, but they can be up to a maximum of +40 of the total, and so for the case of 300, the numbers can sum up to 340 if required.
Assuming only addition, you can formulate this problem into a linear programming problem. You would choose an objective function that would maximize the sum of all of the factors chosen to generate that number for you. Therefore, your objective function would be:
(source: codecogs.com)
.
In this case, n would be the number of factors you are using to try and decompose your number into. Each x_i is a particular factor in the overall sum of the value you want to decompose. I'm also going to assume that none of the factors can be floating point, and can only be integer. As such, you need to use a special case of linear programming called integer programming where the constraints and the actual solution to your problem are all in integers. In general, the integer programming problem is formulated thusly:
You are actually trying to minimize this objective function, such that you produce a parameter vector of x that are subject to all of these constraints. In our case, x would be a vector of numbers where each element forms part of the sum to the value you are trying to decompose (300 in your case).
You have inequalities, equalities and also boundaries of x that each parameter in your solution must respect. You also need to make sure that each parameter of x is an integer. As such, MATLAB has a function called intlinprog that will perform this for you. However, this function assumes that you are minimizing the objective function, and so if you want to maximize, simply minimize on the negative. f is a vector of weights to be applied to each value in your parameter vector, and with our objective function, you just need to set all of these to -1.
Therefore, to formulate your problem in an integer programming framework, you are actually doing:
(source: codecogs.com)
V would be the value you are trying to decompose (so 300 in your example).
The standard way to call intlinprog is in the following way:
x = intlinprog(f,intcon,A,b,Aeq,beq,lb,ub);
f is the vector that weights each parameter of the solution you want to solve, intcon denotes which of your parameters need to be integer. In this case, you want all of them to be integer so you would have to supply an increasing vector from 1 to n, where n is the number of factors you want to decompose the number V into (same as before). A and b are matrices and vectors that define your inequality constraints. Because you want equality, you'd set this to empty ([]). Aeq and beq are the same as A and b, but for equality. Because you only have one constraint here, you would simply create a matrix of 1 row, where each value is set to 1. beq would be a single value which denotes the number you are trying to factorize. lb and ub are the lower and upper bounds for each value in the parameter set that you are bounding with, so this would be 60 and 80 respectively, and you'd have to specify a vector to ensure that each value of the parameters are bounded between these two ranges.
Now, because you don't know how many factors will evenly decompose your value, you'll have to loop over a given set of factors (like between 1 to 10, or 1 to 20, etc.), place your results in a cell array, then you have to manually examine yourself whether or not an integer decomposition was successful.
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = intlinprog(-ones(n,1),1:n,[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
You can then go through results and see which value of n was successful in decomposing your number into that said number of factors.
One small problem here is that we also don't know how many factors we should check up to. That unfortunately I don't have an answer to, and so you'll have to play with this value until you get good results. This is also an unconstrained parameter, and I'll talk about this more later in this post.
However, intlinprog was only released in recent versions of MATLAB. If you want to do the same thing without it, you can use linprog, which is the floating point version of integer programming... actually, it's just the core linear programming framework itself. You would call linprog this way:
x = linprog(f,A,b,Aeq,beq,lb,ub);
All of the variables are the same, except that intcon is not used here... which makes sense as linprog may generate floating point numbers as part of its solution. Due to the fact that linprog can generate floating point solutions, what you can do is if you want to ensure that for a given value of n, you could loop over your results, take the floor of the result and subtract with the final result, and sum over the result. If you get a value of 0, this means that you had a completely integer result. Therefore, you'd have to do something like:
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = linprog(-ones(n,1),[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
%// Loop through and determine which decompositions were successful integer ones
out = cellfun(#(x) sum(abs(floor(x) - x)), results);
%// Determine which values of n were successful in the integer composition.
final_factors = find(~out);
final_factors will contain which number of factors you specified that was successful in an integer decomposition. Now, if final_factors is empty, this means that it wasn't successful in finding anything that would be able to decompose the value into integer factors. Noting your problem description, you said you can allow for tolerances, so perhaps scan through results and determine which overall sum best matches the value, then choose whatever number of factors that gave you that result as the final answer.
Now, noting from my comments, you'll see that this problem is very unconstrained. You don't know how many factors are required to get an integer decomposition of your value, which is why we had to semi-brute-force it. In fact, this is a more general case of the subset sum problem. This problem is NP-complete. Basically, what this means is that it is not known whether there is a polynomial-time algorithm that can be used to solve this kind of problem and that the only way to get a valid solution is to brute-force each possible solution and check if it works with the specified problem. Usually, brute-forcing solutions requires exponential time, which is very intractable for large problems. Another interesting fact is that modern cryptography algorithms use NP-Complete intractability as part of their ciphertext and encrypting. Basically, they're banking on the fact that the only way for you to determine the right key that was used to encrypt your plain text is to check all possible keys, which is an intractable problem... especially if you use 128-bit encryption! This means you would have to check 2^128 possibilities, and assuming a moderately fast computer, the worst-case time to find the right key will take more than the current age of the universe. Check out this cool Wikipedia post for more details in intractability with regards to key breaking in cryptography.
In fact, NP-complete problems are very popular and there have been many attempts to determine whether there is or there isn't a polynomial-time algorithm to solve such problems. An interesting property is that if you can find a polynomial-time algorithm that will solve one problem, you will have found an algorithm to solve them all.
The Clay Mathematics Institute has what are known as Millennium Problems where if you solve any problem listed on their website, you get a million dollars.
Also, that's for each problem, so one problem solved == 1 million dollars!
(source: quickmeme.com)
The NP problem is amongst one of the seven problems up for solving. If I recall correctly, only one problem has been solved so far, and these problems were first released to the public in the year 2000 (hence millennium...). So... it has been about 14 years and only one problem has been solved. Don't let that discourage you though! If you want to invest some time and try to solve one of the problems, please do!
Hopefully this will be enough to get you started. Good luck!

Random numbers that add to 1 with a minimum increment: Matlab

Having read carefully the previous question
Random numbers that add to 100: Matlab
I am struggling to solve a similar but slightly more complex problem.
I would like to create an array of n elements that sums to 1, however I want an added constraint that the minimum increment (or if you like number of significant figures) for each element is fixed.
For example if I want 10 numbers that sum to 1 without any constraint the following works perfectly:
num_stocks=10;
num_simulations=100000;
temp = [zeros(num_simulations,1),sort(rand(num_simulations,num_stocks-1),2),ones(num_simulations,1)];
weights = diff(temp,[],2);
I foolishly thought that by scaling this I could add the constraint as follows
num_stocks=10;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp2 = [zeros(num_simulations,1),sort(round(rand(num_simulations,num_stocks-1)*scaling)/scaling,2),ones(num_simulations,1)];
weights2 = diff(temp2,[],2);
However though this works for small values of n & small values of increment, if for example n=1,000 & the increment is 0.1% then over a large number of trials the first and last numbers have a mean which is consistently below 0.1%.
I am sure there is a logical explanation/solution to this but I have been tearing my hair out to try & find it & wondered anybody would be so kind as to point me in the right direction. To put the problem into context create random stock portfolios (hence the sum to 1).
Thanks in advance
Thank you for the responses so far, just to clarify (as I think my initial question was perhaps badly phrased), it is the weights that have a fixed increment of 0.1% so 0%, 0.1%, 0.2% etc.
I did try using integers initially
num_stocks=1000;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp = [zeros(num_simulations,1),sort(randi([0 scaling],num_simulations,num_stocks-1),2),ones(num_simulations,1)*scaling];
weights = (diff(temp,[],2)/scaling);
test=mean(weights);
but this was worse, the mean for the 1st & last weights is well below 0.1%.....
Edit to reflect excellent answer by Floris & clarify
The original code I was using to solve this problem (before finding this forum) was
function x = monkey_weights_original(simulations,stocks)
stockmatrix=1:stocks;
base_weight=1/stocks;
r=randi(stocks,stocks,simulations);
x=histc(r,stockmatrix)*base_weight;
end
This runs very fast, which was important considering I want to run a total of 10,000,000 simulations, 10,000 simulations on 1,000 stocks takes just over 2 seconds with a single core & I am running the whole code on an 8 core machine using the parallel toolbox.
It also gives exactly the distribution I was looking for in terms of means, and I think that it is just as likely to get a portfolio that is 100% in 1 stock as it is to geta portfolio that is 0.1% in every stock (though I'm happy to be corrected).
My issue issue is that although it works for 1,000 stocks & an increment of 0.1% and I guess it works for 100 stocks & an increment of 1%, as the number of stocks decreases then each pick becomes a very large percentage (in the extreme with 2 stocks you will always get a 50/50 portfolio).
In effect I think this solution is like the binomial solution Floris suggests (but more limited)
However my question has arrisen because I would like to make my approach more flexible & have the possibility of say 3 stocks & an increment of 1% which my current code will not handle correctly, hence how I stumbled accross the original question on stackoverflow
Floris's recursive approach will get to the right answer, but the speed will be a major issue considering the scale of the problem.
An example of the original research is here
http://www.huffingtonpost.com/2013/04/05/monkeys-stocks-study_n_3021285.html
I am currently working on extending it with more flexibility on portfolio weights & numbers of stock in the index, but it appears my programming & probability theory ability are a limiting factor.......
One problem I can see is that your formula allows for numbers to be zero - when the rounding operation results in two consecutive numbers to be the same after sorting. Not sure if you consider that a problem - but I suggest you think about it (it would mean your model portfolio has fewer than N stocks in it since the contribution of one of the stocks would be zero).
The other thing to note is that the probability of getting the extreme values in your distribution is half of what you want them to be: If you have uniformly distributed numbers from 0 to 1000, and you round them, the numbers that round to 0 were in the interval [0 0.5>; the ones that round to 1 came from [0.5 1.5> - twice as big. The last number (rounding to 1000) is again from a smaller interval: [999.5 1000]. Thus you will not get the first and last number as often as you think. If instead of round you use floor I think you will get the answer you expect.
EDIT
I thought about this some more, and came up with a slow but (I think) accurate method for doing this. The basic idea is this:
Think in terms of integers; rather than dividing the interval 0 - 1 in steps of 0.001, divide the interval 0 - 1000 in integer steps
If we try to divide N into m intervals, the mean size of a step should be N / m; but being integer, we would expect the intervals to be binomially distributed
This suggests an algorithm in which we choose the first interval as a binomially distributed variate with mean (N/m) - call the first value v1; then divide the remaining interval N - v1 into m-1 steps; we can do so recursively.
The following code implements this:
% random integers adding up to a definite sum
function r = randomInt(n, limit)
% returns an array of n random integers
% whose sum is limit
% calls itself recursively; slow but accurate
if n>1
v = binomialRandom(limit, 1 / n);
r = [v randomInt(n-1, limit - v)];
else
r = limit;
end
function b = binomialRandom(N, p)
b = sum(rand(1,N)<p); % slow but direct
To get 10000 instances, you run this as follows:
tic
portfolio = zeros(10000, 10);
for ii = 1:10000
portfolio(ii,:) = randomInt(10, 1000);
end
toc
This ran in 3.8 seconds on a modest machine (single thread) - of course the method for obtaining a binomially distributed random variate is the thing slowing it down; there are statistical toolboxes with more efficient functions but I don't have one. If you increase the granularity (for example, by setting limit=10000) it will slow down more since you increase the number of random number samples that are generated; with limit = 10000 the above loop took 13.3 seconds to complete.
As a test, I found mean(portfolio)' and std(portfolio)' as follows (with limit=1000):
100.20 9.446
99.90 9.547
100.09 9.456
100.00 9.548
100.01 9.356
100.00 9.484
99.69 9.639
100.06 9.493
99.94 9.599
100.11 9.453
This looks like a pretty convincing "flat" distribution to me. We would expect the numbers to be binomially distributed with a mean of 100, and standard deviation of sqrt(p*(1-p)*n). In this case, p=0.1 so we expect s = 9.4868. The values I actually got were again quite close.
I realize that this is inefficient for large values of limit, and I made no attempt at efficiency. I find that clarity trumps speed when you develop something new. But for instance you could pre-compute the cumulative binomial distributions for p=1./(1:10), then do a random lookup; but if you are just going to do this once, for 100,000 instances, it will run in under a minute; unless you intend to do it many times, I wouldn't bother. But if anyone wants to improve this code I'd be happy to hear from them.
Eventually I have solved this problem!
I found a paper by 2 academics at John Hopkins University "Sampling Uniformly From The Unit Simplex"
http://www.cs.cmu.edu/~nasmith/papers/smith+tromble.tr04.pdf
In the paper they outline how naive algorthms don't work, in a way very similar to woodchips answer to the Random numbers that add to 100 question. They then go on to show that the method suggested by David Schwartz can also be slightly biased and propose a modified algorithm which appear to work.
If you want x numbers that sum to y
Sample uniformly x-1 random numbers from the range 1 to x+y-1 without replacement
Sort them
Add a zero at the beginning & x+y at the end
difference them & subtract 1 from each value
If you want to scale them as I do, then divide by y
It took me a while to realise why this works when the original approach didn't and it come down to the probability of getting a zero weight (as highlighted by Floris in his answer). To get a zero weight in the original version for all but the 1st or last weights your random numbers had to have 2 values the same but for the 1st & last ones then a random number of zero or the maximum number would result in a zero weight which is more likely.
In the revised algorithm, zero & the maximum number are not in the set of random choices & a zero weight occurs only if you select two consecutive numbers which is equally likely for every position.
I coded it up in Matlab as follows
function weights = unbiased_monkey_weights(num_simulations,num_stocks,min_increment)
scaling=1/min_increment;
sample=NaN(num_simulations,num_stocks-1);
for i=1:num_simulations
allcomb=randperm(scaling+num_stocks-1);
sample(i,:)=allcomb(1:num_stocks-1);
end
temp = [zeros(num_simulations,1),sort(sample,2),ones(num_simulations,1)*(scaling+num_stocks)];
weights = (diff(temp,[],2)-1)/scaling;
end
Obviously the loop is a bit clunky and as I'm using the 2009 version the randperm function only allows you to generate permutations of the whole set, however despite this I can run 10,000 simulations for 1,000 numbers in 5 seconds on my clunky laptop which is fast enough.
The mean weights are now correct & as a quick test I replicated woodchips generating 3 numbers that sum to 1 with the minimum increment being 0.01% & it also look right
Thank you all for your help and I hope this solution is useful to somebody else in the future
The simple answer is to use the schemes that work well with NO minimum increment, then transform the problem. As always, be careful. Some methods do NOT yield uniform sets of numbers.
Thus, suppose I want 11 numbers that sum to 100, with a constraint of a minimum increment of 5. I would first find 11 numbers that sum to 45, with no lower bound on the samples (other than zero.) I could use a tool from the file exchange for this. Simplest is to simply sample 10 numbers in the interval [0,45]. Sort them, then find the differences.
X = diff([0,sort(rand(1,10)),1]*45);
The vector X is a sample of numbers that sums to 45. But the vector Y sums to 100, with a minimum value of 5.
Y = X + 5;
Of course, this is trivially vectorized if you wish to find multiple sets of numbers with the given constraint.

Problem using the find function in MATLAB

I have two arrays of data that I'm trying to amalgamate. One contains actual latencies from an experiment in the first column (e.g. 0.345, 0.455... never more than 3 decimal places), along with other data from that experiment. The other contains what is effectively a 'look up' list of latencies ranging from 0.001 to 0.500 in 0.001 increments, along with other pieces of data. Both data sets are X-by-Y doubles.
What I'm trying to do is something like...
for i = 1:length(actual_latency)
row = find(predicted_data(:,1) == actual_latency(i))
full_set(i,1:4) = [actual_latency(i) other_info(i) predicted_info(row,2) ...
predicted_info(row,3)];
end
...in order to find the relevant row in predicted_data where the look up latency corresponds to the actual latency. I then use this to created an amalgamated data set, full_set.
I figured this would be really simple, but the find function keeps failing by throwing up an empty matrix when looking for an actual latency that I know is in predicted_data(:,1) (as I've double-checked during debugging).
Moreover, if I replace find with a for loop to do the same job, I get a similar error. It doesn't appear to be systematic - using different participant data sets throws it up in different places.
Furthermore, during debugging mode, if I use find to try and find a hard-coded value of actual_latency, it doesn't always work. Sometimes yes, sometimes no.
I'm really scratching my head over this, so if anyone has any ideas about what might be going on, I'd be really grateful.
You are likely running into a problem with floating point comparisons when you do the following:
predicted_data(:,1) == actual_latency(i)
Even though your numbers appear to only have three decimal places of precision, they may still differ by very small amounts that are not being displayed, thus giving you an empty matrix since FIND can't get an exact match.
One feature of floating point numbers is that certain numbers can't be exactly represented, since they aren't an integer power of 2. This occurs with the numbers 0.1 and 0.001. If you repeatedly add or multiply one of these numbers you can see some unexpected behavior. Amro pointed out one example in his comment: 0.3 is not exactly equal to 3*0.1. This can also be illustrated by creating your look-up list of latencies in two different ways. You can use the normal colon syntax:
vec1 = 0.001:0.001:0.5;
Or you can use LINSPACE:
vec2 = linspace(0.001,0.5,500);
You'd think these two vectors would be equal to one another, but think again!:
>> isequal(vec1,vec2)
ans =
0 %# FALSE!
This is because the two methods create the vectors by performing successive additions or multiplications of 0.001 in different ways, giving ever so slightly different values for some entries in the vector. You can take a look at this technical solution for more details.
When comparing floating point numbers, you should therefore do your comparisons using some tolerance. For example, this finds the indices of entries in the look-up list that are within 0.0001 of your actual latency:
tolerance = 0.0001;
for i = 1:length(actual_latency)
row = find(abs(predicted_data(:,1) - actual_latency(i)) < tolerance);
...
The topic of floating point comparison is also covered in this related question.
You may try to do the following:
row = find(abs(predicted_data(:,1) - actual_latency(i))) < eps)
EPS is accuracy of floating-point operation.
Have you tried using a tolerance rather than == ?