Is my subset sum algorithm of polynomial time? - polynomial-math

I came up with a new algorithm to solve the subset sum problem, and I think it's in polynomial time. Tell me I'm either wrong or a total genius.
Quick starter facts:
Subset sum problem is an NP-complete problem. Solving it in polynomial time means that P = NP. The number of subsets in a set of length N, is 2^N.
On the more useful hand, the number of unique subsets in the length N is at maximum the sum of the whole set minus the smallest element, or, the range of sums that any subset can possibly produce is between the sum of all the negative elements and the sum of all the positive elements, since no sum can possibly be bigger or smaller than all the positive or negative sums, which grow at a linear rate when we add extra elements.
What this means is that as N increases, the number of duplicate subsets increases exponentially, and the number of unique, useful subsets increases only linearly. If an algorithm could be devised that could remove the duplicate subsets at the earliest possible opportunity, we would run in polynomial time. A quick example is easily taken from binary. From only the numbers that are powers of two, we can create unique subsets for any integral value. As such, any subset involving any other number (if we had all powers of two) is a duplicate and a waste. By not computing them and their derivatives, we can save virtually all the running time of the algorithm, since the numbers which are powers of two are logarithmically occurring compared to any integer.
As such, I propose a simple algorithm that will remove these duplicates and save having to compute them and all their derivatives.
To begin with, we'll sort the set which is only O(N log N), and split it into two halves, positive and negative. The procedure for the negative numbers is identical, so I'll only outline the positive numbers (the set now means just the positive half, just for clarification).
Imagine an array indexed by sum, which has entries for all the possible result sums of the positive side (which is only linear, remember). When we add an entry, the value is the entries in the subset. So like, array[3] = { 1, 2 }.
In general, we now move to enumerate all subsets in the set. We do this by starting with the subsets of one length, then adding to them. When we have all the unique subsets, they form an array, and we simply iterate them in the fashion used in Horowitz/Sahni.
Now we start with the "first generation" values. That is, if there were no duplicate numbers in the original data set, there are guaranteed to be no duplicates in these values. That is, all single-value subsets, and all length of the set minus one length subsets. These can easily be generated by summing the set, and subtracting each element in turn. In addition the set itself is a valid first generation sum and subset, as well as each individual element of the subset.
Now we do the second generation values. We loop through each value in the array and for each unique subset, if it doesn't have it, we add it and compute the new unique subset. If we have a duplicate, fun occurs. We add it to a collision list. When we come to add new subsets, we check if they're on the collision list. We key by the less desirable (normally larger, but can be arbitrary) equal subset. When we come to add to subsets, if we would generate a collision, we simply do nothing. When we come to add the more desirable subset, it misses the check and adds, generating the common subset. Then we just repeat for the other generations.
By removing duplicate subsets in this manner, we don't have to keep combining the duplicates with the rest of the set, nor keep checking them for collisions, nor sum them. Most importantly, by not creating new subsets that are non-unique, we're not generating new subsets from them, which can, I believe, turn the algorithm from NP to P, since the growth of subsets is no longer exponential- we discard the vast majority of them before they can "reproduce" in the next generation and create more subsets by being combined with the other non-duplicate subsets.
I don't think I've explained this too well. I have pictures... they're in my head. The important thing is that by discarding duplicate subsets, you could remove virtually all of the complexity.
For example, imagine (because I'm doing this example by hand) a simple dataset that goes -7 to 7 (not zero) for which we aim at zero. Sort and split, so we're left with (1, 2, 3, 4, 5, 6, 7). The sum is 28. But 2^7 is 128. So 128 subsets fit in the range 1 .. 28, meaning that we know in advance that 100 sets are duplicates. If we had 8, then we'd only have 36 slots, but now 256 subsets. So you can easily see that the number of dupes would now be 220, greater than double what it was before.
In this case, the first generation values are 1, 2, 3, 4, 5, 6, 7, 28, 27, 26, 25, 24, 23, 22, 21, and we map them to their constituent components, so
1 = { 1 }
2 = { 2 }
...
28 = { 1, 2, 3, 4, 5, 6, 7 }
27 = { 2, 3, 4, 5, 6, 7 }
26 = { 1, 3, 4, 5, 6, 7 }
...
21 = { 1, 2, 3, 4, 5, 6 }
Now to generate the new subsets, we take each subset in turn and add it to each other subset, unless they have a mutual subsubset, e.g. 28 and 27 have a hueg mutual subsubset. So when we take 1 and we add it to 2, we get 3 = { 1, 2 } but owait! It's already in the array. What this means is that we now don't add 1 to any subset that already has 2 in it, and vice versa, because that's a duplicate on 3's subsets.
Now we have
1 = { 1 }
2 = { 2 }
// Didn't add 1 to 2 to get 3 because that's a dupe
3 = { 3 } // Add 1 to 3, amagad, get a duplicate. Repeat the process.
4 = { 4 } // And again.
...
8 = { 1, 7 }
21? Already has 1 in.
...
27? We already have 28
Now we add 2 to all.
1? Existing duplicate
3? Get a new duplicate
...
9 = { 2, 7 }
10 = { 1, 2, 7 }
21? Already has 2 in
...
26? Already have 28
27? Got 2 in already.
3?
1? Existing dupe
2? Existing dupe
4? New duplicate
5? New duplicate
6? New duplicate
7? New duplicate
11 = { 1, 3, 7 }
12 = { 2, 3, 7 }
13 = { 1, 2, 3, 7 }
As you can see, even though I am still adding new subsets each time, the quantity is only going up linearly.
4?
1? Existing dupe
2? Existing dupe
3? Existing dupe
5? New duplicate
6? New duplicate
7? New duplicate
8? New duplicate
9? New duplicate
14 = {1, 2, 4, 7}
15 = {1, 3, 4, 7}
16 = {2, 3, 4, 7}
17 = {1, 2, 3, 4, 7}
5?
1,2,3,4 existing duplicate
6,7,8,9,10,11,12 new duplicate
18 = {1, 2, 3, 5, 7}
19 = {1, 2, 4, 5, 7}
20 = {1, 3, 4, 5, 7}
21 = new duplicate
Now we have every value in the range, so we stop, and add to our list 1-28. Repeat for negative numbers, iterate through lists.
Edit:
This algorithm is totally wrong in any case. Subsets which have duplicate sums are not duplicates for the purposes of which subsets can be spawned from them, because they are arrived at differently- i.e., they cannot be folded.

This does not prove P = NP.
You have failed to consider the possibility where the positive numbers are: 1, 2, 4, 8, 16, etc... and so there will be no duplicates when you sum subsets, so it will run in O(2^N) time in this case.
You can treat this as a special case but still the algorithm is still not polynomial for other similar cases. This assumption that you made is where you go away from the NP-complete version of subset sum to solving only easy (polynomial time) problems:
[assume the sum of the positive numbers grows] at a linear rate when we add extra elements.
Here you are effectively assuming that P (i.e. number of bits required to state the problem) is smaller than N. Quote from Wikipedia:
Thus, the problem is most difficult if N and P are of the same order.
If you assume that N and P are of the same order then you can't assume that the sum grows linearly indefinitely as you add more elements. As you add more elements to your set those elements also need to get larger to ensure that problem remains hard to solve.
If P (the number of place values) is a small fixed number, then there are dynamic programming algorithms that can solve it exactly.
You have rediscovered one of these algorithms. It's a nice piece of work but it isn't something new and it doesn't prove P = NP. Sorry!

Dead MG,
It has been almost half a year since you posted but I will answer anyway.
Mark Byers wrote most of what should be written.
The algorithm is known.
Such algorithms are known as generating functions algorithms or simply as dynamic programming algorithms.
Your algorihtm has very important feature, the so called pseudopolynomial complexity.
Traditional complexity is a function of the size of the problem. In terms of traditional complexity your algorithm has O(2^n) pessimistic complexity (that is for the numbers 1,2, 4,... as was mentioned earlier )
The complexity of your algorithm algorithm can be alternatively expressed as the function of the size of the problem and the size of some numbers in the problem. For your algorithm it would be something like O(nw) where w is the number of distinct sums.
This is psuedopolynomial complexity. It is a VERY important feature. Such algorithms can solve lots of real-world problem instances, despite problem complexity class.
Horowitz and Sahni algorithm has pessimistic complexity O(2^N/2). This is not two times better than your algorithm but lot's more - 2^N/2 times better than your algorithm. What Greg probably meant was that Horowitz and Sahni algorithm can solve twice as big instances of the problem (having twice as many numbers in the subset sum)
That's true in theory but in practice Horowitz and Sahni can solve (on home computers) instances with about 60 numbers, while the algorithm similiar to yours can handle even instances with 1000 numbers (provided that the numbers aren't too big themselves)
In fact the two algorithms can even be mixed, that is of your kind and of Horowitz and Sahni algorithm. Such solution has both pseudopolynomial complexity and pessimistic complexity of O(2^n/2).
A trained computer sciencist can construct such algorithm as yours by means of generating functions theory.
Both trained and untrained can think it up the way you did.
Do not necessarily think in terms "is it known?". It should be important to you that you can invent such algorithm on your own. It means that you probably can invent other important algorithms on your own and someday one that isn't known maybe. Knowing current progress in the field and what's in the literature helps. Otherwise you will keep on reinventing the wheel.
When I was way back in high school I reinvented Dijkstra algorithm. My version had terrible complexity because I didn't know anything about data structures. Anyway, I am still proud of myself.
If you are still studying pay attention to generating functions theory.
You may also want to check out on wiki:
psuedopolynomial time
weakly NP-complete
strongly NP-complete
generating functions
Megli

What this means is that as N increases, the number of duplicate subsets increases exponentially, and the number of unique, useful subsets increases only linearly.
Not necessarily - the number of duplicate subset sums is also determined by the value of the number closest to zero in the set (that the greater the minimum distance to zero - the fewer the duplicate subset sums for the set).
In general, we now move to enumerate all subsets in the set.
Unfortunately, enumerating all the sums of the subsets of the set requires performing an exponential number of addition operations (2^7 or 128 in your example). Otherwise, how would the algorithm determine what the unique sums happen to be? So, although the steps that follow the first step could very well have a polynomial running time, the algorithm as a whole has exponential complexity (because an algorithm is only as fast as its slowest part).
Incidentally, best known algorithm for solving the subset sum problem (Horowitz and Sahni, 1974) has O(2^N/2) complexity - which makes it about twice as fast as this algorithm.

Related

Finding permutations under a set of linear constraints

I have two variables A and B, which both are 1x250 doubles.
Variables A and B are related to one another.
In other words, A(1) and B(1) are a pair (cause and effect).
A = [100, 1000, 254, 21,.....]
B = [-30, -100, -1254, -821,.....]
The problem is to find combinations in A that can meet two conditions below
sums of any combinations in A should be <= 500
At the same time, sum of corresponding value in B should be less than -5000
I tried to use nchoosek, but it really exploded after a few iterations due to the size of my variables.
What you said is pretty much impossible to do in terms of time complexity. If you need to check every combination in set of size of 250, you will need at least 2^250 operations, so that's order of magnitude of 10^75.
Unless you have some data about the values of the set, which lets you skip most of the combinations, or you limit yourself to combinations of 2 or 3, it's impossible to be done in sensible time.

Online Algorithm approach for alternating subsequence

Consider a sequence A = a1, a2, a3, ... an of integers. A subsequence B of A is a sequence B = b1, b2, .... ,bn which is created from A by removing some elements but by keeping the order. Given an integer sequence A, the goal is to compute an alternating subsequence B, i.e. a sequence b1, ... bn such that for all i in {2, 3, ... , m-1}, if b{i-1} < b{i} then b{i} > b{i+1} and if b{i-1} > b{i} then b{i} < b{i+1}**
Consider an online version of the problem, where the sequence A is given element-by-element and each time, one needs to directly decide whether to include the next element in the subsequence B. Is it possible to achieve a constant competitive ratio (by using a deterministic online algorithm)? Either give an online algorithm which achieves a constant competitive ratio or show that it is not possible to find such an online algorithm.
Assume sequence [9,8,9,8,9,8, .... , 9,8,9,8,2,1,2,9,8,9, ... , 8,9,8,9,8,9]
My Argumentation:
The algorithm must decide immediately if it inserts an incoming number into the subsequence. If the algorithm now gets the numbers 1 then 2 then 2 it will eventually decide that they are part of the sequence and thus by a nonlinear factor worse than the optimal solution of n-3.
-> No constant competitive ratio!
Is this a proper argumentation?
If I understood what you meant, your argument is correct, but the sequence you gave in the example is wrong. for example the algorithm may choose all the 9's and 8's.
You can alter your argument slightly to make it more accurate, for example consider the sequence
3,4,3,4,3,4,......, 1/5,2/6,1/5,2/6,....
Explanation:
You start the sequence with 3,4,3,4,... etc. until the algorithm picks two numbers. If it never does, it's obviously not competitive (it gets 0/1 out of n)
If the algorithm picked a 3, then 4, the algorithm must next take a number lower than 4. By continuing with 5,6,5,6,... the algorithm cannot take another number.
If the algorithm chose to take a 4 then a 3, by a similar resoning we can easily see how continuing with 1,2,1,2,... prevents the algorithm from taking another nubmer.
Thus, in any case, the algorithm cannot take more than 2 numbers for every n, which, as you stated, isn't a constant competitive ratio.

How to embed a sentence into vector

I had the sentence.I use word2vec to embed word to vector.For example, consider I have a sentence of 5 words.so I get 5 different vectors(One for each word) for the sentence.Which is the best method to make the complete sentence as a single vector which I will pass to the ANN?
This is an open problem; many approaches exist to creating meaningful sentence vectors.
BoW models, as Fabrizio_P explained
Element-wise vector operations (http://www.aclweb.org/anthology/P/P08/P08-1028.pdf)
Addition (i.e. simply add all the word vector together, possibly normalizing afterwards)
Multiplication (i.e. multiply all vectors together, element-wise, resulting in a logically grounded embedding)
Arbitrary task-specific recurrent functions (http://www.aclweb.org/anthology/D12-1110)
More sophisticated general-purpose approaches (https://arxiv.org/abs/1508.02354, https://arxiv.org/abs/1506.06726)
Element-wise operations, such as vector addition, suffice for most simple tasks, but obviously exhibit a high amount of information loss as sentences grow larger or the task at hand gets more demanding. Recurrent neural networks are quite good at creating task specific sentence embeddings, but obviously these require training data and some familiarity with machine learning. General purpose sentence embeddings are the most interesting ones from a research perspective, but probably not what you're looking for.
You could use the bag of words concept, as explained here https://machinelearningmastery.com/gentle-introduction-bag-words-model/. So that you collect all of you words and put them in a vocabulary. After that you can represent your sentence as a vector, where each element is either 1 or 0, depending on whether the word is in the sentence or not.
For example if your sentence is
Hello my name is Peter.
Your dictionary will be
[Hello, my, name, is, Peter]
The vector for your sentence will be
[1, 1, 1, 1, 1]
If you have another sentence like
I am happy.
Your dictionary will extend including also those words. So it will be
[Hello, my, name, is, Peter, I, am, happy]
And your vector sentence will also extend
[1, 1, 1, 1, 1, 0, 0, 0]
As an alternative you can also create a vocabulary where each word is represented by a number, so that
{Hello: 1, my: 2, name: 3, is: Peter: 4, I: 5, am: 6, happy: 7}
And the vector for your sentence will be
[1,2,3,4]
For each new sentence you will convert the words into numbers using the vocabulary as reference.
word2vec is an algorithm to create word embeddings, you can read the details here https://www.tensorflow.org/tutorials/word2vec
You can run this algorithm on your own dataset, or use saved word embeddings that Google (or other parties) have been run on billions of documents.
The idea is to map each word as dense vector in some n-dimensional vector space, thus containing much more information about words and their relationships.
Put simply each word is represented by a unique list of numbers, and now mathematical operations are possible on words, sentences and documents.

Generate matrix of random number with constraints in matlab

I want to generate a matrix of random numbers (normrnd with mean == 0) that satisfy the following constraints using MATLAB (or any other language)
The sum of the absolute values in the matrix must equal X
The largest abs(single number) must equal Y
The difference between the number and its 8 neighbors (3 if in corner, 5 if on edge) must be less than Z
It would be relatively easy to satisfy one of the constraints, but I can't think of an algorithm that satisfies all of them...
Any ideas?
I am not sure whether to edit my post or to reply here, so I am editing... #MZimmerman6, you have a point. Though these constraints won't produce a unique solution, how would I obtain multiple solutions without using rand?
A very simply 3 x 3 where 5 is the max element value, 30 is the sum, and 2 is the difference
5 4 3
4 4 2
3 2 3
Rody, that actually may help...I need to think more :)
Luis ...Hmmm...why not? I can add up the absolute value of a normally distributed sample...right?
Here is an algorithm to get the 'random' numbers that you need.
Generate a valid number (for example in the middle)
Determine the feasible range for one of the numbers next to it
If there is no range, you go to step 1, otherwise generate a number and continue
Depending on your constraints it may take a while of course. You could add an other step to see if changing the existing numbers would help before going back to step 1.

Random numbers that add to 1 with a minimum increment: Matlab

Having read carefully the previous question
Random numbers that add to 100: Matlab
I am struggling to solve a similar but slightly more complex problem.
I would like to create an array of n elements that sums to 1, however I want an added constraint that the minimum increment (or if you like number of significant figures) for each element is fixed.
For example if I want 10 numbers that sum to 1 without any constraint the following works perfectly:
num_stocks=10;
num_simulations=100000;
temp = [zeros(num_simulations,1),sort(rand(num_simulations,num_stocks-1),2),ones(num_simulations,1)];
weights = diff(temp,[],2);
I foolishly thought that by scaling this I could add the constraint as follows
num_stocks=10;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp2 = [zeros(num_simulations,1),sort(round(rand(num_simulations,num_stocks-1)*scaling)/scaling,2),ones(num_simulations,1)];
weights2 = diff(temp2,[],2);
However though this works for small values of n & small values of increment, if for example n=1,000 & the increment is 0.1% then over a large number of trials the first and last numbers have a mean which is consistently below 0.1%.
I am sure there is a logical explanation/solution to this but I have been tearing my hair out to try & find it & wondered anybody would be so kind as to point me in the right direction. To put the problem into context create random stock portfolios (hence the sum to 1).
Thanks in advance
Thank you for the responses so far, just to clarify (as I think my initial question was perhaps badly phrased), it is the weights that have a fixed increment of 0.1% so 0%, 0.1%, 0.2% etc.
I did try using integers initially
num_stocks=1000;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp = [zeros(num_simulations,1),sort(randi([0 scaling],num_simulations,num_stocks-1),2),ones(num_simulations,1)*scaling];
weights = (diff(temp,[],2)/scaling);
test=mean(weights);
but this was worse, the mean for the 1st & last weights is well below 0.1%.....
Edit to reflect excellent answer by Floris & clarify
The original code I was using to solve this problem (before finding this forum) was
function x = monkey_weights_original(simulations,stocks)
stockmatrix=1:stocks;
base_weight=1/stocks;
r=randi(stocks,stocks,simulations);
x=histc(r,stockmatrix)*base_weight;
end
This runs very fast, which was important considering I want to run a total of 10,000,000 simulations, 10,000 simulations on 1,000 stocks takes just over 2 seconds with a single core & I am running the whole code on an 8 core machine using the parallel toolbox.
It also gives exactly the distribution I was looking for in terms of means, and I think that it is just as likely to get a portfolio that is 100% in 1 stock as it is to geta portfolio that is 0.1% in every stock (though I'm happy to be corrected).
My issue issue is that although it works for 1,000 stocks & an increment of 0.1% and I guess it works for 100 stocks & an increment of 1%, as the number of stocks decreases then each pick becomes a very large percentage (in the extreme with 2 stocks you will always get a 50/50 portfolio).
In effect I think this solution is like the binomial solution Floris suggests (but more limited)
However my question has arrisen because I would like to make my approach more flexible & have the possibility of say 3 stocks & an increment of 1% which my current code will not handle correctly, hence how I stumbled accross the original question on stackoverflow
Floris's recursive approach will get to the right answer, but the speed will be a major issue considering the scale of the problem.
An example of the original research is here
http://www.huffingtonpost.com/2013/04/05/monkeys-stocks-study_n_3021285.html
I am currently working on extending it with more flexibility on portfolio weights & numbers of stock in the index, but it appears my programming & probability theory ability are a limiting factor.......
One problem I can see is that your formula allows for numbers to be zero - when the rounding operation results in two consecutive numbers to be the same after sorting. Not sure if you consider that a problem - but I suggest you think about it (it would mean your model portfolio has fewer than N stocks in it since the contribution of one of the stocks would be zero).
The other thing to note is that the probability of getting the extreme values in your distribution is half of what you want them to be: If you have uniformly distributed numbers from 0 to 1000, and you round them, the numbers that round to 0 were in the interval [0 0.5>; the ones that round to 1 came from [0.5 1.5> - twice as big. The last number (rounding to 1000) is again from a smaller interval: [999.5 1000]. Thus you will not get the first and last number as often as you think. If instead of round you use floor I think you will get the answer you expect.
EDIT
I thought about this some more, and came up with a slow but (I think) accurate method for doing this. The basic idea is this:
Think in terms of integers; rather than dividing the interval 0 - 1 in steps of 0.001, divide the interval 0 - 1000 in integer steps
If we try to divide N into m intervals, the mean size of a step should be N / m; but being integer, we would expect the intervals to be binomially distributed
This suggests an algorithm in which we choose the first interval as a binomially distributed variate with mean (N/m) - call the first value v1; then divide the remaining interval N - v1 into m-1 steps; we can do so recursively.
The following code implements this:
% random integers adding up to a definite sum
function r = randomInt(n, limit)
% returns an array of n random integers
% whose sum is limit
% calls itself recursively; slow but accurate
if n>1
v = binomialRandom(limit, 1 / n);
r = [v randomInt(n-1, limit - v)];
else
r = limit;
end
function b = binomialRandom(N, p)
b = sum(rand(1,N)<p); % slow but direct
To get 10000 instances, you run this as follows:
tic
portfolio = zeros(10000, 10);
for ii = 1:10000
portfolio(ii,:) = randomInt(10, 1000);
end
toc
This ran in 3.8 seconds on a modest machine (single thread) - of course the method for obtaining a binomially distributed random variate is the thing slowing it down; there are statistical toolboxes with more efficient functions but I don't have one. If you increase the granularity (for example, by setting limit=10000) it will slow down more since you increase the number of random number samples that are generated; with limit = 10000 the above loop took 13.3 seconds to complete.
As a test, I found mean(portfolio)' and std(portfolio)' as follows (with limit=1000):
100.20 9.446
99.90 9.547
100.09 9.456
100.00 9.548
100.01 9.356
100.00 9.484
99.69 9.639
100.06 9.493
99.94 9.599
100.11 9.453
This looks like a pretty convincing "flat" distribution to me. We would expect the numbers to be binomially distributed with a mean of 100, and standard deviation of sqrt(p*(1-p)*n). In this case, p=0.1 so we expect s = 9.4868. The values I actually got were again quite close.
I realize that this is inefficient for large values of limit, and I made no attempt at efficiency. I find that clarity trumps speed when you develop something new. But for instance you could pre-compute the cumulative binomial distributions for p=1./(1:10), then do a random lookup; but if you are just going to do this once, for 100,000 instances, it will run in under a minute; unless you intend to do it many times, I wouldn't bother. But if anyone wants to improve this code I'd be happy to hear from them.
Eventually I have solved this problem!
I found a paper by 2 academics at John Hopkins University "Sampling Uniformly From The Unit Simplex"
http://www.cs.cmu.edu/~nasmith/papers/smith+tromble.tr04.pdf
In the paper they outline how naive algorthms don't work, in a way very similar to woodchips answer to the Random numbers that add to 100 question. They then go on to show that the method suggested by David Schwartz can also be slightly biased and propose a modified algorithm which appear to work.
If you want x numbers that sum to y
Sample uniformly x-1 random numbers from the range 1 to x+y-1 without replacement
Sort them
Add a zero at the beginning & x+y at the end
difference them & subtract 1 from each value
If you want to scale them as I do, then divide by y
It took me a while to realise why this works when the original approach didn't and it come down to the probability of getting a zero weight (as highlighted by Floris in his answer). To get a zero weight in the original version for all but the 1st or last weights your random numbers had to have 2 values the same but for the 1st & last ones then a random number of zero or the maximum number would result in a zero weight which is more likely.
In the revised algorithm, zero & the maximum number are not in the set of random choices & a zero weight occurs only if you select two consecutive numbers which is equally likely for every position.
I coded it up in Matlab as follows
function weights = unbiased_monkey_weights(num_simulations,num_stocks,min_increment)
scaling=1/min_increment;
sample=NaN(num_simulations,num_stocks-1);
for i=1:num_simulations
allcomb=randperm(scaling+num_stocks-1);
sample(i,:)=allcomb(1:num_stocks-1);
end
temp = [zeros(num_simulations,1),sort(sample,2),ones(num_simulations,1)*(scaling+num_stocks)];
weights = (diff(temp,[],2)-1)/scaling;
end
Obviously the loop is a bit clunky and as I'm using the 2009 version the randperm function only allows you to generate permutations of the whole set, however despite this I can run 10,000 simulations for 1,000 numbers in 5 seconds on my clunky laptop which is fast enough.
The mean weights are now correct & as a quick test I replicated woodchips generating 3 numbers that sum to 1 with the minimum increment being 0.01% & it also look right
Thank you all for your help and I hope this solution is useful to somebody else in the future
The simple answer is to use the schemes that work well with NO minimum increment, then transform the problem. As always, be careful. Some methods do NOT yield uniform sets of numbers.
Thus, suppose I want 11 numbers that sum to 100, with a constraint of a minimum increment of 5. I would first find 11 numbers that sum to 45, with no lower bound on the samples (other than zero.) I could use a tool from the file exchange for this. Simplest is to simply sample 10 numbers in the interval [0,45]. Sort them, then find the differences.
X = diff([0,sort(rand(1,10)),1]*45);
The vector X is a sample of numbers that sums to 45. But the vector Y sums to 100, with a minimum value of 5.
Y = X + 5;
Of course, this is trivially vectorized if you wish to find multiple sets of numbers with the given constraint.