Cumulative counting number of occurrences of integers in MATLAB array including zero occurences - matlab

I am constructing a classification tree in MATLAB, however I have too many features and I want to select the right set of features to maximize my accuracy. To do that, I have a brute-force algorithm that tries all possible subsets of features (e.g. if my data has 4 features, I would try selecting features (1, 2), (1, 3), (1, 4), (2,1)...; (1, 2, 3), (1, 2, 4) etc., and if for a given subset of features, my accuracy is above 84%, I want to save the feature numbers that was used.
So what I would like to do is have an array that can count the number of occurrences of each feature number for each feature set that produced an accuracy above 84%. This array would also have to be cumulative over all iterations of feature combinations.
I've seen other posts for counting integer occurrences in Matlab arrays, but for me a) the size of the array to be analyzed changes throughout execution and b) if there are no occurrences of a number, I want the counting array to display 0 in its place.
For example, if I have two feature subsets [1 2 7] and [1 3 6 8] I would like my counting array to be
[2 1 1 0 0 1 1 1 0 0] (assuming I have ten features total)

Related

Finding the rows of a matrix with specified elements

I want to find the rows of a matrix which contain specified element of another matrix.
For example, a=[1 2 3 4 5 6 7] and b=[1 2 0 4;0 9 10 11;3 1 2 12]. Now, I want to find the rows of b which contain at least three element of a. For this purpose, I used bsxfun command as following:
c=find(sum(any(bsxfun(#eq, b, reshape(a,1,1,[])), 2), 3)>=3);
It works good for low dimension matrices but when I want to use this for high dimension matrices, for example, when the number of rows of b is 192799, MATLAB gives following error:
Requested 192799x4x48854 (35.1GB) array exceeds maximum array size preference.
Creation of arrays greater than this limit may take a long time and cause MATLAB
to become unresponsive. See array size limit or preference panel for more information.
Is there any other command which does this task without producing the behaviour like above for high dimension matrices?
a possible solution:
a=[1 2 3 4 5 6 7]
b=[1 2 0 4;0 9 10 11;3 1 2 12]
i=ismember(b,a)
idx = sum(i,2)
idx = find(idx>=3)

MatLAB help: shuffling a predefined vector without consecutively repeating numbers (with equal occurrences of all values)

I'm having troubles with randomly shuffling a vector without repeating numbers (ex. 1 1 is not acceptable but 1 2 is acceptable), given that each value is repeated equally.
More specifically, I would like to repeat the matrix [1:4] ten times (40 elements in total) so that 1, 2, 3 and 4 would all repeat 10 times without being consecutive.
If there is any clarification needed please let me know, I hope this question was clear.
This is what I have so far:
cond_order = repmat([1:4],10,1); %make matrix
cond_order = cond_order(:); %make sequence
I know randperm is quite relevant but I'm not sure how to use it with the one condition of non-repeating numbers.
EDIT: Thank you for all the responses.
I realize I was quite unclear. These are the examples I would like to reject [1 1 2 2 4 4 4...].
So it doesn't matter if [1 2 3 4] occurs in that order as long as individual values are not repeated. (so both [1 2 3 4 1 2 3 4...] and [4 3 1 2...] are acceptable)
Preferably I am looking for a shuffled vector meeting the criteria that
it is random
there are no consecutively repeating values (ex. 1 1 4 4)
all four values appear equal amount of times
Kind of working with the rejection sampling idea, just repeating with randperm until a sequence permutation is found that has no repeated values.
cond_order = repmat(1:4,10,1); %//make matrix
N = numel(cond_order); %//number of elements
sequence_found = false;
while ~sequence_found
candidate = cond_order(randperm(N));
if all(diff(candidate) ~= 0) %// check if no repeated values
sequence_found = true;
end
end
result = candidate;
The solution from mikkola got it methodically right, but I think there is a more efficient way:
He chose to sample based on equal quantities and check for the difference. I chose to do it the other way round and ended up with a solution requiering much less iterations.
n=4;
k=10;
d=42; %// random number to fail first check
while(~all(sum(bsxfun(#eq,d,(1:n).'),2)==k)) %' //Check all numbers to appear k times.
d=mod(cumsum([randi(n,1,1),randi(n-1,1,(n*k)-1)]),n)+1; %generate new random sample, enforcing a difference of at least 1.
end
A subtle but important distinction: does the author need an equal probability of picking any feasible sequence?
A number of people have mentioned answers of the form, "Let's use randperm and then rearrange the sequence so that it's feasible." That may not work. What will make this problem quite hard is if the author needs an equal chance of choosing any feasible sequence. Let me give an example to show the problem.
Imagine the set of numbers [1 2 2 3 4]. First lets enumerate the set of feasible sequences:
6 sequences beginning with 1: [1 2 3 2 4], [1 2 3 4 2], [1 2 4 2 3], [1 2 4 3 2], [1 3 2 4 2], [1 4 2 3 2].
Then there are 6 sequences beginning with [2 1]: [2 1 2 3 4], [2 1 2 4 3], [2 1 3 2 4], [2 1 3 4 2], [2 1 4 2 3], [2 1 4 3 2]. By symmetry, there are 18 sequences beginning with 2 (i.e. 6 of [2 1], 6 of [2 3], 6 of [2 4]).
By symmetry there are 6 sequences beginning with 3 and another 6 starting with 4.
Hence there are 6 * 3 + 18 = 36 possible sequences.
Sampling uniformly from feasible sequences, the probability the first number is 2 is 18/36 = 50 percent! BUT if you just went with a random permutation, the probability the first digit is 2 would be 40 percent! (i.e. 2/5 numbers in set are 2)
If equal probability of any feasible sequence is required, you want 50 percent of a 2 as the first number, but naive use of randperm and then rejiggering numbers at 2:end to make sequence feasible would give you a 40 percent probability of the first digit being two.
Note that rejection sampling would get the probabilities right as every feasible sequence would have an equal probability of being accepted. (Of course rejection sampling becomes very slow as probability of being accepted goes towards 0.)
Following some of the discussion on here, I think that there is a trade-off between performance and the theoretical requirements of the application.
If a completely uniform draw from the set of all valid permutations is required, then pure rejection sampling method will probably be required. The problem with this of course is that as the size of the problem is increased, the rejection rate will become very high. To demonstrate this, if we consider the base example in the question being n multiples of [1 2 3 4] then we can see the number of samples rejected for each valid draw as follows (note the log y axis):
My alternative method is to randomly sort the array, and then if duplicates are detected then the remaining elements will again be randomly sorted:
cond_order = repmat(1:4,10,1); %make matrix
cond_order = reshape(cond_order, numel(cond_order), 1);
cond_order = cond_order(randperm(numel(cond_order)));
i = 2;
while i < numel(cond_order)
if cond_order(i) ~= cond_order(i - 1)
i = i + 1;
else
tmp = cond_order(i:end);
cond_order(i:end) = tmp(randperm(numel(tmp)));
end
end
cond_order
Note that there is no guarantee that this will converge, but in the case where is becomes clear that it will not converge, we can just start again and it will still be better that re-computing the whole sequence.
This definitely meets the second two requirements of the question:
B) there are no consecutive values
C) all 4 values appear equal amount of times
The question is whether it meets the first 'Random' requirement.
If we take the simplest version of the problem, with the input of [1 2 3 4 1 2 3 4] then there are 864 valid permutations (empirically determined!). If we run both methods over 100,000 runs, then we would expect a Gaussian distribution around 115.7 draws per permutation.
As expected, the pure rejection sampling method gives this:
However, my algorithm does not:
There is clearly a bias towards certain samples.
In the end, it depends on the requirements. Both methods sample over the whole distribution so both fill the core requirements of the problem. I have not included performance comparisons, but for anything other than the simplest of cases, I am confident that my algorithm would be much faster. However, the distribution of the draws is not perfectly uniform. Whether it is good enough is dependent on the application and the size of the actual problem.

Calculating the Local Ternary Pattern of an depth image

I found the detail and implementation of Local Ternary Pattern (LTP) on Calculating the Local Ternary Pattern of an image?. I want to ask more details that what the best way to choose the threshold t and also I have confusion in understand the role of reorder_vector = [8 7 4 1 2 3 6 9];
Unfortunately there isn't a good way to figure out what the threshold is using LTPs. It's mostly trial and error or by experimentation. However, I could suggest to make the threshold adaptive. You can use Otsu's algorithm to dynamically determine the best threshold of your image. This is assuming that the distribution of your intensities in the image is bimodal. In other words, there is a clear separation between objects and background. MATLAB has an implementation of this by the graythresh function. However, this generates a threshold between 0 and 1, so you will need to multiply the result by 255, assuming that the type of your image is uint8.
Therefore, do:
t = 255*graythresh(im);
im is the image that you desire to compute the LTPs. Now, I can certainly provide insight on what the reorder_vector is doing. Look at the following figure on how to calculate LTPs:
(source: hindawi.com)
When we generate the ternary code matrix (matrix in the middle), we need to generate an 8 element sequence that doesn't include the middle of the neighbourhood. We start from the east most element (row 2, column 3), then traverse the elements in counter-clockwise order. The reorder_vector variable allows you to select those specific elements that respect that order. If you recall, MATLAB can access matrices using column-major linear indices. Specifically, given a 3 x 3 matrix, we can access an element using a number from 1 to 9 and the memory is laid out like so:
1 4 7
2 5 8
3 6 9
Therefore, the first element of reorder_vector is index 8, which is the east most element. Next is index 7, which is the top right element, then index 4 which is the north facing element, then 1, 2, 3, 6 and finally 9.
If you follow these numbers, you will determine how I got the reorder_vector:
reorder_vector = [8 7 4 1 2 3 6 9];
By using this variable for accessing each 3 x 3 local neighbourhood, we would thus generate the correct 8 element sequence that respects the ordering of the ternary code so that we can proceed with the next stage of the algorithm.

how to sum matrix entities as indexing by another vector values?

Suppose I have a vector B=[1 1 2 2] and A=[5 6 7 4] in the form of B says the numbers in the A that are need to be summed up. That is we need to sum 5 and 6 as the first entry of the result array and sum 7 and 4 as the second entry. If B is [1 2 1 2] then first element of the result is 5+7 and second element is 6+4.
How could I do it in Matlab in generic sense?
A fexible and general approach would be to use accumarray().
accumarray(B',A')
The function accumulates the values in A into the positions specified by B.
Since the documentation is not simple to understand I will summarize why it is flexible. You can:
choose your accumulating function (sum by default)
specify the positions as a set of coordinates for accumulation into ND arrays
preset the dimension of the accumulated array (by default it expands to max position)
pad with custom values the non accumulated positions (pads with 0 by default)
set the accumulated array to sparse, thus potential avoiding out of memory
[sum(A(1:2:end));sum(A(2:2:end))]

Is my subset sum algorithm of polynomial time?

I came up with a new algorithm to solve the subset sum problem, and I think it's in polynomial time. Tell me I'm either wrong or a total genius.
Quick starter facts:
Subset sum problem is an NP-complete problem. Solving it in polynomial time means that P = NP. The number of subsets in a set of length N, is 2^N.
On the more useful hand, the number of unique subsets in the length N is at maximum the sum of the whole set minus the smallest element, or, the range of sums that any subset can possibly produce is between the sum of all the negative elements and the sum of all the positive elements, since no sum can possibly be bigger or smaller than all the positive or negative sums, which grow at a linear rate when we add extra elements.
What this means is that as N increases, the number of duplicate subsets increases exponentially, and the number of unique, useful subsets increases only linearly. If an algorithm could be devised that could remove the duplicate subsets at the earliest possible opportunity, we would run in polynomial time. A quick example is easily taken from binary. From only the numbers that are powers of two, we can create unique subsets for any integral value. As such, any subset involving any other number (if we had all powers of two) is a duplicate and a waste. By not computing them and their derivatives, we can save virtually all the running time of the algorithm, since the numbers which are powers of two are logarithmically occurring compared to any integer.
As such, I propose a simple algorithm that will remove these duplicates and save having to compute them and all their derivatives.
To begin with, we'll sort the set which is only O(N log N), and split it into two halves, positive and negative. The procedure for the negative numbers is identical, so I'll only outline the positive numbers (the set now means just the positive half, just for clarification).
Imagine an array indexed by sum, which has entries for all the possible result sums of the positive side (which is only linear, remember). When we add an entry, the value is the entries in the subset. So like, array[3] = { 1, 2 }.
In general, we now move to enumerate all subsets in the set. We do this by starting with the subsets of one length, then adding to them. When we have all the unique subsets, they form an array, and we simply iterate them in the fashion used in Horowitz/Sahni.
Now we start with the "first generation" values. That is, if there were no duplicate numbers in the original data set, there are guaranteed to be no duplicates in these values. That is, all single-value subsets, and all length of the set minus one length subsets. These can easily be generated by summing the set, and subtracting each element in turn. In addition the set itself is a valid first generation sum and subset, as well as each individual element of the subset.
Now we do the second generation values. We loop through each value in the array and for each unique subset, if it doesn't have it, we add it and compute the new unique subset. If we have a duplicate, fun occurs. We add it to a collision list. When we come to add new subsets, we check if they're on the collision list. We key by the less desirable (normally larger, but can be arbitrary) equal subset. When we come to add to subsets, if we would generate a collision, we simply do nothing. When we come to add the more desirable subset, it misses the check and adds, generating the common subset. Then we just repeat for the other generations.
By removing duplicate subsets in this manner, we don't have to keep combining the duplicates with the rest of the set, nor keep checking them for collisions, nor sum them. Most importantly, by not creating new subsets that are non-unique, we're not generating new subsets from them, which can, I believe, turn the algorithm from NP to P, since the growth of subsets is no longer exponential- we discard the vast majority of them before they can "reproduce" in the next generation and create more subsets by being combined with the other non-duplicate subsets.
I don't think I've explained this too well. I have pictures... they're in my head. The important thing is that by discarding duplicate subsets, you could remove virtually all of the complexity.
For example, imagine (because I'm doing this example by hand) a simple dataset that goes -7 to 7 (not zero) for which we aim at zero. Sort and split, so we're left with (1, 2, 3, 4, 5, 6, 7). The sum is 28. But 2^7 is 128. So 128 subsets fit in the range 1 .. 28, meaning that we know in advance that 100 sets are duplicates. If we had 8, then we'd only have 36 slots, but now 256 subsets. So you can easily see that the number of dupes would now be 220, greater than double what it was before.
In this case, the first generation values are 1, 2, 3, 4, 5, 6, 7, 28, 27, 26, 25, 24, 23, 22, 21, and we map them to their constituent components, so
1 = { 1 }
2 = { 2 }
...
28 = { 1, 2, 3, 4, 5, 6, 7 }
27 = { 2, 3, 4, 5, 6, 7 }
26 = { 1, 3, 4, 5, 6, 7 }
...
21 = { 1, 2, 3, 4, 5, 6 }
Now to generate the new subsets, we take each subset in turn and add it to each other subset, unless they have a mutual subsubset, e.g. 28 and 27 have a hueg mutual subsubset. So when we take 1 and we add it to 2, we get 3 = { 1, 2 } but owait! It's already in the array. What this means is that we now don't add 1 to any subset that already has 2 in it, and vice versa, because that's a duplicate on 3's subsets.
Now we have
1 = { 1 }
2 = { 2 }
// Didn't add 1 to 2 to get 3 because that's a dupe
3 = { 3 } // Add 1 to 3, amagad, get a duplicate. Repeat the process.
4 = { 4 } // And again.
...
8 = { 1, 7 }
21? Already has 1 in.
...
27? We already have 28
Now we add 2 to all.
1? Existing duplicate
3? Get a new duplicate
...
9 = { 2, 7 }
10 = { 1, 2, 7 }
21? Already has 2 in
...
26? Already have 28
27? Got 2 in already.
3?
1? Existing dupe
2? Existing dupe
4? New duplicate
5? New duplicate
6? New duplicate
7? New duplicate
11 = { 1, 3, 7 }
12 = { 2, 3, 7 }
13 = { 1, 2, 3, 7 }
As you can see, even though I am still adding new subsets each time, the quantity is only going up linearly.
4?
1? Existing dupe
2? Existing dupe
3? Existing dupe
5? New duplicate
6? New duplicate
7? New duplicate
8? New duplicate
9? New duplicate
14 = {1, 2, 4, 7}
15 = {1, 3, 4, 7}
16 = {2, 3, 4, 7}
17 = {1, 2, 3, 4, 7}
5?
1,2,3,4 existing duplicate
6,7,8,9,10,11,12 new duplicate
18 = {1, 2, 3, 5, 7}
19 = {1, 2, 4, 5, 7}
20 = {1, 3, 4, 5, 7}
21 = new duplicate
Now we have every value in the range, so we stop, and add to our list 1-28. Repeat for negative numbers, iterate through lists.
Edit:
This algorithm is totally wrong in any case. Subsets which have duplicate sums are not duplicates for the purposes of which subsets can be spawned from them, because they are arrived at differently- i.e., they cannot be folded.
This does not prove P = NP.
You have failed to consider the possibility where the positive numbers are: 1, 2, 4, 8, 16, etc... and so there will be no duplicates when you sum subsets, so it will run in O(2^N) time in this case.
You can treat this as a special case but still the algorithm is still not polynomial for other similar cases. This assumption that you made is where you go away from the NP-complete version of subset sum to solving only easy (polynomial time) problems:
[assume the sum of the positive numbers grows] at a linear rate when we add extra elements.
Here you are effectively assuming that P (i.e. number of bits required to state the problem) is smaller than N. Quote from Wikipedia:
Thus, the problem is most difficult if N and P are of the same order.
If you assume that N and P are of the same order then you can't assume that the sum grows linearly indefinitely as you add more elements. As you add more elements to your set those elements also need to get larger to ensure that problem remains hard to solve.
If P (the number of place values) is a small fixed number, then there are dynamic programming algorithms that can solve it exactly.
You have rediscovered one of these algorithms. It's a nice piece of work but it isn't something new and it doesn't prove P = NP. Sorry!
Dead MG,
It has been almost half a year since you posted but I will answer anyway.
Mark Byers wrote most of what should be written.
The algorithm is known.
Such algorithms are known as generating functions algorithms or simply as dynamic programming algorithms.
Your algorihtm has very important feature, the so called pseudopolynomial complexity.
Traditional complexity is a function of the size of the problem. In terms of traditional complexity your algorithm has O(2^n) pessimistic complexity (that is for the numbers 1,2, 4,... as was mentioned earlier )
The complexity of your algorithm algorithm can be alternatively expressed as the function of the size of the problem and the size of some numbers in the problem. For your algorithm it would be something like O(nw) where w is the number of distinct sums.
This is psuedopolynomial complexity. It is a VERY important feature. Such algorithms can solve lots of real-world problem instances, despite problem complexity class.
Horowitz and Sahni algorithm has pessimistic complexity O(2^N/2). This is not two times better than your algorithm but lot's more - 2^N/2 times better than your algorithm. What Greg probably meant was that Horowitz and Sahni algorithm can solve twice as big instances of the problem (having twice as many numbers in the subset sum)
That's true in theory but in practice Horowitz and Sahni can solve (on home computers) instances with about 60 numbers, while the algorithm similiar to yours can handle even instances with 1000 numbers (provided that the numbers aren't too big themselves)
In fact the two algorithms can even be mixed, that is of your kind and of Horowitz and Sahni algorithm. Such solution has both pseudopolynomial complexity and pessimistic complexity of O(2^n/2).
A trained computer sciencist can construct such algorithm as yours by means of generating functions theory.
Both trained and untrained can think it up the way you did.
Do not necessarily think in terms "is it known?". It should be important to you that you can invent such algorithm on your own. It means that you probably can invent other important algorithms on your own and someday one that isn't known maybe. Knowing current progress in the field and what's in the literature helps. Otherwise you will keep on reinventing the wheel.
When I was way back in high school I reinvented Dijkstra algorithm. My version had terrible complexity because I didn't know anything about data structures. Anyway, I am still proud of myself.
If you are still studying pay attention to generating functions theory.
You may also want to check out on wiki:
psuedopolynomial time
weakly NP-complete
strongly NP-complete
generating functions
Megli
What this means is that as N increases, the number of duplicate subsets increases exponentially, and the number of unique, useful subsets increases only linearly.
Not necessarily - the number of duplicate subset sums is also determined by the value of the number closest to zero in the set (that the greater the minimum distance to zero - the fewer the duplicate subset sums for the set).
In general, we now move to enumerate all subsets in the set.
Unfortunately, enumerating all the sums of the subsets of the set requires performing an exponential number of addition operations (2^7 or 128 in your example). Otherwise, how would the algorithm determine what the unique sums happen to be? So, although the steps that follow the first step could very well have a polynomial running time, the algorithm as a whole has exponential complexity (because an algorithm is only as fast as its slowest part).
Incidentally, best known algorithm for solving the subset sum problem (Horowitz and Sahni, 1974) has O(2^N/2) complexity - which makes it about twice as fast as this algorithm.