How can I generate a binary matrix with specific patterns? - matlab

I have a binary matrix of size m-by-n. Given below is a sample binary matrix (the real matrix is much larger):
1010001
1011011
1111000
0100100
Given p = m*n, I have 2^p possible matrix configurations. I would like to get some patterns which satisfy certain rules. For example:
I want not less than k cells in the jth column as zero
I want the sum of cell values of the ith row greater than a given number Ai
I want at least g cells in a column continuously as one
etc....
How can I get such patterns satisfying these constraints strictly without sequentially checking all the 2^p combinations?
In my case, p can be a number like 2400, giving approximately 2.96476e+722 possible combinations.

Instead of iterating over all 2^p combinations, one way you could generate such binary matrices is by performing repeated row- and column-wise operations based on the given constraints you have. As an example, I'll post some code that will generate a matrix based on the three constraints you have listed above:
A minimum number of zeroes per column
A minimum sum for each row
A minimum sequential length of ones per column
Initializations:
First start by initializing a few parameters:
nRows = 10; % Row size of matrix
nColumns = 10; % Column size of matrix
minZeroes = 5; % Constraint 1 (for columns)
minRowSum = 5; % Constraint 2 (for rows)
minLengthOnes = 3; % Constraint 3 (for columns)
Helper functions:
Next, create a couple of functions for generating column vectors that match constraints 1 and 3 from above:
function vector = make_column
vector = [false(minZeroes,1); true(nRows-minZeroes,1)]; % Create vector
[vector,maxLength] = randomize_column(vector); % Randomize order
while maxLength < minLengthOnes, % Loop while constraint 3 is not met
[vector,maxLength] = randomize_column(vector); % Randomize order
end
end
function [vector,maxLength] = randomize_column(vector)
vector = vector(randperm(nRows)); % Randomize order
edges = diff([false; vector; false]); % Find rising and falling edges
maxLength = max(find(edges == -1)-find(edges == 1)); % Find longest
% sequence of ones
end
The function make_column will first create a logical column vector with the minimum number of 0 elements and the remaining elements set to 1 (using the functions TRUE and FALSE). This vector will undergo random reordering of its elements until it contains a sequence of ones greater than or equal to the desired minimum length of ones. This is done using the randomize_column function. The vector is randomly reordered using the RANDPERM function to generate a random index order. The edges where the sequence switches between 0 and 1 are detected using the DIFF function. The indices of the edges are then used to find the length of the longest sequence of ones (using FIND and MAX).
Generate matrix columns:
With the above two functions we can now generate an initial binary matrix that will at least satisfy constraints 1 and 3:
binMat = false(nRows,nColumns); % Initialize matrix
for iColumn = 1:nColumns,
binMat(:,iColumn) = make_column; % Create each column
end
Satisfy the row sum constraint:
Of course, now we have to ensure that constraint 2 is satisfied. We can sum across each row using the SUM function:
rowSum = sum(binMat,2);
If any elements of rowSum are less than the minimum row sum we want, we will have to adjust some column values to compensate. There are a number of different ways you could go about modifying column values. I'll give one example here:
while any(rowSum < minRowSum), % Loop while constraint 2 is not met
[minValue,rowIndex] = min(rowSum); % Find row with lowest sum
zeroIndex = find(~binMat(rowIndex,:)); % Find zeroes in that row
randIndex = round(1+rand.*(numel(zeroIndex)-1));
columnIndex = zeroIndex(randIndex); % Choose a zero at random
column = binMat(:,columnIndex);
while ~column(rowIndex), % Loop until zero changes to one
column = make_column; % Make new column vector
end
binMat(:,columnIndex) = column; % Update binary matrix
rowSum = sum(binMat,2); % Update row sum vector
end
This code will loop until all the row sums are greater than or equal to the minimum sum we want. First, the index of the row with the smallest sum (rowIndex) is found using MIN. Next, the indices of the zeroes in that row are found and one of them is randomly chosen as the index of a column to modify (columnIndex). Using make_column, a new column vector is continuously generated until the 0 in the given row becomes a 1. That column in the binary matrix is then updated and the new row sum is computed.
Summary:
For a relatively small 10-by-10 binary matrix, and the given constraints, the above code usually completes in no more than a few seconds. With more constraints, things will of course get more complicated. Depending on how you choose your constraints, there may be no possible solution (for example, setting minRowSum to 6 will cause the above code to never converge to a solution).
Hopefully this will give you a starting point to begin generating the sorts of matrices you want using vectorized operations.

If you have enough constraints, exploring all possible matrices could be attempted:
// Explore all possibilities starting at POSITION (0..P-1)
explore(int position)
{
// Check if one or more constraints can't be verified anymore with
// all values currently set.
invalid = ...;
if (invalid) return;
// Do we have a solution?
if (position >= p)
{
// print the matrix
return;
}
// Set one more value and continue exploring
for (int value=0;value<2;value++)
{ matrix[position] = value; explore(position+1); }
}
If the number of constraints is low, this approach will take too much time.
In this case, for the kind of constraints you gave as examples, simulated annealing may be a good solution.
You must design an energy function, high when all constraints are met. That would be something like that:
Generate a random matrix
Compute energy E0
Change one cell
Compute energy E1
If E1>E0, or E0-E1 is smaller than f(temperature), keep it, otherwise reverse the move
Update temperature, and goto 2 unless stop criterion is reached

If all the contraints relate to columns (as is the case in the question), then you can find all possible valid columns and check that each column in the matrix is in this set. (i.e. when you consider each column independently, you reduce the number of possibilities a lot.)

I might be way off here, but I remember doing something similar once with some genetic algorithm.

Check out pseudo boolean constraints (also called 0-1 integer programming).

This is virtually impossible if your constraint set is complex enough. You might try to use a stochastic optimizer, like simulated annealing, particle swarm optimization, or a genetic algorithm to find a feasible solution.
However, if you can generate one (non-random) solution to such a problem, then often you can generate others by random permutations made to the existing solution.

Related

Optimize nested for loop for calculating xcorr of matrix rows

I have 2 nested loops which do the following:
Get two rows of a matrix
Check if indices meet a condition or not
If they do: calculate xcorr between the two rows and put it into new vector
Find the index of the maximum value of sub vector and replace element of LAG matrix with this value
I dont know how I can speed this code up by vectorizing or otherwise.
b=size(data,1);
F=size(data,2);
LAG= zeros(b,b);
for i=1:b
for j=1:b
if j>i
x=data(i,:);
y=data(j,:);
d=xcorr(x,y);
d=d(:,F:(2*F)-1);
[M,I] = max(d);
LAG(i,j)=I-1;
d=xcorr(y,x);
d=d(:,F:(2*F)-1);
[M,I] = max(d);
LAG(j,i)=I-1;
end
end
end
First, a note on floating point precision...
You mention in a comment that your data contains the integers 0, 1, and 2. You would therefore expect a cross-correlation to give integer results. However, since the calculation is being done in double-precision, there appears to be some floating-point error introduced. This error can cause the results to be ever so slightly larger or smaller than integer values.
Since your calculations involve looking for the location of the maxima, then you could get slightly different results if there are repeated maximal integer values with added precision errors. For example, let's say you expect the value 10 to be the maximum and appear in indices 2 and 4 of a vector d. You might calculate d one way and get d(2) = 10 and d(4) = 10.00000000000001, with some added precision error. The maximum would therefore be located in index 4. If you use a different method to calculate d, you might get d(2) = 10 and d(4) = 9.99999999999999, with the error going in the opposite direction, causing the maximum to be located in index 2.
The solution? Round your cross-correlation data first:
d = round(xcorr(x, y));
This will eliminate the floating-point errors and give you the integer results you expect.
Now, on to the actual solutions...
Solution 1: Non-loop option
You can pass a matrix to xcorr and it will perform the cross-correlation for every pairwise combination of columns. Using this, you can forego your loops altogether like so:
d = round(xcorr(data.'));
[~, I] = max(d(F:(2*F)-1,:), [], 1);
LAG = reshape(I-1, b, b).';
Solution 2: Improved loop option
There are limits to how large data can be for the above solution, since it will produce large intermediate and output variables that can exceed the maximum array size available. In such a case for loops may be unavoidable, but you can improve upon the for-loop solution above. Specifically, you can compute the cross-correlation once for a pair (x, y), then just flip the result for the pair (y, x):
% Loop over rows:
for row = 1:b
% Loop over upper matrix triangle:
for col = (row+1):b
% Cross-correlation for upper triangle:
d = round(xcorr(data(row, :), data(col, :)));
[~, I] = max(d(:, F:(2*F)-1));
LAG(row, col) = I-1;
% Cross-correlation for lower triangle:
d = fliplr(d);
[~, I] = max(d(:, F:(2*F)-1));
LAG(col, row) = I-1;
end
end

Matlab calculate outliers from data and time they occur

In Matlab I have a large matrix A. The first column of the matrix contains a time in seconds. The second to 13th column contain results from a calculation. For each column (except the first) I calculated the whisker by:
quantile(A,[.75])-1.5*(quantile(A,[.75])-quantile(A,[.25]))
Now I would like to now how many outliers (= values below whisker) there are in each column, and when they occur. This will give me the ability to calculate how much the outliers are spread over time.
I prefer to create a loop which gives me 12 martices containing two columns. The second column should contain the values of the outliers (= values of cells below whisker) without any zero's in between, and the first column should contain the time at which a outlier occurs (chronologically).
How can I create this?
regards,
Vincent
let,
A =
0.6260 0.7690 0.1209 0.5523 0.0495
0.6609 0.5814 0.8627 0.6299 0.4896
0.7298 0.9283 0.4843 0.0320 0.1925
0.8908 0.5801 0.8449 0.6147 0.1231
0.9823 0.0170 0.2094 0.3624 0.2055
for second column:
B = quantile(A(:,2),[.75])-1.5*(quantile(A(:,2),[.75])-quantile(A(:,2),[.25]))
Then,
index = find(A(:,2) < B)
value_outliner = A(index,2)
outliner_time = A(index,1)
No need to loop: use matrix operations and logical indexing instead.
Assuming you have a matrix A and outlier threshold thr is a 1x12 vector with the threshold for each column:
vals = A(:,2:13);
outliers = bsxfun(#lt, vals, thr); #% #lt is 'less than' function handle
#% outliers is a Nx12 logical matrix with true(1) where the value < threshold
#% and false(0) otherwise.
To get the time when these outliers occurred (for a given column, let's say column 2 of the data portion of the original matrix):
t = A(outliers(:,2), 1);
#% ^____________ logical index of rows where outliers occurred in that column
You can also easily get the number of outliers in each column (or row) by summing:
num_outliers = sum(outliers,1);

MATLAB: Indexing a large matrix for Monte Carlo Simulation

I'm trying to index a large matrix in MATLAB that contains numbers monotonically increasing across rows, and across columns, i.e. if the matrix is called A, for every (i,j), A(i+1,j) > A(i,j) and A(i,j+1) > A(i,j).
I need to create a random number n and compare it with the values of the matrix A, to see where that random number should be placed in the matrix A. In other words, the value of n may not equal any of the contents of the matrix, but it may lie in between any two rows and any two columns, and that determines a "bin" that identifies its position in A. Once I find this position, I increment the corresponding index in a new matrix of the same size as A.
The problem is that I want to do this 1,000,000 times. I need to create a random number a million times and do the index-checking for each of these numbers. It's a Monte Carlo Simulation of a million photons coming from a point landing on a screen; the matrix A consists of angles in spherical coordinates, and the random number is the solid angle of each incident photon.
My code so far goes something like this (I haven't copy-pasted it here because the details aren't important):
for k = 1:1000000
n = rand(1,1)*pi;
for i = length(A(:,1))
for j = length(A(1,:))
if (n > A(i-1,j)) && (n < A(i+1,j)) && (n > A(i,j-1)) && (n < A(i,j+1))
new_img(i,j) = new_img(i,j) + 1; % new_img defined previously as zeros
end
end
end
end
The "if" statement is just checking to find the indices of A that form the bounds of n.
This works perfectly fine, but it takes ridiculously long, especially since my matrix A is an image of dimensions 11856 x 11000. is there a quicker / cleverer / easier way of doing this?
Thanks in advance.
You can get rid of the inner loops by performing the calculation on all elements of A at once. Also, you can create the random numbers all at once, instead of one at a time. Note that the outermost pixels of new_img can never be different from zero.
randomNumbers = rand(1,1000000)*pi;
new_img = zeros(size(A));
tmp_img = zeros(size(A)-2);
for r = randomNumbers
tmp_img = tmp_img + A(:,1:end-2)<r & A(:,3:end)>r & A(1:end-1,:)<r & A(3:end,:)>r;
end
new_img(2:end-1,2:end-1) = tmp_img;
/aside: If the arrays were smaller, I'd have used bsxfun for the comparison, but with the array sizes in the OP, the approach would run out of memory.
Are the values in A bin edges? Ie does A specify a grid? If this is the case then you can QUICKLY populate A using hist3.
Here is an example:
numRand = 1e
n = randi(100,1e6,1);
nMatrix = [floor(data./10), mod(data,10)];
edges = {0:1:9, 0:10:99};
A = hist3(dataMat, edges);
If your A doesn't specify a grid, then you should create all of your random values once and sort them. Then iterate through those values.
Because you know that n(i) >= n(i-1) you don't have to check bins that were too small for n(i-1). This is a very easy way to optimize away most redundant checks.
Here is a snippet that should help a lot in the inner loop, it finds the location of the greatest point that is smaller than your value.
idx1 = A<value
idx2 = A(idx1) == max(A(idx1))
if you want to find the exact location you can wrap it with a find.

How to master the random generator in MATLAB for application to bootstrap artificial neural network models

I am working with hydrological time series data and I am attempting to construct Bootstrap Artificial Neural Network models. In order to provide an uncertainty assessment using confidence intervals, one must make sure when resampling/Bootstrapping the original time series data set, that every value in the original time series is held back at least twice within all bootstrap samples in order to calculate the variance and confidence intervals at that point in time.
To give some background:
I am using a hydrological time series that contains Standard Precipitation Index values at monthly time steps, this time series spans 429 (rows) x 1 (column), let's call this time series vector X. All elements/values of X are normalized and standardized between 0 and 1.
Time series X is then trained against some Target values (same length and conditions as X) in a Neural Network to produce new estimates of the Target values, we'll call this output vector, O (same length and conditions as X).
I am now to take X and resample it ii =1:1:200 times (i.e. Bootstrap size = 200) for length(429) with replacement. Let's call the matrix where all the bootstrap samples are placed, M. I use B = randsample(X, length(X), true) and fill M using a for loop such that M(:,ii) = B. Note: I also make sure to include rng('shuffle') after my randsample statement to keep the RNG moving to new states in hopes that it will provide more random results.
Now I am to test how "well" my data was resampled for use in creating confidence intervals.
My procedure is as follow:
Generate a for loop to create M using above procedure
Create a new variable Xc, this will hold all values of X that were not resampled in bootstrap sample ii for ii = 1:1:200
For j=1:1:length(X) fill 'Xc' using the Xc(j,ii) = setdiff(X, M(:,ii)), if element j exists in M(:,ii) fill Xc(j,ii) with NaN.
Xc is now a matrix the same size and dimensions as M. Count the amount of NaN values in each row of Xc and place in vector CI.
If any row in CI is > [Bootstrap sample size, for this case (200) - 1], then no confidence interval can be created at this point.
When I run this I find that the values chosen from my set X are almost always repeated, i.e. the same values of X are used to generate all the samples in M. It's roughly the same ~200 data points in my original time series that are always chosen to create the new bootstrap samples.
How can I effectively alter my program or use any specific functions that will allow me to avoid the negative solution in (5)?
Here is an example of my code, but please keep in mind the variables used in the script may differ from my text in here.
Thank you for the help and please see the code below.
for ii = 1:1:Blen % for loop to create 'how many bootstraps we desire'
B = randsample(Xtrain, wtrain, true); % bootstrap resamples of data series 'X' for 'how many elements' with replacement
rng('shuffle');
M(:,ii) = B; % creates a matrix of all bootstrap resamples with respect to the amount created by the for loop
[C,IA] = setdiff(Xtrain,B); % creates a vector containing all elements of 'Xtrain' that were not included in bootstrap sample 'ii' and the location of each element
[IAc] = setdiff(k,IA); % creates a vector containing locations of elements of 'Xtrain' used in bootstrap sample 'ii' --> ***IA + IAc = wtrain***
for j = 1:1:wtrain % for loop that counts each row of vector
if ismember(j,IA)== 1 % if the count variable is equal to a value of 'IA'
XC(j,ii) = Xtrain(j,1); % place variable in matrix for sample 'ii' in position 'j' if statement above is true
else
XC(j,ii) = NaN; % hold position with a NaN value to state that this value has been used in bootstrap sample 'ii'
end
dum1(:,ii) = wtrain - sum(isnan(XC(:,ii))); % dummy variable to permit transposing of 'IAs' limited by 'isnan' --> used to calculate amt of elements in IA
dum2(:,ii) = sum(isnan(XC(:,ii))); % dummy variable to permit transposing of 'IAsc' limited by 'isnan'
IAs = transpose(dum1) ; % variable counting amount of elements not resampled in 'M' at set 'i', ***i.e. counts 'IA' for each resample set 'i'
IAsc = transpose(dum2) ; % variable counting amount of elements resampled in 'M' at set 'i', ***i.e. counts 'IAc' for each resample set 'i'
chk = isnan(XC); % returns 1 in position of NaN and 0 in position of actual value
chks = sum(chk,2); % counts how many NaNs are in each row for length of time training set
chks_cnt = sum(chks(:)<(Blen-1)); % counts how many values of the original time series that can be provided a confidence interval, should = wtrain to provide complete CIs
end
end
This doesn't appear to be a problem with randsample, but rather a problem in your other code somewhere. randsample does the right thing. For example:
x = (1:10)';
nSamples = 10;
for iter = 1:100;
data(:,iter) = randsample(x,nSamples ,true);
end;
hist(data(:)) %this is approximately uniform
randsample samples quite randomly...

N-Dimensional Histogram Counts

I am currently trying to code up a function to assign probabilities to a collection of vectors using a histogram count. This is essentially a counting exercise, but requires some finesse to be able to achieve efficiently. I will illustrate with an example:
Say that I have a matrix X = [x1, x2....xM] with N rows and M columns. Here, X represents a collection of M, N-dimensional vectors. IN other words, each of the columns of X is an N-dimensional vector.
As an example, we can generate such an X for M = 10000 vectors and N = 5 dimensions using:
X = randint(5,10000)
This will produce a 5 x 10000 matrix of 0s and 1s, where each column is represents a 5 dimensional vector of 1s and 0s.
I would like to assign a probability to each of these vectors through a basic histogram count. The steps are simple: first find the unique columns of X; second, count the number of times each unique column occurs. The probability of a particular occurrence is then the #of times this column was in X / total number of columns in X.
Returning to the example above, I can do the first step using the unique function in MATLAB as follows:
UniqueXs = unique(X','rows')'
The code above will return UniqueXs, a matrix with N rows that only contains the unique columns of X. Note that the transposes are due to weird MATLAB input requirements.
However, I am unable to find a good way to count the number of times each of the columns in UniqueX is in X. So I'm wondering if anyone has any suggestions?
Broadly speaking, I can think of two ways of achieving the counting step. The first way would be to use the find function, though I think this may be slow since find is an elementwise operation. The second way would be to call unique recursively as it can also provide the index of one of the unique columns in X. This should allow us to remove that column from X and redo unique on the resulting X and keep counting.
Ideally, I think that unique might already be doing some counting so the most efficient way would probably be to work without the built-in functions.
Here are two solutions, one assumes all values are either 0's or 1's (just like the example in your description), the other does not. Both codes should be very fast (more so the one with binary values), even on large data.
1) only zeros and ones
%# random vectors of 0's and 1's
x = randi([0 1], [5 10000]); %# RANDINT is deprecated, use RANDI instead
%# convert each column to a binary string
str = num2str(x', repmat('%d',[1 size(x,1)])); %'
%# convert binary representation to decimal number
num = (str-'0') * (2.^(size(s,2)-1:-1:0))'; %'# num = bin2dec(str);
%# count frequency of how many each number occurs
count = accumarray(num+1,1); %# num+1 since it starts at zero
%# assign probability based on count
prob = count(num+1)./sum(count);
2) any positive integer
%# random vectors with values 0:MAX_NUM
x = randi([0 999], [5 10000]);
%# format vectors as strings (zero-filled to a constant length)
nDigits = ceil(log10( max(x(:)) ));
frmt = repmat(['%0' num2str(nDigits) 'd'], [1 size(x,1)]);
str = cellstr(num2str(x',frmt)); %'
%# find unique strings, and convert them to group indices
[G,GN] = grp2idx(str);
%# count frequency of occurrence
count = accumarray(G,1);
%# assign probability based on count
prob = count(G)./sum(count);
Now we can see for example how many times each "unique vector" occurred:
>> table = sortrows([GN num2cell(count)])
table =
'000064850843749' [1] # original vector is: [0 64 850 843 749]
'000130170550598' [1] # and so on..
'000181606710020' [1]
'000220492735249' [1]
'000275871573376' [1]
'000525617682120' [1]
'000572482660558' [1]
'000601910301952' [1]
...
Note that in my example with random data, the vector space becomes very sparse (as you increase the maximum possible value), thus I wouldn't be surprised if all counts were equal to 1...