how to make Sliding window model for data stream mining? - matlab

we have a situation that a stream (data from sensor or click stream data at server) is coming with sliding window algorithm we have to store the last (say) 500 samples of data in memory. These samples are then used to create histograms, aggregations & capture information about anomalies in the input data stream.
please tell me how to make such sliding window.

If you are asking how to store and maintain these values in a sliding-window manner, consider this simple example which keep tracks of the running mean of the last 10 values of some random stream of data:
WINDOW_SIZE = 10;
x = nan(WINDOW_SIZE,1);
%# init
counter = 0;
stats = [NaN NaN]; %# previous/current value
%# prepare figure
SHOW_LIM = 200;
hAx = axes('XLim',[1 SHOW_LIM], 'YLim',[200 800]);
hLine = line('XData',1, 'YData',nan, 'EraseMode','none', ...
'Parent',hAx, 'Color','b', 'LineWidth',2);
%# infinite loop!
while true
val = randi([1 1000]); %# get new value from data stream
x = [ x(2:end) ; val ]; %# add to window in a cyclic manner
counter = counter + 1;
%# do something interesting with x
stats(1) = stats(2); %# keep track of the previous mean
stats(2) = nanmean(x); %# update the current mean
%# show and update plot
set(hLine, 'XData',[counter-1 counter], 'YData',[stats(1) stats(2)])
if rem(counter,SHOW_LIM)==0
%# show only the last couple of means
set(hAx, 'XLim', [counter counter+SHOW_LIM]);
end
drawnow
pause(0.02)
if ~ishandle(hAx), break, end %# break in case you close the figure
end
Update
The EraseMode=none property was deprecated and removed in recent versions. Use the animatedline function instead for a similar functionality.

Related

How to force MATLAB function area to hold on in figure

I'm working on this function which gets axis handler and data, and is supposed to plot it correctly in the axis. The function is called in for loop. It's supposed to draw the multiple data in one figure. My resulted figure is shown below.
There are only two correctly plotted graphs (those with four colors). Others miss areas plotted before the final area (red area is the last plotted area in each graph). But the script is same for every axis. So where can be the mistake? The whole function is written below.
function [] = powerSpectrumSmooth(axis,signal,fs)
N= length(signal);
samplesPer1Hz = N/fs;
delta = int16(3.5*samplesPer1Hz); %last sample of delta frequncies
theta = int16(7.5*samplesPer1Hz); %last sample of theta frequncies
alpha = int16(13*samplesPer1Hz); %last sample of alpha frequncies
beta = int16(30*samplesPer1Hz); %last sample of beta frequncies
x=fft(double(signal));
powerSpectrum = 20*log10(abs(real(x)));
smoothPS=smooth(powerSpectrum,51);
PSmin=min(powerSpectrum(1:beta));
y1=[(smoothPS(1:delta)); zeros(beta-delta,1)+PSmin];
y2=[zeros(delta-1,1)+PSmin; (smoothPS(delta:theta)); zeros(beta-theta,1)+PSmin];
y3=[zeros(theta-1,1)+PSmin; (smoothPS(theta:alpha)); zeros(beta-alpha,1)+PSmin];
y4=[zeros(alpha-1,1)+PSmin; (smoothPS(alpha:beta))];
a1=area(axis,1:beta,y1);
set(a1,'FaceColor','yellow')
hold on
a2=area(axis,1:beta,y2);
set(a2,'FaceColor','blue')
a3=area(axis,1:beta,y3);
set(a3,'FaceColor','green')
a4=area(axis,1:beta,y4);
set(a4,'FaceColor','red')
ADDED
And here is the function which calls the function above.
function [] = drawPowerSpectrum(axesContainer,dataContainer,fs)
size = length(axesContainer);
for l=1:size
powerSpectrumSmooth(axesContainer{l},dataContainer{l},fs)
set(axesContainer{l},'XTickLabel','')
set(axesContainer{l},'YTickLabel','')
uistack(axesContainer{l}, 'top');
end
ADDED 29th July
Here is a script which reproduces the error, so you can run it in your computer. Before running it again you might need to clear variables.
len = 9;
axesContainer = cell(len,1);
x = [0.1,0.4,0.7,0.1,0.4,0.7,0.1,0.4,0.7];
y = [0.1,0.1,0.1,0.4,0.4,0.4,0.7,0.7,0.7];
figure(1)
for i=1:len
axesContainer{i} = axes('Position',[x(i),y(i),0.2,0.2]);
end
dataContainer = cell(len,1);
N = 1500;
for i=1:len
dataContainer{i} = rand(1,N)*100;
end
for l=1:len
y1=[(dataContainer{l}(1:N/4)) zeros(1,3*N/4)];
y2=[zeros(1,N/4) (dataContainer{l}(N/4+1:(2*N/4))) zeros(1,2*N/4)];
y3=[zeros(1,2*N/4) (dataContainer{l}(2*N/4+1:3*N/4)) zeros(1,N/4)];
y4=[zeros(1,3*N/4) (dataContainer{l}(3*N/4+1:N))];
axes=axesContainer{l};
a1=area(axes,1:N,y1);
set(a1,'FaceColor','yellow')
hold on
a2=area(axes,1:N,y2);
set(a2,'FaceColor','blue')
hold on
a3=area(axes,1:N,y3);
set(a3,'FaceColor','green')
hold on
a4=area(axes,1:N,y4);
set(a4,'FaceColor','red')
set(axes,'XTickLabel','')
set(axes,'YTickLabel','')
end
My result of this script is plotted below:
Again only one picture contains all areas.
It looks like that every call to plot(axes,data) deletes whatever was written in axes.
Important note: Do not use a variable name the same as a function. Do not call something sin ,plot or axes!! I changed it to axs.
To solve the problem I just used the classic subplot instead of creating the axes as you did:
len = 9;
axesContainer = cell(len,1);
x = [0.1,0.4,0.7,0.1,0.4,0.7,0.1,0.4,0.7];
y = [0.1,0.1,0.1,0.4,0.4,0.4,0.7,0.7,0.7];
figure(1)
dataContainer = cell(len,1);
N = 1500;
for i=1:len
dataContainer{i} = rand(1,N)*100;
end
for l=1:len
y1=[(dataContainer{l}(1:N/4)) zeros(1,3*N/4)];
y2=[zeros(1,N/4) (dataContainer{l}(N/4+1:(2*N/4))) zeros(1,2*N/4)];
y3=[zeros(1,2*N/4) (dataContainer{l}(2*N/4+1:3*N/4)) zeros(1,N/4)];
y4=[zeros(1,3*N/4) (dataContainer{l}(3*N/4+1:N))];
axs=subplot(3,3,l);
a1=area(axs,1:N,y1);
set(a1,'FaceColor','yellow')
hold on
a2=area(axs,1:N,y2);
set(a2,'FaceColor','blue')
hold on
a3=area(axs,1:N,y3);
set(a3,'FaceColor','green')
hold on
a4=area(axs,1:N,y4);
set(a4,'FaceColor','red')
set(axs,'XTickLabel','')
set(axs,'YTickLabel','')
axis tight % this is to beautify it.
end
As far as I know, you can still save the axs variable in an axescontainer and then modify the properties you want (like location).
I found out how to do what I needed.
len = 8;
axesContainer = cell(len,1);
x = [0.1,0.4,0.7,0.1,0.4,0.7,0.1,0.4];
y = [0.1,0.1,0.1,0.4,0.4,0.4,0.7,0.7];
figure(1)
for i=1:len
axesContainer{i} = axes('Position',[x(i),y(i),0.2,0.2]);
end
dataContainer = cell(len,1);
N = 1500;
for i=1:len
dataContainer{i} = rand(1,N)*100;
end
for l=1:len
y1=[(dataContainer{l}(1:N/4)) zeros(1,3*N/4)];
y2=[zeros(1,N/4) (dataContainer{l}(N/4+1:(2*N/4))) zeros(1,2*N/4)];
y3=[zeros(1,2*N/4) (dataContainer{l}(2*N/4+1:3*N/4)) zeros(1,N/4)];
y4=[zeros(1,3*N/4) (dataContainer{l}(3*N/4+1:N))];
axes=axesContainer{l};
Y=[y1',y2',y3',y4'];
a=area(axes,Y);
set(axes,'XTickLabel','')
set(axes,'YTickLabel','')
end
The area is supposed to work with matrices like this. The tricky part is, that the signal in every next column is not plotted absolutely, but relatively to the data in previous column. That means, if at time 1 the data in first column has value 1 and data in second column has value 4, the second column data is ploted at value 5. Source: http://www.mathworks.com/help/matlab/ref/area.html

intermittent loops in matlab

Hello again logical friends!
I’m aware this is quite an involved question so please bear with me! I think I’ve managed to get it down to two specifics:- I need two loops which I can’t seem to get working…
Firstly; The variable rollers(1).ink is a (12x1) vector containing ink values. This program shares the ink equally between rollers at each connection. I’m attempting to get rollers(1).ink to interact with rollers(2) only at specific timesteps. The ink should transfer into the system once for every full revolution i.e. nTimesSteps = each multiple of nBins_max. The ink should not transfer back to rollers(1).ink as the system rotates – it should only introduce ink to the system once per revolution and not take any back out. Currently I’ve set rollers(1).ink = ones but only for testing. I’m truly stuck here!
Secondly; The reason it needs to do this is because at the end of the sim I also wish to remove ink in the form of a printed image. The image should be a reflection of the ink on the last roller in my system and half of this value should be removed from the last roller and taken out of the system at each revolution. The ink remaining on the last roller should be recycled and ‘re-split’ in the system ready for the next rotation.
So…I think it’s around the loop beginning line86 where I need to do all this stuff. In pseudo, for the intermittent in-feed I’ve been trying something like:
For k = 1:nTimeSteps
While nTimesSteps = mod(nTimeSteps, nBins_max) == 0 % This should only output when nTimeSteps is a whole multiple of nBins_max i.e. one full revolution
‘Give me the ink on each segment at each time step in a matrix’
End
The output for averageAmountOfInk is the exact format I would like to return this data except I don’t really need the average, just the actual value at each moment in time. I keep getting errors for dimensional mismatches when I try to re-create this using something like:
For m = 1:nTimeSteps
For n = 1:N
Rollers(m,n) = rollers(n).ink’;
End
End
I’ll post the full code below if anyone is interested to see what it does currently. There’s a function at the end also which of course needs to be saved out to a separate file.
I’ve posted variations of this question a couple of times but I’m fully aware it’s quite a tricky one and I’m finding it difficult to get my intent across over the internets!
If anyone has any ideas/advice/general insults about my lack of programming skills then feel free to reply!
%% Simple roller train
% # Single forme roller
% # Ink film thickness = 1 micron
clc
clear all
clf
% # Initial state
C = [0,70; % # Roller centres (x, y)
10,70;
21,61;
11,48;
21,34;
27,16;
0,0
];
R = [5.6,4.42,9.8,6.65,10.59,8.4,23]; % # Roller radii (r)
% # Direction of rotation (clockwise = -1, anticlockwise = 1)
rotDir = [1,-1,1,-1,1,-1,1]';
N = numel(R); % # Amount of rollers
% # Find connected rollers
isconn = #(m, n)(sum(([1, -1] * C([m, n], :)).^2)...
-sum(R([m, n])).^2 < eps);
[Y, X] = meshgrid(1:N, 1:N);
conn = reshape(arrayfun(isconn, X(:), Y(:)), N, N) - eye(N);
% # Number of bins for biggest roller
nBins_max = 50;
nBins = round(nBins_max*R/max(R))';
% # Initialize roller struct
rollers = struct('position',{}','ink',{}','connections',{}',...
'rotDirection',{}');
% # Initialise matrices for roller properties
for ii = 1:N
rollers(ii).ink = zeros(1,nBins(ii));
rollers(ii).rotDirection = rotDir(ii);
rollers(ii).connections = zeros(1,nBins(ii));
rollers(ii).position = 1:nBins(ii);
end
for ii = 1:N
for jj = 1:N
if(ii~=jj)
if(conn(ii,jj) == 1)
connInd = getConnectionIndex(C,ii,jj,nBins(ii));
rollers(ii).connections(connInd) = jj;
end
end
end
end
% # Initialize averageAmountOfInk and calculate initial distribution
nTimeSteps = 1*nBins_max;
averageAmountOfInk = zeros(nTimeSteps,N);
inkPerSeg = zeros(nTimeSteps,N);
for ii = 1:N
averageAmountOfInk(1,ii) = mean(rollers(ii).ink);
end
% # Iterate through timesteps
for tt = 1:nTimeSteps
rollers(1).ink = ones(1,nBins(1));
% # Rotate all rollers
for ii = 1:N
rollers(ii).ink(:) = ...
circshift(rollers(ii).ink(:),rollers(ii).rotDirection);
end
% # Update all roller-connections
for ii = 1:N
for jj = 1:nBins(ii)
if(rollers(ii).connections(jj) ~= 0)
index1 = rollers(ii).connections(jj);
index2 = find(ii == rollers(index1).connections);
ink1 = rollers(ii).ink(jj);
ink2 = rollers(index1).ink(index2);
rollers(ii).ink(jj) = (ink1+ink2)/2;
rollers(index1).ink(index2) = (ink1+ink2)/2;
end
end
end
% # Calculate average amount of ink on each roller
for ii = 1:N
averageAmountOfInk(tt,ii) = sum(rollers(ii).ink);
end
end
image(5:20) = (rollers(7).ink(5:20))./2;
inkPerSeg1 = [rollers(1).ink]';
inkPerSeg2 = [rollers(2).ink]';
inkPerSeg3 = [rollers(3).ink]';
inkPerSeg4 = [rollers(4).ink]';
inkPerSeg5 = [rollers(5).ink]';
inkPerSeg6 = [rollers(6).ink]';
inkPerSeg7 = [rollers(7).ink]';
This is an extended comment rather than a proper answer, but the comment box is a bit too small ...
Your code overwhelms me, I can't see the wood for the trees. I suggest that you eliminate all the stuff we don't need to see to help you with your immediate problem (all those lines drawing figures for example) -- I think it will help you to debug your code yourself to put all that stuff into functions or scripts.
Your code snippet
For k = 1:nTimeSteps
While nTimesSteps = mod(nTimeSteps, nBins_max) == 0
‘Give me the ink on each segment at each time step in a matrix’
End
might be (I don't quite understand your use of the while statement, the word While is not a Matlab keyword, and as you have written it the value returned by the statement doesn't change from iteration to iteration) equivalent to
For k = 1:nBins_max:nTimeSteps
‘Give me the ink on each segment at each time step in a matrix’
End
You seem to have missed an essential feature of Matlab's colon operator ...
1:8 = [1 2 3 4 5 6 7 8]
but
1:2:8 = [1 3 5 7]
that is, the second number in the triplet is the stride between successive elements.
Your matrix conn has a 1 at the (row,col) where rollers are connected, and a 0 elsewhere. You can find the row and column indices of all the 1s like this:
[ri,ci] = find(conn==1)
You could then pick up the (row,col) locations of the 1s without the nest of loops and if statements that begins
for ii = 1:N
for jj = 1:N
if(ii~=jj)
if(conn(ii,jj) == 1)
I could go on, but won't, that's enough for one comment.

Rolling window for averaging using MATLAB

I have the following code, pasted below. I would like to change it to only average the 10 most recently filtered images and not the entire group of filtered images. The line I think I need to change is: Yout(k,p,q) = (Yout(k,p,q) + (y.^2))/2;, but how do I do it?
j=1;
K = 1:3600;
window = zeros(1,10);
Yout = zeros(10,column,row);
figure;
y = 0; %# Preallocate memory for output
%Load one image
for i = 1:length(K)
disp(i)
str = int2str(i);
str1 = strcat(str,'.mat');
load(str1);
D{i}(:,:) = A(:,:);
%Go through the columns and rows
for p = 1:column
for q = 1:row
if(mean2(D{i}(p,q))==0)
x = 0;
else
if(i == 1)
meanvalue = mean2(D{i}(p,q));
end
%Calculate the temporal mean value based on previous ones.
meanvalue = (meanvalue+D{i}(p,q))/2;
x = double(D{i}(p,q)/meanvalue);
end
%Filtering for 10 bands, based on the previous state
for k = 1:10
[y, ZState{k}] = filter(bCoeff{k},aCoeff{k},x,ZState{k});
Yout(k,p,q) = (Yout(k,p,q) + (y.^2))/2;
end
end
end
% for k = 2:10
% subplot(5,2,k)
% subimage(Yout(k)*5000, [0 100]);
% colormap jet
% end
% pause(0.01);
end
disp('Done Loading...')
The best way to do this (in my opinion) would be to use a circular-buffer to store your images. In a circular-, or ring-buffer, the oldest data element in the array is overwritten by the newest element pushed in to the array. The basics of making such a structure are described in the short Mathworks video Implementing a simple circular buffer.
For each iteration of you main loop that deals with a single image, just load a new image into the circular-buffer and then use MATLAB's built in mean function to take the average efficiently.
If you need to apply a window function to the data, then make a temporary copy of the frames multiplied by the window function and take the average of the copy at each iteration of the loop.
The line
Yout(k,p,q) = (Yout(k,p,q) + (y.^2))/2;
calculates a kind of Moving Average for each of the 10 bands over all your images.
This line calculates a moving average of meanvalue over your images:
meanvalue=(meanvalue+D{i}(p,q))/2;
For both you will want to add a buffer structure that keeps only the last 10 images.
To simplify it, you can also just keep all in memory. Here is an example for Yout:
Change this line: (Add one dimension)
Yout = zeros(3600,10,column,row);
And change this:
for q = 1:row
[...]
%filtering for 10 bands, based on the previous state
for k = 1:10
[y, ZState{k}] = filter(bCoeff{k},aCoeff{k},x,ZState{k});
Yout(i,k,p,q) = y.^2;
end
YoutAvg = zeros(10,column,row);
start = max(0, i-10+1);
for avgImg = start:i
YoutAvg(k,p,q) = (YoutAvg(k,p,q) + Yout(avgImg,k,p,q))/2;
end
end
Then to display use
subimage(Yout(k)*5000, [0 100]);
You would do sth. similar for meanvalue

Kmean plotting in matlab

I am on a project thumb recognition system on matlab. I implemented Kmean Algorithm and I got results as well. Actually now I want to plot the results like here they done. I am trying but couldn't be able to do so. I am using the following code.
load training.mat; % loaded just to get trainingData variable
labelData = zeros(200,1);
labelData(1:100,:) = 0;
labelData(101:200,:) = 1;
k=2;
[trainCtr, traina] = kmeans(trainingData,k);
trainingResult1=[];
for i=1:k
trainingResult1 = [trainingResult1 sum(trainCtr(1:100)==i)];
end
trainingResult2=[];
for i=1:k
trainingResult2 = [trainingResult2 sum(trainCtr(101:200)==i)];
end
load testing.mat; % loaded just to get testingData variable
c1 = zeros(k,1054);
c1 = traina;
cluster = zeros(200,1);
for j=1:200
testTemp = repmat(testingData(j,1:1054),k,1);
difference = sum((c1 - testTemp).^2, 2);
[value index] = min(difference);
cluster(j,1) = index;
end
testingResult1 = [];
for i=1:k
testingResult1 = [testingResult1 sum(cluster(1:100)==i)];
end
testingResult2 = [];
for i=1:k
testingResult2 = [testingResult2 sum(cluster(101:200)==i)];
end
in above code trainingData is matrix of 200 X 1054 in which 200 are images of thumbs and 1054 are columns. actually each image is of 25 X 42. I reshaped each image in to row matrix (1 X 1050) and 4 other (some features) columns so total of 1054 columns are in each image. Similarly testingData I made it in the similar manner as I made testingData It is also the order of 200 X 1054. Now my Problem is just to plot the results as they did in here.
After selecting 2 features, you can just follow the example. Start a figure, use hold on, and use plot or scatter to plot the centroids and the data points. E.g.
selectedFeatures = [42,43];
plot(trainingData(trainCtr==1,selectedFeatures(1)),
trainingData(trainCtr==1,selectedFeatures(2)),
'r.','MarkerSize',12)
Would plot the selected feature values of the data points in cluster 1.

SET game odds simulation (MATLAB)

I have recently found the great card came - SET. Briefly, there are 81 cards with the four features: symbol (oval, squiggle or diamond), color (red, purple or green), number (one, two or three) and shading (solid, striped or open). The task is to locate (from selected 12 cards) a SET of 3 cards, in which each of the four features is either all the same on each card or all different on each card (no 2+1 combination).
I've coded it in MATLAB to find a solution and to estimate odds of having a set in randomly selected cards.
Here is my code to estimate odds:
%% initialization
K = 12; % cards to draw
NF = 4; % number of features (usually 3 or 4)
setallcards = unique(nchoosek(repmat(1:3,1,NF),NF),'rows'); % all cards: rows - cards, columns - features
setallcomb = nchoosek(1:K,3); % index of all combinations of K cards by 3
%% test
tic
NIter=1e2; % number of test iterations
setexists = 0; % test results holder
% C = progress('init'); % if you have progress function from FileExchange
for d = 1:NIter
% C = progress(C,d/NIter);
% cards for current test
setdrawncardidx = randi(size(setallcards,1),K,1);
setdrawncards = setallcards(setdrawncardidx,:);
% find all sets in current test iteration
for setcombidx = 1:size(setallcomb,1)
setcomb = setdrawncards(setallcomb(setcombidx,:),:);
if all(arrayfun(#(x) numel(unique(setcomb(:,x))), 1:NF)~=2) % test one combination
setexists = setexists + 1;
break % to find only the first set
end
end
end
fprintf('Set:NoSet = %g:%g = %g:1\n', setexists, NIter-setexists, setexists/(NIter-setexists))
toc
100-1000 iterations are fast, but be careful with more. One million iterations takes about 15 hours on my home computer. Anyway, with 12 cards and 4 features I've got around 13:1 of having a set. This is actually a problem. The instruction book said this number should be 33:1. And it was recently confirmed by Peter Norvig. He provides the Python code, but I didn't test it yet.
So can you find an error? Any comments on performance improvement are welcome.
I tackled the problem writing my own implementation before looking at your code. My first attempt was very similar to what you already had :)
%# some parameters
NUM_ITER = 100000; %# number of simulations to run
DRAW_SZ = 12; %# number of cards we are dealing
SET_SZ = 3; %# number of cards in a set
FEAT_NUM = 4; %# number of features (symbol,color,number,shading)
FEAT_SZ = 3; %# number of values per feature (eg: red/purple/green, ...)
%# cards features
features = {
'oval' 'squiggle' 'diamond' ; %# symbol
'red' 'purple' 'green' ; %# color
'one' 'two' 'three' ; %# number
'solid' 'striped' 'open' %# shading
};
fIdx = arrayfun(#(k) grp2idx(features(k,:)), 1:FEAT_NUM, 'UniformOutput',0);
%# list of all cards. Each card: [symbol,color,number,shading]
[W X Y Z] = ndgrid(fIdx{:});
cards = [W(:) X(:) Y(:) Z(:)];
%# all possible sets: choose 3 from 12
setsInd = nchoosek(1:DRAW_SZ,SET_SZ);
%# count number of valid sets in random draws of 12 cards
counterValidSet = 0;
for i=1:NUM_ITER
%# pick 12 cards
ord = randperm( size(cards,1) );
cardsDrawn = cards(ord(1:DRAW_SZ),:);
%# check for valid sets: features are all the same or all different
for s=1:size(setsInd,1)
%# set of 3 cards
set = cardsDrawn(setsInd(s,:),:);
%# check if set is valid
count = arrayfun(#(k) numel(unique(set(:,k))), 1:FEAT_NUM);
isValid = (count==1|count==3);
%# increment counter
if isValid
counterValidSet = counterValidSet + 1;
break %# break early if found valid set among candidates
end
end
end
%# ratio of found-to-notfound
fprintf('Size=%d, Set=%d, NoSet=%d, Set:NoSet=%g\n', ...
DRAW_SZ, counterValidSet, (NUM_ITER-counterValidSet), ...
counterValidSet/(NUM_ITER-counterValidSet))
After using the Profiler to discover hot spots, some improvement can be made mainly by early-break'ing out of loops when possible. The main bottleneck is the call to the UNIQUE function. Those two lines above where we check for valid sets can be rewritten as:
%# check if set is valid
isValid = true;
for k=1:FEAT_NUM
count = numel(unique(set(:,k)));
if count~=1 && count~=3
isValid = false;
break %# break early if one of the features doesnt meet conditions
end
end
Unfortunately, the simulation is still slow for larger simulation. Thus my next solution is a vectorized version, where for each iteration, we build a single matrix of all possible sets of 3 cards from the hand of 12 drawn cards. For all these candidate sets, we use logical vectors to indicate what feature is present, thus avoiding the calls to UNIQUE/NUMEL (we want features all the same or all different on each card of the set).
I admit that the code is now less readable and harder to follow (thus I posted both versions for comparison). The reason being that I tried to optimize the code as much as possible, so that each iteration-loop is fully vectorized. Here is the final code:
%# some parameters
NUM_ITER = 100000; %# number of simulations to run
DRAW_SZ = 12; %# number of cards we are dealing
SET_SZ = 3; %# number of cards in a set
FEAT_NUM = 4; %# number of features (symbol,color,number,shading)
FEAT_SZ = 3; %# number of values per feature (eg: red/purple/green, ...)
%# cards features
features = {
'oval' 'squiggle' 'diamond' ; %# symbol
'red' 'purple' 'green' ; %# color
'one' 'two' 'three' ; %# number
'solid' 'striped' 'open' %# shading
};
fIdx = arrayfun(#(k) grp2idx(features(k,:)), 1:FEAT_NUM, 'UniformOutput',0);
%# list of all cards. Each card: [symbol,color,number,shading]
[W X Y Z] = ndgrid(fIdx{:});
cards = [W(:) X(:) Y(:) Z(:)];
%# all possible sets: choose 3 from 12
setsInd = nchoosek(1:DRAW_SZ,SET_SZ);
%# optimizations: some calculations taken out of the loop
ss = setsInd(:);
set_sz2 = numel(ss)*FEAT_NUM/SET_SZ;
col = repmat(1:set_sz2,SET_SZ,1);
col = FEAT_SZ.*(col(:)-1);
M = false(FEAT_SZ,set_sz2);
%# progress indication
%#hWait = waitbar(0./NUM_ITER, 'Simulation...');
%# count number of valid sets in random draws of 12 cards
counterValidSet = 0;
for i=1:NUM_ITER
%# update progress
%#waitbar(i./NUM_ITER, hWait);
%# pick 12 cards
ord = randperm( size(cards,1) );
cardsDrawn = cards(ord(1:DRAW_SZ),:);
%# put all possible sets of 3 cards next to each other
set = reshape(cardsDrawn(ss,:)',[],SET_SZ)';
set = set(:);
%# check for valid sets: features are all the same or all different
M(:) = false; %# if using PARFOR, it will complain about this
M(set+col) = true;
isValid = all(reshape(sum(M)~=2,FEAT_NUM,[]));
%# increment counter if there is at least one valid set in all candidates
if any(isValid)
counterValidSet = counterValidSet + 1;
end
end
%# ratio of found-to-notfound
fprintf('Size=%d, Set=%d, NoSet=%d, Set:NoSet=%g\n', ...
DRAW_SZ, counterValidSet, (NUM_ITER-counterValidSet), ...
counterValidSet/(NUM_ITER-counterValidSet))
%# close progress bar
%#close(hWait)
If you have the Parallel Processing Toolbox, you can easily replace the plain FOR-loop with a parallel PARFOR (you might want to move the initialization of the matrix M inside the loop again: replace M(:) = false; with M = false(FEAT_SZ,set_sz2);)
Here are some sample outputs of 50000 simulations (PARFOR used with a pool of 2 local instances):
» tic, SET_game2, toc
Size=12, Set=48376, NoSet=1624, Set:NoSet=29.7882
Elapsed time is 5.653933 seconds.
» tic, SET_game2, toc
Size=15, Set=49981, NoSet=19, Set:NoSet=2630.58
Elapsed time is 9.414917 seconds.
And with a million iterations (PARFOR for 12, no-PARFOR for 15):
» tic, SET_game2, toc
Size=12, Set=967516, NoSet=32484, Set:NoSet=29.7844
Elapsed time is 110.719903 seconds.
» tic, SET_game2, toc
Size=15, Set=999630, NoSet=370, Set:NoSet=2701.7
Elapsed time is 372.110412 seconds.
The odds ratio agree with the results reported by Peter Norvig.
Here's a vectorized version, where 1M hands can be calculated in about a minute. I got about 28:1 with it, so there might still be something a little off with finding 'all different' sets. My guess is that this is what your solution has trouble with, as well.
%# initialization
K = 12; %# cards to draw
NF = 4; %# number of features (this is hard-coded to 4)
nIter = 100000; %# number of iterations
%# each card has four features. This means that a card can be represented
%# by a coordinate in 4D space. A set is a full row, column, etc in 4D
%# space. We can even parallelize the iterations, at least as long as we
%# have RAM (each hand costs 81 bytes)
%# make card space - one dimension per feature, plus one for the iterations
cardSpace = false(3,3,3,3,nIter);
%# To draw cards, we put K trues into each cardSpace. I can't think of a
%# good, fast way to draw exactly K cards that doesn't involve calling
%# unique
for i=1:nIter
shuffle = randperm(81) + (i-1) * 81;
cardSpace(shuffle(1:K)) = true;
end
%# to test, all we have to do is check whether there is any row, column,
%# with all 1's
isEqual = squeeze(any(any(any(all(cardSpace,1),2),3),4) | ...
any(any(any(all(cardSpace,2),1),3),4) | ...
any(any(any(all(cardSpace,3),2),1),4) | ...
any(any(any(all(cardSpace,4),2),3),1));
%# to get a set of 3 cards where all symbols are different, we require that
%# no 'sub-volume' is completely empty - there may be something wrong with this
%# but since my test looked ok, I'm not going to investigate on Friday night
isDifferent = squeeze(~any(all(all(all(~cardSpace,1),2),3),4) & ...
~any(all(all(all(~cardSpace,1),2),4),3) & ...
~any(all(all(all(~cardSpace,1),3),4),2) & ...
~any(all(all(all(~cardSpace,4),2),3),1));
isSet = isEqual | isDifferent;
%# find the odds
fprintf('odds are %5.2f:1\n',sum(isSet)/(nIter-sum(isSet)))
I found my error. Thanks Jonas for the hint with RANDPERM.
I used RANDI to randomly drawn K cards, but there is about 50% chance to get repeats even in 12 cards. When I substituted this line with randperm, I've got 33.8:1 with 10000 iterations, very close to the number in instruction book.
setdrawncardidx = randperm(81);
setdrawncardidx = setdrawncardidx(1:K);
Anyway, it would be interesting to see other approaches to the problem.
I'm sure there's something wrong with my calculation of these odds, since several others have confirmed with simulations that it's close to 33:1 as in the instructions, but what's wrong with the following logic?
For 12 random cards, there are 220 possible combinations of three cards (12!/(9!3!) = 220). Each combination of three cards has a 1/79 chance of being a set, so there's a 78/79 chance of three arbitrary cards not being a set. So if you examined all 220 combinations and there were a 78/79 chance that each one weren't a set, then your chance of not finding a set examining all possible combinations would be 78/79 raised to the 220th power, or 0.0606, which is approx. 17:1 odds.
I must be missing something...?
Christopher