Matlab CUDA basic experiment - matlab

(correctly and instructively asnwered, see below)
I'm beginning to do experiments with matlab and gpu (nvidia gtx660).
Now, I wrote this simple monte carlo algorithm to calculate PI. The following is the CPU version:
function pig = mc1vecnocuda(n)
countr=0;
A=rand(n,2);
for i=1:n
if norm(A(i,:))<1
countr=countr+1;
end
end
pig=(countr/n)*4;
end
This takes very little time to be executed on CPU with 100000 points "thrown" into the unit circle:
>> tic; mc1vecnocuda(100000);toc;
Elapsed time is 0.092473 seconds.
See, instead, what happens with gpu-ized version of the algorithm:
function pig = mc1veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
parfor (i=1:n,1024)
if norm(A(i,:))<1
gpucountr=gpucountr+1;
end
end
pig=(gpucountr/n)*4;
end
Now, this takes a LONG time to be executed:
>> tic; mc1veccuda(100000);toc;
Elapsed time is 21.137954 seconds.
I don't understand why. I used parfor loop with 1024 workers, because querying my nvidia card with gpuDevice, 1024 is the maximum number of simultaneous threads allowed on the gtx660.
Can someone help me? Thanks.
Edit: this is the updated version that avoids IF:
function pig = mc2veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
parfor (i=1:n,1024)
gpucountr = gpucountr+nnz(norm(A(i,:))<1);
end
pig=(gpucountr/n)*4;
end
And this is the code written following Bichoy's guidelines (the
right code to achieve result):
function pig = mc3veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
Asq = A.^2;
Asqsum_big_column = Asq(:,1)+Asq(:,2);
Anorms=Asqsum_big_column.^(1/2);
gpucountr=gpucountr+nnz(Anorms<1);
pig=(gpucountr/n)*4;
end
Please note execution time with n=10 millions:
>> tic; mc3veccuda(10000000); toc;
Elapsed time is 0.131348 seconds.
>> tic; mc1vecnocuda(10000000); toc;
Elapsed time is 8.108907 seconds.
I didn't test my original cuda version (for/parfor), for its execution would require hours with n=10000000.
Great Bichoy! ;)

I guess the problem is with parfor!
parfor is supposed to run on MATLAB workers, that is your host not the GPU!
I guess what is actually happening is that you are starting 1024 threads on your host (not on your GPU) and each of them is trying to call the GPU. This result in the tremendous time your code is taking.
Try to re-write your code to use matrix and array operations, not for-loops! This will show some speed-up. Also, remember that you should have much more calculations to do in the GPU otherwise, memory transfer will just dominate your code.
Code:
This is the final code after including all corrections and suggestions from several people:
function pig = mc2veccuda(n)
A=gpuArray.rand(n,2); % An nx2 random matrix
Asq = A.^2; % Get the square value of each element
Anormsq = Asq(:,1)+Asq(:,2); % Get the norm squared of each point
gpucountr = nnz(Anorm<1); % Check the number of elements < 1
pig=(gpucountr/n)*4;

Many reasons like:
Movement of data between host & device
Computation within each loop is very small
Call to rand on GPU may not be parallel
if condition within the loop can cause divergence
Accumulation to a common variable may run in serial, with overhead
It is difficult to profile Matlab+CUDA code. You should probably try in native C++/CUDA and use parallel Nsight to find the bottleneck.

As Bichoy said, CUDA code should always be done vectorized. In MATLAB, unless you're writing a CUDA Kernal, the only large speedup that you're getting is that the vectorized operations are called on the GPU which has thousands of (slow) cores. If you don't have large vectors and vectorized code, it won't help.
Another thing that hasn't been mentioned is that for highly parallel architectures like GPUs you want to use different random number generating algorithms than the "standard" ones. So to add to Bichoy's answer, adding the parameter 'Threefry4x64' (64-bit) or 'Philox4x32-10' (32-bit and a lot faster! Super fast!) can lead to large speedups in CUDA code. MATLAB explains this here: http://www.mathworks.com/help/distcomp/examples/generating-random-numbers-on-a-gpu.html

Related

Why accessing 2d matrix in parfor so slow?

Let's say I have a large matrix A:
A = rand(10000,10000);
The following serial code took around 0.5 seconds
tic
for i=1:5
r=9999*rand(1);
disp(A(round(r)+1, round(r)+1))
end
toc
Whereas the following code with parfor took around 47 seconds
tic
parfor i=1:5
r=9999*rand(1);
disp(A(round(r)+1, round(r)+1))
end
toc
How can I speed up the parfor code?
EDIT: If instead of using disp, I try to compute the sum with the following code
sum=0;
tic
for i=1:5000
r=9999*rand(1);
sum=sum+(A(round(r)+1, round(r)+1));
end
toc
This takes .025 sec
But parfor it takes 42.5 sec:
tic
parfor i=1:5000
r=9999*rand(1);
sum=sum+(A(round(r)+1, round(r)+1));
end
toc
Your issue is in not considering node communication overheads.
When you use a parfor to loop using parallel computation, you have to think about the structure of several worker nodes doing small tasks for the client node.
Here are some issues with the tests you present:
The function disp is serial, since you can only display results one at a time to the client node. Communication between nodes is needed to schedule this task.
Creating a summation external to the loop means all of the nodes have to communicate the current value back to the client node.
A is a broadcast variable in all of your examples. From the docs:
This type of variable can be useful or even essential for particular tasks. However, large broadcast variables can cause significant communication between client and workers and increase parallel overhead.
The MATLAB editor warns you about this, underlining the variable in orange with the following tooltip:
The entire array or structure 'A' is a broadcast variable. This might result in unnecessary communication overhead.
Instead, we can calculate some random indices up front and slice A into temporary variables to use in the loop. Then do gathering operations (like summing all of the parts) after the loop.
k = 50;
sumA = zeros( k, 1 ); % Output values for each loop index
idx = randi( [1,1e4], k, 1 ); % Calculate our indices outside the loop
randA = A( idx, idx ); % Slice A outside the loop
parfor ii = 1:k
sumA( ii ) = randA( ii ); % All that's left to do in the loop
end
sumA = sum( sumA ); % Collate results from all nodes
I did a quick benchmark to compare your 2 summation tests with the above code, using R2017b and 12 workers, here are my results:
Serial loop: ~ 0.001 secs
Parallel with broadcasting: ~ 100 secs
Parallel no broadcasting: ~ 0.1 secs
Parallel loops are overkill for this operation, the overhead isn't justified, but it's clear that with some pre-allocation and avoiding of broadcast variables, they are at least not 5 orders of magnitude slower!
See how the version of the code without broadcast variables uses more vectorisation too, which will speed up the code without even having to use parfor. Optimising your code before using parallel computation will not only speed things up for serial computation, but often make the transition easier too!
Side note: sum and i are bad variable names because they are the names of built-in functions.
So there are a few main causes,
MATLAB parallel toolbox sucks. It just does unless you're using the GPU portion.
The only time it's beneficial is if the individual tasks are large enough. Your computer has to dedicate a core to assigning jobs to all the other cores. This is expensive and has a lot of overhead unless the jobs are of sufficient size. Your computer is running overtime assigning small jobs. If you were assigning jobs that would each take a minute it would be a different story.
You're running too few jobs. You're only looping through 5 times on very small jobs. Why would you even bother trying to multithread this? When I assign it to loop through 500,000 times it finally gains a speedup with parfor if I reduce the matrix size to 1000 x 1000
When you run parfor, MATLAB has to duplicate memory across all of the treads, you have a 10,000 x 10,000 matrix which takes up 800 MB. Duplicated across a 4 core machine is 3,200 MB or probably half of your RAM. Operating on these arrays costs extra memory, potentially doubling the size -> 6,400 MB. Probably more than you can afford to use.
Simply put, "how do you I speed up this parfor code?"
You don't

I want advice about how to optimize my code. It takes too long for execution

I wrote a MATLAB code for finding seismic signal (ex. P wave) from SAC(seismic) file (which is read via another code). This algorithm is called STA/LTA trigger algorithm (actually not that important for my question)
Important thing is that actually this code works well, but since my seismic file is too big (1GB, which is for two months), it takes almost 40 minutes for executing to see the result. Thus, I feel the need to optimize the code.
I heard that replacing loops with advanced functions would help, but since I am a novice in MATLAB, I cannot get an idea about how to do it, since the purpose of code is scan through the every time series.
Also, I heard that preallocation might help, but I have mere idea about how to actually do this.
Since this code is about seismology, it might be hard to understand, but my notes at the top might help. I hope I can get useful advice here.
Following is my code.
function[pstime]=classic_LR(tseries,ltw,stw,thresh,dt)
% This is the code for "Classic LR" algorithm
% 'ns' is the number of measurement in STW-used to calculate STA
% 'nl' is the number of measurement in LTW-used to calculate LTA
% 'dt' is the time gap between measurements i.e. 0.008s for HHZ and 0.02s for BHZ
% 'ltw' and 'stw' are long and short time windows respectively
% 'lta' and 'sta' are long and short time windows average respectively
% 'sra' is the ratio between 'sta' and 'lta' which will be each component
% for a vector containing the ratio for each measurement point 'i'
% Index 'i' denotes each measurement point and this will be converted to actual time
nl=fix(ltw/dt);
ns=fix(stw/dt);
nt=length(tseries);
aseries=abs(detrend(tseries));
sra=zeros(1,nt);
for i=1:nt-ns
if i>nl
lta=mean(aseries(i-nl:i));
sta=mean(aseries(i:i+ns));
sra(i)=sta/lta;
else
sra(i)=0;
end
end
[k]=find(sra>thresh);
if ~isempty(k)
pstime=k*dt;
else
pstime=0;
end
return;
If you have MATLAB 2016a or later, you can use movmean instead of your loop (this means you also don't need to preallocate anything):
lta = movmean(aseries(1:nt-ns),nl+1,'Endpoints','discard');
sta = movmean(aseries(nl+1:end),ns+1,'Endpoints','discard');
sra = sta./lta;
The only difference here is that you will get sra with no leading and trailing zeros. This is most likely to be the fastest way. If for instance, aseries is 'only' 8 MB than this method takes less than 0.02 second while the original method takes almost 6 seconds!
However, even if you don't have Matlab 2016a, considering your loop, you can still do the following:
Remove the else statement - sta(i) is already zero from the preallocating.
Start the loop from nl+1, instead of checking when i is greater than nl.
So your new loop will be:
for i=nl+1:nt-ns
lta = mean(aseries(i-nl:i));
sta = mean(aseries(i:i+ns));
sra(i)=sta/lta;
end
But it won't be so faster.

best way to parallelize calculations on time series data in matlab

I have a linux cluster with Matlab & PCT installed (128 workers with Torque Manager), and I am looking for a good way to parallelize my calculations.
I have a time-series Trajectory data (100k x 2) matrix. I perform maximum likelihood (ML) calculations that involve matrix diagonalization, exponentiation & multiplications, which is running fast for smaller matrices. I divide the Trajectory data into small chunks and perform the calculations on many workers (coarse parallelization) and don't have any problems here as it works fine (gets done in ~30s)
But the calculations also depend on a number of parameters that I need to vary & test the effect on ML. (something akin to parameter sweep).
When I try to do this using a loop, the calculations becomes progressively very slow, for some reason I am unable to figure out.
%%%%%%% Pseudo- Code Example:
% a [100000x2], timeseries data
load trajectoryData
% p1,p2,p3,p4 are parameters
% but i want to do this over a multiple values fp3 & fp4 ;
paramsMat = [p1Vect; p2Vect;p3Vect ;p4Vect];
matlabpool start 128
[ML] = objfun([p1 p2 p3 p4],trajectoryData) % runs fast ~ <30s
%% NOTE: this runs progressively slow
for i = 1:length(paramsMat)
currentparams = paramsMat(i,:);
[ML] = objfun(currentparams,trajectoryData)
end
matlabpool close
The objFunc function is as follows:
% objFunc.m
[ML] = objFunc(Params, trajectoryData)
% b = 2 always
[a b] = size(trajectoryData) ;
% split into fragments of 1000 points (or any other way)
fragsMat = reshape(trajectoryData,1000, a*2/1000) ;
% simple parallelization. do the calculation on small chunks
parfor ix = 1: numFragments
% do heavy calculations
costVal(ix) = costValFrag;
end
% just an example;
ML = sum(costVal) ;
%%%%%%
Just a single calculation oddly takes ~30s (using the full cluster) but within the for loop, for some weird reason there is damping of speed & even within the 100th calculation, it becomes very slow. The workers are using only 10-20% of CPU.
If you have any suggestions including alternative parallelization suggestions it would be of immense help.
If I read this correctly, each parameter set is completely independent of all the others, and you have more parameter sets than you do workers.
The simple solution is to use a batch job instead of parfor.
job_manager = findresource( ... look up the args that fit your cluster ... )
job = createJob(job_manager);
for i = 1:num_param_sets
t = createTask(job, #your_function, 0, {your params});
end
submit(job);
This way you avoid any communications overhead you have from the parfor of the inner function, and you keep your matlabs separate. You can even tell it to automatically restart the workers between tasks (I think), as one of the job parameters.
What is the value of numFragments? If this is not always larger than your number of workers, then you will see things slowing down.
I would suggest trying to make your outer for loop be the parfor. It's generally better to apply the parallelism at the outermost level.

Matlab's fftn gets slower with multithreading?

I have access to a 12 core machine and some matlab code that relies heavily on fftn. I would like to speed up my code.
Since the fft can be parallelized I would think that more cores would help but I'm seeing the opposite.
Here's an example:
X = peaks(1028);
ncores = feature('numcores');
ntrials = 20;
mtx_power_times = zeros(ncores,ntrials);
fft_times = zeros(ncores, ntrials);
for i=1:ncores
for j=1:ntrials
maxNumCompThreads(i);
tic;
X^2;
mtx_power_times(i,j) = toc;
tic
fftn(X);
fft_times(i,j) = toc;
end
end
subplot(1,2,1);
plot(mtx_power_times,'x-')
title('mtx power time vs number of cores');
subplot(1,2,2);
plot(fft_times,'x-');
title('fftn time vs num of cores');
Which gives me this:
The speedup for matrix multiplication is great but it looks like my ffts go almost 3x slower when I use all my cores. What's going on?
For reference my version is 7.12.0.635 (R2011a)
Edit: On large 2D arrays taking 1D transforms I get the same problem:
Edit: The problem appears to be that fftw is not seeing the thread limiting that maxNumCompThreads enforces. I'm getting all the cpus going full speed no matter what I set maxNumCompThreads at.
So... is there a way I can specify how many processors I want to use for an fft in Matlab?
Edit: Looks like I can't do this without some careful work in .mex files. http://www.mathworks.com/matlabcentral/answers/35088-how-to-control-number-of-threads-in-fft has an answer. It would be nice if someone has an easy fix...
Looks like I can't do this without some careful work in .mex files. http://www.mathworks.com/matlabcentral/answers/35088-how-to-control-number-of-threads-in-fft has an answer. It would be nice if someone has an easy fix...
To use different cores, you should use the Parallel Computing Toolbox. For instance, you could use a parfor loop, and you have to pass the functions as a list of handles:
function x = f(n, i)
...
end
m = ones(8);
parfor i=1:8
m(i,:) = f(m(i,:), i);
end
More info is available at:
High performance computing
Multithreaded computation
Multithreading

Timing program execution in MATLAB; weird results

I have a program which I copied from a textbook, and which times the difference in program execution runtime when calculating the same thing with uninitialized, initialized array and vectors.
However, although the program runs somewhat as expected, if running several times every once in a while it will give out a crazy result. See below for program and an example of crazy result.
clear all; clc;
% Purpose:
% This program calculates the time required to calculate the squares of
% all integers from 1 to 10000 in three different ways:
% 1. using a for loop with an uninitialized output array
% 2. Using a for loop with a pre-allocated output array
% 3. Using vectors
% PERFORM CALCULATION WITH AN UNINITIALIZED ARRAY
% (done only once because it is so slow)
maxcount = 1;
tic;
for jj = 1:maxcount
clear square
for ii = 1:10000
square(ii) = ii^2;
end
end
average1 = (toc)/maxcount;
% PERFORM CALCULATION WITH A PRE-ALLOCATED ARRAY
% (averaged over 10 loops)
maxcount = 10;
tic;
for jj = 1:maxcount
clear square
square = zeros(1,10000);
for ii = 1:10000
square(ii) = ii^2;
end
end
average2 = (toc)/maxcount;
% PERFORM CALCULATION WITH VECTORS
% (averaged over 100 executions)
maxcount = 100;
tic;
for jj = 1:maxcount
clear square
ii = 1:10000;
square = ii.^2;
end
average3 = (toc)/maxcount;
% Display results
fprintf('Loop / uninitialized array = %8.6f\n', average1)
fprintf('Loop / initialized array = %8.6f\n', average2)
fprintf('Vectorized = %8.6f\n', average3)
Result - normal:
Loop / uninitialized array = 0.195286
Loop / initialized array = 0.000339
Vectorized = 0.000079
Result - crazy:
Loop / uninitialized array = 0.203350
Loop / initialized array = 973258065.680879
Vectorized = 0.000102
Why is this happening ?
(sometimes the crazy number is on vectorized, sometimes on loop initialized)
Where did MATLAB "find" that number?
That is indeed crazy. Don't know what could cause it, and was unable to reproduce on my own Matlab R2010a copy over several runs, invoked by name or via F5.
Here's an idea for debugging it.
When using tic/toc inside a script or function, use the "tstart = tic" form that captures the output. This makes it safe to use nested tic/toc calls (e.g. inside called functions), and lets you hold on to multiple start and elapsed times and examine them programmatically.
t0 = tic;
% ... do some work ...
te = toc(t0); % "te" for "time elapsed"
You can use different "t0_label" suffixes for each of the tic and toc returns, or store them in a vector, so you preserve them until the end of your script.
t0_uninit = tic;
% ... do the uninitialized-array test ...
te_uninit = toc(t0_uninit);
t0_prealloc = tic;
% ... test the preallocated array ...
te_prealloc = toc(t0_prealloc);
Have the script break in to the debugger when it finds one of the large values.
if any([te_uninit te_prealloc te_vector] > 5)
keyboard
end
Then you can examine the workspace and the return values from tic, which might provide some clues.
EDIT: You could also try testing tic() on its own to see if there's something odd with your system clock, or whatever tic/toc is calling. tic()'s return value looks like a native timestamp of some sort. Try calling it many times in a row and comparing the subsequent values. If it ever goes backwards, that would be surprising.
function test_tic
t0 = tic;
for i = 1:1000000
t1 = tic;
if t1 <= t0
fprintf('tic went backwards: %s to %s\n', num2str(t0), num2str(t1));
end
t0 = t1;
end
On Matlab R2010b (prerelease), which has int64 math, you can reproduce a similar ridiculous toc result by jiggering the reference tic value to be "in the future". Looks like an int rollover effect, as suggested by gary comtois.
>> t0 = tic; toc(t0+999999)
Elapsed time is 6148914691.236258 seconds.
This suggests that if there were some jitter in the timer that toc were using, you might get rollover if it occurs while you're timing very short operations. (I assume toc() internally does something like tic() to get a value to compare the input to.) Increasing the number of iterations could make the effect go away because a small amount of clock jitter would be less significant as part of longer tic/toc periods. Would also explain why you don't see this in your non-preallocated test, which takes longer.
UPDATE: I was able to reproduce this behavior. I was working on some unrelated code and found that on one particular desktop with a CPU model we haven't used before, a Core 2 Q8400 2.66GHz quad core, tic was giving inaccurate results. Looks like a system-dependent bug in tic/toc.
On this particular machine, tic/toc will regularly report bizarrely high values like yours.
>> for i = 1:50000; t0 = tic; te = toc(t0); if te > 1; fprintf('elapsed: %.9f\n', te); end; end
elapsed: 6934787980.471930500
elapsed: 6934787980.471931500
elapsed: 6934787980.471899000
>> for i = 1:50000; t0 = tic; te = toc(t0); if te > 1; fprintf('elapsed: %.9f\n', te); end; end
>> for i = 1:50000; t0 = tic; te = toc(t0); if te > 1; fprintf('elapsed: %.9f\n', te); end; end
elapsed: 6934787980.471928600
elapsed: 6934787980.471913300
>>
It goes past that. On this machine, tic/toc will regularly under-report elapsed time for operations, especially for low CPU usage tasks.
>> t0 = tic; c0 = clock; pause(4); toc(t0); fprintf('Wall time is %.6f seconds.\n', etime(clock, c0));
Elapsed time is 0.183467 seconds.
Wall time is 4.000000 seconds.
So it looks like this is a bug in tic/toc that is related to particular CPU models (or something else specific to the system configuration). I've reported the bug to MathWorks.
This means that tic/toc is probably giving you inaccurate results even when it doesn't produce those insanely large numbers. As a workaround, on this machine, use etime() instead, and time only longer chunks of work to compensate for etime's lower resolution. You could wrap it in your own tick/tock functions that use the for i=1:50000 test to detect when tic is broken on the current machine, use tic/toc normally, and have them warn and fall back to using etime() on broken-tic systems.
UPDATE 2012-03-28: I've seen this in the wild for a while now, and it's highly likely due to an interaction with the CPU's high resolution performance timer and speed scaling, and (on Windows) QueryPerformanceCounter, as described here: http://support.microsoft.com/kb/895980/. It is not a bug in tic/toc, the issue is in the OS features that tic/toc is calling. Setting a boot parameter can work around it.
Here's my theory about what might be happening, based on these two pieces of data I found:
There is a function maxNumCompThreads which controls the maximum number of computational threads used by MATLAB to perform tasks. Quoting the documentation:
By default, MATLAB makes use of the
multithreading capabilities of the
computer on which it is running.
Which leads me to think that perhaps multiple copies of your script are running at the same time.
This newsgroup thread discusses a bug in an older version of MATLAB (R14) "in the way that MATLAB accelerates M-code with global structure variables", which it appears the TIC/TOC functions may use. The solution there was to disable the accelerator using the undocumented FEATURE function:
feature accel off
Putting these two things together, I'm wondering if the multiple versions of your script that are running in the workspace may be simultaneously resetting global variables used by the TIC/TOC functions and screwing one another up. Maybe this isn't a problem when converting your script to a function as Amro did since this would separate the workspaces that the two programs are running in (i.e. they wouldn't both be running in the main workspace).
This could also explain the exceedingly large numbers you get. As gary and Andrew have pointed out, these numbers appear to be due to an integer roll-over effect (i.e. an integer overflow) whereby the starting time (from TIC) is larger than the ending time (from TOC). This would result in a huge number that is still positive because TIC/TOC are internally using unsigned 64-bit integers as time measures. Consider the following possible scenario with two scripts running at the same time on different threads:
The first thread calls TIC, initializing a global variable to a starting time measure (i.e. the current time).
The first thread then calls TOC, and the immediate action the TOC function is likely to make is to get the current time measure.
The second thread calls TIC, resetting the global starting time measure to the current time, which is later than the time just measured by the TOC function for the first thread.
The TOC function for the first thread accesses the global starting time measure to get the difference between it and the measure it previously took. This difference would result in a negative number, except that the time measures are unsigned integers. This results in integer overflow, giving a huge positive number for the time difference.
So, how might you avoid this problem? Changing your scripts to functions like Amro did is probably the best choice, as that seems to circumvent the problem and keeps the workspace from becoming cluttered. An alternative work-around you could try is to set the maximum number of computational threads to one:
maxNumCompThreads(1);
This should keep multiple copies of your script from running at the same time in the main workspace.
There are at least two possible error sources. Can you try to differentiate between 'tic/toc' and 'fprintf' by just looking at the computed values without formatting them.
I don't understand the braces around 'toc' but they shouldn't do any harm.
Here is a hypothesis which is testable. Matlab's tic()/toc() have to be using some high-resolution timer. On Windows, because their return value looks like clock cycles, I think they're using the Win32 QueryPerformanceCounter() call, or maybe something else hitting the CPU's RDTSC time stamp counter. These apparently have glitches on some multiprocessor systems, mentioned in the linked articles. Perhaps your machine is one of those, getting different results if the Matlab process is moved from core to core by the process scheduler.
http://msdn.microsoft.com/en-us/library/ms644904(VS.85).aspx
http://www.virtualdub.org/blog/pivot/entry.php?id=106
This would be hardware and system configuration dependent, which would explain why other posters haven't been able to reproduce it.
Try using Windows Task Manager to set the affinity on your Matlab.exe process to a single CPU. (On the Processes tab, right-click MATLAB.exe, "Set affinity...", un-check all but CPU 0.) If the crazy timing goes away while affinity is set, looks like you found the cause.
Regardless, the workaround looks like to just increase maxcount so you're timing longer pieces of work, and the noise you're apparently getting in tic()/toc() is small compared to the measured value. (You don't want to have to muck around with CPU affinity; Matlab is supposed to be easy to run.) If there's a problem in there that's causing int overflow, the other small positive numbers are a bit suspect too. Besides, hi-res timing in a high level language like Matlab is a bit problematic. Timing workloads down to a couple hundred microseconds subjects them to noise from other transient conditions in your machine's state.