Matlab firpm fails for large AFR data arrays - matlab

Here is a quick & dirty code for trying to create a high precision equalizer:
bandPoints = 355;
for n = 1:bandPoints
x = (n / (bandPoints + 2));
f = (x*x)*(22000-20)+20; % 20...22000
freqs(n) = f;
niqfreqs(n) = f/22050.0;
amps(n) = 0;
end
amps(bandPoints+1) = 0; % firpm needs even numbers
niqfreqs(bandPoints+1) = 1; % firpm needs even numbers
% set some point to have a high amplitude
amps(200) = 1;
fircfs = firpm(101,niqfreqs,amps);
[h,w] = freqz(fircfs,1,512);
plot(w/pi,abs(h));
legend('firpm Design')
but it gives me
Warning:
*** FAILURE TO CONVERGE ***
Probable cause is machine rounding error.
and all FIR coefficients are 0.
If I lower the n parameter from 101 to 91, firpm works without errors, but the frequency response is far from desired. And taking into account, that I want to calculate FIR coefficients for a hardware DSP FIR module, which supports up to 12288 taps, how can I make Matlab calculate the needed coefficients? Is firpm capable of doing this or do I need to use another approach (inverse FFT) in both Matlab and later in my application C++ code?

Oh, it seems MP algorithm really cannot handle this, so I need some other solution:
http://www.eetimes.com/design/embedded/4212775/Designing-very-high-order-FIR-filters-with-zero-stuffing
I guess, I'll have to stick with inverse FFT then.

Related

Moving maximum and LTI in MATLAB

I've implement a moving maximum filter in matlab and now I have to determine if it's LTI or not. I've already prove it's linear but I can't prove if it's time ivnariant. To prove this I have to give as an input a signal x[n] and get the output X[n]. After that I'm giving the same signal but this time it have to delay it that is x[n-a] and the output must be X[n-a] the same as the X[n] but delayed. My problem is that I don't know what input to give. Also, the moving maximum is a LTI system or not?
Thanks in advance.
Max-Filter Using For-Loop and Testing System for Linear and Time-Invariance
Filter Construction:
To filter an array/matrix/signal a for-loop can be applied. Within this for-loop, portions of the signal that overlaps the window must be stored in this case called Window_Sample. To apply the maximum filter the max() of the Window_Sample must be taken upon each iteration of the loop. Upon each iteration of the loop the window moves. Here I push all the code that calculates the maximum filtered signal into a function called Max_Filter().
Time-Invariant or Not-Time Invariant:
With proper padding, I think the maximum filter is time-invariant/shift-invariant. Concatenating zeros to the start and ends of the signal is a good way to combat this. The amount of padding I used on each end is the size of the window, the requirement would be half the window but there's no harm in having a little bit of buffer. Unfortunately, without proper padding, this may appear to break down in MATLAB since the max filter will be introduced to points early. In this case without padding the ramp the max filter will take the 5th point as the max right away. Here you can apply a scaling factor to test if the system is linear. You'll see that the system is not linear since the output is not a scaled version of the input.
clf;
%**********************************************************%
%ADJUSTABLE PARAMETERS%
%**********************************************************%
Window_Size = 3; %Window size for filter%
Delay_Size = 5; %Delay for the second test signal%
Scaling_Factor = 1; %Scaling factor for the second test signal%
n = (0:10);
%Ramp function with padding%
Signal = n; %Function for test signal%
%**********************************************************%
Signal = [zeros(1,Window_Size) Signal zeros(1,Window_Size)];
Filtered_Result_1 = Max_Filter(Signal,Window_Size);
%Ramp function with additional delay%
Delayed_Signal = Scaling_Factor.*[zeros(1,Delay_Size) Signal];
Filtered_Result_2 = Max_Filter(Delayed_Signal,Window_Size);
plot(Filtered_Result_1);
hold on
plot(Filtered_Result_2);
xlabel("Samples [n]"); ylabel("Amplitude");
title("Moving Maximum Filtering Signal");
legend("Original Signal","Delayed Signal");
xticks([0: 1: length(Filtered_Result_2)]);
xlim([0 length(Filtered_Result_2)]);
grid;
function [Filtered_Signal] = Max_Filter(Signal,Filter_Size)
%Initilizing an array to store filtered signal%
Filtered_Signal = zeros(1,length(Signal));
%Filtering array using for-loop and acccesing chunks%
for Sample_Index = 1: length(Signal)
Start_Of_Window = Sample_Index - floor(Filter_Size/2);
End_Of_Window = Sample_Index + floor(Filter_Size/2);
if Start_Of_Window < 1
Start_Of_Window = 1;
end
if End_Of_Window > length(Signal)
End_Of_Window = length(Signal);
end
Window_Sample = Signal(Start_Of_Window:End_Of_Window);
Filtered_Signal(1,Sample_Index) = max(Window_Sample);
end
end
Alternatively Using Built-In Functions:
To test the results of the created filter it is a good idea to compare this to the output results of MATLAB's built-in function movmax(). Here the results appear to be the same as the for-loop implementation.
clf;
Window_Size = 3;
n = (0:10);
%Ramp function with padding%
Signal = n;
Signal = [zeros(1,Window_Size) Signal zeros(1,Window_Size)];
Filtered_Result_1 = movmax(Signal,Window_Size);
%Ramp function with additional delay%
Delay_Size = 5;
Delayed_Signal = [zeros(1,Delay_Size) Signal];
Filtered_Result_2 = movmax(Delayed_Signal,Window_Size);
plot(Filtered_Result_1);
hold on
plot(Filtered_Result_2);
xlabel("Samples [n]"); ylabel("Amplitude");
title("Moving Maximum Filtering Signal");
legend("Original Signal","Delayed Signal");
xticks([0: 1: length(Filtered_Result_2)]);
xlim([0 length(Filtered_Result_2)]);
grid;
Ran using MATLAB R2019b

Vectorizing a matlab loop with internal functions

I have a 3D Mesh grid, X, Y, Z. I want to create a new 3D array that is a function of X, Y, & Z. That function comprises the sum of several 3D Gaussians located at different points. Currently, I have a for loop that runs over the different points where I have my gaussians, and I have an array of center locations r0(nGauss, 1:3)
[X,Y,Z]=meshgrid(-10:.1:10);
Psi=0*X;
for index = 1:nGauss
Psi = Psi + Gauss3D(X,Y,Z,[r0(index,1),r0(index,2),r0(index,3)]);
end
where my 3D gaussian function is
function output=Gauss3D(X,Y,Z,r0)
output=exp(-(X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2);
end
I'm happy to redesign the function, which is the slowest part of my code and has to happen many many time, but I can't figure out how to vectorize this so that it will run faster. Any suggestions would be appreciated
*****NB the Original function had a square root in it, and has been modified to make it an actual gaussian***
NOTE! I've modified your code to create a Gaussian, which was:
output=exp(-sqrt((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
That does not make a Gaussian. I changed this to:
output = exp(-((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
(note no sqrt). This is a Gaussian with sigma = sqrt(1/2).
If this is not what you want, then this answer might not be very useful to you, because your function does not go to 0 as fast as the Gaussian, and therefore is harder to truncate, and it is not separable.
Vectorizing this code is pointless, as the other answers attest. MATLAB's JIT is perfectly capable of running this as fast as it'll go. But you can reduce the amount of computation significantly by noting that the Gaussian goes to almost zero very quickly, and is separable:
Most of the exp evaluations you're doing here yield a very tiny number. You don't need to compute those, just fill in 0.
exp(-x.^2-y.^2) is the same as exp(-x.^2).*exp(-y.^2), which is much cheaper to compute.
Let's put these two things to the test. Here is the test code:
function gaussian_test
N = 100;
r0 = rand(N,3)*20 - 10;
% Original
tic
[X,Y,Z] = meshgrid(-10:.1:10);
Psi1 = zeros(size(X));
for index = 1:N
Psi1 = Psi1 + Gauss3D(X,Y,Z,r0(index,:));
end
t = toc;
fprintf('original, time = %f\n',t)
% Fast, large truncation
tic
[X,Y,Z] = deal(-10:.1:10);
Psi2 = zeros(numel(X),numel(Y),numel(Z));
for index = 1:N
Psi2 = Gauss3D_fast(Psi2,X,Y,Z,r0(index,:),5);
end
t = toc;
fprintf('tuncation = 5, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi2-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi2-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi2-Psi1),[],1)))
% Fast, smaller truncation
tic
[X,Y,Z] = deal(-10:.1:10);
Psi3 = zeros(numel(X),numel(Y),numel(Z));
for index = 1:N
Psi3 = Gauss3D_fast(Psi3,X,Y,Z,r0(index,:),3);
end
t = toc;
fprintf('tuncation = 3, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi3-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi3-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi3-Psi1),[],1)))
% DIPimage, same smaller truncation
tic
Psi4 = newim(201,201,201);
coords = (r0+10) * 10;
Psi4 = gaussianblob(Psi4,coords,10*sqrt(1/2),(pi*100).^(3/2));
t = toc;
fprintf('DIPimage, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi4-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi4-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi4-Psi1),[],1)))
end % of function gaussian_test
function output = Gauss3D(X,Y,Z,r0)
output = exp(-((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
end
function Psi = Gauss3D_fast(Psi,X,Y,Z,r0,trunc)
% sigma = sqrt(1/2)
x = X-r0(1);
y = Y-r0(2);
z = Z-r0(3);
mx = abs(x) < trunc*sqrt(1/2);
my = abs(y) < trunc*sqrt(1/2);
mz = abs(z) < trunc*sqrt(1/2);
Psi(my,mx,mz) = Psi(my,mx,mz) + exp(-x(mx).^2) .* reshape(exp(-y(my).^2),[],1) .* reshape(exp(-z(mz).^2),1,1,[]);
% Note! the line above uses implicit singleton expansion. For older MATLABs use bsxfun
end
This is the output on my machine, reordered for readability (I'm still on MATLAB R2017a):
| time(s) | mean abs | mean sq. | max abs
--------------+----------+----------+----------+----------
original | 5.035762 | | |
tuncation = 5 | 0.169807 | 0.000000 | 0.000000 | 0.000005
tuncation = 3 | 0.054737 | 0.000452 | 0.000002 | 0.024378
DIPimage | 0.044099 | 0.000452 | 0.000002 | 0.024378
As you can see, using these two properties of the Gaussian we can reduce time from 5.0 s to 0.17 s, a 30x speedup, with hardly noticeable differences (truncating at 5*sigma). A further 3x speedup can be gained by allowing a small error. The smallest the truncation value, the faster this will go, but the larger the error will be.
I added that last method, the gaussianblob function from DIPimage (I'm an author), just to show that option in case you need to squeeze that bit of extra time from your code. That function is implemented in C++. This version that I used you will need to compile yourself. Our current official release implements this function still in M-file code, and is not as fast.
Further chance of improvement is if the fractional part of the coordinates is always the same (w.r.t. the pixel grid). In this case, you can draw the Gaussian once, and shift it over to each of the centroids.
Another alternative involves computing the Gaussian once, at a somewhat larger scale, and interpolating into it to generate each of the 1D Gaussians needed to generate the output. I did not implement this, I have no idea if it will be faster or if the time difference will be significant. In the old days, exp was expensive, I'm not sure this is still the case.
So, I am building off of the answer above me #Durkee. I enjoy these kinds of problems, so I thought a little about how to make each of the expansions implicit, and I have the one-line function below. Using this function I shaved .11 seconds off of the call, which is completely negligible. It looks like yours is pretty decent. The only advantage of mine might be how the code scales on a finer mesh.
xLin = [-10:.1:10]';
tic
psi2 = sum(exp(-sqrt((permute(xLin-r0(:,1)',[3 1 4 2])).^2 ...
+ (permute(xLin-r0(:,2)',[1 3 4 2])).^2 ...
+ (permute(xLin-r0(:,3)',[3 4 1 2])).^2)),4);
toc
The relative run times on my computer were (all things kept the same):
Original - 1.234085
Other - 2.445375
Mine - 1.120701
So this is a bit of an unusual problem where on my computer the unvectorized code actually works better than the vectorized code, here is my script
clear
[X,Y,Z]=meshgrid(-10:.1:10);
Psi=0*X;
nGauss = 20; %Sample nGauss as you didn't specify
r0 = rand(nGauss,3); % Just make this up as it doesn't really matter in this case
% Your original code
tic
for index = 1:nGauss
Psi = Psi + Gauss3D(X,Y,Z,[r0(index,1),r0(index,2),r0(index,3)]);
end
toc
% Vectorize these functions so we can use implicit broadcasting
X1 = X(:);
Y1 = Y(:);
Z1 = Z(:);
tic
val = [X1 Y1 Z1];
% Change the dimensions so that r0 operates on the right elements
r0_temp = permute(r0,[3 2 1]);
% Perform the gaussian combination
out = sum(exp(-sqrt(sum((val-r0_temp).^2,2))),3);
toc
% Check to make sure both functions match
sum(abs(vec(Psi)-vec(out)))
function output=Gauss3D(X,Y,Z,r0)
output=exp(-sqrt((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
end
function out = vec(in)
out = in(:);
end
As you can see, this is probably about as vectorized as you can get. The whole function is done using broadcasting and vectorized operations which normally improve performance ten-one hundredfold. However, in this case, this is not what we see
Elapsed time is 1.876460 seconds.
Elapsed time is 2.909152 seconds.
This actually shows the unvectorized version as being faster.
There could be a few reasons for this of which I am by no means an expert.
MATLAB uses a JIT compiler now which means that for loops are no longer inefficient.
Your code is already reasonably vectorized, you are operating at 8 million elements at once
Unless nGauss is 1000 or something, you're not looping through that much, and at that point, vectorization means you will run out of memory
I could be hitting some memory threshold where I am using too much memory and that is making my code inefficient, I noticed that when I lowered the resolution on the meshgrid the vectorized version worked better
As an aside, I tested this on my GTX 1060 GPU with single precision(single precision is 10x faster than double precision on most GPUs)
Elapsed time is 0.087405 seconds.
Elapsed time is 0.241456 seconds.
Once again the unvectorized version is faster, sorry I couldn't help you out but it seems that your code is about as good as you are going to get unless you lower the tolerances on your meshgrid.

How to improve the time consuming of log and exponentiation operation

The theoretical result is a mixture of gamma ratios, like:sum(
AiGamma(Bi)/Gamma(Ci)), in which A is a binomial coeff, and would be very hard to calculate by using nchoosek directly in matlab. So my solution is to decompose all elements in the results to prod(vector), however, as the vector getting longer, I meet digit problem. So I changed the solution to get x(1:n) = log(vector) and then rst = sum(exp(x)). In practice, I found this is quite time consuming, especially when the # of gamma terms is very large.
Here is a code section:
gamma_sum = zeros(1,x2+1);
coef = ones(1,x2+1);
% sub_gamma_sum = zeros(1,x2+1);
% coef(1) = prod(1./sqrt(1:x2));
coef(1) = sum(log(1:x2))/2-sum(log([1:1-1 1:x2-1+1]));
if x1>0
% gamma_sum(1) = gamma(beta)/gamma(alpha+beta)/...
% prod((alpha+beta:alpha+beta+x1-1));
% gamma_sum(1) = prod(1./(alpha+beta:alpha+beta+x1-1));
gamma_sum(1) = sum(log(1./(alpha+beta:alpha+beta+x1-1)));
else
% gamma_sum(1) = gamma(beta)/gamma(alpha+beta);
% gamma_sum(1) = 1;
gamma_sum(1) = log(1);
end
for i = 2:x2+1
% coef(i) = prod((1:x2)./[1:i-1 1:x2-i+1]);
% coef(i) = exp(sum(log(1:x2))/2-sum(log([1:i-1 1:x2-i+1])));
coef(i) = sum(log(1:x2))/2-sum(log([1:i-1 1:x2-i+1]));
% coef(i) = prod(1./[1:i-1 1:x2-i+1])*exp(sum(log(1:x2))/2);
% gamma_sum(i) = prod((beta:beta+i-2)./(alpha+beta:alpha+beta+i-2))*prod(1./(alpha+beta+i-1:alpha+beta+x1+i-2));%% den has x1+i-1 terms
gamma_sum(i) = sum(log((beta:beta+i-2)./(alpha+beta:alpha+beta+i-2)))+sum(log(1./(alpha+beta+i-1:alpha+beta+x1+i-2)));
end
In the code, coef is the Ai, and gamma_sum is the rest part. Just found that when x2, i.e. the number of the terms of the gamma terms, the computing time is really troublesome. P.S: I tried to replace all for loop with matrix operation, but when x2 increases the matrix size also makes the computing time consuming. Is there any way to solve the problem, like use some other method to solve the digit problem(number exceeds 1e300 or number less than e-200) more efficiently, i.e. guarantee the precision and increase the speed.
This might make your system slower, but you can try vpa() for your big numbers, if you need a high precision and if you have a Symbolic Math toolbox. Here is the example:
>> exp(1000)
ans =
Inf
>> vpa('exp(1000)',1000)
ans =
197007111401704699388887935224332312531693798532384578995280299138506385078244119347497807656302688993096381798752022693598298173054461289923262783660152825232320535169584566756192271567602788071422466826314006855168508653497941660316045367817938092905299728580132869945856470286534375900456564355589156220422320260518826112288638358372248724725214506150418881937494100871264232248436315760560377439930623959705844189509050047074217568.2267578083308102070668818911968536445918206584929433885943734416066833995904928281627706135987730904979566512246702227965470280600169740154332169201122794194769119334980240147712089576923975942544366215939426101781299421858554271852298015286303411058042095685866168239536053428580900735188184273075136717125183129388223688310255949141146674987544438726686065824907707203395789112200325628195551034220107289821072957315749621922062772097208051047568893649549635990627082681006282905378167473398226026683503867394140748723651685213836918959449223430784235236845739442
In this way you would be able to use enormous numbers in your calculations at the expense of increased memory usage and lower speed.

Simulation of Markov chains

I have the following Markov chain:
This chain shows the states of the Spaceship, which is in the asteroid belt: S1 - is serviceable, S2 - is broken. 0.12 - the probability of destroying the Spaceship by a collision with an asteroid. 0.88 - the probability of that a collision will not be critical. Need to find the probability of a serviceable condition of the ship after the third collision.
Analytical solution showed the response - 0.681. But it is necessary to solve this problem by simulation method using any modeling tool (MATLAB Simulink, AnyLogic, Scilab, etc.).
Do you know what components should be used to simulate this process in Simulink or any other simulation environment? Any examples or links.
First, we know the three step probability transition matrix contains the answer (0.6815).
% MATLAB R2019a
P = [0.88 0.12;
0 1];
P3 = P*P*P
P(1,1) % 0.6815
Approach 1: Requires Econometrics Toolbox
This approach uses the dtmc() and simulate() functions.
First, create the Discrete Time Markov Chain (DTMC) with the probability transition matrix, P, and using dtmc().
mc = dtmc(P); % Create the DTMC
numSteps = 3; % Number of collisions
You can get one sample path easily using simulate(). Pay attention to how you specify the initial conditions.
% One Sample Path
rng(8675309) % for reproducibility
X = simulate(mc,numSteps,'X0',[1 0])
% Multiple Sample Paths
numSamplePaths = 3;
X = simulate(mc,numSteps,'X0',[numSamplePaths 0]) % returns a 4 x 3 matrix
The first row is the X0 row for the starting state (initial condition) of the DTMC. The second row is the state after 1 transition (X1). Thus, the fourth row is the state after 3 transitions (collisions).
% 50000 Sample Paths
rng(8675309) % for reproducibility
k = 50000;
X = simulate(mc,numSteps,'X0',[k 0]); % returns a 4 x 50000 matrix
prob_survive_3collisions = sum(X(end,:)==1)/k % 0.6800
We can bootstrap a 95% Confidence Interval on the mean probability to survive 3 collisions to get 0.6814 ± 0.00069221, or rather, [0.6807 0.6821], which contains the result.
numTrials = 40;
ProbSurvive_3collisions = zeros(numTrials,1);
for trial = 1:numTrials
Xtrial = simulate(mc,numSteps,'X0',[k 0]);
ProbSurvive_3collisions(trial) = sum(Xtrial(end,:)==1)/k;
end
% Mean +/- Halfwidth
alpha = 0.05;
mean_prob_survive_3collisions = mean(ProbSurvive_3collisions)
hw = tinv(1-(0.5*alpha), numTrials-1)*(std(ProbSurvive_3collisions)/sqrt(numTrials))
ci95 = [mean_prob_survive_3collisions-hw mean_prob_survive_3collisions+hw]
maxNumCollisions = 10;
numSamplePaths = 50000;
ProbSurvive = zeros(maxNumCollisions,1);
for numCollisions = 1:maxNumCollisions
Xc = simulate(mc,numCollisions,'X0',[numSamplePaths 0]);
ProbSurvive(numCollisions) = sum(Xc(end,:)==1)/numSamplePaths;
end
For a more complex system you'll want to use Stateflow or SimEvents, but for this simple example all you need is a single Unit Delay block (output = 0 => S1, output = 1 => S2), with a Switch block, a Random block, and some comparison blocks to construct the logic determining the next value of the state.
Presumably you must execute the simulation a (very) large number of times and average the results to get a statistically significant output.
You'll need to change the "seed" of the random generator each time you run the simulation.
This can be done by setting the seed to be "now" (or something similar to that).
Alternatively you could quite easily vectorize the model so that you only need to execute it once.
If you want to simulate this, it is fairly easy in matlab:
servicable = 1;
t = 0;
while servicable =1
t = t+1;
servicable = rand()<=0.88
end
Now t represents the amount of steps before the ship is broken.
Wrap this in a for loop and you can do as many simulations as you like.
Note that this can actually give you the distribution, if you want to know it after 3 times, simply add && t<3 to the while condition.

MATLAB code for calculating MFCC

I have a question if that's ok. I was recently looking for algorithm to calculate MFCCs. I found a good tutorial rather than code so I tried to code it by myself. I still feel like I am missing one thing. In the code below I took FFT of a signal, calculated normalized power, filter a signal using triangular shapes and eventually sum energies corresponding to each bank to obtain MFCCs.
function output = mfcc(x,M,fbegin,fs)
MF = #(f) 2595.*log10(1 + f./700);
invMF = #(m) 700.*(10.^(m/2595)-1);
M = M+2; % number of triangular filers
mm = linspace(MF(fbegin),MF(fs/2),M); % equal space in mel-frequency
ff = invMF(mm); % convert mel-frequencies into frequency
X = fft(x);
N = length(X); % length of a short time window
N2 = max([floor(N+1)/2 floor(N/2)+1]); %
P = abs(X(1:N2,:)).^2./N; % NoFr no. of periodograms
mfccShapes = triangularFilterShape(ff,N,fs); %
output = log(mfccShapes'*P);
end
function [out,k] = triangularFilterShape(f,N,fs)
N2 = max([floor(N+1)/2 floor(N/2)+1]);
M = length(f);
k = linspace(0,fs/2,N2);
out = zeros(N2,M-2);
for m=2:M-1
I = k >= f(m-1) & k <= f(m);
J = k >= f(m) & k <= f(m+1);
out(I,m-1) = (k(I) - f(m-1))./(f(m) - f(m-1));
out(J,m-1) = (f(m+1) - k(J))./(f(m+1) - f(m));
end
end
Could someone please confirm that this is all right or direct me if I made mistake> I tested it on a simple pure tone and it gives me, in my opinion, reasonable answers.
Any help greatly appreciated :)
PS. I am working on how to apply vectorized Cosinus Transform. It looks like I would need a matrix of MxM of transform coefficients but I did not find any source that would explain how to do it.
You can test it yourself by comparing your results against other implementations like this one here
you will find a fully configurable matlab toolbox incl. MFCCs and even a function to reverse MFCC back to a time signal, which is quite handy for testing purposes:
melfcc.m - main function for calculating PLP and MFCCs from sound waveforms, supports many options.
invmelfcc.m - main function for inverting back from cepstral coefficients to spectrograms and (noise-excited) waveforms, options exactly match melfcc (to invert that processing).
the page itself has a lot of information on the usage of the package.