I'm relatively new to abaqus and currently working on an optimisation project for a beam under concentrated load where I have to run multiple files with different parameters(length,width,load) in matlab to create a .inp file to be run on abaqus. My question is, how would I go about tackling this problem and is there a way to generate the nodes and elements automatically?
You can parametrize your Abaqus project using the Python interface. It requires some getting used to, but it can be quite powerful depending on what you want to do. I assume your optimizer is programmed in Matlab, but that should not be a problem. You can write a wrapper function in Matlab which calls Abaqus with the script you want. The command would be abaqus cae noGUI=<script.py> to run the script that creates your input file.
Alternatively you can create a base input file and create a script that looks for the lines that have to be changed. This can be based on the input file keywords, you can put some own parameters (dummy words) that you substitute using your script or a comment that can easily be recognized.
The best approach depends a bit on what exactly you want to change.
Abaqus also offers some parameterization, I think this link may be handy.
Regarding generating the nodes and elements automatically, it very much possible but it is time consuming and rather complicated process.
Nodes can be created by defining the x,y and z coordinates and Elements can be created by defining the Nodal Connectivity. As the model gets complicated, defining the x,y and z coordinates and Nodal Connectivity also get difficult. However, for simple models you can still create the nodes and elements automatically. If I am guessing correctly, your model is rectangular shaped beam, hence creating mesh may not be complicated.
In Abaqus, Nodal Connectivity for each element type is given in Analysis Help Manuals. Following is the example of Nodal Connectivity for 4-Noded Quad and 8-Noded Hex Element.
To get you started, I am providing here the Python code to create Abaqus INPUT mesh file for 2D and 3D beam (Later you can replicate in MATLAB code):
fnm = 'shell.inp'
w = 100 # x-direction
h = 10 # y-direction
npx = 30; npy = 5
fout = open(fnm, 'w')
# Creating node information
fout.write('*Node, Nset=All_Nodes\n')
nds = []; elms = []
for i in range(npy):
y = (i+1)/npy * h
for j in range(npx):
nid = i*npx + j + 1
x = (j+1)/npx * w
nds.append([nid,x,y,0.0])
fout.write('%10d,%10f,%10f,%10f\n'%(nid,x,y,0.0))
# Creating Nodal connectivity
fout.write('*Element, Type=S4, Elset=All_Elements\n')
for m in range(npy-1):
for n in range(npx-1):
eid = m*(npx-1) + n + 1
e1 = n + 2 + npx + m*npx
e2 = e1 - 1
e3 = n + 1 + m*npx
e4 = e3 + 1
elms.append([eid,e1,e2,e3,e4])
fout.write(('%10d,'*4+'%10d\n')%(eid,e1,e2,e3,e4))
fout.close()
# ====================================================================
fnm = 'rect_beam.inp'
w = 10 # z-direction
h = 10 # y-direction
l = 100 # X-direction
npx = 20;npy = 5;npz = 3
fout = open(fnm, 'w')
# Creating node information
fout.write('*Node, Nset=All_Nodes\n')
nds = [];elms = []
for i in range(npz):
z = (i+1)/npz * w
for j in range(npy):
y = (j+1)/npy * h
for k in range(npx):
nid = i*(npx*npy) + j*npx + k + 1
x = (k+1)/npx * l
nds.append([nid,x,y,z])
fout.write('%10d,%10f,%10f,%10f\n'%(nid,x,y,z))
# Creating Nodal connectivity
fout.write('*Element, Type=C3D8, Elset=All_Elements\n')
for m in range(npz-1):
for n in range(npy-1):
for p in range(npx-1):
eid = m*(npy-1)*(npx-1) + n*(npx-1) + p + 1
e1 = p + 2 + npx + n*npx + m*npx*npy
e2 = e1 - 1
e3 = p + 1 + n*npx + m*npx*npy
e4 = e3 + 1
e5 = e1 + (npx*npy)
e6 = e5 - 1
e7 = e3 + (npx*npy)
e8 = e7 + 1
elms.append([eid,e1,e2,e3,e4,e5,e6,e7,e8])
fout.write(('%6d,'*8+'%6d\n')%(eid,e1,e2,e3,e4,e5,e6,e7,e8))
print(m,n,p,'|',eid,e1,e2,e3,e4,e5,e6,e7,e8)
fout.close()
Related
I'm given the a(k) matrix and e(n) and I need to compute y(n) from the following eq:
y(n) = sum(k = 1 to 10 )( a(k)*y(n-k) ) + e(n).
I have the a matrix(filter coefficients) and the e matrix (the residual), therefore there is only 1 unknown: y, which is built by the previous 10 samples of y.
An example to this equation:
say e(0) (my first residual sample) = 3
and y(-10) to y(-1) = 0
then y(0), my first sample in the signal y, would just be e(0) = 3:
y(0) = a(1)*y(-1) + a(2)*y(-2) + .... + a(10)*y(-10) + e(0) = e(0) = 3
and if e(1) = 4, and a(1) = 5, then
y(1) = a(1)*y(0) + a(2)*y(-1) + a(3)&y(-2) + ... + a(10)*y(-9) + e(1) = 19
The problem is
I don't know how to do this without loops because, say, y(n) needs y(n-1), so I need to immediately append y(n-1) into my matrix in order to get y(n).
If n (the number of samples) = say, 10,000,000, then using a loop is not ideal.
What I've done so far
I have not implemented anything. The only thing I've done for this particular problem is research on what kind of Matlab functions I could use.
What I need
A Matlab function that, given an equation and or input matrix, computes the next y(n) and automatically appends that to the input matrix, and then computes the next y(n+1), and automatically append that to the input matrix and so on.
If there is anything regarding my approach
That seems like it's on the wrong track, or my question isn't clear enough, of if there is no such matlab function that exists, then I apologize in advance. Thank you for your time.
Thank you for the CNTK Tool, the examples are running pretty fast. Since some days, I try to set up a simple network, but I dont get it. I need a network with 2 input and 3 output, for example:
|features 0.3 0.5 |labels 0.2 0.7 0.9
The output is not a one-hot-vector, the network has to learn the label-values 0.2 0.7 0.9. But most examples have a one-hot-vector as output, so it is not clear to me how to solve this. I have tried to change the tutorial with 3 classification, but it does not work, the network does not learn the output correctly. The network I have tried is:
BrainScriptNetworkBuilder = {
SDim = 2 # feature dimension
H1Dim = 50 # hidden dimension
H2Dim = 50 # hidden dimension
LDim = 3 # number of classes (labels)
model (features) = {
W0 = ParameterTensor {(H1Dim:SDim)} ; b0 = ParameterTensor {H1Dim}
W1 = ParameterTensor {(H2Dim:H1Dim)} ; b1 = ParameterTensor {H2Dim}
W2 = ParameterTensor {(LDim:H2Dim)} ; b2 = ParameterTensor {LDim}
r1 = ReLU(W0 * features + b0) # hidden layer 1
r2 = ReLU(W1 * r1 + b1) # hidden layer 2
z = ReLU(W2 * r2 + b2)
}.z
# define inputs
features = Input {SDim, sparse = false}
labels = Input {LDim, sparse = false}
# apply model to features
z = model (features)
# define criteria and output(s)
ce = SquareError(labels, z) # criterion (loss)
err = SquareError(labels, z) # additional metric
# connect to the system. These five variables must be named exactly like this.
featureNodes = (features)
inputNodes = (labels)
criterionNodes = (ce)
evaluationNodes = (err)
outputNodes = (z)
}
So my question is: How to set up a network in CNTK, so that the output is not a one hot vector?
Thank you for help.
When your label is not a one-hot vector, squareError is a good loss function to minimize. If some examples have a one-hot label you can still user squareError. So I think you are doing everything right, you might have to just tune the learning rate to get it to work well.
I know this question has been asked in various forms, but I can't really find any answer I can understand and use. So forgive me if this is a basic question, 'cause I'm a newbie to these tools(theano/keras)
Problem to Solve
Monitor variables in Neural Networks
(e.g. input/forget/output gate values in LSTM)
What I'm currently getting
no matter in which stage I'm getting those values, I'm getting something like :
Elemwise{mul,no_inplace}.0
Elemwise{mul,no_inplace}.0
[for{cpu,scan_fn}.2, Subtensor{int64::}.0, Subtensor{int64::}.0]
[for{cpu,scan_fn}.2, Subtensor{int64::}.0, Subtensor{int64::}.0]
Subtensor{int64}.0
Subtensor{int64}.0
Is there any way I can't monitor(e.g. print to stdout, write to a file, etc) them?
Possible Solution
Seems like callbacks in Keras can do the job, but it doesn't work either for me. I'm getting same thing as above
My Guess
Seems like I'm making very simple mistakes.
Thank you very much in advance, everyone.
ADDED
Specifically, I'm trying to monitor input/forget/output gating values in LSTM.
I found that LSTM.step() is for computing those values:
def step(self, x, states):
h_tm1 = states[0] # hidden state of the previous time step
c_tm1 = states[1] # cell state from the previous time step
B_U = states[2] # dropout matrices for recurrent units?
B_W = states[3] # dropout matrices for input units?
if self.consume_less == 'cpu': # just cut x into 4 pieces in columns
x_i = x[:, :self.output_dim]
x_f = x[:, self.output_dim: 2 * self.output_dim]
x_c = x[:, 2 * self.output_dim: 3 * self.output_dim]
x_o = x[:, 3 * self.output_dim:]
else:
x_i = K.dot(x * B_W[0], self.W_i) + self.b_i
x_f = K.dot(x * B_W[1], self.W_f) + self.b_f
x_c = K.dot(x * B_W[2], self.W_c) + self.b_c
x_o = K.dot(x * B_W[3], self.W_o) + self.b_o
i = self.inner_activation(x_i + K.dot(h_tm1 * B_U[0], self.U_i))
f = self.inner_activation(x_f + K.dot(h_tm1 * B_U[1], self.U_f))
c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1 * B_U[2], self.U_c))
o = self.inner_activation(x_o + K.dot(h_tm1 * B_U[3], self.U_o))
with open("test_visualization.txt", "a") as myfile:
myfile.write(str(i)+"\n")
h = o * self.activation(c)
return h, [h, c]
And as it's in the code above, I tried to write the value of i into a file, but it only gave me values like :
Elemwise{mul,no_inplace}.0
[for{cpu,scan_fn}.2, Subtensor{int64::}.0, Subtensor{int64::}.0]
Subtensor{int64}.0
So I tried i.eval() or i.get_value(), but both failed to give me values.
.eval() gave me this:
theano.gof.fg.MissingInputError: An input of the graph, used to compute Subtensor{::, :int64:}(<TensorType(float32, matrix)>, Constant{10}), was not provided and not given a value.Use the Theano flag exception_verbosity='high',for more information on this error.
and .get_value() gave me this:
AttributeError: 'TensorVariable' object has no attribute 'get_value'
So I backtracked those chains(which line calls which functions..) and tried to get values at every steps I found but in vain.
Feels like I'm in some basic pitfalls.
I use the solution described in the Keras FAQ:
http://keras.io/getting-started/faq/#how-can-i-visualize-the-output-of-an-intermediate-layer
In detail:
from keras import backend as K
intermediate_tensor_function = K.function([model.layers[0].input],[model.layers[layer_of_interest].output])
intermediate_tensor = intermediate_tensor_function([thisInput])[0]
yields:
array([[ 3., 17.]], dtype=float32)
However I'd like to use the functional API but I can't seem to get the actual tensor, only the symbolic representation. For example:
model.layers[1].output
yields:
<tf.Tensor 'add:0' shape=(?, 2) dtype=float32>
I'm missing something about the interaction of Keras and Tensorflow here but I'm not sure what. Any insight much appreciated.
One solution is to create a version of your network that is truncated at the LSTM layer of which you want to monitor the gate values, and then replace the original layer with a custom layer in which the stepfunction is modified to return not only the hidden layer values, but also the gate values.
For instance, say you want to access the access the gate values of a GRU. Create a custom layer GRU2 that inherits everything from the GRU class, but adapt the step function such that it returns a concatenation of the states you want to monitor, and then takes only the part containing the previous hidden layer activations when computing the next activations. I.e:
def step(self, x, states):
# get prev hidden layer from input that is concatenation of
# prev hidden layer + reset gate + update gate
x = x[:self.output_dim, :]
###############################################
# This is the original code from the GRU layer
#
h_tm1 = states[0] # previous memory
B_U = states[1] # dropout matrices for recurrent units
B_W = states[2]
if self.consume_less == 'gpu':
matrix_x = K.dot(x * B_W[0], self.W) + self.b
matrix_inner = K.dot(h_tm1 * B_U[0], self.U[:, :2 * self.output_dim])
x_z = matrix_x[:, :self.output_dim]
x_r = matrix_x[:, self.output_dim: 2 * self.output_dim]
inner_z = matrix_inner[:, :self.output_dim]
inner_r = matrix_inner[:, self.output_dim: 2 * self.output_dim]
z = self.inner_activation(x_z + inner_z)
r = self.inner_activation(x_r + inner_r)
x_h = matrix_x[:, 2 * self.output_dim:]
inner_h = K.dot(r * h_tm1 * B_U[0], self.U[:, 2 * self.output_dim:])
hh = self.activation(x_h + inner_h)
else:
if self.consume_less == 'cpu':
x_z = x[:, :self.output_dim]
x_r = x[:, self.output_dim: 2 * self.output_dim]
x_h = x[:, 2 * self.output_dim:]
elif self.consume_less == 'mem':
x_z = K.dot(x * B_W[0], self.W_z) + self.b_z
x_r = K.dot(x * B_W[1], self.W_r) + self.b_r
x_h = K.dot(x * B_W[2], self.W_h) + self.b_h
else:
raise Exception('Unknown `consume_less` mode.')
z = self.inner_activation(x_z + K.dot(h_tm1 * B_U[0], self.U_z))
r = self.inner_activation(x_r + K.dot(h_tm1 * B_U[1], self.U_r))
hh = self.activation(x_h + K.dot(r * h_tm1 * B_U[2], self.U_h))
h = z * h_tm1 + (1 - z) * hh
#
# End of original code
###########################################################
# concatenate states you want to monitor, in this case the
# hidden layer activations and gates z and r
all = K.concatenate([h, z, r])
# return everything
return all, [h]
(Note that the only lines I added are at the beginning and end of the function).
If you then run your network with GRU2 as last layer instead of GRU (with return_sequences = True for the GRU2 layer), you can just call predict on your network, this will give you all hidden layer and gate values.
The same thing should work for LSTM, although you might have to puzzle a bit to figure out how to store all the outputs you want in one vector and retrieve them again afterwards.
Hope that helps!
You can use theano's printing module for printing during execution (and not during definition, which is what you're doing and the reason why you're not getting values, but their abstract definition).
Print
Just use the Print function. Don't forget to use the output of Print to continue your graph, otherwise the output will be disconnected and Print will most likely be removed during optimisation. And you will see nothing.
from keras import backend as K
from theano.printing import Print
def someLossFunction(x, ref):
loss = K.square(x - ref)
loss = Print('Loss tensor (before sum)')(loss)
loss = K.sum(loss)
loss = Print('Loss scalar (after sum)')(loss)
return loss
Plot
A little bonus you might enjoy.
The Print class has a global_fn parameter, to override the default callback to print. You can provide your own function and directly access to the data, to build a plot for instance.
from keras import backend as K
from theano.printing import Print
import matplotlib.pyplot as plt
curve = []
# the callback function
def myPlottingFn(printObj, data):
global curve
# Store scalar data
curve.append(data)
# Plot it
fig, ax = plt.subplots()
ax.plot(curve, label=printObj.message)
ax.legend(loc='best')
plt.show()
def someLossFunction(x, ref):
loss = K.sum(K.square(x - ref))
# Callback is defined line below
loss = Print('Loss scalar (after sum)', global_fn=myplottingFn)(loss)
return loss
BTW the string you passed to Print('...') is stored in the print object under property name message (see function myPlottingFn). This is useful for building multi-curves plot automatically
I have a backwards recursion for a binomial tree. At each node an unknown C enters in such a way that at the starting node we get a formula, A(1,1), that depends upon C. The code is as follows:
A=sym(zeros(1,Steps));
B=zeros(1,Steps);
syms C; % The unknown that enters A at every node
tic
for t=Steps-1:-1:1
% Values needed in A and B
Lambda=1-exp(-(1./S(t,1:t).^b).*h);
Q=((1./D(t))./(1-Lambda)-d)/(u-d);
R=normcdf(a0+a1*Lambda);
% the backward recursion for A and B
A(1:t)=D(t)*C+D(t)*...
(Q.*(1-Lambda).*A(1:t) ...
+ (1-Q).*(1-Lambda).*A(2:t+1));
B(1:t)=Lambda.*(1-R)+D(t)*...
(Q.*(1-Lambda).*B(1:t)...
+ (1-Q.*(1-Lambda).*B(2:t+1)));
end
C = solve(A(1,1)==sym(B(1,1)),C);
This code takes around 4 seconds if Steps = 104. If however we remove C and set matrix A to a regular double matrix, it only takes about 0.02 seconds. Using syms thus increases the calculation time by a factor 200. This seems too much to me. Any suggestions into speeding this up?
I am using Matlab 2013b on a MacBook air 13-inch spring 2013. Furthermore, if you're interested in the code before the above part (not sure whether it is relevant):
a0 = 0.9;
a1 = -3.2557;
b = 1.2594;
S0=18.57;
sigma=0.6579;
h=1/104;
T=1;
Steps=T/h;
f=transpose(normrnd(0.04, 0.001 [1 pl]));
D=exp(-h*f); % discount values
pl=T/h; % pathlength - amount of steps in maturity
u=exp(sigma*sqrt(h));
d=1/u;
u_row = repmat(cumprod([1 u*ones(1,pl-1)]),pl,1);
d_row = cumprod(tril(d*ones(pl),-1)+triu(ones(pl)),1);
path = tril(u_row.*d_row);
S=S0*path;
Unless I'm missing something, there's no need to use symbolic math or use an unknown variable. You can effectively assume that C = 1 in your recursion relation and solve for the actual value at the end. Here's the full code with some other improvements:
rng(1); % Always seed your random number generator
a0 = 0.9;
a1 = -3.2557;
b = 1.2594;
S0 = 18.57;
sigma = 0.6579;
h = 1/104;
T = 1;
Steps = T/h;
pl = T/h;
f = 0.04+0.001*randn(pl,1);
D = exp(-h*f);
u = exp(sigma*sqrt(h));
d = 1/u;
u_row = repmat(cumprod([1 u*ones(1,pl-1)]),pl,1);
d_row = cumprod(tril(d*ones(pl),-1)+triu(ones(pl)),1);
pth = tril(u_row.*d_row);
S = S0*pth;
A = zeros(1,Steps);
B = zeros(1,Steps);
tic
for t = Steps-1:-1:1
Lambda = 1-exp(-h./S(t,1:t).^b);
Q = ((1./D(t))./(1-Lambda)-d)/(u-d);
R = 0.5*erfc((-a0-a1*Lambda)/sqrt(2)); % Faster than normcdf
% Backward recursion for A and B
A = D(t)+D(t)*(Q.*(1-Lambda).*A(1:end-1) + ...
(1-Q).*(1-Lambda).*A(2:end));
B = Lambda.*(1-R)+D(t)*(Q.*(1-Lambda).*B(1:end-1) + ...
(1-Q.*(1-Lambda).*B(2:end)));
end
C = B/A
toc
This take about 0.005 seconds to run on my MacBook Pro. There are certainly other improvements you could make. There are many combinations of variables that are used in multiple places (e.g., 1-Lambda or D(t)*(1-Lambda)), that could be calculated once. Matlab may try to optimize this a bit. And you can try moving Lambda, Q, and R out of the loop – or at least calculate parts of them outside and save the results in arrays.
I need to filter an image using a bank of filters in Matlab. My first attempt was to use a simple for loop to repeatedly call the "imfilter" function for each filter in the bank.
I will need to repeat this process many times for my application, so I need to this step to be as efficient as possible. Therefore, I was wondering if there was any way this operation could be vectorized to speed up the process. In an effort to simplify things, all of my filter kernels are the same size (9x9).
As an example of what I am going for, my filters are set up in a 9x9x32 element block, which needs to be applied to my image. I thought about replicating the image into a block (e.g. 100x100x32), but I'm not sure if there's a way to apply an operation like convolution without resorting to loops. Does anyone have suggestions for a good way of tackling this problem?
Other than pre allocating the space, there is not a faster way to arrive at an exact solution. If approximations are ok, then you might be able to decompose the 32 filters into a set of linear combinations of a smaller number of filters, say eight. See for instance Steerable filters.
http://people.csail.mit.edu/billf/papers/steerpaper91FreemanAdelson.pdf
edit: here is a tool to help apply filters to images.
function FiltIm = ApplyFilterBank(im,filters)
%#function FiltIm = ApplyFilterBank(im,filters)
%#
%#assume im is a single layer image, and filters is a cell array
nFilt = length(filters);
maxsz = 0;
for i = 1:nFilt
maxsz = max(maxsz,max(size(filters{i})));
end
FiltIm = zeros(size(im,1), size(im,2), nFilt);
im = padimage(im,maxsz,'symmetric');
for i = 1:nFilt
FiltIm(:,:,i) = unpadimage(imfilter(im,filters{i}),maxsz);
end
function o = padimage(i,amnt,method)
%#function o = padimage(i,amnt,method)
%#
%#padarray which operates on only the first 2 dimensions of a 3 dimensional
%#image. (of arbitrary number of layers);
%#
%#String values for METHOD
%# 'circular' Pads with circular repetion of elements.
%# 'replicate' Repeats border elements of A.
%# 'symmetric' Pads array with mirror reflections of itself.
%#
%#if(amnt) is length 1, then pad all sides same amount
%#
%#if(amnt) is length 2, then pad y direction amnt(1), and x direction amnt(2)
%#
%#if(amnt) is length 4, then pad sides unequally with order LTRB, left top right bottom
if(nargin < 3)
method = 'replicate';
end
if(length(amnt) == 1)
o = zeros(size(i,1) + 2 * amnt, size(i,2) + 2* amnt, size(i,3));
for n = 1:size(i,3)
o(:,:,n) = padarray(i(:,:,n),[amnt,amnt],method,'both');
end
end
if(length(amnt) == 2)
o = zeros(size(i,1) + 2 * amnt(1), size(i,2) + 2* amnt(2), size(i,3));
for n = 1:size(i,3)
o(:,:,n) = padarray(i(:,:,n),amnt,method,'both');
end
end
if(length(amnt) == 4)
o = zeros(size(i,1) + amnt(2) + amnt(4), size(i,2) + amnt(1) + amnt(3), size(i,3));
for n = 1:size(i,3)
o(:,:,n) = padarray(padarray(i(:,:,n),[amnt(2), amnt(1)],method,'pre'),[amnt(4), amnt(3)],method,'post');
end
end
function o = unpadimage(i,amnt)
%#un does padimage
%#if length(amnt == 1), unpad equal on each side
%#if length(amnt == 2), first amnt is left right, second up down
%#if length(amnt == 4), then [left top right bottom];
switch(length(amnt))
case 1
sx = size(i,2) - 2 * amnt;
sy = size(i,1) - 2 * amnt;
l = amnt + 1;
r = size(i,2) - amnt;
t = amnt + 1;
b = size(i,1) - amnt;
case 2
sx = size(i,2) - 2 * amnt(1);
sy = size(i,1) - 2 * amnt(2);
l = amnt(1) + 1;
r = size(i,2) - amnt(1);
t = amnt(2) + 1;
b = size(i,1) - amnt(2);
case 4
sx = size(i,2) - (amnt(1) + amnt(3));
sy = size(i,1) - (amnt(2) + amnt(4));
l = amnt(1) + 1;
r = size(i,2) - amnt(3);
t = amnt(2) + 1;
b = size(i,1) - amnt(4);
otherwise
error('illegal unpad amount\n');
end
if(any([sx,sy] < 1))
fprintf('unpadimage newsize < 0, returning []\n');
o = [];
return;
end
o = zeros(sy, sx, size(i,3));
for n = 1:size(i,3)
o(:,:,n) = i(t:b,l:r,n);
end
New Answer: Use colfilt() or block filtering style. Matlab can transform your image into large matrix where each distinct 9x9 pixel area is a single column (81 elements). Make it using im2col() method. If your image is N by M the result matrix would be 81 X (N-8)*(M-8).
Then you can concatenate all your filters to single matrix (each filter is a row) and multiply those huge matrices. This will give you the result of all filters. Now you have to reconstruct back 32 result images from the result matrix. use col2im() method.
For more information type 'doc colfilt'
This method works almost as fast as mex file and doesnt require any 'for' loop
Old answer:
Do you want to get different 32 results or single result for combination of filters?
If it is a sungle result than there is an easy way.
If you use linear filters (like convolutions) then apply filters one on another. Finally apply the resulting filter on the image. Thus image will be convolved only once.
If you filters are symmetric (x and y direction) then instead of applying the 9x9 filter apply 9x1 on y direction and 1x9 on x direction. Works a bit faster.
Finally, you can try using Mex file