fixed point conversion in simulink - simulink

I have generic question. I have developed a control model in simulink and it works fine. I was trying to convert this model to fixed point , with little success. The coefficients in the float model are greater than unity. However my inputs are from 16bit ADC (Q15). Therefore, there seems to be some issues and my outputs are going haywire. Please provide me the advice. I have two coefficients in my design, a1 and a2 with values, a1= .16 and a2 =3600. I am unable to scale it to Q15 to enable proper convergence

Related

Clarification on NN residual layer back-prop derivation

I've looked everywhere and can't find anything that explains the actual derivation of backprop for residual layers. Here's my best attempt and where I'm stuck. It is worth mentioning that the derivation that I'm hoping for is from a generic perspective that need not be limited to convolutional NNs.
If the formula for calculating the output of a normal hidden layer is F(x) then the formula for a hidden layer with a residual connection is F(x) + o, where x is the weight adjusted output of a previous layer, o is the output of a previous layer, and F is the activation function. To get the delta for a normal layer during back-propagation one needs to calculate the gradient of the output ∂F(x)/∂x. For a residual layer this is ∂(F(x) + o)/∂x which is separable into ∂F(x)/∂x + ∂o/∂x (1).
If all of this is correct, how does one deal with ∂o/∂x? It seems to me that it depends on how far back in the network o comes from.
If o is just from the previous layer then o*w=x where w are the weights connecting the previous layer to the layer for F(x). Taking the derivative of each side relative to o gives ∂(o*w)/∂o = ∂x/∂o, and the result is w = ∂x/do which is just the inverse of the term that comes out at (1) above. Does it make sense that in this case the gradient of the residual layer is just ∂F(x)/∂x + 1/w ? Is it accurate to interpret 1/w as a matrix inverse? If so then is that actually getting computed by NN frameworks that use residual connections or is there some shortcut that is for adding in the error from the residual?
If o is from further back in the network then, I think, the derivation becomes slightly more complicated. Here is an example where the residual comes from one layer further back in a network. The network architecture is Input--w1--L1--w2--L2--w3--L3--Out, having a residual connection from the L1 to L3 layers. The symbol o from the first example is replaced by the layer output L1 for unambiguity. We are trying to calculate the gradient at L3 during back-prop which has a forward function of F(x)+L1 where x=F(F(L1*w2)*w3). The derivative of this relationship is ∂x/∂L1=∂F(F(L1*w2)*w3/∂L1, which is more complicated but doesn't seem too difficult to solve numerically.
If the above derivation is reasonable then it's worth noting that there is a case when the derivation fails, and that is when a residual connection originates from the Input layer. This is because the input cannot be broken down into a o*w=x expression (where x would be the input values). I think this must suggest that residual layers cannot originate from from the input layer, but since I've seen network architecture diagrams that have residual connections that originate from the input, this casts my above derivations into doubt. I can't see where I've gone wrong though. If anyone can provide a derivation or code sample for how they calculate the gradient at residual merge points correctly, I would be deeply grateful.
EDIT:
The core of my question is, when using residual layers and doing vanilla back-propagation, is there any special treatment of the error at the layers where residuals are added? Since there is a 'connection' between the layer where the residual comes from and the layer where it is added, does the error need to get distributed backwards over this 'connection'? My thinking is that since residual layers provide raw information from the beginning of the network to deeper layers, the deeper layers should provide raw error to the earlier layers.
Based on what I've seen (reading the first few pages of googleable forums, reading the essential papers, and watching video lectures) and Maxim's post down below, I'm starting to think that the answer is that ∂o/∂x = 0 and that we treat o as a constant.
Does anyone do anything special during back-prop through a NN with residual layers? If not, then does that mean residual layers are an 'active' part of the network on only the forward pass?
I think you've over-complicated residual networks a little bit. Here's the link to the original paper by Kaiming He at al.
In section 3.2, they describe the "identity" shortcuts as y = F(x, W) + x, where W are the trainable parameters. You can see why it's called "identity": the value from the previous layer is added as is, without any complex transformation. This makes two things:
F now learns the residual y - x (discussed in 3.1), in short: it's easier to learn.
The network gets an extra connection to the previous layer, which improves the gradients flow.
The backward flow through the identity mapping is trivial: the error message is passed unchanged, no inverse matrices are involved (in fact, they are not involved in any linear layer).
Now, paper authors go a bit further and consider a slightly more complicated version of F, which changes the output dimensions (which probably you had in mind). They write it generally as y = F(x, W) + Ws * x, where Ws is the projection matrix. Note that, though it's written as matrix multiplication, this operation is in fact very simple: it adds extra zeros to x to make its shape larger. You can read a discussion of this operation in this question. But this does very few changes the backward: the error message is simply clipped to the original shape of x.

Error function and ReLu in a CNN

I'm trying to get a better understanding of neural networks by trying to programm a Convolution Neural Network by myself.
So far, I'm going to make it pretty simple by not using max-pooling and using simple ReLu-activation. I'm aware of the disadvantages of this setup, but the point is not making the best image detector in the world.
Now, I'm stuck understanding the details of the error calculation, propagating it back and how it interplays with the used activation-function for calculating the new weights.
I read this document (A Beginner's Guide To Understand CNN), but it doesn't help me understand much. The formula for calculating the error already confuses me.
This sum-function doesn't have defined start- and ending points, so i basically can't read it. Maybe you can simply provide me with the correct one?
After that, the author assumes a variable L that is just "that value" (i assume he means E_total?) and gives an example for how to define the new weight:
where W is the weights of a particular layer.
This confuses me, as i always stood under the impression the activation-function (ReLu in my case) played a role in how to calculate the new weight. Also, this seems to imply i simply use the error for all layers. Doesn't the error value i propagate back into the next layer somehow depends on what i calculated in the previous one?
Maybe all of this is just uncomplete and you can point me into the direction that helps me best for my case.
Thanks in advance.
You do not backpropagate errors, but gradients. The activation function plays a role in caculating the new weight, depending on whether or not the weight in question is before or after said activation, and whether or not it is connected. If a weight w is after your non-linearity layer f, then the gradient dL/dw wont depend on f. But if w is before f, then, if they are connected, then dL/dw will depend on f. For example, suppose w is the weight vector of a fully connected layer, and assume that f directly follows this layer. Then,
dL/dw=(dL/df)*df/dw //notations might change according to the shape
//of the tensors/matrices/vectors you chose, but
//this is just the chain rule
As for your cost function, it is correct. Many people write these formulas in this non-formal style so that you get the idea, but that you can adapt it to your own tensor shapes. By the way, this sort of MSE function is better suited to continous label spaces. You might want to use softmax or an svm loss for image classification (I'll come back to that). Anyway, as you requested a correct form for this function, here is an example. Imagine you have a neural network that predicts a vector field of some kind (like surface normals). Assume that it takes a 2d pixel x_i and predicts a 3d vector v_i for that pixel. Now, in your training data, x_i will already have a ground truth 3d vector (i.e label), that we'll call y_i. Then, your cost function will be (the index i runs on all data samples):
sum_i{(y_i-v_i)^t (y_i-vi)}=sum_i{||y_i-v_i||^2}
But as I said, this cost function works if the labels form a continuous space (here , R^3). This is also called a regression problem.
Here's an example if you are interested in (image) classification. I'll explain it with a softmax loss, the intuition for other losses is more or less similar. Assume we have n classes, and imagine that in your training set, for each data point x_i, you have a label c_i that indicates the correct class. Now, your neural network should produce scores for each possible label, that we'll note s_1,..,s_n. Let's note the score of the correct class of a training sample x_i as s_{c_i}. Now, if we use a softmax function, the intuition is to transform the scores into a probability distribution, and maximise the probability of the correct classes. That is , we maximse the function
sum_i { exp(s_{c_i}) / sum_j(exp(s_j))}
where i runs over all training samples, and j=1,..n on all class labels.
Finally, I don't think the guide you are reading is a good starting point. I recommend this excellent course instead (essentially the Andrew Karpathy parts at least).

Matlab: Help in running toolbox for Kalman filter

I have AR(1) model with data samples $N=500$ that is driven by a random input sequence x. THe observation y is corrupted with measurement noise $v$ of zero mean. The model is
y(t) = 0.195y(t-1) + x(t) + v(t) where x(t) is generated as randn(). I am unsure how to represent this as a state space model and how to estimate the parameters $a$ and the states. I tried the state space representation would be
d(t) = \mathbf{a^T} d(t) + x(t)
y(t) = \mathbf{h^T}d(t) + sigma*v(t)
sigma =2.
I cannot understand how to perform parameter and state estimation. Using the toolbox mentioned below, I checked the Equations of KF to be matching with those in textbooks. However, the approach for parameter estimation is different. I will appreciate a recommendation for the implementation procedure.
Implementation 1:
I am following the implementation here : Learning Kalman Filter. This implementation does not use Expectation Maximization to estimate the parameters of AR model and it finds out the Covariance of the process noise. In my case, I don't have a process noise, but an input $x$.
Implementation 2: Kalman Filter by Kevin Murphy is another toolbox which uses EM for parameter estimation of AR model. Now, it is confusing since both the implementations uses different approach for parameter estimation.
I am having a tough time figuring out the correct approach, the state space model representation and the code. Shall appreciate recommendations on how to proceed.
I ran the first implementation for the KalmanARSquareRoot technique and the result is completely different. There is Exponential Moving Average Smoothing being performed and a MA filer of length 30 being used. The toolbox runs fine if I run the Demo examples. But on changing the model, the result is extremely poor. Maybe I am doing something wrong. Do I need to change the equations of KF for my time series?
In the second implementation, I cannot figure out what and where to change the Equations.
In general, if I have to use these tools, then do I need to change the KF equations for every time series model? How do I write the Equations myself if these toolboxes are inappropriate for all the time series model - AR, MA, ARMA?
I only have a bit of experience with Kalman Filters, so take this with a grain of salt.
It seems you shouldn't need to change the equations at all. Working with the second package (learn_kalman), you can create an A0 matrix of size [length(d(t)) length(d(t))]. C0 is the same, and in your case the initial state probably makes sense to be the Identity matrix (unlike your A0. All you need to do is choose a good initial condition.
However, I took a look at your system (plotted an example) and it seems noise dominates your system. KF is an optimal estimator but I have not known it to reject that much noise. It only guarantees a reduced covariance...which means that if your system is mostly dominated by noise, you will calculate a bad model that estimates your system given the noise!
Try plotting [d f] where d is the initial data and f is calculated using the regressive formula f(t) = C * A * f(t-1) :
f(t) = A * f(t-1)
; y(t) = C * f(t)
That is, pretend as if there is no noise but using the estimated AR model. You will see that it rejects all the noise and 'technically' models the system well (since the only unique behaviour is at the very beginning).
For example, if you have a system with A = 0.195, Q=R=0.l then you will converge to an A = 0.2207 but it still isn't good enough. Here the problem is that your initial state is so low, that within a few steps of data and you are essentially at 0 accounting for noise. Naturally KF can converge to a LOT of model solutions that are similar. Any noise will throw off even the best initial condition.
If you increase the resolution of your data in some way (e.g. larger initial condition, more refined timesteps) you will see a good fit. Ex, changing your initial condition to 110 and you'll find the two curves similar, though the model is still fairly different.
I am not aware of any approach to model your data well. If the noise variance is in fact 1 and your system converges to 0 that quickly, it seems doomed to not be effectively modelled since you just don't capture any unique behaviour in the dataset.

Matlab Simulink Model of nonlinear Model

I'm trying to create a Matlab simulink model of the following equation:
I am very new to simulink and need some help getting started.
Ok, this is very easy to do.
set the equation so the result is the highest derivative. in you case d^3y/dt^3
There you have. nothing more to do.
How to follow from here you may ask:
you got x, and you can derive it, or apply any equation you want to it. The only doubt may come is: where the hell should i get y from?
Easy! you have the equation, integrate the result once and use that value for 4*(dydy/dt^2)^2 , integrate it again and use it for the last item and integrate it again and use it to multiply x. That's the advantage of simulink. You can close a loop using the "result" in the equation to calculate the "result" (this is no 100% true, as you use the value of 1 step before in each integration, but it works).
This is the power of simulink, still I strongly recommend you to read a bit about it, so you can understand why to use simulink, but I think playing with it is necessary to learn so: go!
In general when setting up equations in Simulink you should set up a number of integrator blocks to get all your states. When that is done you can sum the different factors together.
Unfortunately I can not post the model I made for the equaltion because of my low reputation points (new here).
dddy ddy dy y
+ --------> 1/s ------> 1/s -----> 1/s ----->

Pattern recognition in Neural Network using matlab simulation

I am new to this neural network in matlab. I wanted to create a Neural Network using matlab simulation.
This matlab simulation is using pattern recognition.
I am running on a windows XP platform.
For example, I have a sets of waveforms of circular shape.
I have extracted out the poles.
These poles will teach my Neural Network that it is circular in shape, hence whenever I input another set of slightly different circular shape waveform, the Neural Network is able to distinguish between the shape.
Currently, I have extracted the poles of these 3 shapes, cylinder, circle and rectangle.
But I am clueless of how I should go about creating my Neural Network.
I'd recommend utilizing SOM (Self-organizing map) for pattern recognition since it's really robust. Also there's a Som Toolbox for Matlab you might be interested in. However, to make it learn waves while neglecting their offsets, you'd need to make some changes to the "similarity function". These changes will affect quite a lot on the SOM's training time but if that's not a problem, keep reading.
For the SOM you'll have to sample your waves to constant sized vectors, let say:
sin x -> sin_vector = (a1, a2, a3, ..., aN)
cos x -> cos_vector = (b1, b2, b3, ..., bN)
Usually similarity of "SOM-vectors" is calculated with euclidian distance. Euclidian distance of those two vectors is huge since they have a different offset. In your case they should be considered to be similar ie. distance to be small. So.. if you don't sample all the similar waves from the same starting point, they will be classified in different classes. That is probably a problem. But! Similarity of vectors in SOM is calculated in order to find the BMU (best-matching unit) from the map and pulling the BMU's and its neigborhood's vectors torwards the values of the given sample. So all you need to change is the way to compare those vectors and the way to pull the vectors' values torwards the sample so that both will be "offset-tolerent".
Slow but working solution is first finding the best offset index for each vector. Best offset index is the one that will produce the smallest value with euclidian distance for the sample. Smallest distance calculated with some node of the net will then be the BMU. Then the BMU's and its neigborhood's vectors are pulled torwards the given sample using the offset index calculated for each node just before. Everything else should work out-of-the-box.
This solution is relatively slow but should work great. I'd recommend studying the consept of SOM thoroughly and then reading this post (and angry comments) again :)
PLEASE comment if you know some mathematical solution that would be better than that previous one!
You can try to use Matlab's Neural network pattern recognition tool nprtool as it is specialize to train and test neural network for pattern recognition.