How to solve 2D Markov Chains with Infinite State Space

How to solve 2D Markov Chains with Infinite State Space - markov-chains

I have 2 dimensional markov chain and I want to calculate steady state probabilities and then basic performance measurements such as expected number of customers, expected waiting time, etc. You can check the transition rate diagram link below:
http://tinypic.com/view.php?pic=2n063dd&s=8
As I search for solution methods, matrix geometric and spectral expansion methods appear. I tried matrix geometric method, however since my Markov chain is not repetitive, it did not work.
I read some paper (e.g. Spectral expansion solution for a class of Markov models: application and comparison with the matrix-geometric method), but I could not figure out how to create matrices and what is the steady state probabilities.
Does spectral expansion method require "repetitive process" as matrix geometric method does? If no, how to apply to my problem?
Are there any other methods to compute?
Thanks for all your help!
Ali

First, there is no stable solution method for two-way infinite lattice strip. At least one variable should be capacitated.
Second, the following are the most known solution methods for two-dimensional Markov chains with semi-infinite or finite state space:
Spectral Expansion Method
Matrix Geometric Method
Block Gauss-Seidel Method
Seelen's Method
All methods require high computational work. Experimental studies show that for semi-infinite lattice strip, as the capacitated variable exceeds 50, solution may not be trustable. Also there is a state explosion problem beyond that threshold. To overcome the state explosion problem, iterative methods are used such Gauss-Seidel and Seelen's methods.
Regarding my problem, I determined capacity for both variables. After a search in the literature, block Gauss-Seidel Iterative method seems to be the most appropriate method to apply my problem.
Thank you.

Related

Basic Hidden Markov Model, Viterbi algorithm

I am fairly new to Hidden Markov Models and I am trying to wrap my head around a pretty basic part of the theory.
I would like to use a HMM as a classifier, so, given a time series of data I have two classes: background and signal.
How are the emission probabilities estimated for each class? Does the Viterbi algorithm need a template of the background and signal to estimate prob(data|state)? Or have I completely missed the point?

To do classification with Viterbi you need to already know the model parameters.
Background and Signal are your two hidden states. With the model parameters and observed data you want to use Viterbi to calculate the most likely sequence of hidden states.
To quote the hmmlearn documentation:
The HMM is a generative probabilistic model, in which a sequence of
observable X variables is generated by a sequence of internal hidden
states Z. The hidden states are not be observed directly. The
transitions between hidden states are assumed to have the form of a
(first-order) Markov chain. They can be specified by the start
probability vector π and a transition probability matrix A. The
emission probability of an observable can be any distribution with
parameters θ conditioned on the current hidden state. The HMM is
completely determined by π, A and θ
.
There are three fundamental problems for HMMs:
Given the model parameters and observed data, estimate the optimal sequence of hidden states.
Given the model parameters and observed data, calculate the likelihood of the data.
Given just the observed data, estimate the model parameters.
The first and the second problem can be solved by the dynamic
programming algorithms known as the Viterbi algorithm and the
Forward-Backward algorithm, respectively. The last one can be solved
by an iterative Expectation-Maximization (EM) algorithm, known as the
Baum-Welch algorithm.

So we've got two states for our Hidden Markov model, noise and signal. We also must have something we observe, which could be ones and zeroes. Basically, ones are signal and zeroes are noise, but you get a few zeros mixed in with your signal and a few ones with the noise. So you need to know
Probablity of 0,1 when in the "noise" state
Probability of 0,1 when in the "signal" state
Probability of transitioning to "signal" when in the "noise" state.
Probability of transitioning to "noise" when in the "signal" state.
So we keep track of the probability of each state for each time slot and, crucially, the most likely route we got there (based on the transition probabilities). Then we assume that the most probable state at the end of the time series is our actual final state, and trace backwards.

The Viterbi Algorithm requires to know the HMM.
The HMM can be estimated with a Maximum-Likely-Estimation (MLE) and it's called the Baum–Welch algorithm.
If you have trouble with the Viterbi Algorithm there's a working implementation here

Why do we take the derivative of the transfer function in calculating back propagation algorithm?

What is the concept behind taking the derivative? It's interesting that for somehow teaching a system, we have to adjust its weights. But why are we doing this using a derivation of the transfer function. What is in derivation that helps us. I know derivation is the slope of a continuous function at a given point, but what does it have to do with the problem.

You must already know that the cost function is a function with the weights as the variables.
For now consider it as f(W).
Our main motive here is to find a W for which we get the minimum value for f(W).
One of the ways for doing this is to plot function f in one axis and W in another....... but remember that here W is not just a single variable but a collection of variables.
So what can be the other way?
It can be as simple as changing values of W and see if we get a lower value or not than the previous value of W.
But taking random values for all the variables in W can be a tedious task.
So what we do is, we first take random values for W and see the output of f(W) and the slope at all the values of each variable(we get this by partially differentiating the function with the i'th variable and putting the value of the i'th variable).
now once we know the slope at that point in space we move a little further towards the lower side in the slope (this little factor is termed alpha in gradient descent) and this goes on until the slope gives a opposite value stating we already reached the lowest point in the graph(graph with n dimensions, function vs W, W being a collection of n variables).

The reason is that we are trying to minimize the loss. Specifically, we do this by a gradient descent method. It basically means that from our current point in the parameter space (determined by the complete set of current weights), we want to go in a direction which will decrease the loss function. Visualize standing on a hillside and walking down the direction where the slope is steepest.
Mathematically, the direction that gives you the steepest descent from your current point in parameter space is the negative gradient. And the gradient is nothing but the vector made up of all the derivatives of the loss function with respect to each single parameter.

Backpropagation is an application of the Chain Rule to neural networks. If the forward pass involves applying a transfer function, the gradient of the loss function with respect to the weights will include the derivative of the transfer function, since the derivative of f(g(x)) is f’(g(x))g’(x).

Your question is a really good one! Why should I move the weight more in one direction when the slope of the error wrt. the weight is high? Does that really make sense? In fact it does makes sense if the error function wrt. the weight is a parabola. However it is a wild guess to assume it is a parabola. As rcpinto says, assuming the error function is a parabola, make the derivation of the a updates simple with the Chain Rule.
However, there are some other parameter update rules that actually addresses this, non-intuitive assumption. You can make update rule that takes the weight a fixed size step in the down-slope direction, and then maybe later in the training decrease the step size logarithmic as you train. (I'm not sure if this method has a formal name.)
There are also som alternative error function that can be used. Look up Cross Entropy in you neural network text book. This is an adjustment to the error function such that the derivative (of the transfer function) factor in the update rule cancels out. Just remember to pick the right cross entropy function based on you output transfer function.

When I first started getting into Neural Nets, I had this question too.
The other answers here have explained the math which makes it pretty clear that a derivative term will appear in your calculations while you are trying to update the weights. But all of those calculations are being done in order to implement Back-propagation, which is just one of the ways of updating weights! Now read on...
You are correct in assuming that at the end of the day, all a neural network tries to do is update its weights to fit the data you feed into it. Within this statement lies your answer too. What you are getting confused with here is the idea of the Back-propagation algorithm. Many textbooks use backprop to update neural nets by default but do not mention that there are other ways to update weights too. This leads to the confusion that neural nets and backprop are the same thing and are inherently connected. This also leads to the false belief that neural nets need backprop to train.
Please remember that Back-propagation is just ONE of the ways out there to train your neural network (although it is the most famous one). Now, you must have seen the math involved in backprop, and hence you can see where the derivative term comes in from (some other answers have also explained that). It is possible that other training methods won't need the derivatives, although most of them do. Read on to find out why....
Think about this intuitively, we are talking about CHANGING weights, the direct mathematical operation related to change is a derivative, makes sense that you should need to evaluate derivatives to change weights.
Do let me know if you are still confused and I'll try to modify my answer to make it better. Just as a parting piece of information, another common misconception is that gradient descent is a part of backprop, just like it is assumed that backprop is a part of neural nets. Gradient descent is just one way to minimize your cost function, there are plenty of others you can use. One of the answers above makes this wrong assumption too when it says "Specifically Gradient Descent". This is factually incorrect. :)

Training a neural network means minimizing an associated "error" function wrt the networks weights. Now there are optimization methods that use only function values (Simplex method of Nelder and Mead, Hooke and Jeeves, etc), methods that in addition use first derivatives (steepest descend, quasi Newton, conjugate gradient) and Newton methods using second derivatives as well. So if you want to use a derivative method, you have to calculate the derivatives of the error function, which in return involves the derivatives of the transfer or activation function.
Back propagation is just a nice algorithm to calculate the derivatives, and nothing more.

Yes, the question was really good, this question was also came in my head while i am understanding the Backpropagation. After doing ForwordPropagation on neural network we do back propagation in network to minimize the total error. And there also many other way to minimize the error.your question is why we are doing derivative in backpropagation, the reason is that, As we all know the meaning of derivative is to find the slope of a function or in other words we can find change of particular thing with respect to particular thing. So here we are doing derivative to minimize the total error with respect to the corresponding weights of the network.
and here by doing the derivation of total error with respect to weights we can find it's slope or in other words we can find what is the change in total error with respect to the small change of the weight, so that we can update the weight to minimize the error with the help of this Gradient Descent formula, that is, Weight= weight-Alpha*(del(Total error)/del(weight)).Or in other words New Weights = Old Weights - learning-rate x Partial derivatives of loss function w.r.t. parameters.
Here Alpha is the learning rate which is control the weight update, means if the derivative the - ve than Alpha make it +ve(Becouse of -Alpha in formula) and if +ve it's remain +ve so that weight update goes in +ve direction and it's reflected to minimize the Total error.And also the as derivative part is multiples with Alpha, it's decrees the step size of Alpha when the weight converge to the optimal value of weight(minimum error). Thats why we are doing derivative to minimize the error.

What's the best way to calculate a numerical derivative in MATLAB?

(Note: This is intended to be a community Wiki.)
Suppose I have a set of points xi = {x0,x1,x2,...xn} and corresponding function values fi = f(xi) = {f0,f1,f2,...,fn}, where f(x) is, in general, an unknown function. (In some situations, we might know f(x) ahead of time, but we want to do this generally, since we often don't know f(x) in advance.) What's a good way to approximate the derivative of f(x) at each point xi? That is, how can I estimate values of dfi == d/dx fi == df(xi)/dx at each of the points xi?
Unfortunately, MATLAB doesn't have a very good general-purpose, numerical differentiation routine. Part of the reason for this is probably because choosing a good routine can be difficult!
So what kinds of methods are there? What routines exist? How can we choose a good routine for a particular problem?
There are several considerations when choosing how to differentiate in MATLAB:
Do you have a symbolic function or a set of points?
Is your grid evenly or unevenly spaced?
Is your domain periodic? Can you assume periodic boundary conditions?
What level of accuracy are you looking for? Do you need to compute the derivatives within a given tolerance?
Does it matter to you that your derivative is evaluated on the same points as your function is defined?
Do you need to calculate multiple orders of derivatives?
What's the best way to proceed?

These are just some quick-and-dirty suggestions. Hopefully somebody will find them helpful!
1. Do you have a symbolic function or a set of points?
If you have a symbolic function, you may be able to calculate the derivative analytically. (Chances are, you would have done this if it were that easy, and you would not be here looking for alternatives.)
If you have a symbolic function and cannot calculate the derivative analytically, you can always evaluate the function on a set of points, and use some other method listed on this page to evaluate the derivative.
In most cases, you have a set of points (xi,fi), and will have to use one of the following methods....
2. Is your grid evenly or unevenly spaced?
If your grid is evenly spaced, you probably will want to use a finite difference scheme (see either of the Wikipedia articles here or here), unless you are using periodic boundary conditions (see below). Here is a decent introduction to finite difference methods in the context of solving ordinary differential equations on a grid (see especially slides 9-14). These methods are generally computationally efficient, simple to implement, and the error of the method can be simply estimated as the truncation error of the Taylor expansions used to derive it.
If your grid is unevenly spaced, you can still use a finite difference scheme, but the expressions are more difficult and the accuracy varies very strongly with how uniform your grid is. If your grid is very non-uniform, you will probably need to use large stencil sizes (more neighboring points) to calculate the derivative at a given point. People often construct an interpolating polynomial (often the Lagrange polynomial) and differentiate that polynomial to compute the derivative. See for instance, this StackExchange question. It is often difficult to estimate the error using these methods (although some have attempted to do so: here and here). Fornberg's method is often very useful in these cases....
Care must be taken at the boundaries of your domain because the stencil often involves points that are outside the domain. Some people introduce "ghost points" or combine boundary conditions with derivatives of different orders to eliminate these "ghost points" and simplify the stencil. Another approach is to use right- or left-sided finite difference methods.
Here's an excellent "cheat sheet" of finite difference methods, including centered, right- and left-sided schemes of low orders. I keep a printout of this near my workstation because I find it so useful.
3. Is your domain periodic? Can you assume periodic boundary conditions?
If your domain is periodic, you can compute derivatives to a very high order accuracy using Fourier spectral methods. This technique sacrifices performance somewhat to gain high accuracy. In fact, if you are using N points, your estimate of the derivative is approximately N^th order accurate. For more information, see (for example) this WikiBook.
Fourier methods often use the Fast Fourier Transform (FFT) algorithm to achieve roughly O(N log(N)) performance, rather than the O(N^2) algorithm that a naively-implemented discrete Fourier transform (DFT) might employ.
If your function and domain are not periodic, you should not use the Fourier spectral method. If you attempt to use it with a function that is not periodic, you will get large errors and undesirable "ringing" phenomena.
Computing derivatives of any order requires 1) a transform from grid-space to spectral space (O(N log(N))), 2) multiplication of the Fourier coefficients by their spectral wavenumbers (O(N)), and 2) an inverse transform from spectral space to grid space (again O(N log(N))).
Care must be taken when multiplying the Fourier coefficients by their spectral wavenumbers. Every implementation of the FFT algorithm seems to have its own ordering of the spectral modes and normalization parameters. See, for instance, the answer to this question on the Math StackExchange, for notes about doing this in MATLAB.
4. What level of accuracy are you looking for? Do you need to compute the derivatives within a given tolerance?
For many purposes, a 1st or 2nd order finite difference scheme may be sufficient. For higher precision, you can use higher order Taylor expansions, dropping higher-order terms.
If you need to compute the derivatives within a given tolerance, you may want to look around for a high-order scheme that has the error you need.
Often, the best way to reduce error is reducing the grid spacing in a finite difference scheme, but this is not always possible.
Be aware that higher-order finite difference schemes almost always require larger stencil sizes (more neighboring points). This can cause issues at the boundaries. (See the discussion above about ghost points.)
5. Does it matter to you that your derivative is evaluated on the same points as your function is defined?
MATLAB provides the diff function to compute differences between adjacent array elements. This can be used to calculate approximate derivatives via a first-order forward-differencing (or forward finite difference) scheme, but the estimates are low-order estimates. As described in MATLAB's documentation of diff (link), if you input an array of length N, it will return an array of length N-1. When you estimate derivatives using this method on N points, you will only have estimates of the derivative at N-1 points. (Note that this can be used on uneven grids, if they are sorted in ascending order.)
In most cases, we want the derivative evaluated at all points, which means we want to use something besides the diff method.
6. Do you need to calculate multiple orders of derivatives?
One can set up a system of equations in which the grid point function values and the 1st and 2nd order derivatives at these points all depend on each other. This can be found by combining Taylor expansions at neighboring points as usual, but keeping the derivative terms rather than cancelling them out, and linking them together with those of neighboring points. These equations can be solved via linear algebra to give not just the first derivative, but the second as well (or higher orders, if set up properly). I believe these are called combined finite difference schemes, and they are often used in conjunction with compact finite difference schemes, which will be discussed next.
Compact finite difference schemes (link). In these schemes, one sets up a design matrix and calculates the derivatives at all points simultaneously via a matrix solve. They are called "compact" because they are usually designed to require fewer stencil points than ordinary finite difference schemes of comparable accuracy. Because they involve a matrix equation that links all points together, certain compact finite difference schemes are said to have "spectral-like resolution" (e.g. Lele's 1992 paper--excellent!), meaning that they mimic spectral schemes by depending on all nodal values and, because of this, they maintain accuracy at all length scales. In contrast, typical finite difference methods are only locally accurate (the derivative at point #13, for example, ordinarily doesn't depend on the function value at point #200).
A current area of research is how best to solve for multiple derivatives in a compact stencil. The results of such research, combined, compact finite difference methods, are powerful and widely applicable, though many researchers tend to tune them for particular needs (performance, accuracy, stability, or a particular field of research such as fluid dynamics).
Ready-to-Go Routines
As described above, one can use the diff function (link to documentation) to compute rough derivatives between adjacent array elements.
MATLAB's gradient routine (link to documentation) is a great option for many purposes. It implements a second-order, central difference scheme. It has the advantages of computing derivatives in multiple dimensions and supporting arbitrary grid spacing. (Thanks to #thewaywewalk for pointing out this glaring omission!)
I used Fornberg's method (see above) to develop a small routine (nderiv_fornberg) to calculate finite differences in one dimension for arbitrary grid spacings. I find it easy to use. It uses sided stencils of 6 points at the boundaries and a centered, 5-point stencil in the interior. It is available at the MATLAB File Exchange here.
Conclusion
The field of numerical differentiation is very diverse. For each method listed above, there are many variants with their own set of advantages and disadvantages. This post is hardly a complete treatment of numerical differentiation.
Every application is different. Hopefully this post gives the interested reader an organized list of considerations and resources for choosing a method that suits their own needs.
This community wiki could be improved with code snippets and examples particular to MATLAB.

I believe there is more in to these particular questions. So I have elaborated on the subject further as follows:
(4) Q: What level of accuracy are you looking for? Do you need to compute the derivatives within a given tolerance?
A: The accuracy of numerical differentiation is subjective to the application of interest. Usually the way it works is, if you are using the ND in forward problem to approximate the derivatives to estimate features from signal of interest, then you should be aware of noise perturbations. Usually such artifacts contain high frequency components and by the definition of the differentiator, the noise effect will be amplified in the magnitude order of $i\omega^n$. So, increasing the accuracy of differentiator (increasing the polynomial accuracy) will no help at all. In this case you should be able to cancelt the effect of noise for differentiation. This can be done in casecade order: first smooth the signal, and then differentiate. But a better way of doing this is to use "Lowpass Differentiator". A good example of MATLAB library can be found here.
However, if this is not the case and you're using ND in inverse problems, such as solvign PDEs, then the global accuracy of differentiator is very important. Depending on what kind of bounady condition (BC) suits your problem, the design will be adapted accordingly. The rule of thump is to increase the numerical accuracy known is the fullband differentiator. You need to design a derivative matrix that takes care of suitable BC. You can find comprehensive solutions to such designs using the above link.
(5) Does it matter to you that your derivative is evaluated on the same points as your function is defined?
A: Yes absolutely. The evaluation of the ND on the same grid points is called "centralized" and off the points "staggered" schemes. Note that using odd order of derivatives, centralized ND will deviate the accuracy of frequency response of the differentiator. Therefore, if you're using such design in inverse problems, this will perturb your approximation. Also, the opposite applies to the case of even order of differentiation utilized by staggered schemes. You can find comprehensive explanation on this subject using the link above.
(6) Do you need to calculate multiple orders of derivatives?
This totally depends on your application at hand. You can refer to the same link I have provided and take care of multiple derivative designs.

How calculating hessian works for Neural Network learning

Can anyone explain to me in a easy and less mathematical way what is a Hessian and how does it work in practice when optimizing the learning process for a neural network ?

To understand the Hessian you first need to understand Jacobian, and to understand a Jacobian you need to understand the derivative
Derivative is the measure of how fast function value changes withe the change of the argument. So if you have the function f(x)=x^2 you can compute its derivative and obtain a knowledge how fast f(x+t) changes with small enough t. This gives you knowledge about basic dynamics of the function
Gradient shows you in multidimensional functions the direction of the biggest value change (which is based on the directional derivatives) so given a function ie. g(x,y)=-x+y^2 you will know, that it is better to minimize the value of x, while strongly maximize the vlaue of y. This is a base of gradient based methods, like steepest descent technique (used in the traditional backpropagation methods).
Jacobian is yet another generalization, as your function might have many values, like g(x,y)=(x+1, xy, x-z), thus you now have 23 partial derivatives, one gradient per each output value (each of 2 values) thus forming together a matrix of 2*3=6 values.
Now, derivative shows you the dynamics of the function itself. But you can go one step further, if you can use this dynamics to find the optimum of the function, maybe you can do even better if you find out the dynamics of this dynamics, and so - compute derivatives of second order? This is exactly what Hessian is, it is a matrix of second order derivatives of your function. It captures the dynamics of the derivatives, so how fast (in what direction) does the change change. It may seem a bit complex at the first sight, but if you think about it for a while it becomes quite clear. You want to go in the direction of the gradient, but you do not know "how far" (what is the correct step size). And so you define new, smaller optimization problem, where you are asking "ok, I have this gradient, how can I tell where to go?" and solve it analogously, using derivatives (and derivatives of the derivatives form the Hessian).
You may also look at this in the geometrical way - gradient based optimization approximates your function with the line. You simply try to find a line which is closest to your function in a current point, and so it defines a direction of change. Now, lines are quite primitive, maybe we could use some more complex shapes like.... parabolas? Second derivative, hessian methods are just trying to fit the parabola (quadratic function, f(x)=ax^2+bx+c) to your current position. And based on this approximation - chose the valid step.

Return elements of the Groebner Basis as they are found

This question could refer to any computer algebra system which has the ability to compute the Groebner Basis from a set of polynomials (Mathematica, Singular, GAP, Macaulay2, MatLab, etc.).
I am working with an overdetermined system of polynomials for which the full groebner basis is too difficult to compute, however it would be valuable for me to be able to print out the groebner basis elements as they are found so that I may know if a particular polynomial is in the groebner basis. Is there any way to do this?

If you implement Buchberger's algorithm on your own, then you can simply print out the elements as the are found.
If you have Mathematica, you can use this code as your starting point.
https://www.msu.edu/course/mth/496/snapshot.afs/groebner.m
See the function BuchbergerSteps.

Due to the way the Buchberger algorithm works (see, for instance, Wikipedia or IVA), the partial results that you could obtain by printing intermediate results are not guaranteed to constitute a Gröbner basis.
Depending on your ultimate goal, you may want to try instead an algorithm for triangularization of ideals, such as Ritt-Wu's algorithm (see IVA or Shang-Ching Chou's book). This is somewhat similar to reduction to row echelon form in Linear Algebra, and you may interrupt the algorithm at any point to get a partially reduced system of polynomial equations.