Which way non-symmetric cost matrices work? - traminer

I was trying to answer to this question Traminer substitution cost
And it occurred to me that I don't really know in what direction the matrix is treated by TraMineR. Let's say for example that i have the following matrix
A B
A 0 1
B 2 0
does that mean that TraMineR considers A->B cost is 2 or that B->A cost is 2 ?
thanks !

Your question is about substitution costs in TraMineR, and TraMineR uses substitution costs for computing optimal matching and related dissimilarity measures. Costs are supposed to reflect the dissimilarity between states.
The algorithm used for determining the minimal cost of editing one sequence into the other is essentially Needleman & Wunsch , and this algorithm assumes the costs are symmetric.
So, your question is not a concern for TraMineR.
If you really want to use a concept of non-symmetric costs, you will have to look for algorithms---or define your own algorithm---to evaluate the dissimilarity between sequences from such non-symmetric costs. No such function is provided by TraMineR.

Related

better building of kd-trees

Has anyone ever tried improving kd-trees using the following method?
Dividing each numeric dimension via some 1-d clustering method (e.g. Jenks Natural Breaks Optimization, or FayyadIranni or xyz...)
Sorting the dimensions on the expected value of the variance reduction within each division of that dimension
Building the KD-tree top-down selecting attributes from the order found in (2)
Breaking dimensions at each level of the KD-tree using the divisions found in (1)
And just to say the obvious. If (3) terminates when #rows is (say) less than 30 then nearest neighbor would require 30 distance measures, not N.
You want the tree to be balanced, so there is not much leeway in terms of where to split.
Also, you want the construction to be fast.
If you put in an O(n^2) method during construction, construction will likely be the new bottleneck.
In many cases, the very simple (original) k-d-tree is just as fast as any of the "optimized" techniques that try to determine the "best" splitting axis.

What's the best way to calculate a numerical derivative in MATLAB?

(Note: This is intended to be a community Wiki.)
Suppose I have a set of points xi = {x0,x1,x2,...xn} and corresponding function values fi = f(xi) = {f0,f1,f2,...,fn}, where f(x) is, in general, an unknown function. (In some situations, we might know f(x) ahead of time, but we want to do this generally, since we often don't know f(x) in advance.) What's a good way to approximate the derivative of f(x) at each point xi? That is, how can I estimate values of dfi == d/dx fi == df(xi)/dx at each of the points xi?
Unfortunately, MATLAB doesn't have a very good general-purpose, numerical differentiation routine. Part of the reason for this is probably because choosing a good routine can be difficult!
So what kinds of methods are there? What routines exist? How can we choose a good routine for a particular problem?
There are several considerations when choosing how to differentiate in MATLAB:
Do you have a symbolic function or a set of points?
Is your grid evenly or unevenly spaced?
Is your domain periodic? Can you assume periodic boundary conditions?
What level of accuracy are you looking for? Do you need to compute the derivatives within a given tolerance?
Does it matter to you that your derivative is evaluated on the same points as your function is defined?
Do you need to calculate multiple orders of derivatives?
What's the best way to proceed?
These are just some quick-and-dirty suggestions. Hopefully somebody will find them helpful!
1. Do you have a symbolic function or a set of points?
If you have a symbolic function, you may be able to calculate the derivative analytically. (Chances are, you would have done this if it were that easy, and you would not be here looking for alternatives.)
If you have a symbolic function and cannot calculate the derivative analytically, you can always evaluate the function on a set of points, and use some other method listed on this page to evaluate the derivative.
In most cases, you have a set of points (xi,fi), and will have to use one of the following methods....
2. Is your grid evenly or unevenly spaced?
If your grid is evenly spaced, you probably will want to use a finite difference scheme (see either of the Wikipedia articles here or here), unless you are using periodic boundary conditions (see below). Here is a decent introduction to finite difference methods in the context of solving ordinary differential equations on a grid (see especially slides 9-14). These methods are generally computationally efficient, simple to implement, and the error of the method can be simply estimated as the truncation error of the Taylor expansions used to derive it.
If your grid is unevenly spaced, you can still use a finite difference scheme, but the expressions are more difficult and the accuracy varies very strongly with how uniform your grid is. If your grid is very non-uniform, you will probably need to use large stencil sizes (more neighboring points) to calculate the derivative at a given point. People often construct an interpolating polynomial (often the Lagrange polynomial) and differentiate that polynomial to compute the derivative. See for instance, this StackExchange question. It is often difficult to estimate the error using these methods (although some have attempted to do so: here and here). Fornberg's method is often very useful in these cases....
Care must be taken at the boundaries of your domain because the stencil often involves points that are outside the domain. Some people introduce "ghost points" or combine boundary conditions with derivatives of different orders to eliminate these "ghost points" and simplify the stencil. Another approach is to use right- or left-sided finite difference methods.
Here's an excellent "cheat sheet" of finite difference methods, including centered, right- and left-sided schemes of low orders. I keep a printout of this near my workstation because I find it so useful.
3. Is your domain periodic? Can you assume periodic boundary conditions?
If your domain is periodic, you can compute derivatives to a very high order accuracy using Fourier spectral methods. This technique sacrifices performance somewhat to gain high accuracy. In fact, if you are using N points, your estimate of the derivative is approximately N^th order accurate. For more information, see (for example) this WikiBook.
Fourier methods often use the Fast Fourier Transform (FFT) algorithm to achieve roughly O(N log(N)) performance, rather than the O(N^2) algorithm that a naively-implemented discrete Fourier transform (DFT) might employ.
If your function and domain are not periodic, you should not use the Fourier spectral method. If you attempt to use it with a function that is not periodic, you will get large errors and undesirable "ringing" phenomena.
Computing derivatives of any order requires 1) a transform from grid-space to spectral space (O(N log(N))), 2) multiplication of the Fourier coefficients by their spectral wavenumbers (O(N)), and 2) an inverse transform from spectral space to grid space (again O(N log(N))).
Care must be taken when multiplying the Fourier coefficients by their spectral wavenumbers. Every implementation of the FFT algorithm seems to have its own ordering of the spectral modes and normalization parameters. See, for instance, the answer to this question on the Math StackExchange, for notes about doing this in MATLAB.
4. What level of accuracy are you looking for? Do you need to compute the derivatives within a given tolerance?
For many purposes, a 1st or 2nd order finite difference scheme may be sufficient. For higher precision, you can use higher order Taylor expansions, dropping higher-order terms.
If you need to compute the derivatives within a given tolerance, you may want to look around for a high-order scheme that has the error you need.
Often, the best way to reduce error is reducing the grid spacing in a finite difference scheme, but this is not always possible.
Be aware that higher-order finite difference schemes almost always require larger stencil sizes (more neighboring points). This can cause issues at the boundaries. (See the discussion above about ghost points.)
5. Does it matter to you that your derivative is evaluated on the same points as your function is defined?
MATLAB provides the diff function to compute differences between adjacent array elements. This can be used to calculate approximate derivatives via a first-order forward-differencing (or forward finite difference) scheme, but the estimates are low-order estimates. As described in MATLAB's documentation of diff (link), if you input an array of length N, it will return an array of length N-1. When you estimate derivatives using this method on N points, you will only have estimates of the derivative at N-1 points. (Note that this can be used on uneven grids, if they are sorted in ascending order.)
In most cases, we want the derivative evaluated at all points, which means we want to use something besides the diff method.
6. Do you need to calculate multiple orders of derivatives?
One can set up a system of equations in which the grid point function values and the 1st and 2nd order derivatives at these points all depend on each other. This can be found by combining Taylor expansions at neighboring points as usual, but keeping the derivative terms rather than cancelling them out, and linking them together with those of neighboring points. These equations can be solved via linear algebra to give not just the first derivative, but the second as well (or higher orders, if set up properly). I believe these are called combined finite difference schemes, and they are often used in conjunction with compact finite difference schemes, which will be discussed next.
Compact finite difference schemes (link). In these schemes, one sets up a design matrix and calculates the derivatives at all points simultaneously via a matrix solve. They are called "compact" because they are usually designed to require fewer stencil points than ordinary finite difference schemes of comparable accuracy. Because they involve a matrix equation that links all points together, certain compact finite difference schemes are said to have "spectral-like resolution" (e.g. Lele's 1992 paper--excellent!), meaning that they mimic spectral schemes by depending on all nodal values and, because of this, they maintain accuracy at all length scales. In contrast, typical finite difference methods are only locally accurate (the derivative at point #13, for example, ordinarily doesn't depend on the function value at point #200).
A current area of research is how best to solve for multiple derivatives in a compact stencil. The results of such research, combined, compact finite difference methods, are powerful and widely applicable, though many researchers tend to tune them for particular needs (performance, accuracy, stability, or a particular field of research such as fluid dynamics).
Ready-to-Go Routines
As described above, one can use the diff function (link to documentation) to compute rough derivatives between adjacent array elements.
MATLAB's gradient routine (link to documentation) is a great option for many purposes. It implements a second-order, central difference scheme. It has the advantages of computing derivatives in multiple dimensions and supporting arbitrary grid spacing. (Thanks to #thewaywewalk for pointing out this glaring omission!)
I used Fornberg's method (see above) to develop a small routine (nderiv_fornberg) to calculate finite differences in one dimension for arbitrary grid spacings. I find it easy to use. It uses sided stencils of 6 points at the boundaries and a centered, 5-point stencil in the interior. It is available at the MATLAB File Exchange here.
Conclusion
The field of numerical differentiation is very diverse. For each method listed above, there are many variants with their own set of advantages and disadvantages. This post is hardly a complete treatment of numerical differentiation.
Every application is different. Hopefully this post gives the interested reader an organized list of considerations and resources for choosing a method that suits their own needs.
This community wiki could be improved with code snippets and examples particular to MATLAB.
I believe there is more in to these particular questions. So I have elaborated on the subject further as follows:
(4) Q: What level of accuracy are you looking for? Do you need to compute the derivatives within a given tolerance?
A: The accuracy of numerical differentiation is subjective to the application of interest. Usually the way it works is, if you are using the ND in forward problem to approximate the derivatives to estimate features from signal of interest, then you should be aware of noise perturbations. Usually such artifacts contain high frequency components and by the definition of the differentiator, the noise effect will be amplified in the magnitude order of $i\omega^n$. So, increasing the accuracy of differentiator (increasing the polynomial accuracy) will no help at all. In this case you should be able to cancelt the effect of noise for differentiation. This can be done in casecade order: first smooth the signal, and then differentiate. But a better way of doing this is to use "Lowpass Differentiator". A good example of MATLAB library can be found here.
However, if this is not the case and you're using ND in inverse problems, such as solvign PDEs, then the global accuracy of differentiator is very important. Depending on what kind of bounady condition (BC) suits your problem, the design will be adapted accordingly. The rule of thump is to increase the numerical accuracy known is the fullband differentiator. You need to design a derivative matrix that takes care of suitable BC. You can find comprehensive solutions to such designs using the above link.
(5) Does it matter to you that your derivative is evaluated on the same points as your function is defined?
A: Yes absolutely. The evaluation of the ND on the same grid points is called "centralized" and off the points "staggered" schemes. Note that using odd order of derivatives, centralized ND will deviate the accuracy of frequency response of the differentiator. Therefore, if you're using such design in inverse problems, this will perturb your approximation. Also, the opposite applies to the case of even order of differentiation utilized by staggered schemes. You can find comprehensive explanation on this subject using the link above.
(6) Do you need to calculate multiple orders of derivatives?
This totally depends on your application at hand. You can refer to the same link I have provided and take care of multiple derivative designs.

How to define the maximum k of the kNN classifier?

I am trying to use kNN classifier to perform some supervised learning. In order to find the best number of 'k' of kNN, I used cross validation. For example, the following codes load some Matlab standard data and run the cross validation to plot various k values with respect to the cross validation error
load ionosphere;
[N,D] = size(X)
resp = unique(Y)
rng(8000,'twister') % for reproducibility
K = round(logspace(0,log10(N),10)); % number of neighbors
cvloss = zeros(numel(K),1);
for k=1:numel(K)
knn = ClassificationKNN.fit(X,Y,...
'NumNeighbors',K(k),'CrossVal','On');
cvloss(k) = kfoldLoss(knn);
end
figure; % Plot the accuracy versus k
plot(K,cvloss);
xlabel('Number of nearest neighbors');
ylabel('10 fold classification error');
title('k-NN classification');
The result looks like
The best k in this case is k=2 (it is not an exhaustive search). From the figure, we can see that the cross validation error goes up dramatically after k>50. It gets to a large error and become stable after k>100.
My question is what is the maximum k we should test in this kind of cross validation framework?
For example, there are two classes in the 'ionosphere' data. One class labeled as 'g' and one labeled as 'b'. There are 351 instances in total. For 'g' there are 225 cases and for 'b' there are 126 cases.
In the codes above, it chooses the largest k=351 to be tested. But should we only test from 1 to 126 or up to 225? Is there a relation between the test cases and the maximum number of k? Thanks. A.
The best way to choose a parameter in a classification problem, is to choose it by expertness. What you are doing certainly is not this. If your data is small enough to do a lot of classification with different values of parameters, you will do that, but to be reasonable, you need to show that the parameter you chose is not randomly chosen, you need to explain the behavior of plot you drawn.
In this case, the function is ascending, so you can tell 2 is the best choice.
In most cases you will not choose K more than 20, but there is no proof and you need to do the classification until you can proof your choice.
You don't want k to be too large (i.e. too close to the number of examples), because then the k neighborhood of each query example contains a large fraction of the space, so the prediction depends less and less on the actual location of the query and more on the overall statistics. This explains why the performance is not good for large k. Your classifier essentially chooses always 'g', and gets it wrong 126/351=35% as you see in the plot.
Theory suggests that k needs to grow as the number of labeled examples grow, but sub-linearly.
When you have lots of training data, you want k to be large because you want to have a good estimate of the likelihood of a point near the query point to get each label. This allows to imitate the maximum aposteriori decision rule (which is optimal, assuming you know the actual distribution).
So here are some practical tips:
Get more data if you can. Then run the experiment again.
Focus on small values of k. My bet is that k=3 is better than k=2. Usually for binary classification k is at least 3, and usually an odd number (to avoid ties).
The fact that you see that k=2 is better does not make sense. Therefore the only case in which k=1 is different than k=2 is when the 2 nearest neighbors have different labels. However, in this case the decision is made either randomly or arbitrarily (e.g. always choose 'g'). It depends on the implementation of the knn algorithm. My guess is that in the algorithm you are using the decision is fixed, and that in cases of a tie it chooses 'g' which just happens to be more likely overall. If you switch the roles of the labels you will probably see that k=1 is better than k=2.
Would be interesting to see the the plot for small values of k (e.g. 1 - 20).
References:
nearest neighbor classification
Increasing the number of neighbors to be taken into account during the classification makes your classifier a mean value choice. You only need to check the ratio of your classes to see that it is equal to the error rate.
Since you are using cross validation the k that corresponds to the minimum of your error rate is what you should select as value. In this case it is 3 if not mistaken.
Keep in mind that the cross validation parameter introduces bias in your selection of k. A more elaborate analysis is needed there, but your 10 should be fine for this case.

How to efficiently solve linear system with Laplacian + diagonal matrix?

In my implementation of an image processing algorithm, I have to solve a large linear system of the form A*x=b, where:
Matrix A=L+D is the sum of a Laplacian matrix L and a diagonal matrix D
Laplacian matrix L is sparse, with about 25 non-zeros per row
The system is large, with as many unknowns as there are pixels in the input image (typically > 1 million).
The Laplacian matrix L does not change between successive runs of the algorithm; I can construct this matrix in preprocessing, and possibly compute its factorization. The diagonal matrix D and right-side vector b change at each run of the algorithm.
I am trying to find out what would be the fastest method to solve the system at runtime; I do not mind spending time on preprocessing (for computing a factorization of L, for example).
My initial idea was to pre-compute a Cholesky factorization of L, then update the factorization at runtime with values from D (rank-1 update with cholupdate), and solve quickly the problem with back-substitution. Unfortunately, the Cholesky factorization is not as sparse as the original L matrix, and just loading it from disk already takes 5.48s; as a comparison, it takes 8.30s to directly solve the system with backslash.
Given the shape of my matrices, is there any other method that you would recommend to speedup the solving at runtime, no matter how long it takes at preprocessing time?
Assuming that you are working on a grid (since you mention images - although this is not guaranteed), that you are more interested in speed than precision (since 5s seems already too slow for 1 million unknowns), I see several options.
First, forget about exact methods such as Cholesky (+reordering). Even if they allow to store the factorization and reuse it for multiple rhs, you'll likely need to store gigantic matrices that appear to be intractable in your case (I hope you're re-ordering rows/columns with reverse Cuthill McKee or anything else though - that sparsifies the factorization a lot).
Depending on your boundary conditions, I would first try a Matlab poisolv that solves a Poisson problem using an FFT, and possible reprojections if you want Dirichlet boundary conditions instead of periodic ones. It's very fast, but might not be appropriate for your problem (you mention having 25 nnz for a Laplacian matrix+identity : why ? is-it a high order Laplace matrix, in which case you may be more interested in precision than what I assume ? or is-it in fact a different problem than the one you describe ?).
Then, you can try multigrid solvers that are very fast for images and smooth problems. You can use a simple relaxation method for each iteration and each level of the multigrid, or use fancier methods (for instance, a preconditioned conjugate gradient par level).
Alternatively, you can do a simpler preconditioned conjugate gradient (or even SSOR) without multigrid, and if you're only interested in an approximate solution, you can stop the iterations before full convergence.
My arguments for iterative solvers are:
you can stop before convergence if you want an approximate problem
you can still re-use other results to initialize your solution (for instance, if your different runs correspond to different frames of a video, then using the solution of the previous frame as an initialization of the next would make some sense).
Of course, a direct solver for which you can precompute, store and keep the factorization also makes sense (although I don't understand your argument for a rank-1 update if your matrix is constant) since only the backsubstitution remains to be done at runtime. But given this ignores the structure of the problem (a regular grid, a possible interest in limited precision results etc.), I'd opt for methods which have been designed for these cases such as Fourier-like methods or multigrids. Both methods can be implemented on the GPU for faster results (recall that GPUs are rather tailored for dealing with images/textures!).
Finally, you can get interesting answers from scicomp.stackexchange which is more targeted to numerical analysis.

the comparation of the computing time of multiplication

Let a, b be two integers with n digits.
I am wondering does the computing time of the square of a is shorter than a*b.
Thank you for your help.
I don't think there's a way to square A without using an IMUL on x86. I could be wrong.
To find out how long something takes, microbenchmark it!
Edit: oh wait, I've got it! ab takes two memory reads and aa takes one! So a*a is faster :-).
True answer: there's no reason a*b would be slower unless you have some outside factor influencing things.
I assume your question is:
*Let a, b be two integers with n digits. I am wondering if the computing time of calculating the square of a is shorter than the computing time of calculating a*b.*
If n is large enough that you cannot just use a single multiply instruction, then any algorithm that I know can take advantage of the fact that both factors are the same. That's true for the algorithm that you learned at school, since almost half the products of pairs of digits don't need to be multiplied. At the extreme end for very large n, using convolution with FFTs, the FFT for both factors is the same for the square and needs to be calculated only once.
Take a look at the benchmarks in Bentley's "Programming Pearls", you could hack up something from there to measure.