BigBird, or Sparse self-attention: How to implement a sparse matrix? - neural-network

This question is related to the new paper: Big Bird: Transformers for Longer Sequences. Mainly, about the implementation of the Sparse Attention (that is specified in the Supplemental material, part D). Currently, I am trying to implement it in PyTorch.
They suggest a new way to speed up the computation by blocking the original query and key matrices (see, below)
When you do the matrix multiplaciton in the step (b), you end up with something like that:
.
So I was wondering: how would you go from that representation (image above) to a sparse matrix (using PyTorch, see below)? In the paper, they just say: "simply reshape the result", and I do not know any easy ways to do so (especially, when I have multiple blocks in different positions (see step (c) on the first image).
RESOLUTION:
Huggingface has an implementation of BigBird in pytorch.

I end up following the guidelines in the paper. When it comes to the unpacking of the result I use: torch.sparse_coo_tensor
EDIT: Sparse tensors are still memory-hungry! The more efficient solution is described here

Related

Dot product with huge vectors

I am facing the following problem: I have a system of 160000 linear equations with 160000 variables. I am going to write two programs on conjugate gradient method and steepest descent method to solve it. The matrix is block tridiagonal with only 5 non zero diagonals, thus it's not necessary to create and store the matrix. But I am having the following problem: when I go to the iterarion stepe, there must be dot product of vectors involved. I have tried the following commands: dot(u,v), u'*v, which are commonly used. But when I run the program, MATLAB told me the data size is too large for the memory.
To resolve this problem, I tried to decompose the huge vector into sparse vectors with small support, then calculate the dot products of small vectors and finally glue them together. But it seems that this method is more complicated and not very efficient, and it is easy (especially for beginners like me) to make mistakes. I wonder if there're any more efficient ways to deal with this problem. Thanks in advance.

CT projection (distance-driven) operator implementation?

I am trying to use MATLAB to implement a CT (computed tomography) projection operator, A, which I think is also referred as "system matrix" often times.
Basically, for a N x N image M, the projection data, P, can be obtained by multiplication of the project operator to the image:
P = AM
and the backprojection procedure can be performed by multiplying the (conjugate) transpose of the projection operator to the projection data:
M = A'P
Anyone has any idea/example/sample code on how to implement matrix A (for example: Radon transform)? I would really like to start with a small size of matrix, say 8 x 8, or 16 x 16, if possible.
My question really is: how to implement the projection operator, such that by multiplying the operator with an image, I can get the projections, and by multiplying the (conjugate) transpose of the operator with the projections, I can get the original image back.
EDIT:
Particularly, I would like to implement distance-driven projector, in which case beam trajectory (parallel, fan, or etc) would not matter. Very simple example (MATLAB preferred) will be the best for me to start.
You have different examples:
Here there is a Matlab example related to 3d Cone Beam. It can be a good starting point.
Here you also have another operator
Here you have a brief explanation of the Distance-Driven Method. So using the first example and the explanation in this book, you can obtain some ideas.
If not, you can always go to the Distance-Driven operator paper and implement it using the first example.
As far as I'm aware, there are no freely available implementations of the distance-driven projector/backprojector (it is patented). You can, however, code it yourself without too much difficulty.
Start by reading the papers and understanding what the projector is doing. There are only a few key parts that you need:
Projecting pixel boundaries onto an axis.
Projecting detector boundaries onto an axis.
The overlap kernel.
The first two are simple geometry. The overlap kernel is described in good detail (and mostly usable pseudocode) in the papers.
Note that you won't wind up with an actual matrix that does the projection. The system would be too large for all but the tiniest examples. Instead, you should write a function that implements the linear operator corresponding to distance-driven projection.
Although there are already a lot of satisfactory answers, I would like to mention that I have implemented the Distance Driven method for 2D Computed Tomography (CT) and 3D Digital Breast Tomosynthesis (DBT) on MATLAB.
Until now, for 2D CT, these codes are available:
Simple Distance-Driven, base on the original papers [1] and [2],
Branchless Distance-Driven, for acceleration on GPU, based on the papers [3] and [4],
and for 3D DBT:
Simple Distance-Driven, based on the book [5].
Note that:
1 - The code for DBT is strictly for limited angle tomography; however it is straightforward to extend to a full rotation angle.
2 - All the codes are implemented for CPU.
Please, report any issue on the codes so we can keep improving it.
Distance-driven projection is not implemented in stock MATLAB. For forward projection, there is the fanbeam() and radon() command, depending on what geometry you're looking for. I don't consider fanbeam to be very good. It exhibits nonlinear behavior, as of R2013a, see here for details
As for a matching transpose, there is no function for that either for fanbeam or parallel geometry. Note, iradon and ifanbeam are not operator implementations of the matching transpose. However, you might consider using FUNC2MAT. It will let you convert any linear operator from function form to matrix form and then you can transpose freely.

Simple Sequential feature selection in Matlab

I have a 40X3249 noisy dataset and 40X1 resultset. I want to perform simple sequential feature selection on it, in Matlab. Matlab example is complicated and I can't follow it. Even a few examples on SoF didn't help. I want to use decision tree as classifier to perform feature selection. Can someone please explain in simple terms.
Also is it a problem that my dataset has very low number of observations compared to the number of features?
I am following this example: Sequential feature selection Matlab and I am getting error like this:
The pooled covariance matrix of TRAINING must be positive definite.
I've explained the error message you're getting in answers to your previous questions.
In general, it is a problem that you have many more variables than samples. This will prevent you using some techniques, such as the discriminant analysis you were attempting, but it's a problem anyway. The fact is that if you have that high a ratio of variables to samples, it is very likely that some combination of variables would perfectly classify your dataset even if they were all random numbers. That's true if you build a single decision tree model, and even more true if you are using a feature selection method to explicitly search through combinations of variables.
I would suggest you try some sort of dimensionality reduction method. If all of your variables are continuous, you could try PCA as suggested by #user1207217. Alternatively you could use a latent variable method for model-building, such as PLS (plsregress in MATLAB).
If you're still intent on using sequential feature selection with a decision tree on this dataset, then you should be able to modify the example in the question you linked to, replacing the call to classify with one to classregtree.
This error comes from the use of the classify function in that question, which is performing LDA. This error occurs when the data is rank deficient (or in other words, some features are almost exactly correlated). In order to overcome this, you should project the data down to a lower dimensional subspace. Principal component analysis can do this for you. See here for more details on how to use pca function within statistics toolbox of Matlab.
[basis, scores, ~] = pca(X); % Find the basis functions and their weighting, X is row vectors
indices = find(scores > eps(2*max(scores))); % This is to find irrelevant components up to machine precision of the biggest component .. with a litte extra tolerance (2x)
new_basis = basis(:, indices); % This gets us the relevant components, which are stored in variable "basis" as column vectors
X_new = X*new_basis; % inner products between the new basis functions spanning some subspace of the original, and the original feature vectors
This should get you automatic projections down into a relevant subspace. Note that your features won't have the same meaning as before, because they will be weighted combinations of the old features.
Extra note: If you don't want to change your feature representation, then instead of classify, you need to use something which works with rank deficient data. You could roll your own version of penalised discriminant analysis (which is quite simple), use support vector machines, or other classification functions which don't break with correlated features as LDA does (by virtue of requiring matrix inversion of the covariance estimate).
EDIT: P.S I haven't tested this, because I have rolled my own version of PCA in Matlab.

Essential philosophy behind Support Vector Machine

I am studying Support Vector Machines (SVM) by reading a lot of material. However, it seems that most of it focuses on how to classify the input 2D data by mapping it using several kernels such as linear, polynomial, RBF / Gaussian, etc.
My first question is, can SVM handle high-dimensional (n-D) input data?
According to what I found, the answer is YES!
If my understanding is correct, n-D input data will be
constructed in Hilbert hyperspace, then those data will be
simplified by using some approaches (such as PCA ?) to combine it together / project it back to 2D plane, so that
the kernel methods can map it into an appropriate shape such a line or curve can separate it into distinguish groups.
It means most of the guides / tutorials focus on step (3). But some toolboxes I've checked cannot plot if the input data greater than 2D. How can the data after be projected to 2D?
If there is no projection of data, how can they classify it?
My second question is: is my understanding correct?
My first question is, does SVM can handle high-dimensional (n-D) input data?
Yes. I have dealt with data where n > 2500 when using LIBSVM software: http://www.csie.ntu.edu.tw/~cjlin/libsvm/. I used linear and RBF kernels.
My second question is, does it correct my understanding?
I'm not entirely sure on what you mean here, so I'll try to comment on what you said most recently. I believe your intuition is generally correct. Data is "constructed" in some n-dimensional space, and a hyperplane of dimension n-1 is used to classify the data into two groups. However, by using kernel methods, it's possible to generate this information using linear methods and not consume all the memory of your computer.
I'm not sure if you've seen this already, but if you haven't, you may be interested in some of the information in this paper: http://pyml.sourceforge.net/doc/howto.pdf. I've copied and pasted a part of the text that may appeal to your thoughts:
A kernel method is an algorithm that depends on the data only through dot-products. When this is the case, the dot product can be replaced by a kernel function which computes a dot product in some possibly high dimensional feature space. This has two advantages: First, the ability to generate non-linear decision boundaries using methods designed for linear classifiers. Second, the use of kernel functions allows the user to apply a classifier to data that have no obvious fixed-dimensional vector space representation. The prime example of such data in bioinformatics are sequence, either DNA or protein, and protein structure.
It would also help if you could explain what "guides" you are referring to. I don't think I've ever had to project data on a 2-D plane before, and it doesn't make sense to do so anyway for data with a ridiculous amount of dimensions (or "features" as it is called in LIBSVM). Using selected kernel methods should be enough to classify such data.

What is the best way to implement a tree in matlab?

I want to write an implementation of a (not a binary) tree and and run some algorithms on it. The reason for using the matlab is that the rest of all programs are in matlab and it would be usful for some analysis and plotting. From an initial search in matlab i found that there aren't thing like pointers in matlab. So I'd like to know the best ( in terms on convinience) possible way to do this in matlab ? or any other ways ?
You can do this with MATLAB objects but you must make sure you use handle objects and not value objects because your nodes will contain cross-references to other nodes (i.e. parent, next sibling, first child).
This question is very old but still open. So I would just like to point readers to this implementation in plain MATLAB made by yours truly. Here is a tutorial that walks you through its use.
Matlab is very well suited to handle any kind of graphs (not only trees) represented as adjacency matrix or incidence matrix.
Matrices (representing graphs) can be either dense or sparse, depending on the properties of your graphs.
Last but not least, graph theory and linear algebra are in very fundamental ways related to each other see for example, so Matlab will be able to provide for you a very nice platform to harness such relationships.