Sparse boolean matrix multiplication - boolean

Does anybody know the efficient implementation of sparse boolean matrix multiplication? I'm interested in both CPU and GPGPU implementations because it is necessary to multiply matrices of different sizes (from 8x8 to up to 10^8x10^8). Currently, I use cuSPARSE library, but it supports only numerical matrices (float, double etc) and this fact leads to huge overhead (by memory and time) which is critical in my task.

Since a boolean matrix can be viewed as the adjacency matrix of some (bipartite) graph, its product with another matrix can be interpreted as the distance 2 connections between the nodes of two subgraphs linked by a common set of nodes.
To avoid wasting space and exploit some amount of bit parallelism, you could try using some form of succint data structure for graph storage and manipulation.
One such family of data structures which could be useful in your case is the K2-tree (or Kn in general), which uses an approach to store the adjacencies similar to spatial decompositions such as quad- and oct- trers.
Ultimately, the best algorithm and data structure will heavily depend on the dimension and sparsity patterns of your matrices.

Related

How to do binary linear algebra on a sparse matrix in Matlab (or any other language)?

I have a sparse binary matrices whose properties I want to analyze over the binary field. The application is to analyze some sparse, binary error-correcting codes. The matrices themselves are too big to handle as full dense matrices, with sizes on the order of 10,000 x 30,000 and bigger, even though only a small percentange of entries are going to be filled. I want to be able to do binary linear algebra while exploiting the matrices' sparsity.
The two main things I will need to do are:
-finding a basis of intersection of its row space with the row space of another sparse matrix
-finding its rank
I've seen that there some packages to find subspace intersection (e.g. this MuPAD function) and to find the rank of a matrix over different fields (like gfrank), but they take prohibitively long time for the matrices I'm working with.
Is there anything like this available? Or any tricks that can be used to do this? If this is possible in another programming language that would also be helpful.

Multiplication of large sparse Matrices without null values in scala

I have two very sparse distributed matrixes of dimension 1,000,000,000 x 1,000,000,000 and I want to compute the matrix multiplication efficiently.
I tried to create a BlockMatrix from a CoordinateMatrix but it's a lot of memory (where in reality the non zero data are around ~500'000'000) and the time of computation is enormous.
So there is another way to create a sparse matrix and compute a multiplication efficiently in a distributed way in Spark? Or i have to compute it manually?
You must obviously use a storage format for sparse matrices that makes use of their sparsity.
Now, without knowing anything about how you handle matrices and which libraries you use, there's no helping you but to ask you to look at the linear algebra libraries of your choice and look for sparse storage formats; the "good old" Fortran-based libraries that underly a lot of modern math libs support them, and so chances are that you really have to do but a little googling with yourlibraryname + "sparse matrix".
second thoughts:
Sparse matrixes really don't lend themselves to distribution very well; think about the operations you'd have to do to coordinate distribution compared to the actual multiplications/additions.
Also, ~5e8 non-zero elements in a 1e18 element matrix are definitely a lot of memory, and since you don't specify how much you consider a lot to be, it's very possible there's nothing wrong with it. Assuming you're using the default double precision, that's 5e8 * 8B = 4GB of pure numbers, not counting the coordinates needed for sparse storage. So, if you've got ~10GB of memory, I wouldn't be surprised at all.
As there is no build-in method in Spark to perform a matrix multiplication with sparse matrixes. I resolved by reduce at best the sparsity of the matrices before perform the matrice multiplication with BlockMatrix (that not support sparse matrix).
Last edit: Even with the sparsity optimization I had a lot of problems with large dataset. Finally, I decided to implement it myself. Now is running very fast. I hope that a matrix implementation with sparse matrix will be implemented in Spark as I think there are a lot of application that can make use of this.

Large and Sparse Matrix Multiplcation

I have a very large and sparse matrix of size 180GB(text , 30k * 3M) containing only the entries and no additional data. I have to do matrix multiplication , inversion and some similar linear algebra operations over it. I tried octave and simple single-threaded C code for the multiplication but my system RAM of 40GB gets used up very fast and then I can find the program starts thrashing. Is there any other options available to me. I am not familiar with MathLab or any other matrix operational library that can help me in doing so.
When I run a simple matrix multiplication of two matrices with 10 rows and 3 M cols, and its transpose, it gives the following error :
memory exhausted or requested size too large for range of Octave's index type
I am not sure whether the same would work on Matlab or not. For sparse matrix representation and matrix multiplication, is there another library or code.
if there are few enough nonzero entries, I suggest creating a sparse matrix S with appropriate dimensions and max nonzero entries; see matlab create sparse matrix. Then as #oleg komarov described, load the matrix in blocks and assign the nonzero entries from each block into the correct address in the sparse matrix S. I feel that if your matrix is sparse enough, then loading it is really the only difficulty you face. I had similar issues with large transfer operators.
Have you considered performing your processing in blocks? Transposition and multiplications work very well with block matrix processing (see https://en.wikipedia.org/wiki/Block_matrix) and that will get you around any limitations about the indices.
This wouldn't help you with matrix inversion though unless you can decompose your matrix in blocks when blocks that aren't on the diagonal are completely empty, which isn't stated in your assumptions.
Octave has a limit in both the memory resources of about 2GB and the maximum number of indices a matrix can hold of about 2^32 (for 32 bits Octave). MatLab doesn't have such a memory limit, since it will use all of your memory resources, swapping file included. Thus you could try with MatLab by setting a huge swapfile, you may then compute your operations (but it will anyway take quite along time...).
If you are interested by other approaches, you may take a look into out-of-core computing which aims to promote new methods to process huge datasets that cannot reside all in memory, but rather store it on disk and load efficiently the bits that are necessary.
For a practical approach, you may take a look into Blaze for Python (notice: still in development!).

Clustering: a training dataset of variable data dimensions

I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features.
In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features).
Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.
Thanks
Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.
Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.
Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).
[Updated after problem clarification]
In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.
Suggested solution:
Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).
*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.
There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a distance function. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).
So the key question here is how you want to compare your features. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, Cosine distance is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, bag of words features are used for such a situation. For images, this is also known as visual words. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.
I see two options here:
Restrict yourself to those features for which all your data-points have a value.
See if you can generate sensible default values for missing features.
However, if possible, you should probably resample all your data-points, so that they all have values for all features.

Data clustering algorithm

What is the most popular text clustering algorithm which deals with large dimensions and huge dataset and is fast?
I am getting confused after reading so many papers and so many approaches..now just want to know which one is used most, to have a good starting point for writing a clustering application for documents.
To deal with the curse of dimensionality you can try to determine the blind sources (ie topics) that generated your dataset. You could use Principal Component Analysis or Factor Analysis to reduce the dimensionality of your feature set and to compute useful indexes.
PCA is what is used in Latent Semantic Indexing, since SVD can be demonstrated to be PCA : )
Remember that you can lose interpretation when you obtain the principal components of your dataset or its factors, so you maybe wanna go the Non-Negative Matrix Factorization route. (And here is the punch! K-Means is a particular NNMF!) In NNMF the dataset can be explained just by its additive, non-negative components.
There is no one size fits all approach. Hierarchical clustering is an option always. If you want to have distinct groups formed out of the data, you can go with K-means clustering (it is also supposedly computationally less intensive).
The two most popular document clustering approaches, are hierarchical clustering and k-means. k-means is faster as it is linear in the number of documents, as opposed to hierarchical, which is quadratic, but is generally believed to give better results. Each document in the dataset is usually represented as an n-dimensional vector (n is the number of words), with the magnitude of the dimension corresponding to each word equal to its term frequency-inverse document frequency score. The tf-idf score reduces the importance of high-frequency words in similarity calculation. The cosine similarity is often used as a similarity measure.
A paper comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm to k-means, can be found here.
The simplest approaches to dimensionality reduction in document clustering are: a) throw out all rare and highly frequent words (say occuring in less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each dataset to see impact on results), b) stopping: throw out all words in a stop list of common english words: lists can be found online, and c) stemming, or removing suffixes to leave only word roots. The most common stemmer is a stemmer designed by Martin Porter. Implementations in many languages can be found here. Usually, this will reduce the number of unique words in a dataset to a few hundred or low thousands, and further dimensionality reduction may not be required. Otherwise, techniques like PCA could be used.
I will stick with kmedoids, since you can compute the distance from any point to anypoint at the beggining of the algorithm, You only need to do this one time, and it saves you time, specially if there are many dimensions. This algorithm works by choosing as a center of a cluster the point that is nearer to it, not a centroid calculated in base of the averages of the points belonging to that cluster. Therefore you have all possible distance calculations already done for you in this algorithm.
In the case where you aren't looking for semantic text clustering (I can't tell if this is a requirement or not from your original question), try using Levenshtein distance and building a similarity matrix with it. From this, you can use k-medoids to cluster and subsequently validate your clustering through use of silhouette coefficients. Unfortunately, Levensthein can be quite slow, but there are ways to speed it up through uses of thresholds and other methods.
Another way to deal with the curse of dimensionality would be to find 'contrasting sets,', conjunctions of attribute-value pairs that are more prominent in one group than in the rest. You can then use those contrasting sets as dimensions either in lieu of the original attributes or with a restricted number of attributes.