There have been a few questions inquiring about general math/stats frameworks for Scala.
I am interested in only one specific problem, that of solving a large sparse linear system. Essentially I am looking for an equivalent of scipy.sparse.linalg.spsolve.
Currently I am looking into breeze-math of ScalaNLP Breeze, which looks like it would do the job, except that the focus of this library collection is natural language processing, so it feels a bit strange to use that.
Saddle also looks promising, but not very mature yet, and looking at its dependencies, EJML doesn't seem to have sparse functionality, while Apache commons math did, but it was flaky.
Has anyone got a reasonably simple and efficient solution that is currently available?
Although ScalaNLP Breeze says it's for NLP, it's linear algebra library is fairly general and not specialized to NLP. With that said, you could easily do something like this:
val A = new CSCMatrix[Int]()
val B = new CSCMatrix[Int]()
val x = A \ B
Related
I try to apply One Class SVM but my dataset contains too many features and I believe feature selection would improve my metrics. Are there any methods for feature selection that do not need the label of the class?
If yes and you are aware of an existing implementation please let me know
You'd probably get better answers asking this on Cross Validated instead of Stack Exchange, although since you ask for implementations I will answer your question.
Unsupervised methods exist that allow you to eliminate features without looking at the target variable. This is called unsupervised data (dimensionality) reduction. They work by looking for features that convey similar information and then either eliminate some of those features or reduce them to fewer features whilst retaining as much information as possible.
Some examples of data reduction techniques include PCA, redundancy analysis, variable clustering, and random projections, amongst others.
You don't mention which program you're working in but I am going to presume it's Python. sklearn has implementations for PCA and SparseRandomProjection. I know there is a module designed for variable clustering in Python but I have not used it and don't know how convenient it is. I don't know if there's an unsupervised implementation of redundancy analysis in Python but you could consider making your own. Depending on what you decide to do it might not be too tricky (especially if you just do correlation based).
In case you're working in R, finding versions of data reduction using PCA will be no problem. For variable clustering and redundancy analysis, great packages like Hmisc and ClustOfVar exist.
You can also read about other unsupervised data reduction techniques; you might find other methods more suitable.
I am programming a machine learning algorithm in Scala. For that one I don't think I will need matrices, but I will need vectors. However, the vectors won't even need a dot product, but just element-wise operations. I see two options:
use a linear algebra library like Breeze
implement my vectors as Scala collections like List or Vector and work on them in a functional manner
The advantage of using a linear algebra library might mean less programming for me... will it though, considering learning? I already started learning and using it and it seems not that straight forward (documentation is so-so). (Another) disadvantage is having a extra dependency. I don't have much experience in writing my own projects (so far I programmed in the job, where libraries usage was dictated).
So, is a linear algebra library - e.g. Breeze - worth the learning and dependency compared to programming needed functionality myself, in my particular case?
I'm considering to learn Scala for my algorithm development, but first need to know if the language has implemented (or is implementing) complex inverse and pseudo-inverse functions. I looked at the documentation (here, here), and although it states these functions are for real matrices, in the code, I don't see why it wouldn't accept complex matrices.
There's also the following comment left in the code:
pinv for anything that can be transposed, multiplied with that transposed, and then solved
Is this just my wishful thinking, or will it not accept complex matrices?
Breeze implementer here:
I haven't implemented inv etc. for complex numbers yet, because I haven't figured out a good way to store complex numbers unboxed in a way that is compatible with blas and lapack and doesn't break the current API. You can set the call up yourself using netlib java following a similar recipe to the code you linked.
I currently have a couple of algorithms in Matlab that I am looking to code in Java. I will do so using one of the following (Colt, Apache Commons Math, jblas). However, since I am really looking to improve upon the speed of these algorithms, I am looking for suggestions, and hopefully existing implementations, for parallelizing these algorithms to increase performance.
From what I can tell, Hadoop is not a good option for distributing matrix operations. I have also looked at Mahout but it is not clear to me if this will be helpful in achieving this objective.
Many thanks for all your tips and suggestions.
Where are you getting the information that Hadoop "is not a good option for distributing matrix operations"? It is certainly a good option, but only as long as your data is huge - like 50 GB+ at least. If you can fit it in memory, Hadoop is not a good option, but if you're thinking you'll want to use it on multiple TB of data, then Hadoop is a good tool for the job. There's also a lot of other things to consider when optimizing matrix multiplication, like the structure of your data (is it sparse? does it occur in clusters? etc).
There's plenty of information on google about implementing Matrix Multiplication on MapReduce - Jeffrey Ullman's book might be a good place to start if you choose this route.
I have to solve a multiobjective problem but I don't know if I should use CPLEX or Matlab. Can you explain the advantage and disadvantage of both tools.
Thank you very much!
This is really a question about choosing the most suitable modeling approach in the presence of multiple objectives, rather than deciding between CPLEX or MATLAB.
Multi-criteria Decision making is a whole sub-field in itself. Take a look at: http://en.wikipedia.org/wiki/Multi-objective_optimization.
Once you have decided on the approach and formulated your problem (either by collapsing your multiple objectives into a weighted one, or as series of linear programs) either tool will do the job for you.
Since you are familiar with MATLAB, you can start by using it to solve a series of linear programs (a goal programming approach). This page by Mathworks has a few examples with step-by-step details: http://www.mathworks.com/discovery/multiobjective-optimization.html to get you started.
Probably this question is not a matter of your current concern. However my answer is rather universal, so let me post it here.
If solving a multiobjective problem means deriving a specific Pareto optimal solution, then you need to solve a single-objective problem obtained by scalarizing (aggregating) the objectives. The type of scalarization and values of its parameters (if any) depend on decision maker's preferences, e.g. how he/she/you want(s) to prioritize different objectives when they conflict with each other. Weighted sum, achievement scalarization (a.k.a. weighted Chebyshev), and lexicographic optimization are the most widespread types. They have different advantages and disadvantages, so there is no universal recommendation here.
CPLEX is preferred in the case, where (A) your scalarized problem belongs to the class solved by CPLEX (obviously), e.g. it is a [mixed integer] linear/quadratic problem, and (B) the problem is complex enough for computational time to be essential. CPLEX is specialized in the narrow class of problems, and should be much faster than Matlab in complex cases.
You do not have to limit the choice of multiobjective methods to the ones offered by Matlab/CPLEX or other solvers (which are usually narrow). It is easy to formulate a scalarized problem by yourself, and then run appropriate single-objective optimization (source: it is one of my main research fields, see e.g. implementation for the class of knapsack problems). The issue boils down to finding a suitable single-objective solver.
If you want to obtain some general information about the whole Pareto optimal set, I recommend to start with deriving the nadir and the ideal objective vectors.
If you want to derive a representation of the Pareto optimal set, besides the mentioned population based-heuristics such as GAs, there are exact methods developed for specific classes of problems. Examples: a library implemented in Julia, a recently published method.
All concepts mentioned here are described in the comprehensive book by Miettinen (1999).
Can cplex solve a pareto type multiobjective one? All i know is that it can solve a simple goal programming by defining the lexicographical objs, or it uses the weighted sum to change weights gradually with sensitivity information and "enumerate" the pareto front, which highly depends on the weights and looks very subjective.
You can refer here as how cplex solves the bi-objetive one, which seems not good.
For a true pareto way which includes the ranking, i only know some GA variants can do like NSGA-II.
A different approach would be to use a domain-specific modeling language for mathematical optimization like YALMIP (or JUMP.jl if you like to give Julia a try). There you can write your optimization problem with Matlab with some extra YALMIP functionalities and use CPLEX (or any other supported solver as a backend) without restricting to one solver.