Two-sample test using Scala Spark - scala

I'm trying to conduct a two-sample test in Scala Spark. I'm thinking of using Kolmogorov–Smirnov (K-S) test or max-mean-discrepency (MMD) test. However, I'm having the following issues blocking me:
The K-S test provided in spark.mllib.stat is one-sample. Technically, I can convert one of the samples to be an empirical CDF. But I don't know how to do so using Scala.
There is no available library for calculating MMD between two datasets. The brute force implementation is to write for-loops and calculate the values of kernel item by item. How can I implement it leveraging parallelization?

Related

How to create a "Denoising Autoencoder" in Matlab?

I know Matlab has the function TrainAutoencoder(input, settings) to create and train an autoencoder. The result is capable of running the two functions of "Encode" and "Decode".
But this is only applicable to the case of normal autoencoders. What if you want to have a denoising autoencoder? I searched and found some sample codes, where they used the "Network" function to convert the autoencoder to a normal network and then Train(network, noisyInput, smoothOutput)like a denoising autoencoder.
But there are multiple missing parts:
How to use this new network object to "encode" new data points? it doesn't support the encode().
How to get the "latent" variables to the features, out of this "network'?
I appreciate if anyone could help me resolve this issue.
Thanks,
-Moein
At present (2019a), MATALAB does not permit users to add layers manually in autoencoder. If you want to build up your own, you will have start from the scratch by using layers provided by MATLAB;
In order to to use TrainNetwork(...) to train your model, you will have you find out a way to insert your data into an object called imDatastore. The difficulty for autoencoder's data is that there is NO label, which is required by imDatastore, hence you will have to find out a smart way to avoid it--essentially you are to deal with a so-called OCC (One Class Classification) problem.
https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.imagedatastore.html
Use activations(...) to dump outputs from intermediate (hidden) layers
https://www.mathworks.com/help/deeplearning/ref/activations.html?searchHighlight=activations&s_tid=doc_srchtitle
I swang between using MATLAB and Python (Keras) for deep learning for a couple of weeks, eventually I chose the latter, albeit I am a long-term and loyal user to MATLAB and a rookie to Python. My two cents are that there are too many restrictions in the former regarding deep learning.
Good luck.:-)
If you 'simulation' means prediction/inference, simply use activations(...) to dump outputs from any intermediate (hidden) layers as I mentioned earlier so that you can check them.
Another way is that you construct an identical network but with the encoding part only, copy your trained parameters into it, and feed your simulated signals.

Converting Dataframe from Spark to the type used by DL4j

Is there any convenient way to convert Dataframe from Spark to the type used by DL4j? Currently using Daraframe in algorithms with DL4j I get an error:
"type mismatch, expected: RDD[DataSet], actual: Dataset[Row]".
In general, we use datavec for that. I can point you at examples for that if you want. Dataframes make too many assumptions that make it too brittle to be used for real world deep learning.
Beyond that, a data frame is not typically a good abstraction for representing linear algebra. (It falls down when dealing with images for example)
We have some interop with spark.ml here: https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl/SparkDl4jNetworkTest.java
But in general, a dataset is just a pair of ndarrays just like numpy. If you have to use spark tools, and want to use ndarrays on the last mile only, then my advice would be to get the dataframe to match some form of schema that is purely numerical, map that to an ndarray "row".
In general, a big reason we do this is because all of our ndarrays are off heap.
Spark has many limitations when it comes to working with their data pipelines and using the JVM for things it shouldn't be(matrix math) - we took a different approach that allows us to use gpus and a bunch of other things efficiently.
When we do that conversion, it ends up being:
raw data -> numerical representation -> ndarray
What you could do is map dataframes on to a double/float array and then use Nd4j.create(float/doubleArray) or you could also do:
someRdd.map(inputFloatArray -> new DataSet(Nd4j.create(yourInputArray),yourLabelINDARray))
That will give you a "dataset" You need a pair of ndarrays matching your input data and a label.
The label from there is relative to the kind of problem you're solving whether that be classification or regression though.

Spark-mllib retraining saved models

I am trying to make a classification with spark-mllib, especially using RandomForestModel.
I have taken a look on this example from spark (RandomForestClassificationExample.scala), but I need a somewhat expanded approach.
I need to be able to train a model, save the model for future usage, but also to be able to load it and train further. Like, extend the dataset and train again.
I completely understand the need to export and import a model for future usage.
Unfortunately, training "further" isn't possible with Spark nor does it make sense. Thus it's recommended to retrain the model with the data from use to train the first model + new data.
Your first training values/metrics don't have much sense anymore if you want to add more data (e.g features, intercept, coefficients, etc.)
I hope that this answers your question.
You may need to look for some reinforcement learning technique instead of Random Forest if you want to use the old model and retrain it with new data.
That I know, there's deeplearning4j that implements deep reinforcement learning algorithms on top of Spark (and Hadoop).
If you only need to save JavaRDD[Object], you can do (in Java)
model.saveAsObjectFile()
Values will be writter out using Java Serialization. Then, to read your data you do:
JavaRDD[Object] model = jsc.objectFile(pathOfYourModel)
Be careful, object files are not available in Python. But you could use saveAsPickleFile() to write your model and pickleFile() to read it.

framework for distributed algorithm

I have to do a project where I have a dynamic graph and each node execute my algorithm to calculate the pagerank.
My question is: There is a framwork that allows me to run an algorithm in the same time in each node (the algorithm is not centralized)?
Yes, Giraph is probably the most common example for it and can do exactly what you are looking for. However it isn't trivial to set up, there is a question from yesterday on SO about materials for Giraph: https://stackoverflow.com/questions/22817423/material-related-to-giraph/
Another example would be GraphX (http://amplab.github.io/graphx/) from spark and GraphLab (http://graphlab.org/projects/index.html), but I don't have any experience with those. However all of those frameworks enable writing code for a node and execute it for each node in a graph. They also allow you to distribute the algorithm across multiple servers for large graphs, but it isn't necessary if your graph is small enough.

Why is this not faster using parallel collections?

I just wanted to test the parallel collections a bit and I used the following line of code (in REPL):
(1 to 100000).par.filter(BigInt(_).isProbablePrime(100))
against:
(1 to 100000).filter(BigInt(_).isProbablePrime(100))
But the parallel version is not faster. In fact it even feels a bit slower (But I haven't really measured that).
Has anyone an explanation for that?
Edit 1: Yes, I do have a multi-core processor
Edit 2: OK, I "solved" the problem myself. The implementation of isProbablePrime seems to be the problem and not the parallel collections. I replaced isProbablePrime with another function to test for primality and now I get an expected speedup.
Both with sequential and parallel ranges, filter will generate a vector data structure - a Vector or a ParVector, respectively.
This is a known problem with parallel vectors that get generated from range collections - transformer methods (such as filter) for parallel vectors do not construct the vector in parallel.
A solution for this that allows efficient parallel construction of vectors has already been developed, but was not yet implemented. I suggest you file a ticket, so that it can be fixed for the next release.