For Kolmogorov - Smirnov two sample test on very large data is there a way to split the data into multiple sample and do the computation per sample? - distributed-computing

Suppose I have two very large lists with many millions of values. Is there any way to do computation on smaller samples individually and then combine the results from different sample.

Related

Statistical testing of classifier performance on different populations subsets

I have a dataset of people's medical records which contains people of different races, income levels, genders, etc.
I have a binary classification problem, and want to compare the accuracy of the two models:
Model 1: I have trained and tested a classifier on a random sample from the population.
Model 2: I have trained and tested a classifier on a sample of the population where the income level is above say 100k.
The test set sizes are of different sizes (both contain over 50,000 people). The test sets may contain some of the same people of course - I have a lot of data and will be doing the above experiment for many different conditions so I am not mentioning how many people overlap between the sets as it will change depending on the condition.
I believe I can't use a standard or modified t-test to compare the performance on the separate test sets, since the demographics are different in the two test sets - is this correct?
Instead, I think the only option is to compare the accuracy of model 1 on the same test as model 2 to figure out if model 2 performs better?

Sampling from two different datasets for training and testing a model

I am doing an experiment to test the ability of a neural network model to generalize. I have two datasets of different size. The second dataset is different from the first one, it contains some words that are not in the first dataset. I want to use examples from the first dataset for training and test on examples from the second dataset. Is it correct to take a sample from the first dataset and use this sample as a training set and take a sample from the second dataset and use this sample as a test set? More precisely, if the first dataset contains 66360 examples and the second one contains 56112 examples, can I sample 50000 examples from the first dataset and use those 50000 examples as a training set and sample 50000 examples from the second dataset and use those 50000 examples as a test set?

Latent Dirichlet Allocation and Analyzing Two Data Sets using MALLET

I am currently analyzing two datasets. Dataset A has about 600000+ documents whereas Dataset B has about 7000+ documents. Does this mean that the topic outputs will be more about Dataset A because it has a larger N? The output of mallet in Rapidminer still accounts for which documents fall under each topic. I wonder if there is a way to make the two datasets be interpreted with equal weights?
I am assuming you're mixing the two documents in the training corpus altogether and peform the training. Under this assumption, then it is very likely that the topic outputs will be more about dataset "coming" from A rather than B, as the Gibbs sampling would construct topics according to the co-occurence of tokens which most likely falls from A as well. However inter-topics or similarity of topic across two datasets overlaps is also possible.
You can sample document A instead so that it has same number of documents as B, assuming their topics structure is not that different. Or, you can check the log output from --output-state parameter to see exactly the assigned topic (z) for each token.

Using Multiple GPUs outside of training in PyTorch

I'm calculating the accumulated distance between each pair of kernel inside a nn.Conv2d layer. However for large layers it runs out of memory using a Titan X with 12gb of memory. I'd like to know if it is possible to divide such calculations across two gpus.
The code follows:
def ac_distance(layer):
total = 0
for p in layer.weight:
for q in layer.weight:
total += distance(p,q)
return total
Where layer is instance of nn.Conv2d and distance returns the sum of the differences between p and q. I can't detach the graph, however, for I need it later on. I tried wrapping my model around a nn.DataParallel, but all calculations in ac_distance are done using only 1 gpu, however it trains using both.
Parallelism while training neural networks can be achieved in two ways.
Data Parallelism - Split a large batch into two and do the same set of operations but individually on two different GPUs respectively
Model Parallelism - Split the computations and run them on different GPUs
As you have asked in the question, you would like to split the calculation which falls into the second category. There are no out-of-the-box ways to achieve model parallelism. PyTorch provides primitives for parallel processing using the torch.distributed package. This tutorial comprehensively goes through the details of the package and you can cook up an approach to achieve model parallelism that you need.
However, model parallelism can be very complex to achieve. The general way is to do data parallelism with either torch.nn.DataParallel or torch.nn.DistributedDataParallel. In both the methods, you would run the same model on two different GPUs, however one huge batch would be split into two smaller chunks. The gradients will be accumulated on a single GPU and optimization happens. Optimization takes place on a single GPU in Dataparallel and parallely across GPUs in DistributedDataParallel by using multiprocessing.
In your case, if you use DataParallel, the computation would still take place on two different GPUs. If you notice imbalance in GPU usage it could be because of the way DataParallel has been designed. You can try using DistributedDataParallel which is the fastest way to train on multiple GPUs according to the docs.
There are other ways to process very large batches too. This article goes through them in detail and I'm sure it would be helpful. Few important points:
Do gradient accumulation for larger batches
Use DataParallel
If that doesn't suffice, go with DistributedDataParallel

How to generate a 'clusterable' dataset in MATLAB

I need to test my Gap Statistics algorithm (which should tell me the optimum k for the dataset) and in order to do so I need to generate a big dataset easily clustarable, so that I know a priori the optimum number of clusters. Do you know any fast way to do it?
It very much depends on what kind of dataset you expect - 1D, 2D, 3D, normal distribution, sparse, etc? And how big is "big"? Thousands, millions, billions of observations?
Anyway, my general approach to creating easy-to-identify clusters is concatenating sequential vectors of random numbers with different offsets and spreads:
DataSet = [5*randn(1000,1);20+3*randn(1,1000);120+25*randn(1,1000)];
Groups = [1*ones(1000,1);2*ones(1000,1);3*ones(1000,1)];
This can be extended to N features by using e.g.
randn(1000,5)
or concatenating horizontally
DataSet1 = [5*randn(1000,1);20+3*randn(1,1000);120+25*randn(1,1000)];
DataSet2 = [-100+7*randn(1000,1);1+0.1*randn(1,1000);20+3*randn(1,1000)];
DataSet = [DataSet1 DataSet2];
and so on.
randn also takes multidimensional inputs like
randn(1000,10,3);
For looking at higher-dimensional clusters.
If you don't have details on what kind of datasets this is going to be applied to, you should look for these.