How to structure dask to stream to random forrest classifier - classification

I have a number of h5py datasets in 1 file where the class label is the dataset name and the shape is (20000,250000) of float64 compressed using gzip
How would the community suggest I use dask to enable random forrest training without needing to load entire datasets into memory?
Im working with a high core, high memory instance.
I should have mentioned I have 3 class labels.
Update:
My current thinking of loading the data was to create a dask array for each class label with the shape of (20000,250000) then concatenate the 3 arrays together. If I did that would I be able to use the distributed random forrest mentioned in the comments to then create the smaller training and test data frames needed?

Related

Trying to get feature importance in Random Forest(PySpark)

I have a customer data with close to 15k columns.
I'm trying to run RF on the data to reduce the number of columns and then run other ML algorithms on it.
I am able to run RF on PySpark but am unable to extract the feature importance of the variables.
Anyone having any clue about the same or any other technique which would help me in reducing the 15k variable to some 200 odd variables.

Caffe is teeing me that the LMDB created by convert_imageset is empty

I am working on a CNN.
For that, I need 2 pairs of LMDB's (one for the testing, and one for training).
Each LMDB in the pair are made out of images, where one LMDB contains the GT.
So I make 4 lists of the images and feed this into convert_imageset. After that I feed the LMDB's into caffe. Caffe constructs the training net without a problem, but when it comes to the testing net, it tells me that the file is empty. Where did I go wrong?

How to get the dataset size of a Caffe net in python?

I look at the python example for Lenet and see that the number of iterations needed to run over the entire MNIST test dataset is hard-coded. However, can this value be not hard-coded at all? How to get the number of samples of the dataset pointed by a network in python?
You can use the lmdb library to access the lmdb directly
import lmdb
db = lmdb.open('/path/to/lmdb_folder') //Needs lmdb - method
num_examples = int( db.stat()['entries'] )
Should do the trick for you.
It seems that you mixed iterations and amount of samples in one question. In the provided example we can see only number of iterations, i. e. how many times training phase will be repeated. The is no any direct relationship between amount of iterations (network training parameters) and amount of samples in dataset (network input).
Some more detailed explanation:
EDIT: Caffe will totally load (batch size x iterations) samples for training or testing, but there is no relation with amount of loaded samples and actual database size: it will start reading from the beginning after reaching database last record - it other words, database in caffe acts like a circular buffer.
Mentioned example points to this configuration. We can see that it expects lmdb input, and sets batch size to 64 (some more info about batches and BLOBs) for training phase and 100 for testing phase. Really we don't make any assumption about input dataset size, i. e. number of samples in dataset: batch size is only processing chunk size, iterations is how many batches caffe will take. It won't stop after reaching database end.
In other words, network itself (i. e. protobuf config files) doesn't point to any number of samples in database - only to dataset name and format and desired amount of samples. There is no way to determine database size with caffe at the current moment, as I know.
Thus if you want to load entire dataset for testing, you have only option to firstly determine amount of samples in mnist_test_lmdb or mnist_train_lmdb manually, and then specify corresponding values for batch size and iterations.
You have some options for this:
Look at ./examples/mnist/create_mnist.sh console output - it prints amount of samples while converting from initial format (I believe that you followed this tutorial);
follow #Shai's advice (read lmdb file directly).

PyBrain: MemoryError: while loading training dataset

I am trying to train a feedforward neural network, for binary classification. My Dataset is 6.2M with 1.5M dimension. I am using PyBrain. I am unable to load even a single datapoint. I am getting MemoryError.
My Code snippet is:
Train_ds = SupervisedDataSet(FV_length, 1) #FV_length is a computed value. 150000
feature_vector = numpy.zeros((FV_length),dtype=numpy.int)
#activate feature values
for index in nonzero_index_list:
feature_vector[index] = 1
Train_ds.addSample(feature_vector,class_label) # both the arguments are tuples
It looks like your computer just does not have the memory to add your feature and class label arrays to the supervised data set Train_ds.
If there is no way for you to allocate more memory to your system it might be a good idea to random sample from your data set and train on the smaller sample.
This should still give accurate results assuming the sample is large enough to be representative.

LMDB files and how they are used for caffe deep learning network

I am quite new in deep learning and I am having some problems in using the caffe deep learning network. Basically, I didn't find any documentation explaining how I can solve a series of questions and problems I am dealing right now.
Please, let me explain my situation first.
I have thousands of images and I must do a series of pre-processing operations on them. For each pre-processing operation, I have to save these pre-processed images as 4D matrices and also store a vector with the images labels. I will store this information as LMDB files that will be used as input for the caffe googlenet deep learning.
I tried to save my images as .HD5 files, but the final file size is 80GB, which is impossible to process with the memory I have.
So, the other option is using LMDB files, right? I am quite newbie in this file format and I appreciate your help in understanding how to create them in Matlab. Basically, my rookie questions are:
1- These LMDB files have extension .MDB, right? is this extension the same used by microsoft access? or the right format is .lmdb and they are different?
2- I find this solution for creating .mdb files (https://github.com/kyamagu/matlab-leveldb), does it create the file format needed by caffe?
3- For caffe, should I have to create one .mdb file for labels and other for images or both can be fields of the same .mdb file?
4- When I create an .mdb file I have to label the database fields. Can I label one field as image and other as label? does caffe understand which field means?
5- what does the function (in https://github.com/kyamagu/matlab-leveldb) database.put('key1', 'value1') and database.put('key2', 'value2') do? Should I have to save my 4-d matrices in one field and the label vector in another?
There is no connection between LMDB files and MS Access files.
As I see it you have two options:
Use the "convert_imageset" tool - it is located in caffe under the tools folder to convert a list of image files and label to lmdb.
Instead of "data layer" use "image data layer" as an input to the network. This type of layer takes a file with a list of image file names and labels as source so you don't have to build a database (another benefit for training - you can use the shuffle option and get slightly better training results)
In order to use an image data layer just replace the layer type from Data to ImageData. The source file is the path to a file containing in each line a path of an image file and the label seperated by space. For example:
/path/to/filnename.png 23
If you want to do some preprocessing of the data without saving the preprocessed file to disk you can use the transformations available by caffe (mirror and cropping) (see here for information http://caffe.berkeleyvision.org/tutorial/data.html) or implement your own DataTransformer.
Caffe doesn't use LevelDB - but it uses LMDB 'Lightning' db from Symas
You can try using this Matlab LMDB wrapper
I personally had no experience with using LMDB with Matlab, but there is nice library for doing this from Python: py-lmdb
LMDB database is a Key/Value db (similar to HashMap in Java or dict in Python). In order to store 4D matrices you need to understand the convention Caffe uses to save images into LMDB format.
This means that the best approach to convert images to LMDB for Caffe would be doing this with Caffe.
There are examples in Caffe on how to convert images into LMDB - I would try to repeat them and then modify scripts to use your images.