Which preprocessing filters should I use for a large dataset with multiple class attributes - class

I have this dataset which is based on movies, their genres, director likes, black&white or in colour, etc.
I want to run the optimal preprocessing filters on Weka to clean up the data here?
My goal is to find which class attribute values lead to the highest movie gross but not sure how to start on the preprocessing.
I have attached images below showing the important parts of the dataset containing all the classes.
Thanks

Related

How does word embedding/ word vectors work/created?

How does word2vec create vectors for words? I trained two word2vec models using two different files (from commoncrawl website) but I am getting same word vectors for a given word from both models.
Actually, I have created multiple word2vec models using different text files from the commoncrawl website. Now I want to check which model is better among all. How can select the best model out of all these models and why I am getting same word vectors for different models?
Sorry, If the question is not clear.
If you are getting identical word-vectors from models that you've prepared from different text corpuses, something is likely wrong in your process. You may not be performing any training at all, perhaps because of a problem in how the text iterable is provided to the Word2Vec class. (In that case, word-vectors would remain at their initial, randomly-initialized values.)
You should enable logging, and review the logs carefully to see that sensible counts of words, examples, progress, and incremental-progress are displayed during the process. You should also check that results for some superficial, ad-hoc checks look sensible after training. For example, does model.most_similar('hot') return other words/concepts somewhat like 'hot'?
Once you're sure models are being trained on varied corpuses – in which case their word-vectors should be very different from each other – deciding which model is 'best' depends on your specific goals with word-vectors.
You should devise a repeatable, quantitative way to evaluate a model against your intended end-uses. This might start crudely with a few of your own manual reviews of results, like looking over most_similar() results for important words for better/worse results – but should become more extensive. rigorous, and automated as your project progresses.
An example of such an automated scoring is the accuracy() method on gensim's word-vectors object. See:
https://github.com/RaRe-Technologies/gensim/blob/6d6f5dcfa3af4bc61c47dfdf5cdbd8e1364d0c3a/gensim/models/keyedvectors.py#L652
If supplied with a specifically-formatted file of word-analogies, it will check how well the word-vectors solve those analogies. For example, the questions-words.txt of Google's original word2vec code release includes the analogies they used to report vector quality. Note, though, that the word-vectors that are best for some purposes, like understanding text topics or sentiment, might not also be the best at solving this style of analogy, and vice-versa. If training your own word-vectors, it's best to choose your training corpus/parameters based on your own goal-specific criteria for what 'good' vectors will be.

What kind of features are extracted with the AlexNet layers?

Question is regarding this method, which extracts features from the FC7 layer of AlexNet.
What kind of features is it actually extracting?
I used this method on images of paintings done by two artists. The training set is about 150 training images from each artist (so that the features are stored in a 300x4096 matrix); the validation set is 40 images. This works really well, 85-90% correct classification. I would like to know why it works so well.
WHAT FEATURES ?
FC8 is the classification layer; FC7 is the one before it, where all of the prior kernel pixels are linearised and concatenated. These represent the abstract, top-level features that the model training has "discovered". To examine these features, try one of the many layer visualization tools available on line (don't ask for references here; SO bans requests for resources).
The features vary from one training to another, depending on the kernel initialization (usually random) and very dependent on the training set. However, the features tend to be simple in the early layers, with greater variety and detail in the later ones. For instance, on the original AlexNet target (ILSVRC 2012, a.k.a. ImageNet data set), the FC7 features often include vehicle tires, animal faces, various types of flower petals, green leaves and stems, two-legged animal torsos, airplane sections, car/truck/bus grill work, etc.
Does that help?
WHY DOES IT WORK SO WELL ?
That depends on the data set and training parameters. How different are the images from the artists? There are plenty of features to extract: choice of subject, palette, compositional complexity, hard/soft edges, even direction of brush strokes. For instance, differentiating any two early cubists could be a little tricky; telling Rembrandt from Jackson Pollack should hit 100%.

Choosing train images for convolutional neural network

The goal is to localise objects from images. I decided to modify and train an existing model. However, I can't decide wether I should train the model using masks or only with ROI's.
For example : For class 1 data, only the class 1 object will be appearable on the image, every other regions will be filled with 0's and for the 2'nd class I'll do the same thing and will leave only 2'nd class's object in the mask, and so on for 3'rd and 4'th class.
The second way, using the ROI's : I'll crop each class from the image without mask, only the region on interest.
Then, I hope to continue do similar thing this : https://github.com/jazzsaxmafia/Weakly_detector
Shall I choose the the first way or second ? Any comments like "Your plan won't work, try this" is also appreciated.
--Edit--
To be clear,
Original image : http://s31.postimg.org/btyn660bf/image.jpg
1'st approach using masks:
1'st class : http://s31.postimg.org/4s0pjywpn/class11.png
2'nd class : http://s31.postimg.org/3zy1krsij/class21.png
3'rd class : http://s31.postimg.org/itcp5j09n/class31.png
4'rd class : http://s31.postimg.org/yowxv31gb/class41.png
1'st approach using ROI's:
1'st class : http://s31.postimg.org/4x4gtn40r/class1.png
2'nd class : http://s31.postimg.org/8s7uw7n6j/class2.png
3'rd class : http://s31.postimg.org/mxdny0w7v/class3.png
4'rd class : http://s31.postimg.org/qfpnuex3v/class4.png
P.S : The locations of objects will be in very similar for the new examples, so maybe using the mask approach can be a bit more useful. For the ROI approach I need to normalise each object which have very different sizes. However normalising the whole image mask may keep the variance between the original one much more less.
CNNs are generally quite robust to varying backgrounds assuming they're trained on a large amount of high-quality data. So I would guess that the difference between using the mask and ROI approaches won't be very substantial. For what it's worth, you will need to normalize the size of the images you're feeding to the CNN, regardless of which approach you use.
I have implemented some gesture recognition software and encountered a similar question. I could just use the raw, unprocessed ROI, or I could use a pre-processed version that filtered out much of the background. I basically tried it both ways and compared the accuracy of the models. In my case, I was able to get slightly better results from the pre-processed images. On the other hand, the backgrounds in my images were much more complex and varied. Anyway, my recommendation would be to build a solid mechanism for testing the accuracy of your model and experiment to see what works best.
Honestly, the most important thing is collecting lots of good samples for each class. In my case, I kept seeing substantial improvements until I hit about 5000 images per class. Since collecting lots of data takes a long time, it's best to capture and store the raw, full size images, along with any meta-data involved in the actual collection of the data so that you can experiment with different approaches (masking vs. ROI, varying input image sizes, other pre-processing such as histogram normalization, etc.) without having to collect new data.

Multiclass classification in SVM

I have been working on "Script identification from bilingual documents".
I want to classify the pages/blocks as either Eng(class 1), Hindi (class 2) or Mixed using libsvm in matlab. but the problem is that the training data i have consists of samples corresponding to Hindi and english pages/blocks only but no mixed pages.
The test data i want to give may consists of Mixed pages/blocks also, in that case i want it to be classified as "Mixed". I am planning to do it using confidence score or probability values. like if the prob value of class 1 is greater than a threshold (say 0.8) and prob value of class 2 is less than a threshold say(0.05) then it will be classified as class 1, and class 2 vice-versa. but if aforementioned two conditions dont satisfy then i want to classify it as "Mixed".
The third return value from the "libsvmpredict" is prob_values and i was planning to go ahead with this prob_values to decide whether the testdata is Hindi, English or Mixed. but at few places i learnt that "libsvmpredict" does not produce the actual prob_values.
Is there any way which can help me to classify the test data into 3 classes( Hindi, English, Mixed) using training data consisting of only 2 classes in SVM.
This is not the modus operandi for SVMs.
In no way SVMs can predict a given class without knowing it, without knowing how to separate such class from all other classes.
The function svmpredict() in LibSVM actually shows the probability estimates and the greater this value is, the more confident you can be regarding your prediction. But you cannot rely on such values if you have just two classes in order to predict a third class: indeed svmpredict() will return as many decision values as there are classes.
You can go on with your thresholding system (which, again, is not SVM-based) but it most likely fail or give bad performances. Think about that: you have to set up two thresholds and use them in a logic AND manner. The chance of correctly classified non-Mixed documents will indeed drastically decrease.
My suggestion is: instead of wasting time setting up thresholds, with a high chance of bad performances, join some of these texts together or create some new files with some Hindi and some English lines in order to add to your training data some proper Mixed documents and perform a standard 3-classes SVM system.
In order to create such files you can as well use Matlab, which has a pretty decent file I/O functions such as fread(), fwrite(), fprintf(), fscanf(), importdata() and so on...

RapidMiner: Ability to classify based off user set support threshold?

I am have built a small text analysis model that is classifying small text files as either good, bad, or neutral. I was using a Support-Vector Machine as my classifier. However, I was wondering if instead of classifying all three I could classify into either Good or Bad but if the support for that text file is below .7 or some user specified threshold it would classify that text file as neutral. I know this isn't looked at as the best way of doing this, I am just trying to see what would happen if I took a different approach.
The operator Drop Uncertain Predictions might be what you want.
After you have applied your model to some test data, the resulting example set will have a prediction and two new attributes called confidence(Good) and confidence(Bad). These confidences are between 0 and 1 and for the two class case they will sum to 1 for each example within the example set. The highest confidence dictates the value of the prediction.
The Drop Uncertain Predictions operator requires a min confidence parameter and will set the prediction to missing if the maximum confidence it finds is below this value (you can also have different confidences for different class values for more advanced investigations).
You could then use the Replace Missing Values operator to change all missing predictions to be a text value of your choice.