I'm using Itti's saliency map. So given an example image, I could get a salient map as shown below(compare the saliency map with the color photo):
The problem is that although the algorithms pinpoints roughly where the salient object is, it fails to reliably get the dimensions of the object. Thus, if I want my program to automatically crop out the most salient object in an image, I can only speculate the dimensions based on the shape of the salient map for the object. This is pretty unreliable since the salient shape could vary greatly.
Are there more reliable methods to do this?
In order to find a better estimate of the salient objects' boundaries, I suggest you'd use some foreground extraction algorithm. One popular algorithm for that task is called GrabCut. It requires an initial estimate of the object boundaries. For the task in hand, the initial estimate would be the boundaries of the blobs from the saliency map.
Related
I also know the fact that saliency map is also a form of image segmentation task.
But it has been used very widely for interpretable deep learning ( Read GradCam etc ) .
I also came across this paper (http://img.cs.uec.ac.jp/pub/conf16/161011shimok_0.pdf)
which talks about Class Saliency Maps - something that rings a bell when it comes to Image Segmentation. Please tell if this concept exists for Image Segmentation or I need to read more on this subject.
Class saliency maps as described in Deep Inside Convolutional Networks: VisualisingImage Classification Models and Saliency Maps explain that such a map describes per pixel how much changing such a pixel will influence a prediction. Hence I see no reason why this could not be applied to image segmentation tasks.
The resulting images from the segmentation task and saliency map have to be interpreted differently however. In an image segmentation task the output is a per pixel prediction of whether or not a pixel belongs to a class, sometimes in the form of a certainty score.
A class saliency map describes per pixel how much changing that pixel would change the score of the classifier. Or quote from above paper: "which pixels need to be changed the least to affect the class score the most"
Edit: Added example.
Say that a pixel gets a score of 99% for being of the class "Dog", we can be rather certain that this pixel actually is part of a dog. The salience map can show a low score for this same pixel. This means that changing this pixel slightly would not influence the prediction of that pixel belonging to the class "Dog". In my experience so far, both the per pixel class probability map and the salience map show somewhat similar patterns, but this does not mean they are to be interpreted equal.
A piece of code I came across that can be applied to pytorch models (from Nikhil Kasukurthi, not mine) can be found on github.
Given are two monochromatic images of same size. Both are prealigned/anchored to one common point. Some points of the original image did move to a new position in the new image, but not in a linear fashion.
Below you see a picture of an overlay of the original (red) and transformed image (green). What I am looking for now is a measure of "how much did the "individual" points shift".
At first I thought of a simple average correlation of the whole matrix or some kind of phase correlation, but I was wondering whether there is a better way of doing so.
I already found that link, but it didn't help that much. Currently I implement this in Matlab, but this shouldn't be the point I guess.
Update For clarity: I have hundreds of these image pairs and I want to compare each pair how similar they are. It doesn't have to be the most fancy algorithm, rather easy to implement and yielding in a good estimate on similarity.
An unorthodox approach uses RASL to align an image pair. A python implementation is here: https://github.com/welch/rasl and it also
provides a link to the RASL authors' original MATLAB implementation.
You can give RASL a pair of related images, and it will solve for the
transformation (scaling, rotation, translation, you choose) that best
overlays the pixels in the images. A transformation parameter vector
is found for each image, and the difference in parameters tells how "far apart" they are (in terms of transform parameters)
This is not the intended use of
RASL, which is designed to align large collections of related images while being indifferent to changes in alignment and illumination. But I just tried it out on a pair of jittered images and it worked quickly and well.
I may add a shell command that explicitly does this (I'm the author of the python implementation) if I receive encouragement :) (today, you'd need to write a few lines of python to load your images and return the resulting alignment difference).
You can try using Optical Flow. http://www.mathworks.com/discovery/optical-flow.html .
It is usually used to measure the movement of objects from frame T to frame T+1, but you can also use it in your case. You would get a map that tells you the "offset" each point in Image1 moved to Image2.
Then, if you want a metric that gives you a "distance" between the images, you can perhaps average the pixel values or something similar.
I am looking for a code or an application which can extract the salient object out of a video considering both context and motion,
or
an algorithm just for motion saliency map detection (motion contrast) so I can fuse it with a context_aware salient object detector that I have.
Actually I have tested context_aware saliency map detector already but it in some frame detects some part of background as salient object and I want to involve the motion and time in this detection so I can extract the exact salient object as it's possible.
Can anyone help me?
one of the most popular approaches (although a bit dated) in the computer vision community is the graph based visual saliency (GBVS) model.
it uses a graph-based method to compute visual saliency. first, the same feature maps than in the fsm model are extracted. it leads to three multiscale feature maps: colors, intensity and orientations. then, a fully-connected graph is built over all grid locations of each feature map and a weight is assigned between each node. this weight depends on the spatial distance and the value of the feature map between nodes. finally, each graph is treated as markov chains to build an activation map where nodes which are highly dissimilar to surrounding nodes will be assigned high values. finally, all activation maps are ultimately merged into the final saliency map.
you can find matlab source code here: http://www.vision.caltech.edu/~harel/share/gbvs.php
Which Matlab functions or examples should be used to (1) track distance from moving object to stereo (binocular) cameras, and (2) track centroid (X,Y,Z) of moving objects, ideally in the range of 0.6m to 6m. from cameras?
I've used the Matlab example that uses the PeopleDetector function, but this becomes inaccurate when a person is within 2m. because it begins clipping heads and legs.
The first thing that you need deal with, is in how detect the object of interest (I suppose you have resolved this issue). There are a lot of approaches of how to detect moving objects. If your cameras will stand in a fix position you can work only with one camera and use some background subtraction to get the objects that appear in the scene (Some info here). If your cameras are are moving, I think the best approach is to work with optical flow of the two cameras (instead to use a previous frame to get the flow map, the stereo pair images are used to get the optical flow map in each fame).
In MatLab, there is an option called disparity computation, this could help you to try to detect the objects in scene, after this you need to add a stage to extract the objects of your interest, you can use some thresholds. Once you have the desired objects, you need to put them in a binary mask. In this mask you can use some image momentum (Check this and this) extractor to calculate the centroids. If the images in the binary mask look noissy you can use some morphological operations to improve the reults (watch this).
I have several images of the pugmark with lots of irrevelant background region. I cannot do intensity based algorithms to seperate background from the foreground.
I have tried several methods. one of them is detecting object in Homogeneous Intensity image
but this is not working with rough texture images like
http://img803.imageshack.us/img803/4654/p1030076b.jpg
http://imageshack.us/a/img802/5982/cub1.jpg
http://imageshack.us/a/img42/6530/cub2.jpg
Their could be three possible methods :
1) if i can reduce the roughness factor of the image and obtain the more smoother texture i.e more flat surface.
2) if i could detect the pugmark like shape in these images by defining rough pugmark shape in the database and then removing the background to obtain image like http://i.imgur.com/W0MFYmQ.png
3) if i could detect the regions with depth and separating them from the background based on difference in their depths.
please tell if any of these methods would work and if yes then how to implement them.
I have a hunch that this problem could benefit from using polynomial texture maps.
See here: http://www.hpl.hp.com/research/ptm/
You might want to consider top-down information in the process. See, for example, this work.
Looks like you're close enough from the pugmark, so I think that you should be able to detect pugmarks using Viola Jones algorithm. Maybe a PCA-like algorithm such as Eigenface would work too, even if you're not trying to recognize a particular pugmark it still can be used to tell whether or not there is a pugmark in the image.
Have you tried edge detection on your image ? I guess it should be possible to finetune Canny edge detector thresholds in order to get rid of the noise (if it's not good enough, low pass filter your image first), then do shape recognition on what remains (you would then be in the field of geometric feature learning and structural matching) Viola Jones and possibly PCA-like algorithm would be my first try though.