I am looking for a code or an application which can extract the salient object out of a video considering both context and motion,
or
an algorithm just for motion saliency map detection (motion contrast) so I can fuse it with a context_aware salient object detector that I have.
Actually I have tested context_aware saliency map detector already but it in some frame detects some part of background as salient object and I want to involve the motion and time in this detection so I can extract the exact salient object as it's possible.
Can anyone help me?
one of the most popular approaches (although a bit dated) in the computer vision community is the graph based visual saliency (GBVS) model.
it uses a graph-based method to compute visual saliency. first, the same feature maps than in the fsm model are extracted. it leads to three multiscale feature maps: colors, intensity and orientations. then, a fully-connected graph is built over all grid locations of each feature map and a weight is assigned between each node. this weight depends on the spatial distance and the value of the feature map between nodes. finally, each graph is treated as markov chains to build an activation map where nodes which are highly dissimilar to surrounding nodes will be assigned high values. finally, all activation maps are ultimately merged into the final saliency map.
you can find matlab source code here: http://www.vision.caltech.edu/~harel/share/gbvs.php
Related
I also know the fact that saliency map is also a form of image segmentation task.
But it has been used very widely for interpretable deep learning ( Read GradCam etc ) .
I also came across this paper (http://img.cs.uec.ac.jp/pub/conf16/161011shimok_0.pdf)
which talks about Class Saliency Maps - something that rings a bell when it comes to Image Segmentation. Please tell if this concept exists for Image Segmentation or I need to read more on this subject.
Class saliency maps as described in Deep Inside Convolutional Networks: VisualisingImage Classification Models and Saliency Maps explain that such a map describes per pixel how much changing such a pixel will influence a prediction. Hence I see no reason why this could not be applied to image segmentation tasks.
The resulting images from the segmentation task and saliency map have to be interpreted differently however. In an image segmentation task the output is a per pixel prediction of whether or not a pixel belongs to a class, sometimes in the form of a certainty score.
A class saliency map describes per pixel how much changing that pixel would change the score of the classifier. Or quote from above paper: "which pixels need to be changed the least to affect the class score the most"
Edit: Added example.
Say that a pixel gets a score of 99% for being of the class "Dog", we can be rather certain that this pixel actually is part of a dog. The salience map can show a low score for this same pixel. This means that changing this pixel slightly would not influence the prediction of that pixel belonging to the class "Dog". In my experience so far, both the per pixel class probability map and the salience map show somewhat similar patterns, but this does not mean they are to be interpreted equal.
A piece of code I came across that can be applied to pytorch models (from Nikhil Kasukurthi, not mine) can be found on github.
So I am using matlab and I've managed to modify one of their examples so that I can now plot the flow lines as people walk below (Camera is above a door).
I use Lucas-Kanade optical flow and the computer vision toolbox.
The lines are defined like so, I also defined the tracked points. These tracked points include cases where the original points haven't changed and so the real(tmp(:)) in this case will be zero and those points will be the same as the orgininally identified feature points.
vel_Lines = [Y(:) X(:) Y(:)+real(tmp(:)) X(:)+imag(tmp(:))];
allTrackedPoints = [Y(:)+real(tmp(:)) X(:)+imag(tmp(:))];
My question is how can I JUST get the points which have successfully been tracked a certain distance? I want to somehow only retain values which the change is large enough.
I'm not great with Matlab's syntax so was hoping this would be easy for someone.
I want to get the points that were successfuly tracked pertaining to the motion, then cluster these points to determine how many people, and then tracked these sets of points using a multiple object tracker.
If your camera is not moving, then background subtraction may work better for you than optical flow. See this example.
You can also use the vision.PeopleDetector object to detect people. See this example.
If you insist on using optical flow, try the Fareneback optical flow algorithm, available as of R2015b release.
Which Matlab functions or examples should be used to (1) track distance from moving object to stereo (binocular) cameras, and (2) track centroid (X,Y,Z) of moving objects, ideally in the range of 0.6m to 6m. from cameras?
I've used the Matlab example that uses the PeopleDetector function, but this becomes inaccurate when a person is within 2m. because it begins clipping heads and legs.
The first thing that you need deal with, is in how detect the object of interest (I suppose you have resolved this issue). There are a lot of approaches of how to detect moving objects. If your cameras will stand in a fix position you can work only with one camera and use some background subtraction to get the objects that appear in the scene (Some info here). If your cameras are are moving, I think the best approach is to work with optical flow of the two cameras (instead to use a previous frame to get the flow map, the stereo pair images are used to get the optical flow map in each fame).
In MatLab, there is an option called disparity computation, this could help you to try to detect the objects in scene, after this you need to add a stage to extract the objects of your interest, you can use some thresholds. Once you have the desired objects, you need to put them in a binary mask. In this mask you can use some image momentum (Check this and this) extractor to calculate the centroids. If the images in the binary mask look noissy you can use some morphological operations to improve the reults (watch this).
Recently, I have to do a project of multi view 3D scanning within this 2 weeks and I searched through all the books, journals and websites for 3D reconstruction including Mathworks examples and so on. I written a coding to track matched points between two images and reconstruct them into 3D plot. However, despite of using detectSURFFeatures() and extractFeatures() functions, still some of the object points are not tracked. How can I reconstruct them also in my 3D model?
What you are looking for is called "dense reconstruction". The best way to do this is with calibrated cameras. Then you can rectify the images, compute disparity for every pixel (in theory), and then get 3D world coordinates for every pixel. Please check out this Stereo Calibration and Scene Reconstruction example.
The tracking approach you are using is fine but will only get sparse correspondences. The idea is that you would use the best of these to try to determine the difference in camera orientation between the two images. You can then use the camera orientation to get better matches and ultimately to produce a dense match which you can use to produce a depth image.
Tracking every point in an image from frame to frame is hard (its called scene flow) and you won't achieve it by identifying individual features (such as SURF, ORB, Freak, SIFT etc.) because these features are by definition 'special' in that they can be clearly identified between images.
If you have access to the Computer Vision Toolbox of Matlab you could use their matching functions.
You can start for example by checking out this article about disparity and the related matlab functions.
In addition you can read about different matching techniques such as block matching, semi-global block matching and global optimization procedures. Just to name a few keywords. But be aware that the topic of stereo matching is huge one.
I'm using Itti's saliency map. So given an example image, I could get a salient map as shown below(compare the saliency map with the color photo):
The problem is that although the algorithms pinpoints roughly where the salient object is, it fails to reliably get the dimensions of the object. Thus, if I want my program to automatically crop out the most salient object in an image, I can only speculate the dimensions based on the shape of the salient map for the object. This is pretty unreliable since the salient shape could vary greatly.
Are there more reliable methods to do this?
In order to find a better estimate of the salient objects' boundaries, I suggest you'd use some foreground extraction algorithm. One popular algorithm for that task is called GrabCut. It requires an initial estimate of the object boundaries. For the task in hand, the initial estimate would be the boundaries of the blobs from the saliency map.