I have a situation where I have many images, and I compare them using a specific fuzz factor (say 10%), looking for images that match. Works fine.
However, I sometimes have a situation where I want to compare all images to all other images (for e.g. 1000 images). Doing 5000+ ImageMagick compares is way too slow.
Hashing all the files and comparing the hashes 5000 times is lightning fast, but of course only works when the images are identical (no fuzz factor).
I'm wondering if there is some way to produce an ID or fingerprint - or maybe a range of IDs - where I could very quickly determine what images are close enough to each other, and then pay the ImageMagick compare cost only for those likely matches. Ideas or names of existing algorithms/approaches are very welcome.
There are quite a few imaging hashing algorithms out there. pHash is the one that springs to the top of my mind. http://www.phash.org/. That one works with basic transformations that one might want to do on an image. If you want to be more sophisticated and roll your own, you can use a pre-trained image classifier like image net (https://www.learnopencv.com/keras-tutorial-using-pre-trained-imagenet-models/), lop off the final layer, and use the penultimate layer as a vector. For small # of images, you can easily do a nearest neighbor. If you have more, you cam use annoy (https://github.com/spotify/annoy) to make the nearest neighbor search a bit more efficient
I have some questions about making tiff/box files for tesseract 4.
In TrainingTesseract 4.00 document written:
Making Box Files As with base Tesseract, there is a choice between
rendering synthetic training data from fonts, or labeling some
pre-existing images (like ancient manuscripts for example).
But it did not explain how to train with pre-existing images.
I want to train for the Persian language in tesseract 4 (lstm). I have some images from ancient manuscripts and want to train with images and texts instead of font. So I can’t use text2image command. I know that the old format box files will not work for LSTM training.
How can I make tif/box for tessearct 4 lstm then label them and
how to change tesseract commands?
Should I use other tools for generating box files (Given that Persian
language is right to left )?
Should I use fine tuning or train from Scratch?
I was struggling just like you, until I found this github repository:
https://github.com/OCR-D/ocrd-train
It will make your life super easy. All you need to do is to put your images in tif format and your text should have the same image name with extension .gt.txt. It will take care of all the rest for you. (you might need to update the Makefile according to your local machine)
Whether to train from scratch or fine-tune depends on your own language, data and the problem you are trying to solve. For me the fine tunining is what I need cause I am happy with the current performance but need to add upon it.
All the useful details you might need can be found in this answer
1) Use below command to make lstmbox:
tesseract test.tif test-lstmbox -l eng --psm 6 lstmbox
It will make a lstmbox for you but you have to correct the character in box file.
2) You require enough data for training from Scratch So I suggest fine tuning is better option.
I am using Caffe to extract features with matlab wrapper.I have 5011 images as test data set.I chopped all the layers after 'relu7' in 'deploy.prototxt'. I found out if you take the same image as input of matcaffe_demo.m and matcaffe_batch.m, you will get the different 4096-dim features.
Could someone tell me why?
what is the differences between you extract features from all these images one by one with matcaffe_demo.m and extract features by listing all these images with matcaffe_batch.m?
You can find the answer to this question at caffe github.
Basically, matcaffe_demo is used for classification and it averages results of 10 crops of the input image, while matcaffe_bathc uses only a single input.
Moreover, note that these m-files are no longer available in recent caffe versions.
At work I have to record a lot of data from png data. Every time I have to manually record the digits (e.g. mean\SD 101.1\11) on the excel sheet and read it with Matlab. Would it be possible that Matlab could directly read the digits from the PNG image, so that lots of work could be saved?
I know it might involve pattern recognition, but still hope that there may be someone who has done this before.
You can make use of Optical Character Recognition (OCR). The code for it is available here
Is there an efficient way to get a fingerprint of an image for duplicate detection?
That is, given an image file, say a jpg or png, I'd like to be able to quickly calculate a value that identifies the image content and is fairly resilient to other aspects of the image (eg. the image metadata) changing. If it deals with resizing that's even better.
[Update] Regarding the meta-data in jpg files, does anyone know if it's stored in a specific part of the file? I'm looking for an easy way to ignore it - eg. can I skip the first x bytes of the file or take x bytes from the end of the file to ensure I'm not getting meta-data?
Stab in the dark, if you are looking to circumvent meta-data and size related things:
Edge Detection and scale-independent comparison
Sampling and statistical analysis of grayscale/RGB values (average lum, averaged color map)
FFT and other transforms (Good article Classification of Fingerprints using FFT)
And numerous others.
Basically:
Convert JPG/PNG/GIF whatever into an RGB byte array which is independent of encoding
Use a fuzzy pattern classification method to generate a 'hash of the pattern' in the image ... not a hash of the RGB array as some suggest
Then you want a distributed method of fast hash comparison based on matching threshold on the encapsulated hash or encoding of the pattern. Erlang would be good for this :)
Advantages are:
Will, if you use any AI/Training, spot duplicates regardless of encoding, size, aspect, hue and lum modification, dynamic range/subsampling differences and in some cases perspective
Disadvantages:
Can be hard to code .. something like OpenCV might help
Probabilistic ... false positives are likely but can be reduced with neural networks and other AI
Slow unless you can encapsulate pattern qualities and distribute the search (MapReduce style)
Checkout image analysis books such as:
Pattern Classification 2ed
Image Processing Fundamentals
Image Processing - Principles and Applications
And others
If you are scaling the image, then things are simpler. If not, then you have to contend with the fact that scaling is lossy in more ways than sample reduction.
Using the byte size of the image for comparison would be suitable for many applications. Another way would be to:
Strip out the metadata.
Calculate the MD5 (or other suitable hashing algorithm) for the
image.
Compare that to the MD5 (or whatever) of the potential dupe
image (provided you've stripped out
the metadata for that one too)
You could use an algorithm like SIFT (Scale Invariant Feature Transform) to determine key points in the pictures and match these.
See http://en.wikipedia.org/wiki/Scale-invariant_feature_transform
It is used e.g. when stitching images in a panorama to detect matching points in different images.
You want to perform an image hash. Since you didn't specify a particular language I'm guessing you don't have a preference. At the very least there's a Matlab toolbox (beta) that can do it: http://users.ece.utexas.edu/~bevans/projects/hashing/toolbox/index.html. Most of the google results on this are research results rather than actual libraries or tools.
The problem with MD5ing it is that MD5 is very sensitive to small changes in the input, and it sounds like you want to do something a bit "smarter."
Pretty interesting question. Fastest and easiest would be to calculate crc32 of content byte array but that would work only on 100% identical images. For more intelligent compare you would probably need some kind of fuzy logic analyzis...
I've implemented at least a trivial version of this. I transform and resize all images to a very small (fixed size) black and white thumbnail. I then compare those. It detects exact, resized, and duplicates transformed to black and white. It gets a lot of duplicates without a lot of cost.
The easiest thing to do is to do a hash (like MD5) of the image data, ignoring all other metadata. You can find many open source libraries that can decode common image formats so it's quite easy to strip metadata.
But that doesn't work when image itself is manipulated in anyway, including scaling, rotating.
To do exactly what you want, you have to use Image Watermarking but it's patented and can be expensive.
This is just an idea: Possibly low frequency components present in the DCT of the jpeg could be used as a size invariant identifier.