How to extract texts from images in a folder - tesseract

My laptop is Ubuntu, I have a folder called testdata which contains lots of jpg images, I want to know how to run Tesseract on all these images and save outputs to another folder e.g. "testresult". Outputs can be a single txt file contains extract texts from all images. Or, it can be one txt file for one image extracted text only.
For a single image, the command line I know is tesseract test_01.jpg test_01
Could anyone help me please?

Related

Full text search on content of documents

I need to save several files like pdf, docx, xslx etc..
I also need to make full text search on that files, for example, file named test.pdf that contains three lines:
firt test line
two my test line
three test line
Given an input like 'my' i need to extract file test.pdf
Can i do it with mongoDB or other tools (aws, alfresco?) ?

.HEIC file to .JPG not identified by PIL.Image

I am running a script to extract the EXIF data from a list of images in a folder that are imported from an iPhone using Python's pillow:
from PIL import Image
image = Image.open(path)
But before anything, some of the pictures need to be converted from iOS' format .HEIC to .jpg. I successfully managed to do so but when I try to open the image that was converted I get the following:
PIL.UnidentifiedImageError: cannot identify image file '.../pictures/IMG_4294.jpg'
See this image comparing the info of two files. The one on the left was converted from .HEIC to .jpg and doesn't work. The one on the right is originally a .jpg and works just fine.
Any thoughts on how I can solve this?

Extracting specific file from zip in matlab

Currently I have a zipfile containing several thousand .xml files, extracted the folder is 1.5gb in size.
I have a function that matches data with specific files inside this zip file. I then want to read this specific file and extract additional data.
My question:
Is there any way to extract these specific files from the archive without unzipping the entire archive?
The built in unzip.m function can only be used to unzip the entire file so it won't work so I am thinking I have to use the COM interface or some other approach.
Matlab version: R2013a
While searching for solutions I found this:Read the data of CSV file inside Zip File without extracting the contents in Matlab
But I can't get the code in the answer to work for my situation
Edit:
Credit to Hoki and Intelk
zipFilename = 'HMDB.zip';
zipJavaFile = java.io.File(zipFilename);
zipFile=org.apache.tools.zip.ZipFile(zipJavaFile);
entries=zipFile.getEntries;
cnt=1;
while entries.hasMoreElements
tempObj=entries.nextElement;
file{cnt,1}=tempObj.getName.toCharArray';
cnt=cnt+1;
end
ind=regexp(file,'$*.xml$');
ind=find(~cellfun(#isempty,ind));
file=file(ind);
file = cellfun(#(x) fullfile('.',x),file,'UniformOutput',false);
And not forgetting the
zipFile.close

C# folder and subfolder

Upon numerous searches, I am here to see if someone has any idea on how I should go about tackling this issue.
I have a folder with sub-folders. The sub-folder containers each has files of different file types e.g. pdf, png, jpeg, tiff, avi and word documents.
My goal is to write a code in C# that will go into the subfolder, and combined all the files into one pdf using the name of the folder. The only exception is that a file such as avi will not be pdf'ed in which case I want a nudge as to which folder it is and possibly file name. I am trying to use the form approach, so that you can copy in the folder pathname and also destination of the created pdf.
Thanks.
to start, create a FolderBrowserDialog to get the root folder. Alternatively just make a textbox in which you paste the folder name ( less preferred since the first method gives you nicer error-handling straight out of the box )
In order to iterate through, see How to: Iterate Through a Directory Tree
To find the filetype, check System.IO.FileInfo.Extension for each file you iterate through. Add those to list with the data you need. ( hint, create a list of objects in which your object reflects the data you need such as path, type etc,... ). If its an avi don't toss it in the list but flash a warning (messagebox?) instead.
From here the original question gets fuzzy. What exactly do you need in the pdf. Just the filenames and locations or do you actually want to throw the actual contents of the file in pdf?

How to load many images at the same time?

I have a problem loading multiple images at the same time using matlab. Could anybody me?
How about writing a small program?
'uigetdir' [http://www.mathworks.com/help/techdoc/ref/uigetdir.html] to let user to select the directory where image files are.
'dir' and determine the names of the files in that directory.
'listdlg' to create a list of files on a GUI, with 'SelectionMode' as 'multiple'
check the file extension (you can do this before #3 also to show only image files in the list.)
count (N) how many image files the user wants to load and plot ('length' of the selected filename string).
loop for N times and go through the list of filenames, and open each one with the appropriate loader function (by determining the file extension of each file before loading)
as you load the data from the files, you can plot them however you like either in a single figure or multiple.
Best,
Y.T.