pytesseract results different from tesseract command line results - tesseract

I am trying to convert a scanned page to text using both pytesseract and tesseract command line on Ubuntu. The results are remarkably different (pytesseract performs way better than tesseract command line) and I am unable to understand why. I looked at the default values for the parameters and tried altering some of the parameter values in tesseract command line (like psm ) but I am unable to get the same result as pytesseract. Due to lack of proper documentation in pytesseract I am not able to figure out what default values for parameters are used.
Here is my pytesseract code
print(pytesseract.image_to_string(Image.open('test.tiff'))

Looking at the source code of pytesseract, it seems the image is always converted into a .bmp file.
Working with a .bmp file and psm of 6 at the command line with Tesseract gives same result as pytesseract.
Also, tesseract can work with uncompressed bmp files only. Hence, if ImageMagick is used to convert .pdf to .bmp, the following will work
convert -density 300 -quality 100 mypdf.pdf BMP3:mypdf.bmp
tesseract mypdf.bmp -psm 6 mypdf txt

In tessaract v5 3.0+
Pytessaract does not convert images to BMP. You can verify this by commenting out cleanup(f.name) in the save context manager, which is found within the source code /pytesseract/pytesseract.py. The filename of the temp file will also need to retrieved (Pytessaract was saving files within temp files directory of the user, ie. "[path-to-user]\AppData\Local[file-name]". I found what Pytesseract is actually doing is in the prepare function.
Basically, taking the temp file and using that same file with the tesseract command directly will yeild the same results

Related

Is there any way to convert lammp_file.data to Gromacs files (top and gro), if not then to or to CHARMM files (psf and pdb)?

I have a lammps_file.data and I need to convert it to Gromacs files (gro and top) to run my simulations.
Does anyone know how to do this?
Another choice is to convert from lammps to charmm files (psf and pdb). Once I get the charmm files I can just use Topotools to get the gromacs files I need.
Thanks
Indeed, NOW I am trying to do the same myself.
So far, you can use intermol , this should work fine to convert LAMMPS data files to Gromacs files. Once you install intermol, and you ceate a path to the intermol converter, you can use a command like:
python2.7 $conv/convert.py --lmp_in topology.data --gromacs -v
CHECK the format of your data file, I still having problemst to convert it.
If you wish to create the psf file,
you would need VMD (google it), then open the tcl terminal and write :
topo readlammpsdata topology.data full
animate write psf topology.psf
The 1st line is for loading yur LAMMPS data file, if you are in the folder where
that files is located
2nd convert the data to psf CHARMM
Also, you could try this. In this paper, they provide a tood to conver
CHARMM topologies to gromacs here. Thus, you convert to psf, then to gro top.

load unix executable file to ascii

I am simply trying to load ascii files with two columns of data (spectral data).
They were saved originally as .asc.
I need to open and edit them using text editor before I can load them into Matlab to erase the headers, but some of them somehow got converted to unix executable foramt with the .asc extension. And others are plain text docs also with the same extension. I have no idea why they got saved with the same extension and with my same manipulation as different kind formats.
When I use the load command in Matlab, the plain text docs load normally as expected but the ones saved as unix executable kinds give me this error:
Error using load Unable to read file filename.asc: No such file or
directory.
How can I either resave them (still with the same extension) or otherwise load them to be read by Matlab as standard two column data matrixes?
Thanks!
If these are truly plain text files, try renaming the file from xxx.asc to xxx.txt. Then, see if you are able to edit them as desired.

ImageMagick crop with row/column in file name only saving last image

I'm attempting to crop an image using ImageMagick and via PowerShell. I can crop the image fine with the following command, and it creates the 2000+ images:
convert -crop 16x16 .\original.png tileOut%d.png
However, I would like to take advantage of ImageMagick's ability to dynamically set the file name.
According to a post on their forums I should be able to run something like the following via a batch file:
convert ^
bigimage.jpg ^
-crop 256x256 ^
-set filename:tile "%%[fx:page.x/256+1]_%%[fx:page.y/256+1]" ^
+repage +adjoin ^
tiled_%%[filename:tile].gif
I shouldn't need to escape the % since I'm running this in PowerShell directly, so I used the following:
convert -crop 16x16 .\original.png -set filename:tile "%[fx:page.x/16+1]_%[fx:page.y/16+1]" +repage +adjoin directory\tiled_%[filename:tile].png
However, when I run this command I end up with one file called tiled_%[filename and another called tiled_45_47.png.
So while it does seem to create the last file, it only creates the one. The first file is 0 bytes in size, but takes up over 8 MB of space on disc, according to properties on the file.
Trying to run the command in a batch file results in the same behavior, which makes me think PowerShell itself isn't the issue, but rather the command is.
According to the documentation +adjoin is required since I want different images. +repage doesn't make much sense to me, but I've kept it in the command since the original had it, and excluding it doesn't seem to change the output. -set filename seems pretty straightforward.
Large size of the first leads me to believe that all the previous images might be getting added to it. However, the file name also suggests it's getting hung up on the :, but it doesn't appear to be a special character in PowerShell. It's also creating an image for the very last crop. Baffling.
So what am I doing wrong?
Thanks in advance!
EDIT:
PowerShell 5.0.10586.0, on Windows 10.
ImageMagick 6.9.2 Q16 (64-bit)
From the comments, I'm thinking the issue might be with the ImageMagick command.
I'm not using Powershell, but I think you will have more success by specifying your image first, then the crop, then setting the filename:
convert original.png -crop 16x16 -set filename:tile "%[fx:page.x/16+1]_%[fx:page.y/16+1]" +repage "tiled_%[filename:tile].png"
So in the past I was using the following command to crop images, with the %d being automatically converted to a number based upon the sequence.
convert -crop 16x16 .\original.png directory\tileOut%d.png
That works perfectly fine. However, the example provided on that forum had the original file name listed as the first argument to the convert command. Changing my command so that it was listed first results in the expected behavior.
convert .\original.png -crop '16x16' -set 'filename:tile' '%[fx:page.x/16+1]_%[fx:page.y/16+1]' +repage +adjoin 'directory\tiled_%[filename:tile].png'
The use of single quotes in so many locations may not be required, but it works.

PDF output from MATLAB and inclusion in LaTeX

I'm printing some figures in MATLAB in PDF form, and can view them fine with the Evince PDF viewer on Fedora 16.
When I try to include them in LaTeX (TeXLive 2011), however, I get an error
!pdfTeX error: /usr/local/texlive/2011/bin/x86_64-linux/pdflatex (file ./caroti
d_amp_mod_log.pdf): xpdf: reading PDF image failed
However, I can take an example PDF image generated in Mathematica and include it just fine, which tells me that the problem is with the PDF's generated by MATLAB and not with PDF's in general.
Might it have something to do with the set(0,'defaultfigurepaperpositionmode','auto')I put in my startup.m file so that pages would auto-fit the images?
EDIT: I just tried using saveas(figure(1), 'filename.pdf') instead of print(figure(1), 'filename.pdf') and it worked fine, but the PaperPositionMode property is ignored. Any way around this?
Finally found the problem. The correct way to print images is to use the print(handle, '-dformat', 'filename') syntax.
So, for PDF's, we need print(figure(1), '-dpdf', 'myfigure'). See MATLAB documentation on graphics file formats for more information.
Using print(figure(1), 'filename.pdf') still produces a valid PDF for viewing, but it can't be included in LaTeX.
You can try using
pdfpages
or
pgf
to include pdf files. However, you need to use pdflatex only, as you are doing right now.

Batch merging image files to pdf files using perl in windows

I have a bunch of image files in this naming format:
313024_Page_1_Image_0001.png
313024_Page_1_Image_0002.png
313025_Page_1_Image_0001.png
313025_Page_1_Image_0002.png
313025_Page_2_Image_0001.png
And I would like to convert the files with the same numbers (pre "Page_") to a single pdf with that name. For example, using the above five files:
313024_Page_1_Image_0001.png
313024_Page_1_Image_0002.png
would merge to 313024.pdf
and
313025_Page_1_Image_0001.png
313025_Page_1_Image_0002.png
313025_Page_2_Image_0001.png
would merge to 313025.pdf
I would like to be able to run this script in Perl in windows.
Thanks in advance,
Jake
Imagemagick includes a convert program that will take PNG files and make PDF files from them, e.g.:
$ convert source.png -compress zip source.pdf
You can also append image files into a larger image file, before converting to PDF:
$ convert {listOfImageFilenames} -append -compress zip verticallyStitchedFilename.pdf
You can run this within a Perl script via system() or through the Imagemagick API (example).
You'll probably need to adjust these calls for the special way that Microsoft Windows does things, but it shouldn't be too hard.