Can pytesseract use ChoiceIterator to search over multiple matches? - tesseract

Can pytesseract use ChoiceIterator to search over multiple matches? It seems to me that pytesseract is only an interface to the binary. tesserocr gives access to the Tesseract API which allows the use of ChoiceIterator. Example How do I use the Tesseract API to iterate over words?

pytesseract "wraps" tesseract executable, which does not provide this feature. So you need to use tesserocr or you can use tesseract API via cffi.
You did not specify why you need ChoiceIterator. Maybe have a look at hocr output (which is supported by pytesseract.

Related

How can Eclipse binaries search be configured to recognize .srec format?

I see that Eclipse CDT/Embedded has a capability to find binaries such as .elf, .bin, .exe... I was wondering whether this is configurable setting or not, since I would like it to add too the Motorola binary formatting called .srec.
Any hint on how or where I can add this onto Eclipse CDT?
Thanks in advance,
SREC is an ASCII object file format so to use a binary search makes little sense. You could use a text search, but to search for particular binary sequences that span more than one record would be complicated.
What you could do is convert the SREC file to a raw binary file, then use the binary search on that. Conversion to raw binary can for example be done using the SRecord utility, e.g.:
srec_cat myobject.srec −o myobject.bin −binary
If you add that as a post-build step, the binary version of your SREC file will always be available for searching.

Spacy Convert and Train UTF-8 Encoding CLI issues

I am training a NER on a foreign language which has a lot of unicode items in there. I create a IOB file and use the cli spacy convert functionality to make it spacy compatible so I can train on that set. However the file is turned into a us-ascii file format. I saw a thread here talking about that Spacy automatically solves this, but how does this work on inference? I also couldn't find where Spacy loads in the data using ujson anymore.
So my question is, does Spacy handle this automatically? And how do I feed my text the best to the Spacy inference?
The escaped UTF-8 will be read in correctly by the library srsly (which is using a fork of ujson internally). If you're worried, you can double-check with srsly.read_json("file.json"). You provide python strings (python3 str) as input.

Does Tesseract correct spelling (UB Mannheim, Windows installation)?

I'm using Tesseract to perform ocr tasks.
Can someone confirm if tesseract has an LSTM module for auto-post-ocr spell correction? And if yes, what did I do wrong so it's not triggered?
It appears it does have this module (e.g. see slide 4 in first link):
https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/2ArchitectureAndDataStructures.pdf
https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM
However, many example outputs give me misspelled words such as ambuiatory/ambutatory for ambulatory, and sometimes informat for information. This seems that tesseract picked the characters with highest likelihood, assembled them, but didn't perform any post-ocr processing.
Here's how I'm using tesseract
I followed the UB Mannheim installation:
https://github.com/UB-Mannheim/tesseract/wiki
and pytesseract through pip install:
https://pypi.org/project/pytesseract/
Here's my sample codes (sorry my image is internal so cannot be posted):
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Users\username\AppData\Local\Tesseract-OCR\tesseract'
pth_test = r'C:\Users\username\image_path_continues'
print( pytesseract.image_to_string(pth_test, lang = 'eng') )

Index PDF files and generate keywords summary

I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files.
I would like to :
Parse the contents of the PDF files to get keywords.
Select the most relevant keywords to make a summary.
Create static HTML pages for some keywords with entries linked to the appropriate files.
My questions are :
Is there an existing tool to perform the whole job ?
What is the most appropriate tool to parse PDF files content, filter (by words size) and counting the words?
I consider using Perl, swish-e, pdfgrep to make a script. Do you know other tools which could be useful?
Given that points 2 and 3 seem custom I'd recommend to have your own script, use a tool out of it to parse pdf, process its output as you please, and write HTML (perhaps using another tool).
Perl is well suited for that, since it excels in processing that you'll need and also provides support for working with all kinds of file formats, via modules.
As for reading pdf, here are some options if your needs aren't too elaborate
Use CAM::PDF (and CAM::PDF::PageText) or PDF-API2 modules
Use pdftotext from the poppler library (probably in poppler-utils package)
Use pdftohtml with -xml option, read the generated simple XML file with XML::libXML or XML::Twig
The last two are external tools which you use via Perl's builtins like system.
The following text processing, to build your summary and design the output, is precisely what languages like Perl are for. The couple of tasks that are mentioned take a few lines of code.
Then write out HTML, either directly if simple or using a suitable module. Given your purpose, you may want to look into HTML::Template. Also see this post, for example.
Full parsing of PDF may be infeasible, but if the files aren't too complex it should work.
If your process for selecting keywords and building statistics is fairly common, there are integrated tools for document management (search for bibliography managers). However, I think that most of them resort to external tools to parse pdf so you may still be better off with your own script.

How can I search and replace in a PDF document using Perl?

Does anyone know of a free Perl program (command line preferable), module, or anyway to search and replace text in a PDF file without using it like an editor.
Basically I want to write a program (in Perl preferably) to automate replacing certain words (e.g. our old address) in a few hundred PDF files. I could use any program that supports command line arguments. I know there are many modules on CPAN that manipulate or create pdfs but they don't have (that I've seen) any sort of simple search and replace.
Thanks in advance for any and all advice!!!
Take a look at CAM::PDF. More specifically the changeString method.
How did you generate those PDFs in the first place? Search-and-replace in the original sources and re-generate PDFs seems to be more viable. Direct editing PDFs can be very difficult, and I'm not aware of any free tools that can do it easily.