OCR with fix template based forms (like Passport) - forms

I am trying to perform OCR with tesseract. I can do pdf to text using tesseract java lib as expected. My requirements is extended a bit now. I need to extract metadata based on template form (kind of passport example where we have fix place for first name, date of birth etc). Input could be either pdf or image with same template form.
I am facing hard time to find any such example or article to achieve or to get further help above using tesseract.
So my basic questions :
Is this possible using tesseract?
Is there any example/articles about how to achieve this using tesseract?
Is there any other software/library which is recommended to achieve this?
Thanks for reading this.

Related

Can I convert .docx Word documents using the DocX .NET Library?

I am currently attempting to convert a couple of .NET desktop applications that I have developed into a web application harnessing AngularJS and RESTful services.
One of the key components of these applications is in their ability to generate Word documents on the fly using a .dotx Word template. I am currently exploring the possibility of using a third party library called DocX to generate these Word documents without resorting to using a template.
I guess my question is: Can I use this library to read an existing Word document in .docx format and generate a source code representation of the document? If this is possible could someone point me in the direction of any code samples that I could use? I have looked around and have been unable to find anything that could help me get started.
Generating code representation of the document and using it with DocX seems like a time consuming effort to me. Why not using a template instead and fill it with data at runtime?
I have some experience with Docentric, which is 3rd party OpenXML toolkit. It features an Word Add-in for template design and libraries for document generation and manipulation. It took me less then a week to generate pretty complex documents. If I was in your shoes I would definitely try some 3rd party toolkits. They cost money, but save time so do some math and see it they can be useful for you.
It is possible to read an existing Word document in .docx format with following code
DocX document = DocX.Load(filename)
While it is impossible to generate a source code representation of a document.

PDF generation from templated Word documents

I have a Word document(some template format) where it containing some placeholders for the data to be filled in and there are several Word documents like this which lies in some directory. When data comes I will be choosing different templates (based on some criteria) and fill the data and the documents have to be converted to PDF format.
I have been investigating Apache POI for this. If anyone has a good suggestion, it would be much appreciated.
As mbeckish mentioned you should indicate how you are going to run/automate this. For example is it one-off, run by hand or part of another program (and if so what programming languages do you use)?
If you are trying to automate it JODReports and Docmosis are tools that can use templates like you require and can produce PDF. JODReports is free. Docmosis is not but has several APIs. Please note I work for the company that develops Docmosis.
Hope that helps.
I've just uploaded this presentation, which presents three approaches for doing this.
Why not use any of existing PDF virtual printers?

asp.net web application to convert pdf to word

Is there any clear and proper process to convert a pdf file into a word file with all formatting and images in asp.net web application?
The best way to do that is by using the OCR. It will recognize the text and the images in the PDF file, and then you can save it on a DOC file. I know a third party toolkit named leadtools that should help you doing your requirements, since it support the ASP.NET environment. You can check their Online OCR Demo
Also, you can check their website for more information, or contact their support team.
PDF is a presentational format where all the content is placed by absolute positions. There are no paragraphs and other structured elements (unless it is a Tagged PDF). Technically, you can output every word character by character in any order, but visually it would look like a normal text. Thus, to make a proper conversion to word it is required to do content recognition or some kind of OCR (e.g. ABBYY FineReader)
There are some paid components on the market that allow to do text extraction and some do converting pages to images (obviously, this is not a desired approach for converting into word).

editing pdf contents in uiwebview iphone

hi im working on pdf manipulation.
my requirements are to edit the existing pdf document.
looks like there is no actual way to do it. i found out using javascript i can edit the html contents.
so now that my pdf is in uiwebview is there any way to convert pdf document to html content???
i have to do it programatically.
preferred language is objective c but its k if any suggestions in C/C++
thanks in advance
You will have to drop down to C if you want to do this. Basically you need to get hold of a CGPDFDocumentRef reference, and through that iterate each CGPDFPageRef. From the page you can get access to the CGPDFContentStreamRef.
From the content stream you can parse out the primitive data that is is PDF document. From there only a good understanding of the PDF document format can help you.
I would advice you to find a commercial tool, hire an experience contractor, or change your plan. What you have your sights on is allot of hard work.

Integrating xPDF in an IOS? (feasibility checking)

I am developing App in which PDF text searching & highlighting is needed. I found like its very difficult to highlight in PDF. So i thought to convert PDF to HTML then by using java-script, Search the string & Highlight it. Actually i got success in searching & highlighting on HTML text using java script.If any1 need source code send your email id.
But my obstacle is PDF to HTML conversion. I know it is very hard,bcoz PDF is enrich text & HTML doesn't support all the features. In between i got some source code in Python i.e. PDFMiner. With out jail breaking its hard to use Python in IOS. So i dropped this idea also.
Now i m looking on xPDF, its C++ based code to convert PDF to HTML. Did any1 got success over integrating xPDF into your IOS app. I want to know feasibility of this.
Thanks in advance for ur thoughtful reply,
Naveen Thunga.
Here you can find an example. Still has some problems, but is a good start:
https://github.com/KurtCode/PDFKitten