What do I get back from Tesseract when OCR a Checkbox (not a form) - tesseract

We parse a good number of PDFs, from many vendors. The PDFs are similar, but not exactly the same and things are not always in an exact same position on the same page. Some cases we are able to parse via getting the Strings from the PDF and checkboxes are Unicode. However, many vendors are not using Unicode so an image. These are never forms. So if I use iText to OCR the whole document, what does it produce for these checkboxes? Such that I can look for that and see if a checkbox is checked or not? Or am I just out of luck and the only way the data gets into our application is through manual entry? Thanks.

Related

Can birt support reading html tables from database and displaying them dynamically to a pdf report file?

I have come across a scenario where I have to read html data from database and display it in pdf reports. This html data also contains table structure <table></table> tags and other html element inside it. Previously we used jasper reports for our reporting needs but recently as we came to know that the above functionality is not supported in jasper, I wanted to know which reporting tool can be used so that it can be incorporated with servoy. Does birt provide this functionality?
AFAIK none of the well-known reporting tools does support this, although in BIRT it works "somehow" - but not good enough to be usable.
The reason for this is simple, I think: A reporting tool would have to incorporate a complete browser engine like WebKit or others to achieve this, because it would have to "understand" the structure for its page-breaking algorithm.
Yes, BIRT has a text element where we can set the display type to HTML. If the html table is in a dataset field you will just have to include it in the expression of the text using "value-of" tag, something like this:
<VALUE-OF format="HTML">row["htmlTableField"]</VALUE-OF>
PDF format is taking such html elements into account, including most of simple style settings such background color, text-align, borders etc.
Usually the reports render just fine with html.
There are some tricks to displaying html correctly in BIRT.
You may use a Dynamic Text element and set to html or auto.
Here are some tricks to handling free form text..
Make sure your xml is valid, I recommend replacing line breaks or you may catch a scenario where the rptdocument will not export.
Also, if possible keep these in auto layout, when using run + render. The page breaks may actually be calculated once on run and again on render. You might experience breaking issues with fixed. The page may attempt to display all the html prior to breaking a page when using the RUN() phase, in web viewer or the rptdocument. Then when rendering to pdf the the breaks are applied differently, with fixed layout.

Extract data from many PDF forms

I regularly receive large numbers of the same PDF form. I want to extract the data from them into a text file. I'd like to do this via a script of some sort. I'm working in a UNIX environment.
Is this possible? I've googled my brains out and can't find anything.
Text in PDF is represented by text elements in page content streams. The streams are commonly compressed. If you have the time and resources you can use ISO 32000-1:2008 or Adobe PDF 1.7 specification to build your own PDF parser. Or it may be more practical to use a 3rd party app as an intermediate translation step.
There are utilities that will decode the stream and give you clear text. One option is PDFtk Server which will work in your environment. Another option is to use the Poppler PDF Rendering Library which has a command line utility "pdftotext" useful for searching for strings in PDFs.

asp.net web application to convert pdf to word

Is there any clear and proper process to convert a pdf file into a word file with all formatting and images in asp.net web application?
The best way to do that is by using the OCR. It will recognize the text and the images in the PDF file, and then you can save it on a DOC file. I know a third party toolkit named leadtools that should help you doing your requirements, since it support the ASP.NET environment. You can check their Online OCR Demo
Also, you can check their website for more information, or contact their support team.
PDF is a presentational format where all the content is placed by absolute positions. There are no paragraphs and other structured elements (unless it is a Tagged PDF). Technically, you can output every word character by character in any order, but visually it would look like a normal text. Thus, to make a proper conversion to word it is required to do content recognition or some kind of OCR (e.g. ABBYY FineReader)
There are some paid components on the market that allow to do text extraction and some do converting pages to images (obviously, this is not a desired approach for converting into word).

Converting large amounts of text and dynamic data into PDF

I have a three page Word document that needs to be converted into PDF. This Word document was given to me as a template to show me what the PDF output should look like. I tried converting this document into PDF, created a PDF form and used iTextSharp to open the form, populate it with data and return it back to the client. This is all great but due to large amounts of data stored, the placeholders were insufficient and the text would be truncated or hidden.
My second attempt was to create an MVC 2 View without master page, pass the model to the view, take the HTML representation of the View, pass it over to iTextSharp and render the PDF. The problem here was that iTextSharp failed on some tags (one of them was <hr> tag). I managed to get rid of the problematic tag, but then tables were not rendered properly. Namely, the border attribute was ignored so I ended up with borderless tables. That attempt failed.
I need a suggestion or advice on the most efficient way to create a PDF document in MVC 2 which would be maintainable in the long run. I really don't want my actions to be 200+ lines long. Working directly with the Word document is not the best solution as I have never worked with VSTO so I don't quite know what it would look like to open Word and manipulate text inside of it and add dynamic data and then convert that dynamically into PDF.
Any suggestion is highly welcome.
Best regards!
One thing that I've done in the past is to save the Word file as a DOCX and unzip it since DOCX is just a renamed zip file. Within the archive open up /word/document.xml and you'll see your document. There's a lot of weird XML tags in there but overall you should get a pretty good idea of where your content is. Then just add placeholder text like {FIRST_NAME}, save the file and re-zip.
Then from code you can just perform the same steps, unzipping with something like SharpZipLib or DotNetZip, swapping placeholder copy, re-zipping and then using very simple Word automation to Save-As a PDF.
The other route is to fully utilize iTextSharp and actually write Paragraphs and PdfPTable and everything else. It takes a lot longer to setup but would give you the most control.
Q: you say "... but due to large amounts of data stored, the placeholders were insufficient and the text would be truncated or hidden"
How do you end up having to much data ? If the word template can "hold" the data in 3 pages, they should fit in 3 PDF pages.
I used to use iTextSharp to create my PDF's, but I also almost always ended up building the PDF document from scratch myself.(not really a <200 line solution) Have you considerate another library, I recently switched to MigraDoc's PDFSharp.Way simpler to use then iText, lotsa examples / docus
Just my two cents
Word documents object model is quite easy to understand. It will either contain series of Paragraphs or Tables. Using the Open XML SDK, you can iterate through each paragraph/table in the word document and retrieve it's content and styles. Then you can generate PDF document on the fly using those retrieved information. This will work under MVC too.
But if your word document contains complex elements, then it will take some more time for you to implement based on this approach. Also, this approach would only work with (Word 2007 and 2010) files.
Also, HTML to PDF options currently available in the ITextSharp library would work with only known set of tags, as far as I know.
Another suggestion is to make use of commercially available .NET components. There are lot of good solution available. For ex: Syncfusion

Printed documentation from Sandcastle

We're using Sandcastle for conceptual documentation and have clients that we would like to give documentation to in a non-CHM or HTML form, i.e printed. It could be Word or PDF, something simple to attach to an email. The use case usually involves someone wanting to send along a topic.
The best we've been able to do is to print from the CHM viewer or to PDF from Chrome when viewing the HTML. These have issues in that they remove anchor element clicks, turn images black and white, etc.
There's a thread on the SHFB discussions on Codeplex stating that there isn't any known alternative - http://shfb.codeplex.com/discussions/260489. I'm re-posting the question here in hopes to get more input and visibility.
I had the same need some time ago and came to the conclusion that using a CHM to PDF converter is the best recourse. I could not find one that was open-source though many have trial versions available, and I only needed to convert one document so that served my needs at the time. Note that trial/demo versions typically add a watermark or a label blazoned across the page saying "unregistered version" or some such.
A general web search reveals quite a number of candidates: while I cannot vouch for any, here are a few that seem reputable: Universal Document Converter, Theta CHM To PDF Converter, Softany CHM to PDF Converter.
2014.07.16 Update
Per #J0e3gan's comment, here is a different online converter (limited to 100MB CHM input) that looks quite promising, though I have not yet had occasion to try it.