iText form filling missing PDF content

iText form filling missing PDF content - forms

I am running into an odd problem with iText. I have a document with a few fields. On my server, I open the local document, set the fields and send the output of the stamper to the browser.
Works perfectly on my local devel machine.
The pdf generated on the server is missing the PDF contents. I only see the content of the fields I set, the rest is completely blank.
Any tips?

Your application on your local machine respects the bytes of the PDF you're using as a template. Your application on the server doesn't respect those bytes. Maybe you've copied the template using the wrong encoding, making all the binary characters corrupt. Or maybe your application is reading the template using the wrong encoding with the same result.
You can find out by opening your PDF file in a text editor (not inside a PDF viewer). Look for the keyword stream and inspect the bytes that follow this keyword. Do you see the difference? In the PDF produced on your local machine, the bytes look like a normal binary stream. In the PDF produced on your server, the bytes look awkward. For instance: it consists of plenty of question marks.
How to solve: check if the template was copied correctly. If so, check the way you're reading the document. For instance: read the PDF template into a byte array without using iText and write it to a new byte array. Can you reproduce the process of corruption? If so, tweak your application (the one that doesn't involve iText) until you've got the correct encoding.

Related

Browser's view-source: Can files be "downloaded" this way?

As you probably know, one can view the original response HTML code for any website URL by prefixing it with view-source: in the browser (e.g. view-source:https://www.google.de/).
Now interestingly, this also works for URLs that lead to files with types other than HTML. For instance, view-source:https://d3.7-zip.org/a/7z2107.exe will show the .exe file (here of 7zip) as byte stream (probably interpreted as latin1 or another encoding). You would get a similar result if you downloaded the .exe file normally and then open it in Notepad.
My question is this: When I just manually copy the code view-source: gives me for a .exe file, paste it in Notepad and then save it as .exe, the file is of roughly correct size but corrupted. Can there anything be done to fix this?
(If you wonder why anyone would want to do this, the admittedly exotic case is browser automatization with Selenium, which is not really able to download files normally, for a resource that is protected in such a way that it practically can only be downloaded by real browsers.)

When an application is compiled, there are static references to parts of the executable, calculated as offset in bytes. These can be as broad as the .text and .data sections of the executable, or more low-level like function call addresses and jumps.
If you open an exe in a real disassembler, you'll see that there are hard coded jumps in bytes, function addresses in bytes, etc. When you open exe in text editor, these jumps make the processor start running random code, which causes an exception. That causes Windows to believe its not a valid executable anymore.

itext pdfreader not working in unix [duplicate]

I have some code that reads pdf files. The code fails at the line :
iTextSharp.text.pdf.PRTokeniser.CheckPdfHeader() at
iTextSharp.text.pdf.PdfReader.ReadPdf()
I know from other entries that this issue is coming from some invalid formatting in the pdf. However I'm not in a position to tell my users to redo their pdfs. Is there some other way around this issue, that can allow reading of the pdf despite this problem?

If a file doesn't start with %PDF- then there's nothing to fix: the file isn't a PDF file.
However, there may be another problem: maybe you're trying to access a file that has zero length due to some problem while creating the InputStream. Another context in which I've seen this happen, is a PDF loaded from a server, where the server returned a 404 message in HTML instead of a PDF file ;-)
Whenever that exception happens, you should store the bytes somewhere, and examine them. Without those bytes, nobody will be able to give you useful advice.

Extract data from many PDF forms

I regularly receive large numbers of the same PDF form. I want to extract the data from them into a text file. I'd like to do this via a script of some sort. I'm working in a UNIX environment.
Is this possible? I've googled my brains out and can't find anything.

Text in PDF is represented by text elements in page content streams. The streams are commonly compressed. If you have the time and resources you can use ISO 32000-1:2008 or Adobe PDF 1.7 specification to build your own PDF parser. Or it may be more practical to use a 3rd party app as an intermediate translation step.
There are utilities that will decode the stream and give you clear text. One option is PDFtk Server which will work in your environment. Another option is to use the Poppler PDF Rendering Library which has a command line utility "pdftotext" useful for searching for strings in PDFs.

Invalidpdfexception pdf header signature not found

I have some code that reads pdf files. The code fails at the line :
iTextSharp.text.pdf.PRTokeniser.CheckPdfHeader() at
iTextSharp.text.pdf.PdfReader.ReadPdf()
I know from other entries that this issue is coming from some invalid formatting in the pdf. However I'm not in a position to tell my users to redo their pdfs. Is there some other way around this issue, that can allow reading of the pdf despite this problem?

If a file doesn't start with %PDF- then there's nothing to fix: the file isn't a PDF file.
However, there may be another problem: maybe you're trying to access a file that has zero length due to some problem while creating the InputStream. Another context in which I've seen this happen, is a PDF loaded from a server, where the server returned a 404 message in HTML instead of a PDF file ;-)
Whenever that exception happens, you should store the bytes somewhere, and examine them. Without those bytes, nobody will be able to give you useful advice.

How to read pdf table content data?

I have a requirement to read a pdf file having tabular format data only like in excel file. I need to extract the cell value of given pdf file.
Is it be anyhow possible using itext API. If you have something to share then please share it or any other solutions?

The PDF format is just a canvas where text and graphics are placed without any structure information. As such there aren't any iText-objects in a PDF file. In each page there will probably be a number of Strings, but you can't reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can't retrieve a Table-object based on these lines.
In short: parsing the content of a PDF-file is NOT POSSIBLE with iText.
You can try this! This lets you read PDF pages.

I recently ran into this problem. I wasn't able to make it work with itext.
An alternate solution I found was to open a PDF document in Adobe and export it to xml. At least with my PDF's it preserved the table information and then I was able to programmatically work with the XML to generate tabular files like excel etc.
The other issue I ran into was that Adobe only lets you export one file at a time and I had lots of files. Luckily Adobe also has a merge function. I ended up merging all the files together and then exporting them as one big XML file and working with that file to generate what I needed.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

iText form filling missing PDF content - forms

Related

Browser's view-source: Can files be "downloaded" this way?

itext pdfreader not working in unix [duplicate]

Extract data from many PDF forms

Invalidpdfexception pdf header signature not found

How to read pdf table content data?

Categories

Resources