itext pdfreader not working in unix [duplicate] - itext

I have some code that reads pdf files. The code fails at the line :
iTextSharp.text.pdf.PRTokeniser.CheckPdfHeader() at
iTextSharp.text.pdf.PdfReader.ReadPdf()
I know from other entries that this issue is coming from some invalid formatting in the pdf. However I'm not in a position to tell my users to redo their pdfs. Is there some other way around this issue, that can allow reading of the pdf despite this problem?

If a file doesn't start with %PDF- then there's nothing to fix: the file isn't a PDF file.
However, there may be another problem: maybe you're trying to access a file that has zero length due to some problem while creating the InputStream. Another context in which I've seen this happen, is a PDF loaded from a server, where the server returned a 404 message in HTML instead of a PDF file ;-)
Whenever that exception happens, you should store the bytes somewhere, and examine them. Without those bytes, nobody will be able to give you useful advice.

Related

WOPI corrupting files on edit

I have a WOPI host running in a Blazor server application with all of the .wopitest tests passing for the desired functionality (others skipped).
When I upload a word document, I am able to view the document with no issues. I am also able to edit the document, however when I try and edit the document a second time, I get an error.
The error doesn't appear to be handled and seems to originate in the Office online javascript file.
Error on attempting second edit
Following the error, I am still able to open the document for viewing. It is the same behaviour if I use the 'Editing' button in the Office Online page or directly navigate to the editing page using an edit action url.
Supplementary information:
Using ngrok to debug locally
.NET 6
Using SQLite database for holding file information (including path to file)
Using local folders for storing file contents (e.g. 'data' folder containing all files)
Similar issues with .xslx files beign corrupted upon editing and requiring a 'repair' when opened with Excel. This repair removes cells containing text and indicates that it removes the theme.
Viewing a word document gives the following console errors View document error
The first editing of a word document gives the following console errors Edit document error
I was expecting to be able to repeatedly edit the document.
I tried opening the file in the Desktop version of Word and got the following error Desktop Word recover
Following a recover, the document appears to work as expected in Word (desktop) but still won't open for editing through WOPI.
Turns out it was the way the POST http request body was being saved.
Still not certain what was going wrong but somewhere along the way of writing the stream into a buffer and then saving that to a file corrupted the file.
I suspect the file stream was either truncating or adding a few bytes.
The interesting part being that Office Online was still able to view the file.
This indicates there is some tolerance for malformed files still being served.

Browser's view-source: Can files be "downloaded" this way?

As you probably know, one can view the original response HTML code for any website URL by prefixing it with view-source: in the browser (e.g. view-source:https://www.google.de/).
Now interestingly, this also works for URLs that lead to files with types other than HTML. For instance, view-source:https://d3.7-zip.org/a/7z2107.exe will show the .exe file (here of 7zip) as byte stream (probably interpreted as latin1 or another encoding). You would get a similar result if you downloaded the .exe file normally and then open it in Notepad.
My question is this: When I just manually copy the code view-source: gives me for a .exe file, paste it in Notepad and then save it as .exe, the file is of roughly correct size but corrupted. Can there anything be done to fix this?
(If you wonder why anyone would want to do this, the admittedly exotic case is browser automatization with Selenium, which is not really able to download files normally, for a resource that is protected in such a way that it practically can only be downloaded by real browsers.)
When an application is compiled, there are static references to parts of the executable, calculated as offset in bytes. These can be as broad as the .text and .data sections of the executable, or more low-level like function call addresses and jumps.
If you open an exe in a real disassembler, you'll see that there are hard coded jumps in bytes, function addresses in bytes, etc. When you open exe in text editor, these jumps make the processor start running random code, which causes an exception. That causes Windows to believe its not a valid executable anymore.

what is the simplest way to check if a .pdf file is a valid pdf document using itext?

what is the simplest way to check if a .pdf file is a valid pdf document using itext?
I have written some code to open a .pdf using the itextsharp.dll, and if i try to open a bad .pdf (one that I have edited in notepad++ to corrupt it) then i get a PDFInvalid exception - which is all good (as opposed to no exception if the pdf is valid).
However, I was wondering if there is another way to check for issues with pdf files.
I am about to extract a very large number of pdf documents from a database, and ideally would like to check that they are all valid pdf files - ie at least that they can be open.
Thanks for any replies in advance.

iText form filling missing PDF content

I am running into an odd problem with iText. I have a document with a few fields. On my server, I open the local document, set the fields and send the output of the stamper to the browser.
Works perfectly on my local devel machine.
The pdf generated on the server is missing the PDF contents. I only see the content of the fields I set, the rest is completely blank.
Any tips?
Your application on your local machine respects the bytes of the PDF you're using as a template. Your application on the server doesn't respect those bytes. Maybe you've copied the template using the wrong encoding, making all the binary characters corrupt. Or maybe your application is reading the template using the wrong encoding with the same result.
You can find out by opening your PDF file in a text editor (not inside a PDF viewer). Look for the keyword stream and inspect the bytes that follow this keyword. Do you see the difference? In the PDF produced on your local machine, the bytes look like a normal binary stream. In the PDF produced on your server, the bytes look awkward. For instance: it consists of plenty of question marks.
How to solve: check if the template was copied correctly. If so, check the way you're reading the document. For instance: read the PDF template into a byte array without using iText and write it to a new byte array. Can you reproduce the process of corruption? If so, tweak your application (the one that doesn't involve iText) until you've got the correct encoding.

Invalidpdfexception pdf header signature not found

I have some code that reads pdf files. The code fails at the line :
iTextSharp.text.pdf.PRTokeniser.CheckPdfHeader() at
iTextSharp.text.pdf.PdfReader.ReadPdf()
I know from other entries that this issue is coming from some invalid formatting in the pdf. However I'm not in a position to tell my users to redo their pdfs. Is there some other way around this issue, that can allow reading of the pdf despite this problem?
If a file doesn't start with %PDF- then there's nothing to fix: the file isn't a PDF file.
However, there may be another problem: maybe you're trying to access a file that has zero length due to some problem while creating the InputStream. Another context in which I've seen this happen, is a PDF loaded from a server, where the server returned a 404 message in HTML instead of a PDF file ;-)
Whenever that exception happens, you should store the bytes somewhere, and examine them. Without those bytes, nobody will be able to give you useful advice.