ITEXT PDFReader not able to read PDF - itext

I am not able to read a PDF file using itext pdfreader. This PDf is valid PDF if I tried to open this.
URL Of PDF: http://www.fundslibrary.co.uk/FundsLibrary.DataRetrieval/Documents.aspx?type=fund_class_kiid&id=f096b13b-3d0e-4580-8d3d-87cf4d002650&user=fidelitydocumentreport

The PDF in question is encrypted.
According to the PDF specification,
Encryption applies to all strings and streams in the document's PDF file, with the following exceptions:
The values for the ID entry in the trailer
Any strings in an Encrypt dictionary
Any strings that are inside streams such as content streams and compressed object streams, which themselves are encrypted
Later on there are information on special cases in which the document level metadata stream is not encrypted either or in which only attachments are encrypted.
The Cross-Reference Stream Dictionary of the PDF looks like this:
<<
/Root 101 0 R
/Info 63 0 R
/XRef(stream)
/Encrypt 103 0 R
/ID[<D034DE62220E1CBC2642AC517F0FE9C7><D034DE62220E1CBC2642AC517F0FE9C7>]
/Type/XRef
/W[1 3 2]
/Index[0 107]
/Size 107
/Length 642
>>
As you can see there is an non-encrypted string here, (stream), which is neither the value for the ID entry, nor in an Encrypt dictionary, nor inside a stream. Furthermore, the afore mentioned special cases do not apply here either.
Thus, this file violates the PDF specification here. Therefore, this file is not a valid PDF.
Furthermore, according to the PDF specification
The last line of the file shall contain only the end-of-file marker, %%EOF.
The file at handsends like this
Thus, the last line of the file does contain something else than the end-of-file marker (which is in the line before), a 0x06 and a 0x0c.
The file, therefore, violates the PDF specification here, too.

Related

Did anyone ever heard about asciihex encoding?

this type of encoding is used in soap messages...
I'm receiving a message encoded in ASCIIHEX and I don't have any ideas on how this encoding actually works although I have the clear description of the encoding method:
"If this mode is used, every single original byte is encoded as a sequence of two characters representing it in hexadecimal. So, if the original byte was 0x0a, the transmitted bytes are 0x30 and 0x41 (‘0’ and ‘a’ in ASCII)."
The buffer received : "1f8b0800000000000000a58e4d0ac2400c85f78277e811f2e665329975bbae500f2022dd2978ff95715ae82cdcf9415efec823c6710247582d5965c32c65aab0f5fc0a5204c415855e7c190ef61b34710bcdc7486d2bab8a7a4910d022d5e107d211ed345f2f37a103da2ddb1f619ab8acefe7fdb1beb6394998c7dfbde3dcac3acf3f399f3eeae152012e010000"
The actual file contains this : "63CD13C1697540000000662534034000030000120011084173878R 00000001000018600050000000100460000009404872101367219 000000000000 DNSO_038114 000000002001160023Replacem000000333168625 N0000 00000000"
The provider sent me the file that contains the string above. I tried to start from the buffer string and get the same result as the one sent by the provider but no results. I also tried searching after this "asciihex" encoding and same. If someone knows anything about this encoding or can give me any advice I would really appreciate it. I have pretty much no experience with SOAP services.
Based on the comments above, it's possible the buffer is compressed. It starts with 1F 8B which is a signature for GZIP compression. See the following list of signatures.
Write the bytes that correspond to the hex strings into a file. Name that file with a gz or tar.gz extension and try to extract it or open it with some file archiver tool.
Another thing you could try would be to not send the Compress element in your request, assuming it's an optional field and you can do that. If you can, check if the buffer changes and has the proper length and you can see similar patterns as the original content (for those zeros at the end, for example).

Reading file names inside .zip file

I am familiar with the .zip file format, and able to read the internal file table content so far.
The problem occurs with non-english characters in the file name.
The specification states that file names use OEM character set, yet sometimes I get UTF-8 representation and sometimes I get OEM represantation.
The specification states the "version made by" field should be in range 0-20, yet I get versions 31 and 63 which may or may not affect the character set.
Another related problem: When I read the "extra field" there is "up" (unicode path, id=0x7075) which suppose to store the utf-8 represantation of the filename, well, it starts with 5 redundant bytes before the actual utf-8 string (Created by WinRar), yet the other softwares seems to read it correctly.
Any input about the issue?

Determine whether file is a PDF in perl?

Using perl, what is the best way to determine whether a file is a PDF?
Apparently, not all PDFs start with %PDF. See the comments on this answer: https://stackoverflow.com/a/941962/327528
Detecting a PDF is not hard, but there are some corner cases to be aware of.
All conforming PDFs contain a one-line header identifying the PDF specification to which the file conforms. Usually it's %PDF-1.N where N is a digit between 0 and 7.
The third edition of the PDF Reference has an implementation note that Acrobat viewer require only that the header appears within the first 1024 bytes of a file. (I've seen some cases where a job control prefix was added to the start of a PDF file, so '%PDF-1.' weren't the first seven bytes of the file)
The subsequent implementation note from the third edition (PDF 1.4) states: Acrobat viewers will also accept a header of the form: %!PS-Adobe-N.n PDF-M.m but note that this isn't part of the ISO32000:2008 (PDF 1.7) specification.
If the file doesn't begin immediately with %PDF-1.N, be careful because I've seen a case where a zip file containing a PDF was mistakenly identified as a PDF because that part of the embedded file wasn't compressed. so a check for the PDF file trailer is a good idea.
The end of a PDF will contain a line with '%%EOF',
The third edition of the PDF Reference has an implementation note that Acrobat viewer requires only that the %%EOF marker appears within the last 1024 bytes of a file.
Two lines above the %%EOF should be the 'startxref' token and the line in between should be a number for the byte offset from the start of the file to the last cross reference table.
In sum, read in the first and last 1kb of the file into a byte buffer, check that the relevant identifying byte string tokens are approximately where they are supposed to be and if they are then you have a reasonable expectation that you have a PDF file on your hands.
The module PDF::Parse has method called IsaPDF which
Returns true, if the file could be parsed and is a PDF-file.

When stamping document - Danish characters disappear and PDF becomes invalid

I have a PDF generated in Oracle BI Publisher. It contains a graph and some text. When trying to stamp the document with an image - The image gets added, but the Danish characters are destroyed.
I run iText Stamp like this:
static void stampPdf() throws IOException, DocumentException {
PdfReader reader = new PdfReader(PDF_SOURCE_FILE);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(
PDF_STAMPED_FILE));
Image img = Image.getInstance(WATERMARK);
img.setAbsolutePosition(10, 100);
PdfContentByte under = stamper.getUnderContent(1);
under.addImage(img);
stamper.close();
}
As a result, I get the following the message: Document invalid. But the document displays, including the added image. The Danish characters have become substituted.
All fonts has been removed from Document properties.
Has anyone seen something like this before? I have done it several times before, without problems.
I have taken a look at the PDF and it's not an iText problem. It's a "Garbage In, Garbage Out" problem. Please open the PDF in Acrobat and analyze it for syntax errors. You'll get the following message:
The content stream of the PDF is wrong in a way that even Acrobat can't analyze it and tell you what is wrong.
So I've looked inside the file, and when it looks as if iText can't see the page resources for the page. The page resources refer to the fonts. If iText can't see the page resources, iText can't see the fonts and they get lost in the process.
If Acrobat would allow me to "Analyze and fix", then I could create a fixed PDF and compare what was fixed. But as Acrobat can't fix the file, it's a lot of work to go through the complete file manually to find out what exactly is wrong with it. Out of curiosity, I opened the document in a text editor, and I found this:
4 0 obj
<<
/ProcSet [ /PDF /Text ]
/Font <<
/F1 7 0 R
/F2 8 0 R
/F3 11 0 R
>>
/Shading <<
/grad0 10 0 R
/grad0#2 15 0 R
/grad1#2 17 0 R
/grad2#2 19 0 R
/grad3#2 21 0 R
/grad4#2 23 0 R
/grad5#2 25 0 R
>>
>>
endobj
The problem is caused by the names /grad0#2, /grad1#2, etc... Those aren't valid names. Let me quote from ISO-32000-1:
When writing a name in a PDF file, a SOLIDUS (2Fh) (/) shall be used
to introduce a name. The SOLIDUS is not part of the name but is a
prefix indicating that what follows is a sequence of characters
representing the name in the PDF file and shall follow these rules:
a) A NUMBER SIGN (23h) (#) in a name shall be written by using its
2-digit hexadecimal code (23), preceded by the NUMBER SIGN.
b) Any character in a name that is a regular character (other than NUMBER
SIGN) shall be written as itself or by using its 2-digit hexadecimal
code, preceded by the NUMBER SIGN.
c) Any character that is not a
regular character shall be written using its 2-digit hexadecimal code,
preceded by the NUMBER SIGN only.
In your case, you have a NUMBER SIGN (#) followed by a 1-digit number. That doesn't make any sense. The PDF is invalid.
Long story short: contact the producer of the PDF and ask him to fix the problem or never use his tools again.

onMetaData marker in FLV file

I wanna know how the onMetaData marker in FLV files looks like. When i open FLV files as plain text I get this:
FLV[][][][][](TAB)[][][][][][][]8[][][][][][][][][]
onMetaData[]
duration...
The docs say the first 3 bytes are the signature "FLV" the next byte tells the flv version, the next byte is telling us if audio or video tags are present, the next 4 bytes are the data-offset(the size of the header), which is 9, in ascii its the TAB code. after the TAB starts the body with the fist "previous tag size field" which is 0(4 bytes) next, there is the Tag Type (1 byte) the data size (3 bytes) and the timestamp (4 bytes) the stream id (always 0, 3bytes). After that remains:
[]
onMetaData[]
[][][][][][]
duration...
I suppose the onMetaData marker is "1byte, newline"onMetaData"1byte,newline) but what are the 7 bytes between onMetaData marker and duration?
You would need to view this file in a hex editor to get anything useful from it; a text editor will just show you unprintable characters.
The ASCII "onMetaData" bit in the file is the tag header, which is wrapping the "duration" field. The three bytes immediately after "onMetaData" are the BodyLength of the tag (uint24, big-endian), and the next 4 bytes ("\x00\x00\x00\x08") describe the length of the name for the next tag, which is "duration."
I suggest you to use hexedit tool http://www.hexedit.com/
this will allow you to see all the info in string format..
as well as it has very nice navigation to analyze bytes.
In addition to it, use https://www.adobe.com/content/dam/Adobe/en/devnet/flv/pdfs/video_file_format_spec_v10.pdf to get details about all bytes in an flv file
Remember that the metadata is encoded using AMF. This means that after the string "onMetaData" you have a 0x08 to signify the start of an array and then 2 bytes to signify the length of the first element as number of character/bytes