Sign concatenated PDF in append mode with CERTIFIED_NO_CHANGES_ALLOWED - itext

I tried to sign PDF with append mode and certification level CERTIFIED_NO_CHANGES_ALLOWED but certain PDF files shown as modified and therefore invalid in Acrobat.
itext 5.5.6, code:
PdfStamper stp = PdfStamper.createSignature(reader, os,'\0',null,true);
PdfSignatureAppearance app = stp.getSignatureAppearance();
app.setCertificationLevel(PdfSignatureAppearance.CERTIFIED_NO_CHANGES_ALLOWED);
PDf file is created with wkhtmltopdf and concatenated with itself with pdfunite (CentOS 7)
Here is zip with sample PDF: https://www.dropbox.com/s/lea6r9fup6th44c/test_pdf.zip?dl=0
g.pdf - original PDF
2g.pdf - concatenated version (pdfunite g.pdf g.pdf 2g.pdf)
signed_g.pdf - original signed file, looks OK in Acrobat
signed_2g.pdf - concatenated signed file, looks like corrupted in Acrobat
So is it correct behavior, or something goes wrong with Acrobat, pdfunite, itext or myself )))?
Thanks.

Certifying the sample 2g.pdf using the OP's code and verifying the result with other tools than Adobe Reader one obtains the information that the certification signature is valid.
Something like this (i.e. Adobe Reader complaining about a perfectly valid signature) usually happens with documents which cause Adobe Reader to manipulate the document upon loading. In such a case Adobe Reader checks the signatures in the changed document and, therefore, sees an invalid signature. Such manipulations may especially be repairs of invalid files.
This also is the case here, 2g.pdf is not completely valid (even though in a way that PDF parsers usually ignore): Its cross reference table is segmented into multiple subsections:
xref
0 1
0000000001 65535 f
3 2
0000000015 00000 n
0000000107 00000 n
6 41
0000000146 00000 n
...
0000015682 00000 n
48 14
0000015864 00000 n
...
0000025433 00000 n
66 2
0000025455 00000 n
0000025548 00000 n
69 41
0000025588 00000 n
...
0000041144 00000 n
111 14
0000041327 00000 n
...
0000050929 00000 n
126 4
0000050952 00000 n
0000051004 00000 n
0000051075 00000 n
0000051242 00000 n
But segmented cross reference tables are valid only in case of incremental updates, not in case of initial document revisions, and this document is constructed as an initial revision.
For a file that has never been incrementally updated, the cross-reference section shall contain only one subsection, whose object numbering begins at 0.
(section 7.5.4 Cross-Reference Table of ISO 32000-1)
Thus, this segmented table is invalid.
So I fixed the cross reference table to contain but one sub-section (by adding f (ree) entries for the indices left out: 2g-fix.pdf . And indeed, certifying this document using the OP's code one gets a certification signature Adobe Reader (at least version XI I've currently installed here) is happy with.
So this is the disadvantage of using incremental updates: One keeps the errors of the original document and has to cope with them...

Related

ITEXT PDFReader not able to read PDF

I am not able to read a PDF file using itext pdfreader. This PDf is valid PDF if I tried to open this.
URL Of PDF: http://www.fundslibrary.co.uk/FundsLibrary.DataRetrieval/Documents.aspx?type=fund_class_kiid&id=f096b13b-3d0e-4580-8d3d-87cf4d002650&user=fidelitydocumentreport
The PDF in question is encrypted.
According to the PDF specification,
Encryption applies to all strings and streams in the document's PDF file, with the following exceptions:
The values for the ID entry in the trailer
Any strings in an Encrypt dictionary
Any strings that are inside streams such as content streams and compressed object streams, which themselves are encrypted
Later on there are information on special cases in which the document level metadata stream is not encrypted either or in which only attachments are encrypted.
The Cross-Reference Stream Dictionary of the PDF looks like this:
<<
/Root 101 0 R
/Info 63 0 R
/XRef(stream)
/Encrypt 103 0 R
/ID[<D034DE62220E1CBC2642AC517F0FE9C7><D034DE62220E1CBC2642AC517F0FE9C7>]
/Type/XRef
/W[1 3 2]
/Index[0 107]
/Size 107
/Length 642
>>
As you can see there is an non-encrypted string here, (stream), which is neither the value for the ID entry, nor in an Encrypt dictionary, nor inside a stream. Furthermore, the afore mentioned special cases do not apply here either.
Thus, this file violates the PDF specification here. Therefore, this file is not a valid PDF.
Furthermore, according to the PDF specification
The last line of the file shall contain only the end-of-file marker, %%EOF.
The file at handsends like this
Thus, the last line of the file does contain something else than the end-of-file marker (which is in the line before), a 0x06 and a 0x0c.
The file, therefore, violates the PDF specification here, too.

Determine whether file is a PDF in perl?

Using perl, what is the best way to determine whether a file is a PDF?
Apparently, not all PDFs start with %PDF. See the comments on this answer: https://stackoverflow.com/a/941962/327528
Detecting a PDF is not hard, but there are some corner cases to be aware of.
All conforming PDFs contain a one-line header identifying the PDF specification to which the file conforms. Usually it's %PDF-1.N where N is a digit between 0 and 7.
The third edition of the PDF Reference has an implementation note that Acrobat viewer require only that the header appears within the first 1024 bytes of a file. (I've seen some cases where a job control prefix was added to the start of a PDF file, so '%PDF-1.' weren't the first seven bytes of the file)
The subsequent implementation note from the third edition (PDF 1.4) states: Acrobat viewers will also accept a header of the form: %!PS-Adobe-N.n PDF-M.m but note that this isn't part of the ISO32000:2008 (PDF 1.7) specification.
If the file doesn't begin immediately with %PDF-1.N, be careful because I've seen a case where a zip file containing a PDF was mistakenly identified as a PDF because that part of the embedded file wasn't compressed. so a check for the PDF file trailer is a good idea.
The end of a PDF will contain a line with '%%EOF',
The third edition of the PDF Reference has an implementation note that Acrobat viewer requires only that the %%EOF marker appears within the last 1024 bytes of a file.
Two lines above the %%EOF should be the 'startxref' token and the line in between should be a number for the byte offset from the start of the file to the last cross reference table.
In sum, read in the first and last 1kb of the file into a byte buffer, check that the relevant identifying byte string tokens are approximately where they are supposed to be and if they are then you have a reasonable expectation that you have a PDF file on your hands.
The module PDF::Parse has method called IsaPDF which
Returns true, if the file could be parsed and is a PDF-file.

When stamping document - Danish characters disappear and PDF becomes invalid

I have a PDF generated in Oracle BI Publisher. It contains a graph and some text. When trying to stamp the document with an image - The image gets added, but the Danish characters are destroyed.
I run iText Stamp like this:
static void stampPdf() throws IOException, DocumentException {
PdfReader reader = new PdfReader(PDF_SOURCE_FILE);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(
PDF_STAMPED_FILE));
Image img = Image.getInstance(WATERMARK);
img.setAbsolutePosition(10, 100);
PdfContentByte under = stamper.getUnderContent(1);
under.addImage(img);
stamper.close();
}
As a result, I get the following the message: Document invalid. But the document displays, including the added image. The Danish characters have become substituted.
All fonts has been removed from Document properties.
Has anyone seen something like this before? I have done it several times before, without problems.
I have taken a look at the PDF and it's not an iText problem. It's a "Garbage In, Garbage Out" problem. Please open the PDF in Acrobat and analyze it for syntax errors. You'll get the following message:
The content stream of the PDF is wrong in a way that even Acrobat can't analyze it and tell you what is wrong.
So I've looked inside the file, and when it looks as if iText can't see the page resources for the page. The page resources refer to the fonts. If iText can't see the page resources, iText can't see the fonts and they get lost in the process.
If Acrobat would allow me to "Analyze and fix", then I could create a fixed PDF and compare what was fixed. But as Acrobat can't fix the file, it's a lot of work to go through the complete file manually to find out what exactly is wrong with it. Out of curiosity, I opened the document in a text editor, and I found this:
4 0 obj
<<
/ProcSet [ /PDF /Text ]
/Font <<
/F1 7 0 R
/F2 8 0 R
/F3 11 0 R
>>
/Shading <<
/grad0 10 0 R
/grad0#2 15 0 R
/grad1#2 17 0 R
/grad2#2 19 0 R
/grad3#2 21 0 R
/grad4#2 23 0 R
/grad5#2 25 0 R
>>
>>
endobj
The problem is caused by the names /grad0#2, /grad1#2, etc... Those aren't valid names. Let me quote from ISO-32000-1:
When writing a name in a PDF file, a SOLIDUS (2Fh) (/) shall be used
to introduce a name. The SOLIDUS is not part of the name but is a
prefix indicating that what follows is a sequence of characters
representing the name in the PDF file and shall follow these rules:
a) A NUMBER SIGN (23h) (#) in a name shall be written by using its
2-digit hexadecimal code (23), preceded by the NUMBER SIGN.
b) Any character in a name that is a regular character (other than NUMBER
SIGN) shall be written as itself or by using its 2-digit hexadecimal
code, preceded by the NUMBER SIGN.
c) Any character that is not a
regular character shall be written using its 2-digit hexadecimal code,
preceded by the NUMBER SIGN only.
In your case, you have a NUMBER SIGN (#) followed by a 1-digit number. That doesn't make any sense. The PDF is invalid.
Long story short: contact the producer of the PDF and ask him to fix the problem or never use his tools again.

Encoding that minimizes misreading / mistyping / misspeaking?

Let's say you have a system in which a fairly long key value can be accurately communicated to a user on-screen, via email or via paper; but the user needs to be able to communicate the key back to you accurately by reading it over the phone, or by reading it and typing it back into some other interface.
What is a "good" way to encode the key to make reading / hearing / typing it easy & accurate?
This could be an invoice number, a document ID, a transaction ID or some other abstract value. Let's say for the sake of this discussion the underlying key value is a big number, say 40 digits in base 10.
Some thoughts:
Shorter keys are generally better
a 40-digit base 10 value may not fit in the space given, and is easy to get lost in the middle of
the same value could be represented in base 16 in 33-34 digits
the same value could be represented in base 36 in 26 digits
the same value could be represented in base 64 in 22-23 digits
Characters that can't be visually confused with each other are better
e.g. an encoding that includes both O (oh) and 0 (zero), or S (ess) and 5 (five), could be bad
This issue depends on the font / face used to display the key, which you may be able to control in some cases (like printing on paper) but can't control in others (like web pages and email).
Also depends on whether you can control the exclusive use of upper and / or lower case -- e.g. capital D (dee) may look like O (oh) but lower case d (dee) would not; while lower case l (ell) looks like a 1 (one) while capital L (ell) would not. (With exceptions for especially exotic fonts / faces).
Characters that can't be verbally / aurally confused with each other are better
a (ay) 8 (eight)
B (bee) C (cee) D (dee) E (ee) g (gee) p (pee) t (tee) v (vee) z (zee) 3 (three)
This issue depends on the audio quality of the end-to-end channel -- bigger challenge if the expected user base could have a speech impediment, or may have to speak through a gas mask, or the communication channel could include CB radios or choppy VOIP phone systems.
Adding a check digit or two would detect errors but not help resolve errors.
An alpha - bravo - charlie - delta type dialog can help with hearing errors, but not reading errors.
Possible choices of encoding:
Base 64 -- compact, but too many hard-to-verbalize characters (underscore, dash etc.)
Base 34 -- 0-9 and A-Z but with O (oh) and I (aye) left out as the easiest to confuse with digits
Base 32 -- same as base 34 but leave out the 0 (zero) and 1 (one) as well
Is there a generally recognized encoding that is a reasonable solution for this scenario?
When I heard it first, I liked the article A Proposal for Proquints: Identifiers that are Readable, Spellable, and Pronounceable. It encodes data as a sequence of consonants and vowels. It's tied to the English language though. (Because in German, f and v sound equal, so they should not be used both.) But I like the general idea.

How can I generate this hash?

I'm new to programming (just started!) and have hit a wall recently. I am making a fansite for World of Warcraft, and I want to link to a popular site (wowhead.com). The following page shows what I'm trying to figure out: http://www.wowhead.com/?talent#ozxZ0xfcRMhuVurhstVhc0c
From what I understand, the "ozxZ0xfcRMhuVurhstVhc0c" part of the link is a hash. It contains all the information about that particular talent spec on the page, and changes whenever I add or remove points into a talent. I want to be able to recreate this part, so that I can then link my users directly to wowhead to view their talent trees, but I havn't the foggiest idea how to do that. Can anyone provide some guidance?
The first character indicates the class:
0 Druid
c Hunter
o Mage
s Paladin
b Priest
f Rogue
h Shaman
I Warlock
L Warrior
j Death Knight
The remaining characters indicate where in each tree points have been allocated. Each tree is separate, delimited by 'Z'. So if e.g. all the points are in the third tree, then the 2nd and 3rd characters will be "ZZ" indicating "end of first tree" and "end of second tree".
To generate the code for a given tree, split the talents up into pairs, going left-to-right and top-to-bottom. Each pair of talents is represented by a single character. So for example, in the DK's Blood tree segment, the first character will indicate the number of points allocated to Butchery and Subversion, and the second character will stand for Blade Barrier and Bladed Armor.
What character represents each allocation among the pair? I'm sure there's an algorithm, probably based on the ASCII character set, but all I've worked out so far is this lookup table. Find the number of points in the first talent along the top, and the number of points in the second talent along the left side. The encoded character is at the intersection.
0 1 2 3 4 5
0 0 o b h L x
1 z k d u p t
2 M R r G T g
3 c s f I j e
4 m a w N n v
5 V q i A y E
So if our Death Knight has one point in Butchery and two points in Subversion, the first character is 'R'. If instead we put no points in those two and five in Blade Barrier, the first two characters will be "0x". Trailing '0's (all the other pairs in the tree with no points allocated) can be omitted, as can trailing 'Z' delimiters (when there are no points in the subsequent trees). For one final example, the entire code for a DK with just a single point in Toughness would be "jZ0o": "Death Knight", "End of the first tree", "No points in the first pair of talents", "one point in the first talent of the second pair".
Can anyone work out what function generates the lookup table above? There's probably a clue in the codes for the classes: in alphabetical order (except for the DK which was added to the game after the others), they correspond to a series in the lookup table of (0,0), (0,3), (1,0), (1,3), (2,0), etc.
If you go to http://www.wowhead.com/?talent and start using the talent tree you can see the mysterious code being built up in the address bar as you click on the various boxes. So it's definitely not a hash but some kind of encoded structure data.
As the code is built up as you click the logic for building the code will be in the JavaScript on that page.
So my advice is do a view source on the page, download the JavaScript files and have a look at them.
I think it isn't a hash value, because hash values are normally one-ways values. This means you cannot (easily) restore the original information from which the hash code was generated.
Best thing would be to contact someone from wowhead.com and ask them how to interpret this information. I am sure they will help you out with some information about what type of encoding they use for the parameters. But without any help of the developers from wowhead.com it is almost impossible to figure out what information is encoded into this parameter.
I am not even sure the parameter you mentioned contains the talents of your character. Maybe it's just a session id or something like that. Take a look into the post data your browser sends to the server, it may contain a hidden field with the value you are searching for (you can use Tamper Data Firefox Addon).
I don't think ozxZ0xfcRMhuVurhstVhc0c is a hash value. I think it is a key (probably encrypted/encoded in some way). The server uses this key to retrieve information from it database. Since you don't have access to the database you don't know which key is needed, let alone how to encode it.
You need the original function that generates the hash.
I don't think that's public though :(
Check this out: hash wikipedia
Good luck learning how to program!
These hashes are hard to 'reverse engineer' unless you know how it was generated.
For example, it could be:
s1 = "random_string-" + score;
hash = encrypt(s1)
...etc
so it is hard to get the original data back from the hash (that is the whole point anyway).
your best bet would be link to the profile that would have the latest score ..etc