Determine whether file is a PDF in perl? - perl

Using perl, what is the best way to determine whether a file is a PDF?
Apparently, not all PDFs start with %PDF. See the comments on this answer: https://stackoverflow.com/a/941962/327528

Detecting a PDF is not hard, but there are some corner cases to be aware of.
All conforming PDFs contain a one-line header identifying the PDF specification to which the file conforms. Usually it's %PDF-1.N where N is a digit between 0 and 7.
The third edition of the PDF Reference has an implementation note that Acrobat viewer require only that the header appears within the first 1024 bytes of a file. (I've seen some cases where a job control prefix was added to the start of a PDF file, so '%PDF-1.' weren't the first seven bytes of the file)
The subsequent implementation note from the third edition (PDF 1.4) states: Acrobat viewers will also accept a header of the form: %!PS-Adobe-N.n PDF-M.m but note that this isn't part of the ISO32000:2008 (PDF 1.7) specification.
If the file doesn't begin immediately with %PDF-1.N, be careful because I've seen a case where a zip file containing a PDF was mistakenly identified as a PDF because that part of the embedded file wasn't compressed. so a check for the PDF file trailer is a good idea.
The end of a PDF will contain a line with '%%EOF',
The third edition of the PDF Reference has an implementation note that Acrobat viewer requires only that the %%EOF marker appears within the last 1024 bytes of a file.
Two lines above the %%EOF should be the 'startxref' token and the line in between should be a number for the byte offset from the start of the file to the last cross reference table.
In sum, read in the first and last 1kb of the file into a byte buffer, check that the relevant identifying byte string tokens are approximately where they are supposed to be and if they are then you have a reasonable expectation that you have a PDF file on your hands.

The module PDF::Parse has method called IsaPDF which
Returns true, if the file could be parsed and is a PDF-file.

Related

How to save the text on a web page in a particular encoding?

I read following sentence from the link
Content authors need to find out how to declare the character encoding
used for the document format they are working with.
Note that just declaring a different encoding in your page won't
change the bytes; you need to save the text in that encoding too.
As per my knowledge, the characters from the text are stored in the computer as one or more bytes irrespective of the 'character encoding' specified in the web page.
I understood the above quoted text also, except the last sentence in bold font
you need to save the text in that encoding too
What does this sentence mean?
Is it saying that the content author/developer has to manually save the same text(which is already stored in the computer as one or more bytes) in the encoding specified by him/her? If yes, how to do it and why it is needed to do? If no, then what this sentence actually mean?
When you make a web page publicly available in the most basic sense you make a text file (that is located on a piece of hardware you own) public in the sense that when a certain adress is requested you return this file. That file can be saved on your local hardware or may not be saved there (dynamic content). Whatever the case, the user that is accessing your web page is provided a file. Once the user gains posession of the file he should be able to read it, that is where the encoding comes in play. If you have a raw binary file you can only guess what it contains and what encoding it is in, so most web pages provide the encoding that they return the file in alongside the file. This is where the bold text you ask about can be related to my answer - if you provide one encoding alongside the file (for example utf 8) but deliver the file in another encoding (ASCII) the user may see parts of the text or may not see it at all. And if you provide a static file it should be saved in the correct encoding (that is the one you told your file will be in).
As for the question how to store it - that is highly specific to the way you provide the file. Most text editors provide means to save a file in specific encoding. And most tools to bring up a page content provide convenient ways to give the file in a form that would be easy for the user to decode.
It is just a note, probably because of confusion by some users.
The text tell us that one should specify in some form the encoding of the file. This is straightforward. Webserver usually cannot know the encoding of a file. Note if pages are delivered by e.g. a database, the encoding could be implicit, but web consider file as first class citizen, so we still need to specify encoding.
The note makes just clears that by changing the encoding, the page is not transcoded by webrowser. The page will remain byte per byte the same, just clients (browsers) will misinterpret the content. So if you want to change the encoding, you should specify the new encoding, but also save the file (or save and convert) to the expected encoding. No magic will be done (usually) by web-servers.
There is no text but encoded text.
The fundamental rule of character encodings is that the reader must use the same encoding as the writer. That requires communication, conventions, specifications or standards to establish an agreement.
"Is it saying that the content author/developer has to manually save the same text(which is already stored in the computer as one or more bytes) in the encoding specified by him/her? If yes, how to do it and why it is needed to do?"
Yes, it always the case for every text file that a character encoding is chosen. Obviously, if the file already exists it is probably best not to change the encoding. You do it by some editor option (try the Save As… dialog or equivalent) or by some library property or configuration.
"save the text in that encoding too"
Actually, it's usually the other way around. You decide on the encoding you want or need to use and the HTML editor or library updates the contents with a matching declaration and any newly necessary character entity references (e.g., does 🚲 need to be written as 🚲? Does ¡ need to be written as ¡?) as it writes or streams the document. (If your editor doesn't do that then get a real HTML editor.)

ITEXT PDFReader not able to read PDF

I am not able to read a PDF file using itext pdfreader. This PDf is valid PDF if I tried to open this.
URL Of PDF: http://www.fundslibrary.co.uk/FundsLibrary.DataRetrieval/Documents.aspx?type=fund_class_kiid&id=f096b13b-3d0e-4580-8d3d-87cf4d002650&user=fidelitydocumentreport
The PDF in question is encrypted.
According to the PDF specification,
Encryption applies to all strings and streams in the document's PDF file, with the following exceptions:
The values for the ID entry in the trailer
Any strings in an Encrypt dictionary
Any strings that are inside streams such as content streams and compressed object streams, which themselves are encrypted
Later on there are information on special cases in which the document level metadata stream is not encrypted either or in which only attachments are encrypted.
The Cross-Reference Stream Dictionary of the PDF looks like this:
<<
/Root 101 0 R
/Info 63 0 R
/XRef(stream)
/Encrypt 103 0 R
/ID[<D034DE62220E1CBC2642AC517F0FE9C7><D034DE62220E1CBC2642AC517F0FE9C7>]
/Type/XRef
/W[1 3 2]
/Index[0 107]
/Size 107
/Length 642
>>
As you can see there is an non-encrypted string here, (stream), which is neither the value for the ID entry, nor in an Encrypt dictionary, nor inside a stream. Furthermore, the afore mentioned special cases do not apply here either.
Thus, this file violates the PDF specification here. Therefore, this file is not a valid PDF.
Furthermore, according to the PDF specification
The last line of the file shall contain only the end-of-file marker, %%EOF.
The file at handsends like this
Thus, the last line of the file does contain something else than the end-of-file marker (which is in the line before), a 0x06 and a 0x0c.
The file, therefore, violates the PDF specification here, too.

Why importdata is not working here?

I have two data text files with the same text content but they have different sizes. The following snapshot compares them (using Beyond Compare).
It seems that the hex content of the files is different.
The MATLAB function importdata reads fine the file to the left but gives the following error with the file to the right (the bigger size file):
Unable to load file. Use TEXTSCAN or FREAD for more complex formats.
What exactly is the difference between the two files ?
How to make importdata work with the file to the right ?
The problem and solution is already mentioned in the comments:
it seems like one text file is in ascii encoding (that is 8 bits per char) while the second one is unicode (16 bits per char). try converting the big file into simple ascii and re-read it
You may see the difference already in BeyondCompare: On the left, you have an ANSI-File, on the right you have a unicode with BOM (The hex-code looks like UTF-16LE. I'm not sure, which version the BeyondCompare you used to get the screenshot. My BeyondCompare would not show 'Unicode' but 'UTF16-LE' ...).

read() function for FileReader reads 8212

I'm trying to read the binary content of a text file (which contains the compressed version of a different text file). The first two characters (01111011 and 00100110) are correct (going by the values that were original put there during the compression process.
However, when it gets to the third character, it should be reading 10010111 (again, going by what was added during the compression process), but instead it reads 10000000010100 (aka 8212). Does anyone know what is causing this discrepancy, or how to fix it? Thanks!
The Java FileReader should not be used to read binary data from files since it reads a character at a time using the default encoding (which is most likely not very good for binary reading)
Instead, use FileInputStream which has read methods that reads actual raw bytes without any encoding applied.

DFM file became binary and infected

We have a DFM file which began as text file.
After some years, in one of our newer versions, the Borland Developer Studio changed it into binary format.
In addition, the file became infected.
Can someone explain me what should I do now? Where can I find how binary file structure is read?
Well, I found what happens to the DFM file, but I don't know why.
The occurence of changing from text file to binary one is known, and could be found in stack overflow in another question. I'll describe only the infection of the file.
In Pascal, the original language of DFM files, a string defines so: first byte is the length of the string (0-255) and the other characters are the string. (Unlike C, which its strings length are recognized by a null character).
Someone (maybe BDS?) while changing the file from text file to binary one, also changed all string of length 13 (0D) to be length 10 (0A). This way, the string finished after 10 chars, and the next char was a value of the property.
I downloaded binary editor, fixed all occurences of length 10, and the file was displayed and compiled well.
(Not only properties' length infected, but also one byte on Icon.Data property was replaced from 0D to 0A)