Is there any PDF parser written in objective-c or c? - iphone

I'm writing a pdf reader iPhone application.
I know how to show pdf file in view using CGPDF** classes in iOS.
What I want to do now is to search text in pdf file, and highlight the searched text.
So, I need a library which can detect what text is in what position.
Besides, I want the library able to handle unicode and Chinese characters.
I've searched for a few days but still cannot find anything suitable.
I've tried xpdf, but it is written in c++. I don't know how to use c++ code in iPhone app.
I've also tried
http://www.codeproject.com/KB/cpp/ExtractPDFText.aspx
but it does not handle Chinese characters.
I've tried to code by myself,
but the encoding in PDF is really complicated.
For example, I don't know what to refer to when I want to decode the text by the following font:
8 0 obj
<< /Type /Font /Subtype /Type0 /Encoding /Identity-H /BaseFont /RNXJTV+PMingLiU
/DescendantFonts [ 157 0 R ] >>
endobj
157 0 obj
<< /Type /Font /Subtype /CIDFontType2 /BaseFont /RNXJTV+PMingLiU /CIDSystemInfo
<< /Registry (Adobe) /Ordering (CNS1) /Supplement 0 >> /FontDescriptor 158 0 R
/W 161 0 R /DW 1000 /CIDToGIDMap 162 0 R >>
endobj
158 0 obj
<< /Type /FontDescriptor /Ascent 801 /CapHeight 711 /Descent -199 /Flags 32
/FontBBox [0 -199 999 801] /FontName /RNXJTV+PMingLiU /ItalicAngle 0 /StemV
0 /Leading 199 /MaxWidth 1000 /XHeight 533 /FontFile2 159 0 R >>
endobj

Take a look at the CGPDFScanner type; it can be used to parse through a PDF document for strings and particular PDF operators.

This code is having some bugs that can be easily fixable. Well presented Objective C code.
https://github.com/KurtCode/PDFKitten

CGPDFScanner can only scan the pdf contents but there is no way you can find the location of the word in the pdf. So highlighting is not possible using cgpdf functions. Also the scanner output is encoded text for flateDecoded and other types of pdf.
It can only scan simple pdfs i.e Linear pdfs. (Open the pdf as text file and at the top you will find the word Linearized pdf.)
Possible solution would be using a c or c+ parsing library if you get one.
Also the cpp project from the code project will only parse the content but wont give any location information.
Writing a pdf parser on your own is complex because pdf formats are complicated and not fixed. Pdf content can be encoded in different ways like FlateDecode type etc.

Related

How to implement Unicode (UTF-8) support for a CID-keyed font (Adobe's Type 0 Composite font) converted from ttf?

This post is sequel to Conversion from ttf to type 2 CID font (type 42 base font)
It is futile to have a CID-Keyed font (containingCIDMap that enforces Identity Mapping i.e Glyph index = CID) without offering Unicode support inherently. Then, how to provide UTF-8 support for such a CID-keyed font externally by an application software?
Note: The application program that uses the CID-keyed font can be written in C, C++, Postscript or any other language.
The CID-keyed font NotoSansTamil-Regular.t42 has been converted from Google's Tamil ttf font.
You need this conversion because without this conversion, a postscript program can't access a truetype font!
Refer Post Conversion from ttf to type 2 CID font (type 42 base font) for conversion.
The CIDMap of t42 font enforces an identity mapping as follows:
Character code 0 maps to Glyph index 0
Character code 1 maps to Glyph index 1
Character code 2 maps to Glyph index 2
......
......
Character code NumGlyphs-1 maps to Glyph index NumGlyphs-1
It is clearly evident that there is no Unicode involved in this mapping inherently.
To understand concretely, edit the following postscript program tamil.ps that accesses t42 font through postscript's findfont operator.
%!PS-Adobe-3.0
/myNoTo {/NotoSansTamil-Regular findfont exch scalefont setfont} bind def
13 myNoTo
100 600 moveto
% தமிழ் தங்களை வரவேற்கிறது!
<0019001d002a005e00030019004e00120030002200030024001f002f0024005b0012002a0020007a00aa> show
100 550 moveto
% Tamil Welcomes You!
<0155017201aa019801a500030163018801a5017f01b101aa018801c20003016901b101cb00aa00b5> show
showpage
Issue the following Ghostscript command to execute the postscript program tamil.ps.
gswin64c.exe "D:\cidfonts\NotoSansTamil-Regular.t42" "D:\cidfonts\tamil.ps (on Windows Platform).
gs ~/cidfonts/NotoSansTamil-Regular.t42 ~/cidfonts/tamil.ps (on Linux Platform).
This will display two strings தமிழ் தங்களை வரவேற்கிறது! and Tamil Welcomes You! respectively in subsequent rows.
Note that the strings for show operator are in Hexadecimal format embedded within angular brackets. Operator show extracts 2 bytes at a time and maps this CID (16 bit value) to a Glyph.
For example, the first 4 Hex digits in the 1st string is 0019 whose decimal equivalent is 25. This maps to glyph த.
In order to use this font t42, each string (created from character set of a ttf) should be converted into hexadecimal string by hand which is practically impossible and therefore this font becomes futile.
Now consider the following C++ code that generates a postscript program called myNotoTamil.ps that accesses the same t42 font through postscript's findfont operator.
const short lcCharCodeBufSize = 200; // Character Code buffer size.
char bufCharCode[lcCharCodeBufSize]; // Character Code buffer
FILE *fps = fopen ("D:\\cidfonts\\myNotoTamil.ps", "w");
fprintf (fps, "%%!PS-Adobe-3.0\n");
fprintf (fps, "/myNoTo {/NotoSansTamil-Regular findfont exch scalefont setfont} bind def\n");
fprintf (fps, "13 myNoTo\n");
fprintf (fps, "100 600 moveto\n");
fprintf (fps, u8"%% தமிழ் தங்களை வரவேற்கிறது!\n");
fprintf (fps, "<%s> show\n", strps(ELang::eTamil, EMyFont::eNoToSansTamil_Regular, u8"தமிழ் தங்களை வரவேற்கிறது!", bufCharCode, lcCharCodeBufSize));
fprintf (fps, "%% Tamil Welcomes You!\n");
fprintf (fps, "<%s> show\n", strps(ELang::eTamil, EMyFont::eNoToSansTamil_Regular, u8"Tamil Welcomes You!", bufCharCode, lcCharCodeBufSize));
fprintf (fps, "showpage\n");
fclose (fps);
Although the contents of tamil.ps and myNotoTamil.ps are same and identical, the difference in the production of those ps files is like difference between heaven and earth!
Observe that unlike tamil.ps(handmade Hexadecimal strings), the myNotoTamil.ps is generated by a C++ program which uses UTF-8 encoded strings directly hiding the hex strings completely. The function strps produces hex strings from UTF-8 encoded strings which are the same and identical as the strings present in tamil.ps.
The futile t42 font has suddenly become fruitful due to strps function's mapping ability from UTF-8 to CIDs (every 2 bytes in Hex strings maps to a CID)!
The strps function consults a mapping table aNotoSansTamilMap (implemented as a single dimensional array constructed with the help of Unicode Blocks) in order to map Unicode Points (extracted from UTF-8 encoded string) to Character Identifiers (CIDs).
The buffer bufCharCode used in strps function (4th parameter) passes out hex strings corresponding to UTF-8 encoded strings to Postscript's show operator.
In order to benefit others, I released this UTF8Map program through GitHub on the following platforms.
Windows 10 Platform (Github Public Repository for UTF8Map Program on Windows 10)
Open up DOS command line and issue the following clone command to download source code:
git clone https://github.com/marmayogi/UTF8Map-Win
Or execute the following curl command to download source code release in zip form:
curl -o UTF8Map-Win-2.0.zip -L https://github.com/marmayogi/UTF8Map-Win/archive/refs/tags/v2.0.zip
Or execute the following wget command to download source code release in zip form:
wget -O UTF8Map-Win-2.0.zip https://github.com/marmayogi/UTF8Map-Win/archive/refs/tags/v2.0.zip
Linux Platform (Github Public Repository for UTF8Map Program on Linux)
Issue the following clone command to download source code:
git clone https://github.com/marmayogi/UTF8Map-Linux
Or execute the following curl command to download source code release in tar form:
curl -o UTF8Map-Linux-1.0.tar.gz -L https://github.com/marmayogi/UTF8Map-Linux/archive/refs/tags/v1.0.tar.gz
Or execute the following wget command to download source code release in tar form:
wget -O UTF8Map-Linux-1.0.tar.gz https://github.com/marmayogi/UTF8Map-Linux/archive/refs/tags/v1.0.tar.gz
Note:
This program uses t42 file to generates a ps file (a postscript program) which will display the following in a single page:
A welcome message in Tamil and English.
List of Vowels (12 + 1 Glyphs). All of them are associated with Unicode Points.
List of Consonants (18 + 6 = 24 Glyphs). No association of Unicode Points.
List of combined glyphs (Combination of Vowels + Consonants) in 24 lines. Each line displays 12 glyphs. Out of 288 Glyphs, 24 are associated with Unicode Points and rest do not.
List of Numbers in two lines. All 13 Glyphs for Tamil numbers are associated with Unicode Points.
A foot Note.
The two program files (main.cpp and mapunicode.h) are 100% portable. i.e. the contents of two files are same and identical across platforms.
The two mapping tables aNotoSansTamilMap and aLathaTamilMap are given in mapunicode.h file.
A README Document in Markdown format has been included with the release.
This software has been tested for t42 fonts converted from the following ttf files.
Google's Noto Tamil ttf
Microsoft`s Latha Tamil ttf

Showing glyphs with Unicodes higher than decimal 256

I am looking for help in printing the club symbol from the Arial font in postscript.
It has a Unicode of 9827 (2663 hexadecimal).
The ampersand is Unicode 38 (26 hexadecimal)
This postscript code snippet
!PS
/ArialMT findfont
12 scalefont setfont
72 72 moveto
<26> show
showpage
produces the ampersand system when I run it through Adobe Distiller. It appears that postscript understands Unicodes with UTF-8 encoding by default.
I am unable to do the same with the club symbol.
My research indicates that I have to use Character Encoding and this where I am lost.
Could some kind soul show me (hopefully fairly short) how to show the club symbol using Character Encoding?
Alternatively if you could point me to a simple tutorial it would be greatly appreciated.
Reading the reference manual leaves my head spinning.
PostScript does not understand Unicode, not at all, or at least not as standard, though there are ways to deal with it.
Section 5.3 of the PostScript Language Reference Manual contains complete information on Character Encoding. You really need to read this in detail, what you are asking is a deceptively simple question, with no simple answer.
The way this works for PostScript fonts is that the characters in the document have a character code which lies between 0 and 255. When processing text, the interpreter takes the character code and looks it up in the Encoding attached to the font. If you didn't supply an Encoding to the font, then it will normally have a pre-defined StandardEncoding.
StandardEncoding has some congruence with UTF-8, for character codes 0x7F and below, but it's not exactly the same I don't think.
The Encoding maps the character code to a glyph name, For example 0x41 in StandardEncoding maps to /A (that's a name in PostScript). Note that is not UTF-8 or anything else, it's a mapping. It's entirely possible, and common practice for subset fonts, to map the first character used to character code 1, the second to character code 2 and so on.
So if we applied that scheme to 'Hello World' we would use an Encoding which maps
0x01->/H
0x02->/e
0x03->/l
0x04->/o
0x05->/space
0x06->/W
0x07->/r
0x08->/d
and then we might draw the text by :
<0102030304050604070308> show
Which, as you can see, bears no relation to UTF-8 at all.
Anyway, having retrieved the glyph name the interpreter then looks at the CharStrings dictionary in the font and locates the key associated with the character code. So for StandardEncoding we would map the 0x41 to /A and we'd then look in the CharStrings dictionary for the /A key. We then take the value associated with that key, which will be a PostScript glyph program and run it.
Your problem is that you are trying to use a TrueType font. PostScript does not support TrueType fonts in that way, it does support them when they are defined as Type42 fonts, because a Type42 font carries around some additional information which allows the PostScript interpreter to treat them, broadly speaking, the same way as PostScript fonts.
Many modern PostScript interpreters will load a TrueType font and create a Type42 font from it for you, but this involves guessing at the additional information, and there's no real way to tell in advance how any given interpreter will deal with this. I suspect that Adobe Distiller will behave similarly to Ghostscript and attempt to map the type42 to a StandardEncoding.
Essentially the Encoding maps the character code to a key in the CharStrings dictionary and the value associated with that key is the GID. The GID is used to index the GLYF table in the TrueType font, the TrueType rasteriser then reads that glyph program.
So in order to create a type42 font with an Encoding which will map a character code to a club symbol, you would need to know what the GID of the club symbol in the font actually is. This can be derived from one of the CMAP subtables in the TrueType font, which is how PostScript interpreters such as Ghostscript build the required Encoding when they load a TrueType font as a Type 42. You would then need to alter the CharStrings dictionary in the type42 font so that it maps to the correct GID. You would also need to alter the Encoding; choose a character code that you want to use, map the character code to the key in the CharStrings dictionary.
You would have to determine what kind of keys the Encoding and CharStrings dictionary is using. It might be names or it might be integers or anything else. You could figure that out of course by looking at the content of the Encoding array.
In all honesty unless you know a lot about TrueType fonts I think it would be hard for you to reverse-engineer the font to retrieve the correct GID and then re-encode the font that gets loaded by the interpreter. You would also need to examine the contents of the font dictionary returned by findfont to see what the existing mapping is. And crucially you may need to modify the CharStrings dictionary to map the key to the GID. It may be that Distiller returns a dictionary which is defined as no-access which will prevent you looking inside or (or at least, inside parts of it). You might be able to get away with looking at the Encoding in the font dictionary and modifying that, if the CharStrings dictionary already contains a key for every glyph in the font, which it may well do.
I could probably guide you through doing this with Ghostscript, but I have no idea how Adobe Distiller defines TrueType fonts loaded from disk.
You could use a CIDFont instead. These are defined in section 5.11.1 and it may be that if you were to use something like the pre-defined Identity-H or UCS2 CMaps you could create a CID-Keyed instance of ArialMT with TrueType outlines which would work for your Unicode code point.
But that would mean defining the font yourself, so you would need to include the whole TrueType font as part of your PostScript program. Again this isn't simple.
There is some good information here: Show Unicode characters in PostScript
I also have the ArialMT.ttf and made the ArialMT.ttf.t42 just to look inside. I found the /club glyph with GID 389 as described by KenS and tried this as described in the linked post with good results:
%!
100 300 moveto
/ArialMT.ttf 46 selectfont (ArialMT) show
100 200 moveto /club glyphshow
showpage
Note: I use ArialMT.ttf because the TT font wasn't installed in the ghostscript Fontmap just in the current directory so used gs -P for that reason. The normal /ArialMT findfont should work when the TT font is already installed in the search path. This is my first attempt with these glyphs and was just using trial and error.
There is a comprehensive Adobe list of glyphs that map many of the Unicode characters: https://github.com/adobe-type-tools/agl-aglfn/blob/master/glyphlist.txt.
If the desired Unicode character is in that list, say club;2663 or clubsuitblack;2663 or clubsuitwhite;2667, all one needs to say is
/club glyphshow and most modern fonts will know what to do. But #KenS says this "can cause problems".
Instead the preferred scheme that emerges from the recommended references is to:
create a composite font in the preamble, one for each of the fonts
you are using;
include the lower 256 characters as Font0;
add whatever Unicode characters you are planning to use, in chunks of
256 characters, as Font1, Font2 etc.;
remap the Unicode of the special characters onto a two-character
sequence, of the sub-font index within the composite font, followed
by the byte that is the index of the character within that sub-font.
The following is a complete example of both methods.
I use http://www.acumentraining.com/Acumen_Journal/AcumenJournal_May2002.zip, but with Font1 is a custom-remapping of a portion of the same font as Font0, re-using some of the well known ascii character(s).
This a complete file.eps:
%!PS-Adobe-3.0 EPSF-3.0
%%BoundingBox: 0 0 792 612
%%LanguageLevel: 2
%%EndComments
%%BeginProlog
userdict begin
%%EndProlog
%%BeginSetup
% The following encodes a few useful Unicode glyphs, if only a few are needed.
% Based on https://stackoverflow.com/questions/54840594/show-unicode-characters-in-postscript
% Usage: /Times-Roman /Times-Roman-Uni UniVec new-font-encoding
/new-font-encoding { <<>> begin
/newcodesandnames exch def
/newfontname exch def
/basefontname exch def
/basefontdict basefontname findfont def % Get the font dictionary on which to base the re-encoded version.
/newfont basefontdict maxlength dict def % Create a dictionary to hold the description for the re-encoded font.
basefontdict
{ exch dup /FID ne % Copy all the entries in the base font dictionary to the new dictionary except for the FID field.
{ dup /Encoding eq
{ exch dup length array copy % Make a copy of the Encoding field.
newfont 3 1 roll put }
{ exch newfont 3 1 roll put }
ifelse
}
{ pop pop } % Ignore the FID pair.
ifelse
} forall
newfont /FontName newfontname put % Install the new name.
newcodesandnames aload pop % Modify the encoding vector. First load the new encoding and name pairs onto the operand stack.
newcodesandnames length 2 idiv
{ newfont /Encoding get 3 1 roll put}
repeat % For each pair on the stack, put the new name into the designated position in the encoding vector.
newfontname newfont definefont pop % Now make the re-encoded font description into a POSTSCRIPT font.
% Ignore the modified dictionary returned on the operand stack by the definefont operator.
end} def
/Helvetica /Helvetica-Uni [
16#43 /club % ASCII 43 = C = /club
] new-font-encoding
/Helv
<<
/FontType 0
/FontMatrix [ 1 0 0 1 0 0 ]
/FDepVector [
/Helvetica findfont % this is Font0
/Helvetica-Uni findfont % this is Font1
]
/Encoding [ 0 1 ]
/FMapType 3
>> definefont pop
%%EndSetup
%%BeginScript
/Helv 20 selectfont
72 300 moveto
(The club character is \377\001C\377\000 a part of the string.) show
/Helvetica findfont 20 scalefont setfont
263 340 moveto
/club glyphshow
showpage
%%EOF
Which produces this
Obviously, this can be extended to more characters, but only 256 per sub-font. I am not aware of a "standard" convention for such re-encoding, although I would imagine a set of Greek letters alpha,beta,gamma... would map pretty obviously onto a,b,c... Perhaps somebody else is aware of such an implementation for all of the Unicode characters from the Adobe glyph list using multiple custom sub-fonts, and can provide a pointer..

ITEXT PDFReader not able to read PDF

I am not able to read a PDF file using itext pdfreader. This PDf is valid PDF if I tried to open this.
URL Of PDF: http://www.fundslibrary.co.uk/FundsLibrary.DataRetrieval/Documents.aspx?type=fund_class_kiid&id=f096b13b-3d0e-4580-8d3d-87cf4d002650&user=fidelitydocumentreport
The PDF in question is encrypted.
According to the PDF specification,
Encryption applies to all strings and streams in the document's PDF file, with the following exceptions:
The values for the ID entry in the trailer
Any strings in an Encrypt dictionary
Any strings that are inside streams such as content streams and compressed object streams, which themselves are encrypted
Later on there are information on special cases in which the document level metadata stream is not encrypted either or in which only attachments are encrypted.
The Cross-Reference Stream Dictionary of the PDF looks like this:
<<
/Root 101 0 R
/Info 63 0 R
/XRef(stream)
/Encrypt 103 0 R
/ID[<D034DE62220E1CBC2642AC517F0FE9C7><D034DE62220E1CBC2642AC517F0FE9C7>]
/Type/XRef
/W[1 3 2]
/Index[0 107]
/Size 107
/Length 642
>>
As you can see there is an non-encrypted string here, (stream), which is neither the value for the ID entry, nor in an Encrypt dictionary, nor inside a stream. Furthermore, the afore mentioned special cases do not apply here either.
Thus, this file violates the PDF specification here. Therefore, this file is not a valid PDF.
Furthermore, according to the PDF specification
The last line of the file shall contain only the end-of-file marker, %%EOF.
The file at handsends like this
Thus, the last line of the file does contain something else than the end-of-file marker (which is in the line before), a 0x06 and a 0x0c.
The file, therefore, violates the PDF specification here, too.

mPDF outputs the pdf as symbols insted of characters

I want to display the pdf in browser. I use mpdf to generate pdf and it generates correctly. When I use
$mpdf->Output('binders/' . $filename . '.pdf', 'I');
it outputs as the following symbols.
When I use F instead of I, i'm getting the pdf saved on the folder with correct characters.
Can someone help me please.
%PDF-1.4 %���� 3 0 obj <> /Contents 4 0 R>> endobj 4 0 obj <> stream x��[�o�����j#���d���ٙ��)dYvlĎ�]��ئ-��^�(EpE�,Ăd!J�� �,5��O������۽;^h�ݛ�����y�͜���S��.wN}�&�}��R2�"�vU\;��_��~|���_)q�V���ִ�m]E�-��ŚX�#�mh6:���R�]�tq������#�wk9w��C�)D؞.���#�Su<��K[�)<&�-]lYL�J�]]E<�:|���z��.�e٤9hH#�b&� �]��f�5s���uE�?7]Z9���5k��`6�j�j�!��Zѧ��j���L���Zƪ=�EH�vp�����O�0���g|� �����F��� ���e�R�br�]�H"k⁽�^�#v����}�[{І#���,�8A;�X����u8��E���D�o�3�YqQd�Ls:p�)�����ϸ��U��s���n�3_�ç��G�/�����J��Q|����;]�$���/�+R��j��V�]�X��k��Jj��v�]X��d������su�L�je"3����k��=�����F��;Q���H����\���D�$�4�V���;��ƈ���{nJ��36���NיFl��nt�U���Kݜ>��[Lv�L;�
The string is a binary PDF representation and its presence means Content-type: application/pdf header is not sent correctly or it is
overriden by your code or setup. Most likely by text/plain or text/html.
Try to figure out these:
Are you resetting Content-type header in PHP code somewhere after calling the mPDF Output method?
Is your server forcing a different Content-type somewhere in your setup?
Does your browser support displaying application/pdf Content-type directly?
Does the D Output mode offer the file for downloading or does it display the same characters?
If the D mode displays the same characters, then the Content-type header is most likely being overriden somewhere in your app or setup after calling the mPDF Output method.

When stamping document - Danish characters disappear and PDF becomes invalid

I have a PDF generated in Oracle BI Publisher. It contains a graph and some text. When trying to stamp the document with an image - The image gets added, but the Danish characters are destroyed.
I run iText Stamp like this:
static void stampPdf() throws IOException, DocumentException {
PdfReader reader = new PdfReader(PDF_SOURCE_FILE);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(
PDF_STAMPED_FILE));
Image img = Image.getInstance(WATERMARK);
img.setAbsolutePosition(10, 100);
PdfContentByte under = stamper.getUnderContent(1);
under.addImage(img);
stamper.close();
}
As a result, I get the following the message: Document invalid. But the document displays, including the added image. The Danish characters have become substituted.
All fonts has been removed from Document properties.
Has anyone seen something like this before? I have done it several times before, without problems.
I have taken a look at the PDF and it's not an iText problem. It's a "Garbage In, Garbage Out" problem. Please open the PDF in Acrobat and analyze it for syntax errors. You'll get the following message:
The content stream of the PDF is wrong in a way that even Acrobat can't analyze it and tell you what is wrong.
So I've looked inside the file, and when it looks as if iText can't see the page resources for the page. The page resources refer to the fonts. If iText can't see the page resources, iText can't see the fonts and they get lost in the process.
If Acrobat would allow me to "Analyze and fix", then I could create a fixed PDF and compare what was fixed. But as Acrobat can't fix the file, it's a lot of work to go through the complete file manually to find out what exactly is wrong with it. Out of curiosity, I opened the document in a text editor, and I found this:
4 0 obj
<<
/ProcSet [ /PDF /Text ]
/Font <<
/F1 7 0 R
/F2 8 0 R
/F3 11 0 R
>>
/Shading <<
/grad0 10 0 R
/grad0#2 15 0 R
/grad1#2 17 0 R
/grad2#2 19 0 R
/grad3#2 21 0 R
/grad4#2 23 0 R
/grad5#2 25 0 R
>>
>>
endobj
The problem is caused by the names /grad0#2, /grad1#2, etc... Those aren't valid names. Let me quote from ISO-32000-1:
When writing a name in a PDF file, a SOLIDUS (2Fh) (/) shall be used
to introduce a name. The SOLIDUS is not part of the name but is a
prefix indicating that what follows is a sequence of characters
representing the name in the PDF file and shall follow these rules:
a) A NUMBER SIGN (23h) (#) in a name shall be written by using its
2-digit hexadecimal code (23), preceded by the NUMBER SIGN.
b) Any character in a name that is a regular character (other than NUMBER
SIGN) shall be written as itself or by using its 2-digit hexadecimal
code, preceded by the NUMBER SIGN.
c) Any character that is not a
regular character shall be written using its 2-digit hexadecimal code,
preceded by the NUMBER SIGN only.
In your case, you have a NUMBER SIGN (#) followed by a 1-digit number. That doesn't make any sense. The PDF is invalid.
Long story short: contact the producer of the PDF and ask him to fix the problem or never use his tools again.