CGPDFScannerPopString returning strange result - iphone

I finally got some sort of pdf scanner to work. It reads into the callback functions without a problem, but when I try to NSLog the result from a CGPDFScannerPopString I get a result like this:
ˆ ˛˝ # ˜˜˜ #˜' ˜˜˜ "˜ '˜˜ " ' ˜˜
No string to be found here...
Any ideas of what it can be?
This is my callback function:
static void op_Tj (CGPDFScannerRef s, void *info)
{
CGPDFStringRef string;
if (!CGPDFScannerPopString(s, &string))
return;
NSLog(#"string: %#", (__bridge NSString *)CGPDFStringCopyTextString(string));
}
Thanks already!
Edit: Example PDF

You should be aware that the CGPDFStringRef is not a ASCII string or something similar at all. Cf. http://developer.apple.com/library/mac/documentation/graphicsimaging/Reference/CGPDFString/Reference/reference.html --- it is a "series of bytes—unsigned integer values in the range 0 to 255" which have to be interpreted according to the latest PDF reference.
The PDF reference in turn will tell you that the interpretation of the bytes depends on the font used, and while ASCII-like interpretations are common in case of European languages, they are not mandatory, and in case of Asian languages where font subset embedding is very common, the interpretation may look random.
CGPDFStringCopyTextString tries to interpret those bytes accordingly, but there does not have to be a sensible interpretation as a regular string.
EDIT Inspection of the sample PDF Ron supplied showed that in case of this sample indeed the encoding of the font in object 3 0 (which is dominant on most pages of the document) is not a standard encoding but instead:
<</Type/Encoding
/Differences[0/.notdef/C/O/V/E/R/space/slash/H/L/F/underscore/W/B/five/eight/four
/zero/two/six/D/one/period/three/Z/I/N/G/U/S/T/colon/seven/A/M/P/Y
/plus/nine/X/hyphen/i/s/p/a/t/c/h/n/f/o/K/greater/equal/l/m/y/J/Q
/parenleft/parenright/comma/dollar/ampersand/d/r/v/b/e/u/w/k/g/x/bar
/quotesingle/asterisk/q/question/percent]
>>
Looking at the top of the first document page
COVER / HLF_CWEB_58408485 / 58408485 / 26DEC12 10.30.22Z
BRIEFING INCLUDES FOLLOWING FLIGHTS:
26DEC12 OR0337 EHAM0630 MUVR1710 PHOYE VSM+2/8 179
NEXT FLIGHTS OF AIRCRAFT:
26DEC12 OR0338 MUVR1830 MMUN1940 PHOYE VSM+2/8 213
26DEC12 OR0338 MMUN2105 EHAM0655 PHOYE GPT+2/7 263
27DEC12 OR0365 EHAM0900 TNCB1930 PHOYE BAH+1/8 272
27DEC12 OR0366 TNCB2030 TNCC2110 PHOYE BAH+1/8 250
27DEC12 OR0366 TNCC2250 EHAM0835 PHOYE ASD+1/8 199
that encoding seems to have been created by dealing out the next number starting from one for the next required glyph. This obviously results in a highly individualistic encoding...
That being said the font object does include both an /Encoding entry and a /ToUnicode entry. Thus, if the method CGPDFStringCopyTextString was given a reference to the font here and really tried, it would easily be able to correctly translate those bytes into the corresponding text. That it doesn't achieve anything decent, seems to indicate that it simply does not have the information which font to interpret the bytes for --- I don't assume it doesn't try...
For accurate text extraction, therefore, you have to interpret the bytes in the CGPDFStringRef yourself using the information of the the font in the content stream. If you don't want to do that from scratch, you might be interested in PDFKitten, a framework for extracting data from PDFs in iOS. While it is not yet perfect (some font structures can baffle it), it is a good starting point.

Related

How to get string input from a socket in D?

I'm using this code to listen to a port:
int start(){
ushort port = 61888;
listener = new TcpSocket();
assert(listener.isAlive);
listener.blocking = false;
listener.bind(new InternetAddress(port));
listener.listen(10);
writefln("Listening on port %d.", port);
enum MAX_CONNECTIONS = 60;
auto socketSet = new SocketSet(MAX_CONNECTIONS + 1);
Socket[] reads;
while (true)
{
socketSet.add( listener);
foreach (sock; reads)
socketSet.add(sock);
Socket.select(socketSet, null, null);
}
return 0;
}
As far as I know, sockets interact with the bytes as they are. I want to find a way how to convert these bytes (which are essentially SQL requests) to strings. How can I do so, providing that input is in UTF-8, which is an encoding using variable size?
You seem to have a few questions here.
How do I get chars from bytes?
cast them with cast(char[]) st. This aliases the bytes, giving you a slice of the exact same data, and doesn't require new allocation. You are not yet assuming that the bytes are valid UTF-8, but autodecoding or other parts of your program might complain if they aren't. You can run it by std.utf.validate if you want.
do basically the same thing with std.string.assumeUTF(st), which at least also asserts on invalid UTF in debug builds only.
How do I get a string from char[]?
You can unsafely alias the char[] with std.exception.assumeUnique(st), or you can allocate an immutable copy with st.idup or std.utf.toUTF8(st).
What if my fixed buffer of bytes contains invalid UTF-8 -- because it got cut off?
If that's a risk you can use low level std.utf tools (decodeFront and catching UTFException is one way) to peel off the valid UTF-8 and then check if you have remaining bytes, or to check that the end of the input is valid UTF-8.
How do I know if I've gotten a complete SQL statement with my fixed buffer socket I/O?
Instead of just passing the raw SQL statement over the line, you can define a network protocol that includes information like statement size, or that has 'end of statement' markers that you can read for.
I've a cheat sheet for string type conversions, which links to a file of more elaborate unittests.
Try this:
import std.exception: assumeUnique;
string s = assumeUnique (cast(char[])ubyteArray);

What is the correct behavior of C_Decrypt in pkcs#11?

I am using C_Decrypt with the CKM_AES_CBC_PAD mechanism. I know that my ciphertext which is 272 bytes long should actually decrypt to 256 bytes, which means a full block of padding was added.
I know that according to the standard when invoking C_Decrypt with a NULL output buffer the function may return an output length which is somewhat longer than the actual required length, in particular when padding is used this is understandable, as the function can't know how many padding bytes are in the final block without carrying out the actual decryption.
So the question is whether if I know that I should get exactly 256 bytes back, such as in the scenario I explained above, does it make sense that I am still getting a CKR_BUFFER_TOO_SMALL error as a result, despite passing a 256 bytes buffer? (To make it clear: I am indicating that this is the length of the output buffer in the appropriate output buffer length parameter, see the parameters of C_Decrypt to observe what I mean)
I am encountering this behavior with a Safenet Luna device and am not sure what to make of it. Is it my code's fault for not querying for the length first by passing NULL in the output buffer, or is this a bug on the HSM/PKCS11 library side?
One more thing I should perhaps mention is that when I provide a 272 (256+16) bytes output buffer, the call succeeds and I am noticing that I am getting back my expected plaintext, but also the padding block which means 16 final bytes with the value 0x10. However, the output length is updated correctly to 256, not 272 - this also proves that I am not using CKM_AES_CBC instead of CKM_AES_CBC_PAD accidentally, which I suspected for a moment as well :)
I have used CKM.AES_CBC_PAD padding mechanism with C_Decrypt in past. You have to make 2 calls to C_Decrypt (1st ==> To get the size of the plain text, 2nd ==> Actual decryption). see the documentation here which talks about determining the length of the buffer needed to hold the plain-text.
Below is the step-by-step code to show the behavior of decryption:
//Defining the decryption mechanism
CK_MECHANISM mechanism = new CK_MECHANISM(CKM.AES_CBC_PAD);
//Initialize to zero -> variable to hold size of plain text
LongRef lRefDec = new LongRef();
// Get ready to decrypt
CryptokiEx.C_DecryptInit(session_1, mechanism, key_handleId_in_hsm);
// Get the size of the plain text -> 1st call to decrypt
CryptokiEx.C_Decrypt(session_1, your_cipher, your_cipher.length, null, lRefDec);
// Allocate space to the buffer to store plain text.
byte[] clearText = new byte[(int)lRefDec.value];
// Actual decryption -> 2nd call to decrypt
CryptokiEx.C_Decrypt(session_1, eFileCipher, eFileCipher.length, eFileInClear,lRefDec);
Sometimes, decryption fails because your input encryption data was misleading (however, encryption is successful but corresponding decryption will fail) the decryption algorithm. So it is important not to send raw bytes directly to the encryption algorithm; rather encoding the input data with UTF-8/16 schema's preserves the data from getting misunderstood as network control bytes.

How to view the actual bytes in a Data variable in Swift

I have a variable of type Data in Swift code, using Xcode 10.1, called data. I can see it in the debugger, but I don't know where the actual values are stored. It should contain a letter (one byte) and three Uint8 values, all 0-255, so it should be 4 bytes. The first _length is shown to be 6, so i don't know what else could be added in (one reason I want to see what is actually in there) (below). But I do not understand where the binary value is. The _rawValue does not seem to be it because it contains 4.5 bytes. Perhaps it is a pointer, as it says "RawPointer"?
Where are the actual bytes stored?
Edit:
By setting a new variable equal to data[i], i did figure out the number of bytes is correct (I found the code was putting things in i didn't know). My string is, for example "!C 0 21 255 17", so 6 bytes.
However, I would still love to find an answer to my question: Is there way during debug to view the elements without creating new variables to inspect?
Create an extension of Data as follows:
extension Data {
public var bytes: [UInt8]
{
return [UInt8](self)
}
}
You can view the bytes of data during debugging as:
po data.bytes
Just type po data as NSData in the debug console. You will see the hex bytes like <066465666768>

LED Display character set

I am looking for character sets to display each character in my LED Display Board.
Normally I have to put all these characters together in an array of booleans, for example H and A:
bool[] H = { 1,0,0,0,0,1, bool[] A = { 0,0,1,1,0,0,
1,0,0,0,0,1, 0,1,0,0,1,0,
1,1,1,1,1,1, 0,1,1,1,1,0,
1,0,0,0,0,1, 1,0,0,0,0,1,
1,0,0,0,0,1 } 1,0,0,0,0,1 }
I think there should be such collections already available in the internet, but under the seach keys with character set I found nothing. So a list with possibly many characters expressed with this bitmap format.
Do you have a tip for me. Would save me a lot of stupid word :)
Thanks you very much for the help. I appreciate it.
Regards,
Chris
check out this site: character-set-generator
Is site great for you.
To import fonts in a C code for LED display, we need to convert bitmap of a font into byte arrays so that it can be easily superimposed onto the video memory of Display.
For that you will find number of tools to fits your requirement.
but best format of byte code is given below to be used in LED display code:
{0x07e0, 0x1ff8, 0x3ffc, 0x600e, 0x8001, 0x0000 }, // Byte code for A
{0x8001, 0x4006, 0x7ffc, 0x3ff8, 0x0fe0, 0x0000 }, // Byte code for B
{0x0018, 0x003e, 0x003e, 0x003e, 0x003e, 0x0018, 0x0000 }, // Byte code for W
Some times in back I had made such custom build application on VC++ platform. It generates byte code for bitmap in above form.
I can share that app if you really indeed.
By using such bytecode formats you can easily achieve LED display sign such as shown in following images:
http://www.ledsignsdisplays.com/indoor-led-signs.html
http://photonplay.com/indoor-led-display.html

How to calculating data/error blocks number of QR code at version > 3

I am working on a QR code encoding/decoding project.
I have been read through the ISO/IEC 18004 (2006) and some tutorials ( http://www.thonky.com/guides/
http://www.matchadesign.com/_blog/Matcha_Design_Blog/post/QR_Code_Demystified_-_Part_1/
http://www.swetake.com/qr/qr1_en.html
)
The ISO documentation and those very nice tutorials helped me a lot. But there’s still one thing I can’t understand, that’s how we can calculate the number of data/error blocks when creating a QR code at Version 3 or higher.
The image below is from the ISO/IEC 18004 – 2006:
A version 7-H (H is error correction capacity level ) symbol that has 66 data codewords and 130 error codewords. They split both of them into 5 blocks.
The document says that the n blocks number (in this case n = 5 ) can be calculated from Table 9 (ISO 18004) according to the version and error correction level. But it seems like I can’t get that number. Please show me how I can calculate it.
Now I got it. All needed information for block splitting actually is at Table 9 of the ISO/IEC 18004 document. Just because of my careless reading.