pySerial and reading binary data - pyserial

When the device I am communicating with sends binary data, I can recover most of it. However, there always seem to be some bytes missing, replaced by non-standard characters. For instance, one individual output looks like this:
\xc4\xa5\x06\x00.\xb3\x01\x01\x02\x00\x00\x00=\xa9
The period and equals sign should be traditional bytes in hexadecimal format (I confirmed this in another application). Other times I get other weird characters such as ')' or 's'. These characters usually occur in the exact same spot (which varies with the command I passed to the device).
How can I fix this problem?

Are you displaying the output using something like this?:
print output
If some of your bytes happen to correspond with printable characters, they'll show up as characters. Try this:
print output.encode('hex')
to see hex values for all your bytes.

At first I liked #RichieHindle answer, but when I tried it the hex bytes were all bunched together.
To get a friendlier output, I use
print ' '.join(map(lambda x:x.encode('hex'),output))

Related

Implementing MD5: Inconsistent endianness?

So I tried implementing the MD5 algorithm according to RFC1321 in C# and it works, but there is one thing about the way the padding is performed that I don't understand, here's an example:
If I want to hash the string "1" (without the quotation marks) this results in the following bit representation: 10001100
The next step is appending a single "1"-Bit, represented by 00000001 (big endian), which is followed by "0"-Bits, followed by a 64-bit representation of the length of the original message (low-order word first).
Since the length of the original message is 8 (Bits) I expected 00000000000000000000000000001000 00000000000000000000000000000000 to be appended (low-order word first). However this does not result in the correct hash value, but appending 00010000000000000000000000000000 00000000000000000000000000000000 does.
This looks as if suddenly the little-endian format is being used, but that does not really seem to make any sense at all, so I guess there must be something else that I am missing?
Yes, for md5 you have to add message length in little-endian.
So, message representation for "1" -> 49 -> 00110001, followed by single bit and zeroes. And after add message length in reversed order of bytes (the least significant byte first).
You could also check permutations step by step on this site: https://cse.unl.edu/~ssamal/crypto/genhash.php.
Or there: https://github.com/MrBlackk/md5_sha256-512_debugger

What's with the dashes in some iOS error messages?

I sometimes see error messages in the log file of my iOS device with at least part of the text containing dashes between each letter, like this:
... ’-t -b-e -c-o-m-p-l-e-t-e-d-. -(-k-C-F-E-r-r-o-r-D-o-m-a-i-n-C-F-N-e-t-w-o-r-k -e-r-r-o-r -2-.-)-" -U-s-e-r-I-n-f-o-=-0-x-1-4-5-f-d-0 -{-k-C-F-G-e-t-A-d-d-r-I-n-f-o-F-a-i-l-u-r-e-K-e-y-=-8-}
I'm just curious as to any meaning this might have (certainly doesn't do much for readability, but I suppose it's easy to spot), or if it's just a random problem. (My device is jailbroken, if that makes a difference).
Update: I was able to format a log message similarly by calling NSLog with a non-ASCII character at the beginning:
NSLog(#"€ Line will be formatted strangely");
I suspect that the message was in UTF16 (or some other double-byte character set) and was incorrectly mapped in as ASCII when converting to an NSString.

How to determine if a file is IBM1047 encoded

I have a bunch of XML files that are declared as encoding="IBM1047" but they don't seem to be:
when converted with iconv from IBM1047 to UTF-8 or ISO8859-1 (Latin 1) they result in indecipherable garbage
file -i <name_of_file> says "unknown 8-bit encoding"
when parsed by an XML parser the parser complains there is text before the prolog but there isn't; this error doesn't happen if I change the encoding in the XML declaration to something else
It would be nice to find out the real encoding of these files (I tried 'file -i' as mentioned above, and 'enca' but it's limited to Slavic languages (the files are in French)).
I have little control about how these files are produced; short of finding the actual encoding, if I can prove conclusively that the files are not in fact IBM1047 I may get the producer to do something about it.
How do I prove it?
Some special chars:
'é' is '©'
'à' is 'ë'
'è' is 'Û'
'ê' is 'ª'
The only way to prove that any class of data streams is encoded or not encoded in a particular way is to know, for at least one instance of the class, exactly what characters are supposed to be in the stream. If you have agreement on what characters are (supposed to be) in a particular test case, you can then calculate the bits that should be in the IBM 1047 (or any other) encoding of the test case, and compare those bits to the bits you actually see.
One simple way for EBCDIC data to be mangled, of course, is for it to have passed through some EBCDIC/ASCII gateway along the way that used a translate table designed for some other EBCDIC code page. But if you are working with EBCDIC data you presumably already know that.

Command-line arguments as bytes instead of strings in python3

I'm writing a python3 program, that gets the names of files to process from command-line arguments. I'm confused regarding what is the proper way to handle different encodings.
I think I'd rather consider filenames as bytes and not strings, since that avoids the danger of using an incorrect encoding. Indeed, some of my file names use an incorrect encoding (latin1 when my system locale uses utf-8), but that doesn't prevent tools like ls from working. I'd like my tool to be resilient to that as well.
I have two problems: the command-line arguments are given to me as strings (I use argparse), and I want to report errors to the user as strings.
I've successfuly adapted my code to use binaries, and my tool can handle files whose name are invalid in the current default encoding, as long as it is by recursing trough the filesystem, because I convert the arguments to binaries early, and use binaries when calling fs functions. When I receive a filename argument which is invalid, however, it is handed to me as a unicode string with strange characters like \udce8. I do not know what these are, and trying to encode it always fail, be it with utf8 or with the corresponding (wrong) encoding (latin1 here).
The other problem is for reporting errors. I expect users of my tool to parse my stdout (hence wanting to preserve filenames), but when reporting errors on stderr I'd rather encode it in utf-8, replacing invalid sequences with appropriate "invalid/question mark" characters.
So,
1) Is there a better, completely different way to do it ? (yes, fixing the filenames is planned, but i'd still like my tool to be robust)
2) How do I get the command line arguments in their original binary form (not pre-decoded for me), knowing that for invalid sequences re-encoding the decoded argument will fail, and
3) How do I tell the utf-8 codec to replace invalid, undecodable sequences with some invalid mark rather than dying on me ?
When I receive a filename argument
which is invalid, however, it is
handed to me as a unicode string with
strange characters like \udce8.
Those are surrogate characters. The low 8 bits is the original invalid byte.
See PEP 383: Non-decodable Bytes in System Character Interfaces.
Don't go against the grain: filenames are strings, not bytes.
You shouldn't use a bytes when you should use a string. A bytes is a tuple of integers. A string is a tuple of characters. They are different concepts. What you're doing is like using an integer when you should use a boolean.
(Aside: Python stores all strings in-memory under Unicode; all strings are stored the same way. Encoding specifies how Python converts the on-file bytes into this in-memory format.)
Your operating system stores filenames as strings under a specific encoding. I'm surprised you say that some filenames have different encodings; as far as I know, the filename encoding is system-wide. Functions like open default to the default system filename encoding, for example.

Displaying Unicode Characters

I already searched for answers to this sort of question here, and have found plenty of them -- but I still have this nagging doubt about the apparent triviality of the matter.
I have read this very interesting an helpful article on the subject: http://www.joelonsoftware.com/articles/Unicode.html, but it left me wondering about how one would go about identifying individual glyphs given a buffer of Unicode data.
My questions are:
How would I go about parsing a Unicode string, say UTF-8?
Assuming I know the byte order, what happens when I encounter the beginning of a glyph that is supposed to be represented by 6 bytes?
That is, if I interpreted the method of storage correctly.
This is all related to a text display system I am designing to work with OpenGL.
I am storing glyph data in display lists and I need to translate the contents of a string to a sequence of glyph indexes, which are then mapped to display list indices (since, obviously, storing the entire glyph set in graphics memory is not always practical).
To have to represent every string as an array of shorts would require a significant amount of storage considering everything I have need to display.
Additionally, it seems to me that 2 bytes per character simply isn't enough to represent every possible Unicode element.
How would I go about parsing a Unicode string, say UTF-8?
I'm assuming that by "parsing", you mean converting to code points.
Often, you don't have to do that. For example, you can search for a UTF-8 string within another UTF-8 string without needing to care about what characters those bytes represent.
If you do need to convert to code points (UTF-32), then:
Check the first byte to see how many bytes are in the character.
Look at the trailing bytes of the character to ensure that they're in the range 80-BF. If not, report an error.
Use bit masking and shifting to convert the bytes to the code point.
Report an error if the byte sequence you got was longer than the minimum needed to represent the character.
Increment your pointer by the sequence length and repeat for the next character.
Additionally, it seems to me that 2
bytes per character simply isn't
enough to represent every possible
Unicode element.
It's not. Unicode was originally intended to be a fixed-with 16-bit encoding. It was later decided that 65,536 characters wasn't enough, so UTF-16 was created, and Unicode was redefined to use code points between 0 and 1,114,111.
If you want a fixed-width encoding, you need 21 bits. But they aren't many languages that have a 21-bit integer type, so in practice you need 32 bits.
Well, I think this answers it:
http://en.wikipedia.org/wiki/UTF-8
Why it didn't show up the first time I went searching, I have no idea.