I have discovered a HUGE issue in my code, and I have literally no idea what is causing this.
SO, when I send requests to my server I hash a string thats in the request. This is sometimes user input.
My app is multi language so I have to support all "ä" chars etc.
So with the normal english letters/chars numbers etc, this hashing method works like a dream. BUT when the string being hashed and compared contains a "ä" or a "ö" (Not specifically those, it literally might be that any char not in the Base64 set will cause this) the hash doesn't match!
This is an absolute and complete disaster, and I have not noticed it this far. I have tried basically everything I know to try to fix this, and googling, and I am out of luck so far.
I generate the hash in Swift inputting the string and secretToken into this function and saving the output as a HTTP header:
func hmac(string: String, key: String) -> String {
var digest = [UInt8](repeating: 0, count: Int(CC_SHA256_DIGEST_LENGTH))
CCHmac(CCHmacAlgorithm(kCCHmacAlgSHA256), key, key.count, string, string.count, &digest)
let data = Data(digest)
return data.map { String(format: "%02hhx", $0) }.joined()
}
How I compare the hash in NodeJS:
if (hashInTheRequest === crypto.createHmac('sha256', secretToken).update(stringToHash).digest('hex')) {
//Good to go
}
Thanks in advance!
This could be due to a composition issue. You mentioned non-latin characters, but didn't specify any concrete examples, where you had problems.
What is composition?
Unicode aims to be able to represent any character used by humanity. However, many characters are similar, such as u, ü, û and ū. The original idea was to just assign a code point to every possible combination. As one might imagine, this is not the most effective way to store things. Instead, the "base" character is used, and then a combining character is added to it.
Let's look at an example: ü
ü can be represented as U+00FC, also known as LATIN SMALL LETTER U WITH DIAERESIS.
ü can also be represented as U+0075 (u), followed by U+0308 (◌̈), also known as LATIN SMALL LETTER U, followed by COMBINING DIARESIS.
Why is this problematic?
Because hash functions don't know what a string is. All they care about is bytes. As such, a string has to be decoded to a string of bytes. As was shown above, there are multiple different ways to decode a string, which means that two different systems can decode the same logical string to different bytes, thus resulting in different hashes.
How can I fix this?
You have to explicitly define how the string will be decoded on both platforms, to ensure that both decode the strings in the exact same manner.
Related
I have a String which contains some encoded values in some way like Base64.
The problem is that I really don't know if it's actually Base64 (there are A-Z, a-z. 0-9, +, /) so it can be some any other code that i'm not familiar with.
Is there a way or any other online site to send him an encoded input and it can tell me in which code is it?
NOTE:
I'm not asking how to know if my String is UTF-8 or iso-8859-1 or something like that.
What I need is to know in which is my code is encoded.
EDIT:
To be more clear,
I need something to get an input like: 23Nzi4lUE4qlc+Pmc3blWMS1Irmgo3i8UTQHhoL7VyzqpEV/i9bDhoiteZ0a7/TqcVSkrXR89V2Yj7tEFDGJx4gvWEBs= this is the encoded String that I have.
The output should be the type of the encoded String and it's decoding like:
Base64 -> "Big yellow fish is swimming in the tube."
Maybe there is some program which get's an input and tries to decode it with a list of coding types (Base64 and etc.). The output doesn't really matter because it's the users decision if it's good or not.
This site handles base64 de/encoding.
Since Base64 is just one instance of a class of encoding schemes ( specifically, encoding a bit stream as base_<n> number ), you probably will never fare better than testing for just a couple of standard encoding schemes.
You either check the well-formedness of the encoding scheme or try to decode without getting an error thrown using a web service or your own code.
In (possibly pathological) cases there will be more than one encoding scheme for which a given octet stream will successfully decode.
Best practice would be to take the effort invested into setting up the verification to committing the data provider to one (or 'a few') encoding(s) first (won't always be possible, of course).
I'm working on an application that eventually reads and prints arbitrary and untrustable Unicode characters to the screen.
There are a number of ways to wreck havoc using Unicode strings, and I would like my program to behave correctly for "dangerous" strings. For instance, the RTL override character will make strings look like they're backwards.
Since the audience is mostly programmers, my solution would be to, first, get the type C canonical form of the string, and then replace anything that's not a printable character on its own with the Unicode code point in the form \uXXXXXX. (The intent is not to have a perfectly accurate representation of the string, it is to have a mostly good representation. The full string data is still available.)
My problem, then, is determining what's an actual printable character and what's a non-printable character. Swift has a Character class, but contrary to, say, Java's Character class, the Swift one doesn't seem to have any method to find out the classification of a character.
How could I carry that plan? Is there anything else I should consider?
Yesterday I was confused by output of FM SSFC_PARSE_CERTIFICATE. It serves for decoding fields of X.509 certificate into readable format.
Everything is OK for latin symbols, but cyrillic letters are turned into something like \u041F\u0440\u0438\u0432\u0435\u0442.
Besides, if original text contains mixed symbols, i.e. latin, non-latin, spaces and digits, the task becomes even more comlex: Hello! \u041F\u0440\u0438\u0432\u0435\u0442 1234.
I wrote some code myself to scan string character by character and decode single entities using CL_ABAP_CONV_IN_CE=>UCCP and it seems to work well, but I'd like to know if there is a standard way to acheive same result?
Well, it's seams like in your input xstring all non-latin charcodes have been escaped instead of being encoded in UTF8. So if you not satisfied with your DIY solution, you should work upstream of the call to FM SSFC_PARSE_CERTIFICATE
I have a Unicode string I'm retrieving from a web service in python.
I need to access a URL I've parsed from this string, that includes various diacritics.
However, if I pass the unicode string to urlllib2, it produces a unicode encoding error. The exact same string, as a "raw" string r"some string" works properly.
How can I get the raw binary representation of a unicode string in python, without converting it to the system locale?
I've been through the python docs, and every thing seems to come back to the codecs module. However, the documentation for the codecs module is sparse at best, and the whole thing seems to be extremely file oriented.
I'm on windows, if it's important.
You need to encode the URL from unicode to a bytestring. u'' and r'' produce two different kinds of objects; a unicode string and a bytestring.
You can encode a unicode string to bytecode with the .encode() method, but you need to know what encoding to use. Usually, for URLs, UTF-8 is great, but you do need to escape the bytes to fit the URL scheme as well:
import urlparse, urllib
parts = list(urlparse.urlsplit(url))
parts[2] = urllib.quote(parts[2].encode('utf8'))
url = urlparse.urlunsplit(parts)
The above example is based on an educated guess that the problem you are facing is due to non-ASCII characters in the path part of the URL, but without further details from you it has to remain a guess.
For domain names, you need to apply the IDNA RFC3490 encoding:
parts = list(urlparse.urlsplit(url))
parts[1] = parts[1].encode('idna')
parts = [p.encode('utf8') if isinstance(p, unicode) else p for p in parts]
url = urlparse.urlunsplit(parts)
See the Python Unicode HOWTO for more information. I also strongly recommend you read the Joel on Software Unicode article as a good primer on the subject of encodings.
I'm looking for a simple and efficient way to store UTF-8 strings in ASCII-7. With efficient I mean the following:
all ASCII alphanumeric chars in the input should stay the same ASCII alphanumeric chars in the output
the resulting string should be as short as possible
the operation needs to be reversable without any data loss
the resulting ASCII string should be case insensitive
there should be no restriction on the input length
the whole UTF-8 range should be allowed
My first idea was to use Punycode (IDNA) as it fits the first four requirements, but it fails at the last two.
Can anyone recommend an alternative encoding scheme? Even better if there's some code available to look at.
UTF-7, or, slightly less transparent but more widespread, quoted-printable.
all ASCII chars in the input should stay ASCII chars in the output
(Obviously not fully possible as you need at least one character to act as an escape.)
Since ASCII covers the full range of 7-bit values, an encoding scheme that preserves all ASCII characters, is 7-bits long, and encodes the full Unicode range is not possible.
Edited to add:
I think I understand your requirements now. You are looking for a way to encode UTF-8 strings in a seven-bit code, in which, if that encoded string were interpreted as ASCII text, then the case of the alphabetic characters may be arbitrarily modified, and yet the decoded string will be byte-for-byte identical to the original.
If that's the case, then your best bet would probably be just to encode the binary representation of the original as a string of hexadecimal digits. I know you are looking for a more compact representation, but that's a pretty tall order given the other constraints of the system, unless some custom encoding is devised.
Since the hexadecimal representation can encode any arbitrary binary values, it might be possible to shrink the string by compressing them before taking the hex values.
If you're talking about non-standard schemes - MECE
URL encoding or numeric character references are two possible options.
It depends on the distribution of characters in your strings.
Quoted-printable is good for mostly-ASCII strings because there's no overhead except with '=' and control characters. However, non-ASCII characters take an inefficient 6-12 bytes each, so if you have a lot of those, you'll want to consider UTF-7 or Base64 instead.
Punycode is used for IDNA, but you can use it outside the restrictions imposed by it
Per se, Punycode doesn't fail your last 2 requirements:
>>> import sys
>>> _ = ("\U0010FFFF"*10000).encode("punycode")
>>> all(chr(c).encode("punycode") for c in range(sys.maxunicode))
True
(for idna, python supplies another homonymous encoding)
obviously, if you don't nameprep the input, the encoded string isn't strictly case-insensitive anymore... but if you supply only lowercase (or if you don't care about the decoded case) you should be good to go