Determine the encoding from byte array - encoding

I was wondering if there's any solution to get the encoding of a byte array? I have an application which accepts one parameter and returns an array of bytes, suppose I have this array of bytes:
ED 3F 3F 3F ED 15 3F 3F
And also I know that it's the byte array of string سلام, but I don't know how that application encoded the string and I need to know which ending is the application uses for converting the string to byte array.
Is there any solutions?

Reverse-Engineering the application which is encoding the string to byte array, to find out the encoding scheme is the only solution,I can think of (if that is an option).
Generic method for doing such, is highly unrealistic in my opinion.

Related

How to identify encoding from hex values?

I have text on a website that displays like that: o¨ instead of ö
I extracted the text out of the CMS and analysed it's hex values:
the ö's that are displays correctly have c3 b6 - UTF-8
the ö's that are displayed incorrect have 6f cc 88
I couldn't find out what encoding this is. What's a good way to identify the encoding?
6F is the UTF-8 (ASCII) encoding of "o", nothing spectacular.
CC 88 is the UTF-8 encoding of U+0308, COMBINING DIAERESIS.
You're simply looking at the decomposed form of the o-umlaut. A combining diaereses character should visually be rendered, well, combined with the previous character. If your system doesn't do that, it means it doesn't treat Unicode correctly, and/or the font you have chosen is somewhat broken. Perhaps you have to normalise your strings into the composed Unicode form instead for your system to handle it correctly.

What character encoding is c3 82 c2 bf?

I have a source of text data that includes the byte sequence c3 82 c2 bf. In context I think it's supposed to be a capital Greek Phi symbol (Φ).
Anyway I can't figure out what encoding is being used; I'm writing a Python script to process this data into a database that expects Unicode, and it throws an exception on this particular sequence of data.
Any suggestions on how to handle it?
Interpreted as UTF-8, c3 82 is “” U+00C2 and c2 bf is “¿” U+00BF, which does not make much sense, but it’s technically valid UTF-8 data, so it should not be reported as character-level data error. Interpreted as UTF-16, it’s Hangul syllables and possibly a CJK ideograph, depending on endianness, but still formally valid data, though most probably not what was meant.
This sounds like the result of double conversion, but it’s difficult to make educated guesses. If it stands for Φ, then the UTF-16 form is 03 A6 or A6 03 and the UTF-8 form is CE A6, which don’t really resemble the actual data. Information about the origin of the data might help in guessing what transcodings may have happened.
It's probably a double conversion from Ñ character.
Ñ character in UTF-8 is: 0xc391.
If you try to convert from LATIN-1 to UTF-8 the Ñ character which is already encoded in UTF-8, you'll get: 0xc382c2bf.
Why?
0xc382 is UTF-8 translation from LATIN-1 0xc3 character à (A with tilde)
0xc2bf is ¿ character which is what you get when you can't convert a character from LATIN-1 (0x91 is an invalid character in LATIN-1
FWIW, I ended up with c3 82 c2 bf from . I did not dig into the transformations because I was able to simply throw that part of the code away. Suffice it to say that was in an html email template that was processed by a wordpress (php) plugin.
I don't know the reason. But maybe there is a possible scenary.
binary x0xx is converted to 0xC2 x0xx
binary x1xx is converted to 0xC3 x0xx
So there are lots of c2 and c3 added.
Where does this happen? Send non ascii in url query string for an ajax call, the Flask server will do this.
i have received this character \xc3\x82 from external utf-16 document after conversion to utf-8 using $str = mb_convert_encoding($content, "UTF-8" , "UTF-16LE"); (PHP)
original sequence was 0xA0 0x00 and the converter converted it probably to what it meant to be NBSP .. it was character at thousands separator in currency number. nbsp is \xc2\xa0 so right now i have thousands removal as:
$price = str_replace(["\xc2\xa0","\xc3\x82"], '', $price);

NSString unicode encoding problem

I'm having problems converting the string to something readable . I'm using
NSString *substring = [NSString stringWithUTF8String:[symbol.data cStringUsingEncoding:NSUTF8StringEncoding]];
but I can't convert \U7ab6\U51b1 into '
It shows as 窶冱 which is what I don't want, it should show as an '. Can anyone help me?
it is shown as a ’
That's character U+2019 RIGHT SINGLE QUOTATION MARK.
What has happened is you've had the character sequence ’s submitted to you, in the UTF-8 encoding, which comes out as bytes:
’ s
E2 80 99 73
That byte sequence has then, incorrectly, been interpreted as if it were encoded in Windows code page 932 (Japanese; more or less Shift-JIS):
E2 80 99 73
窶 冱
So in this one particular case, you could recover the ’s string by firstly encoding the characters into cp932 bytes, and then decoding those bytes back to characters using UTF-8.
However, this will not solve your real problem, which is that the strings were read in incorrectly in the first place. You got 窶冱 in this case because the UTF-8 byte sequence resulting from encoding ’s happened also to be a valid Shift-JIS byte sequence. But that won't be the case for all possible UTF-8 byte sequences you might get. Many other characters will be unrecoverably mangled.
You need to find where bytes are being read into the system and decoded as Shift-JIS, and fix that to use UTF-8 instead.

Character conversion in SAXParser

I have a problem … a very peculiar one could you please guide.
Original message: Kevätsunnuntaisin lentää
The flow of data is HttpConnector -> WSDLConnector -> to the underlying system
The following is the encoding of the first 7 characters
4b 65 76 c3 a4 74 73 75 – In Http Connector – the request XML has UTF-8 encoding
4b 65 76 a3 74 73 75 – in WSDL Connector -
InputSource inputSource = new InputSource(myInputStream);
inputSource.setEncoding("UTF-8");
parser.parse(inputSource);
The original string gets converted to Kev£tsunnuntaisin lent££.Also, there is a loss of a byte.
Could you please guide me where I am going wrong? What must I do to avoid this character conversion?
Thanks for your help!!!
This is very simple: The data in myInputStream is not encoded as UTF-8, hence the decoding fails.
My guess is that you save the output of the HTML connector as a string and then use that as the input for the WSDL connector. In the string, the data is unicode, not UTF-8. Use String.getBytes('UTF-8') to get an array of bytes with the correct encoding.
As for all encoding issues: Always tell the computer with which encoding it should work instead of hoping that it will guess correctly. Bytes have no encoding and the computer is not telepathic :) And I hope it never will be ...

What is this Base64 Look-alike?

I am new to decoding techniques and have just learnt about base64, sha-1, md5 and a few others yesterday.
I have been trying to figure out what "orkut" worms actually contain.
I was attacked by many orkut spammers and hackers in the past few days, and there is a similarity in the URLs that they send to us.
I don't know what information it contains but I need to figure it out.
The problem lies in the following texts:
Foo+bZGMiDsstRKVgpjhlfxMVpM=
lmKpr4+L6caaXii9iokloJ1A4xQ=
The encoding above appears to be base64 but it is not, because whenever I try to decode it using online base64 decoders, I get raw output and it doesn't decode accurately.
Maybe some other code has been mixed with base64.
Can anyone please help me to decode it?
It's part of an orkut worm. This page has some details. Notice it mentions the JSHDF["Page.signature.raw"] variable you're finding these strings in.
It's a SHA1-hash of the page it was found on. This page shows the decoded form of it.
The encoding above appears to be base64 but it is not, because when-ever I try to decode
it using online base64 decoders I get raw output and it doesn't decode accurately.
What makes you think that the decoding is incorrect? Typically you'd base64 or hex encode binary content so that it can be transported as text. You wouldn't base64 encode text so it isn't surprising that decoding the strings you've provided above results in ASCII gobbledygook.
Haha, if it was that easy, it would not be worth a hack! You have to try a lot harder than just simply decoding it once.
They could be merely hashes.
If they are hashes, "reversing" them is algorithmically impossible if the original content is over a certian size, because after a certain source data size, hashing becomes a lossy compression function.
Often times Foo+whatever is the result of a salted hash. It is common to store hash results with salt, and the salt can be stored in the clear. To separate the salt from the actual hash value, a + sign is commonly used.
Base64 is used, so that the binary result of the hash can be stored in text. You can tell that the last part of those strings might be valid Base64 because Base64 content will always be a multiple of 4. It outputs 4 valid ASCII characters for every 3 bytes of input. It pads the end with "=" signs.
So, for Foo+bZGMiDsstRKVgpjhlfxMVpM=, this may be the result of taking some input, be it a message of some sort, or whatever, and applying the salt "Foo", and then hashing the result. The string value bZGMiDsstRKVgpjhlfxMVpM= likely is the binary result of some hash function. An online Base64 decoder shows that the value, in Hex, instead of Base64, is { 6D 91 8C 88 3B 2C B5 12 95 82 98 E1 95 FC 4C 56 93 }. Yes, this is not ASCII text.
Base64, binary, hexadecimal, decimal, are all ways of representing values. Think of the part after the + as just a number. The above 136-bit number may be the result of a 128-bit hash, and an 8-bit CRC, for example. Who knows? I don't know why you're getting spammed, or why these spam messages have these strings attached to them, but this may be some insight into the nature of the structure of the strings.