I am encoding a string to base64 encoded data.
Edit: removed irrelevant base64 conversion code
Is there would be any problem when I trying to encode a mixed english and arabic data, because we are here using
base64Data = [string dataUsingEncoding:NSASCIIStringEncoding];
I heard that NSASCIIStringEncoding should not be used with Unicode encoded string.
Base64 encodes data (raw bytes) and produces ASCII encoded strings. So your problem is in converting your string into an encoded byte array.
You could use any encoding that contains arabic and english characters. But you have to make sure the recipient of the base 64 encoded message would understand and know the encoding.
UTF-8 is a good point to start.
Related
I have the following base64 encoded strings with some bizarre pattern.
0069a1/jE0MgjRAQaqv3N9cXpzf01WmZ9OUUJZrqx0Yndme2JXRZSVSEJPSLuCdX5ncXF9TE2PnEhPMHq6sWJ0ZWBrcUdMlI8wLDR6urFidHJmd2NLR5WUX31CSa27fHR5e2zSsKsxabDhvpweRqrRjJfSirjoaUbhubTmRmfRibDUipHosGAxuIjmvnkeiZTUjIISOSgxaZfhvrAeRrYRBhTSi4joaXDhuIHmRmo8O9SMmdKxgDFps+G+gR5GsNGMhdKKkehpfOG5pOZHdtGJhtSKteiwSDG4iOa+Ux6JkxQCEtKwgjFpsOG/jh5GltGNttKKqehpViHhv4YeRpDRjIPSi4DoaXzhuafmRkvRiLDUirDosVgxuarmvl0eiYbUjJ/SsJMxaabhvpLz9B4bGhoc0rCCMWm24b+OHkac0Yye0oq66Glw4bm35kZzEdGMqtKKuuhpcuG5tOZHftGJv9SKp+iwYzG5oOa+Ux6JlBQBAgIo6Glw4bmy5kZb0YmR1IqG6LBQMbmi5r5PHomQFBzSiqroaWPhua3mRlvRiYXUirPosVAxuarmv3YeibbUjLjSsLoxaZThvrLX8zvRjJ/Si4HoaWPhuafmR33Rib7Ui7vosVExuYDmvkoeiagUDBLSsYgxaaDhv4EeRqXRjKHSi4DoaXzhuaHmRk/Ria0U0ouI6GlP4bml5kZM0YmE1Iqf6LBjMbmJ5r9+Homq1I210rCRMWmf4b6yHkaY0Yyh0ouA6Gl84bmT5kZP0YmuSHZ9W02En0hVRUm6u28AAA==t=
0069a1M83GVi8zMwVnco+Bg4gXG2pxe318Y0FaY2GInoWUHwZwYnZ3enBMS3ZPiYKVgxUZa2ptfnp9M3l3fJ6Il5IPFWBrdm0CHjd5d3yeiICUEwdsYHd2bU9BSmB2gIiLiQi2l4zTi4LTvZ/Ti1YtfmW27p/Pi6TTi7fli6otdUIm7vXPl4LTirrlvbTTdWgmfuZ2Hg/Ti6XTvbPTi0rt9Oa276/Pi5LTioLli6fAxyZ+/baWp9OLgdO9gtOLTC1+d7buts+LntOLp+WKuy11dCbu0c+XqtOKuuW9ntN1b+bwdraXpdOLgtO8jdOLai1/RLbujs+LtBPTvIXTi2wtfnG276fPi57Ti6Tli4YtdEIm7tTPlrrTi5jlvZDTdXomfvu2l7TTi5TTvZE+OeLn6Oh4tpel04uE07yN04tgLX5stu6dz4uS04u05Yu+7S1+WLbunc+LkNOLt+WKsy11TSbuw8+XgdOLkuW9ntN1aObzZmYPz4uS04ux5YuWLXVjJu7iz5ey04uQ5b2C03Vs5u627o3Pi4HTi67li5YtdXcm7tfPlrLTi5jlvLvTdUomfty2l53Ti6bTvbEaPsctfm2276bPi4HTi6TlirAtdUwm79/PlrPTi7LlvYfTdVTm/na2lq/Ti5LTvILTi1ktflO276fPi57Ti6Lli4ItdV/mtu+vz4ut04um5YuBLXV2Ju77z5eB04u75byz03VWJn/Rtpe204ut072x04tkLX5Ttu+nz4ue04uQ5YuCLXVcuhIZfGpmfXpnRkp3dpP88g==t=
0069a1kA0HZTAANvPE0U9BQkkkKHVuSE55ZreswMJIXkRVLDVvfUVEf3W6vdXsSUJUQiYqdHVeTX94xY/U315IVlM8Jn90RV4HG8GP1N9eSEFVIDRzf0RFaEq3vMPVQEhKSDuFiJPguIfWS2lwKJbtv6SF3YDQuJfWjkETKAnttYPn3cbQiLHgj78TSxdwtajnv9VFARDguKDWS0VwKIotNSeF3LDQuKHWj3QTKAQAB+e/zoWJuOC4hNZLdHAojO2/toXdqdC4rdaOURMpGO21tefd4tCImeCPvxNLPXC1rycxRYWIuuC4h9ZKe3Aoqu2+hYXdkdC4hxbWSnNwKKztv7CF3LjQuK3WjlITKCXttIPn3efQiYngjp0TSzNwtbrnv8iFiKvguJHWS2edmiInKSlLhYi64LiB1kp7cCig7b+thd2C0Lih1o5CEygdLe2/mYXdgtC4o9aOQRMpEO21jOfd8NCIsuCOlxNLPXC1qCcyVVUQ0Lih1o5HEyg17bWi593R0IiB4I6VE0shcLWsJy+F3ZLQuLLWjlgTKDXttbbn3eTQiYHgjp0TShhwtYrnv++FiILguKPWS0e5nQftv6yF3LnQuLLWjlITKRPttY3n3OzQiYDgjrcTSyRwtZQnP0WFibDguJfWSnRwKJntv5KF3LjQuK3WjlQTKCHttZ4nhdyw0Lie1o5QEygi7bW3593I0Iiy4I6+E0oQcLWW577ihYip4Lio1ktHcCik7b+Shdy40Lit1o5mEygh7bWdeyEqY3VVTn9isLzU1VM8Mw==t=
0094dfBQcJXQfMLwJRREVLTEccEEJZhIJgf0ZdVVdCVEpbFA1YSomIZmxLTEB5Q0haTB4SQ0KSgWZhNH5BSlRCWF0EHkhDiZIeAjB+QUpUQk9bGAxESIiJcVNGTVZASkJERgO9v6QsdJ7PupjlvZznsaq95bfndFvPl7DivZznv43p5f7nv30slqbiuoLlv6Lpse19NicsdLnPurTlvYAnOym95IfndG3PloXivZEKDemx9r2+jyx0nc+6heW9huexuL3lnud0Yc+XoOK8jee/u+nl2ue/VSyWpuK6qOW/pSk/fb2/jSx0ns+7iuW9oOewi73lpud0Sw/Pu4Llvabnsb695I/ndGHPl6PivbDnvo3p5d/nvkUsl4Tiuqblv7DpsfC9v5wsdIjPupYIDygtJydzvb+NLHSYz7uK5b2q57GjveW153Rtz5ez4r2IJ+exl73lted0b8+XsOK8hee/gunlyOe/fiyXjuK6qOW/oik8bW0n53Rtz5e24r2g57+s6eXp579NLJeM4rq05b+mKSG95aXndH7Pl6nivaDnv7jp5dznvk0sl4Tiu43lv4Dpsde9v7UsdLrPurYsCA3nsaK95I7ndH7Pl6PivIbnv4Pp5NTnvkwsl67iurHlv54pMX29vocsdI7Pu4XlvZPnsZy95I/ndGHPl6XivbTnv5ApveSH53RSz5eh4r2357+56eXw579+LJen4ruF5b+c6bDavb+eLHSxz7q25b2u57GcveSP53Rhz5eX4r2057+TdRkSVEKZgmZ7QU1BQFk2PQ==J=
0094df/M8GDV7OBDaovY2DQ0hMQBsAhoBLVHJprK6KnEVURF0BE4uKTUd/eLmAi4BVQ05CGhuQg01KAEq4s5yKV1JUThEai5A2eHJ5r7mJnUNXTkIaG5CyQEtlc7G5goBY5rWuvuZ/Lrye1o5nHHds5r697ebJLna25I6vHERLL76l7bXvvndH5Lyx1kRZL3e2Jj8tvudOLryl1o97HHdbCwzttfW+d0bkvITWRHsvd4fmtby+5lcuvKnWjl4cdkfmvr/t5tkudp3kj78cRGIvvqQtO36+dkTkvIfWRXQvd6HmtI++5m8uvIMW1kV8L3en5rW6vudGLryp1o5dHHd65r+J7ebcLneN5I6dHERsL76x7bXzvnZV5LyR1kRowsUpLCMjcL52ROS8gdZFdC93q+a1p77mfC68pdaOTRx3QibmtZO+5nwuvKfWjk4cdk/mvobt5ssudrbkjpccRGIvvqMtOG5u7i68pdaOSBx3aua+qO3m6i52heSOlRxEfi++py0lvuZsLry21o5XHHdq5r687ebfLneF5I6dHEVHL76B7bXUvnZ85Lyj1kRI5sIM5rWmvudHLry21o5dHHZM5r6H7efXLneE5I63HER7L76fLTV+vndO5LyX1kV7L3eS5rWYvudGLryp1o5bHHd+5r6ULb7nTi68mtaOXxx3fea+ve3m8y52tuSOvhxFTy++ne202b52V+S8qNZESC93r+a1mL7nRi68qdaOaRx3fua+l3EaEZ2LUUp/Yr+zi4pYNzk=J=
I tried cutting incorrect base64 pattern out and then decoded it . i tried to decoded them with different character sets and still return unreadable text.I'm not sure how previous developer encoded this data.All i know is he encoded them with 2-layer base64 encoding and the result should be readable text in Thai.
Can anyone see some pattern that could help me identify and decode these strings ?
Currently it seems that in order for UTF-8 characters to display in a portal message you need to decode them first.
Here is a snippet from my code:
self.context.plone_utils.addPortalMessage(_(u'This document (%s) has already been uploaded.' % (doc_obj.Title().decode('utf-8'))))
If Titles in Plone are already UTF-8 encoded, the string is a unicode string and the underscore function is handled by i18ndude, I do not see a reason why we specifically need to decode utf-8. Usually I forget to add it and remember once I get a UnicodeError.
Any thoughts? Is this the expected behavior of addPortalMessage? Is it i18ndude that is causing the issue?
UTF-8 is a representation of Unicode, not Unicode and not a Python unicode string. In Python, we convert back and forth between Python's unicode strings and representations of unicode via encode/decode.
Decoding a UTF-8 string via utf8string.decode('utf-8') produces a Python unicode string that may be concatenated with other unicode strings.
Python will automatically convert a string to unicode if it needs to by using the ASCII decoder. That will fail if there are non-ASCII characters in the string -- because, for example, it is encoded in UTF-8.
In base64 what happens if the character you want to encode isn't A-Z, a-z , + or /?
If I wanted to encode a URL in base64 which has a colon (:) in it what would happen since its not in the base64 index.
You're mixing up the encode and decode sides. Base64 can encode any character. It's only decoding that requires a limited set.
You can encode any byte sequence into base64. The resulting characters will all be in the allowed 64 chars. And of course when decoding the encoded text must be valid Base64.
Saying it encode any character is a bit missleading since the characters need first to be encoded into bytes. A character and a byte are only equivalent for a few charsets like ASCII.
I think you're mixing things up - Base64 can encode ANYTHING, those limits simply define what the actual encoded string looks like.
So, nothing would happen if you encoded a colon in Base64. If you tried decoding a colon, however, it would most likely throw an error.
I'm trying to understand what the input requirements are for base64 encoding. Nicholas Zakas, who I have tremendous respect for has an article here where he quotes a specification that an error should be thrown if input contains any character with a code higher than 255 Zakas Article on base64
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters. Since base64 encoding requires eight bits per input character, any character with a code higher than 255 cannot be accurately represented. The specification indicates that an error should be thrown in this case:
if (/([^\u0000-\u00ff])/.test(text)){
throw new Error("Can't base64 encode non-ASCII characters.");
}
He provides a link in another separate part of the article to the RFC 3548 but I don't see any input requirements other than:
Implementations MUST reject the encoding if it contains characters
outside the base alphabet when interpreting base encoded data, unless
the specification referring to this document explicitly states
otherwise.
Not sure what "base alphabet" means but perhaps this is what Zakas is referring to. But by saying they must reject the encoding it seems to imply that this is something that has already been encoded as opposed to the input (of course if the input is invalid it will also show up in the encoding so perhaps the point is moot).
A bit confused on what the standard is.
Fundamentally, it's a mistake to talk about "base64 encoding a string" where "string" is meant in terms of text.
Base64 encoding is applied to binary data (a sequence of bytes, or octets if you want to be even more picky), and the result is text. Every character in the output is printable ASCII text. The whole point of base64 is to provide a safe way of converting arbitrary binary data into a text format which can be reliably embedded in other text, transported etc. ASCII is compatible with almost all character sets, so you're very unlikely to be unable to encode ASCII text as part of something else.
When someone talks about "base64 encoding a string" they're really talking about encoding text as binary using some existing encoding (e.g. UTF-8), then applying a base64 encoding to the result. When decoding, you'd need to decode the base64 back to binary, and then decode that binary data with the original encoding, to get the original text.
For me the (first) linked article has a fundamental problem:
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters
You don't base64 encode strings. You base64 encode byte sequences. And when you're dealing with any kind of encoding work, it's extremely important to keep in mind this difference.
Also, his check for 'ASCII' actually lets through everything from 80 to ff, which aren't ASCII - ASCII is only 00 to 7f.
Now, if you have a string which you have checked is pure ASCII, you can then safely treat it as a byte sequence of the ASCII values of the characters in it - but this is a separate earlier step, nothing strictly to do with the act of base64 encoding.
(I should say that I do like his repeated urging for the reader to note that base64 encoding is not in any shape or form encryption)
When I decode (using one of the online decoders) a base64 string, the decoded data returns several special chars like sqaure blocks and `"
base64 encodes binary data to visible characters. If you decode it, the string will be turned back into the binary data, where some of the bytes won't have an ascii/unicode representation and will show as squares. This is normal behaviour. You should decode the data in the program you want to use the data in.