How to decode character to utf-8 at specific position - centos

I have a python script, in which there is a dictionary. For some reason, I need to convert dictionary to json.
But, whenever script executed, It gives below error
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 604: invalid continuation byte
for line json.dumps(data_dict).
From link, I understand that non utf character should be decoded. But how to do it in a script? How we can get character at that positon from dictionary and decode it.
On interpreter, it works. Below is interpreter snippet.
>>'ren�'.decode('utf-8')
>>u'ren\ufffd'

You're attempting to decode an invalid UTF-8 codepoint. Non-UTF-8 characters cannot be decoded. Try passing 'ignore' to .decode if you absolutely must handle invalid codepoints, or try the chardet library to detect the actual encoding (.decode will encode into Unicode).

Related

How to decode 2-layer unknown base64 encoding

I have the following base64 encoded strings with some bizarre pattern.
0069a1/jE0MgjRAQaqv3N9cXpzf01WmZ9OUUJZrqx0Yndme2JXRZSVSEJPSLuCdX5ncXF9TE2PnEhPMHq6sWJ0ZWBrcUdMlI8wLDR6urFidHJmd2NLR5WUX31CSa27fHR5e2zSsKsxabDhvpweRqrRjJfSirjoaUbhubTmRmfRibDUipHosGAxuIjmvnkeiZTUjIISOSgxaZfhvrAeRrYRBhTSi4joaXDhuIHmRmo8O9SMmdKxgDFps+G+gR5GsNGMhdKKkehpfOG5pOZHdtGJhtSKteiwSDG4iOa+Ux6JkxQCEtKwgjFpsOG/jh5GltGNttKKqehpViHhv4YeRpDRjIPSi4DoaXzhuafmRkvRiLDUirDosVgxuarmvl0eiYbUjJ/SsJMxaabhvpLz9B4bGhoc0rCCMWm24b+OHkac0Yye0oq66Glw4bm35kZzEdGMqtKKuuhpcuG5tOZHftGJv9SKp+iwYzG5oOa+Ux6JlBQBAgIo6Glw4bmy5kZb0YmR1IqG6LBQMbmi5r5PHomQFBzSiqroaWPhua3mRlvRiYXUirPosVAxuarmv3YeibbUjLjSsLoxaZThvrLX8zvRjJ/Si4HoaWPhuafmR33Rib7Ui7vosVExuYDmvkoeiagUDBLSsYgxaaDhv4EeRqXRjKHSi4DoaXzhuaHmRk/Ria0U0ouI6GlP4bml5kZM0YmE1Iqf6LBjMbmJ5r9+Homq1I210rCRMWmf4b6yHkaY0Yyh0ouA6Gl84bmT5kZP0YmuSHZ9W02En0hVRUm6u28AAA==t=
0069a1M83GVi8zMwVnco+Bg4gXG2pxe318Y0FaY2GInoWUHwZwYnZ3enBMS3ZPiYKVgxUZa2ptfnp9M3l3fJ6Il5IPFWBrdm0CHjd5d3yeiICUEwdsYHd2bU9BSmB2gIiLiQi2l4zTi4LTvZ/Ti1YtfmW27p/Pi6TTi7fli6otdUIm7vXPl4LTirrlvbTTdWgmfuZ2Hg/Ti6XTvbPTi0rt9Oa276/Pi5LTioLli6fAxyZ+/baWp9OLgdO9gtOLTC1+d7buts+LntOLp+WKuy11dCbu0c+XqtOKuuW9ntN1b+bwdraXpdOLgtO8jdOLai1/RLbujs+LtBPTvIXTi2wtfnG276fPi57Ti6Tli4YtdEIm7tTPlrrTi5jlvZDTdXomfvu2l7TTi5TTvZE+OeLn6Oh4tpel04uE07yN04tgLX5stu6dz4uS04u05Yu+7S1+WLbunc+LkNOLt+WKsy11TSbuw8+XgdOLkuW9ntN1aObzZmYPz4uS04ux5YuWLXVjJu7iz5ey04uQ5b2C03Vs5u627o3Pi4HTi67li5YtdXcm7tfPlrLTi5jlvLvTdUomfty2l53Ti6bTvbEaPsctfm2276bPi4HTi6TlirAtdUwm79/PlrPTi7LlvYfTdVTm/na2lq/Ti5LTvILTi1ktflO276fPi57Ti6Lli4ItdV/mtu+vz4ut04um5YuBLXV2Ju77z5eB04u75byz03VWJn/Rtpe204ut072x04tkLX5Ttu+nz4ue04uQ5YuCLXVcuhIZfGpmfXpnRkp3dpP88g==t=
0069a1kA0HZTAANvPE0U9BQkkkKHVuSE55ZreswMJIXkRVLDVvfUVEf3W6vdXsSUJUQiYqdHVeTX94xY/U315IVlM8Jn90RV4HG8GP1N9eSEFVIDRzf0RFaEq3vMPVQEhKSDuFiJPguIfWS2lwKJbtv6SF3YDQuJfWjkETKAnttYPn3cbQiLHgj78TSxdwtajnv9VFARDguKDWS0VwKIotNSeF3LDQuKHWj3QTKAQAB+e/zoWJuOC4hNZLdHAojO2/toXdqdC4rdaOURMpGO21tefd4tCImeCPvxNLPXC1rycxRYWIuuC4h9ZKe3Aoqu2+hYXdkdC4hxbWSnNwKKztv7CF3LjQuK3WjlITKCXttIPn3efQiYngjp0TSzNwtbrnv8iFiKvguJHWS2edmiInKSlLhYi64LiB1kp7cCig7b+thd2C0Lih1o5CEygdLe2/mYXdgtC4o9aOQRMpEO21jOfd8NCIsuCOlxNLPXC1qCcyVVUQ0Lih1o5HEyg17bWi593R0IiB4I6VE0shcLWsJy+F3ZLQuLLWjlgTKDXttbbn3eTQiYHgjp0TShhwtYrnv++FiILguKPWS0e5nQftv6yF3LnQuLLWjlITKRPttY3n3OzQiYDgjrcTSyRwtZQnP0WFibDguJfWSnRwKJntv5KF3LjQuK3WjlQTKCHttZ4nhdyw0Lie1o5QEygi7bW3593I0Iiy4I6+E0oQcLWW577ihYip4Lio1ktHcCik7b+Shdy40Lit1o5mEygh7bWdeyEqY3VVTn9isLzU1VM8Mw==t=
0094dfBQcJXQfMLwJRREVLTEccEEJZhIJgf0ZdVVdCVEpbFA1YSomIZmxLTEB5Q0haTB4SQ0KSgWZhNH5BSlRCWF0EHkhDiZIeAjB+QUpUQk9bGAxESIiJcVNGTVZASkJERgO9v6QsdJ7PupjlvZznsaq95bfndFvPl7DivZznv43p5f7nv30slqbiuoLlv6Lpse19NicsdLnPurTlvYAnOym95IfndG3PloXivZEKDemx9r2+jyx0nc+6heW9huexuL3lnud0Yc+XoOK8jee/u+nl2ue/VSyWpuK6qOW/pSk/fb2/jSx0ns+7iuW9oOewi73lpud0Sw/Pu4Llvabnsb695I/ndGHPl6PivbDnvo3p5d/nvkUsl4Tiuqblv7DpsfC9v5wsdIjPupYIDygtJydzvb+NLHSYz7uK5b2q57GjveW153Rtz5ez4r2IJ+exl73lted0b8+XsOK8hee/gunlyOe/fiyXjuK6qOW/oik8bW0n53Rtz5e24r2g57+s6eXp579NLJeM4rq05b+mKSG95aXndH7Pl6nivaDnv7jp5dznvk0sl4Tiu43lv4Dpsde9v7UsdLrPurYsCA3nsaK95I7ndH7Pl6PivIbnv4Pp5NTnvkwsl67iurHlv54pMX29vocsdI7Pu4XlvZPnsZy95I/ndGHPl6XivbTnv5ApveSH53RSz5eh4r2357+56eXw579+LJen4ruF5b+c6bDavb+eLHSxz7q25b2u57GcveSP53Rhz5eX4r2057+TdRkSVEKZgmZ7QU1BQFk2PQ==J=
0094df/M8GDV7OBDaovY2DQ0hMQBsAhoBLVHJprK6KnEVURF0BE4uKTUd/eLmAi4BVQ05CGhuQg01KAEq4s5yKV1JUThEai5A2eHJ5r7mJnUNXTkIaG5CyQEtlc7G5goBY5rWuvuZ/Lrye1o5nHHds5r697ebJLna25I6vHERLL76l7bXvvndH5Lyx1kRZL3e2Jj8tvudOLryl1o97HHdbCwzttfW+d0bkvITWRHsvd4fmtby+5lcuvKnWjl4cdkfmvr/t5tkudp3kj78cRGIvvqQtO36+dkTkvIfWRXQvd6HmtI++5m8uvIMW1kV8L3en5rW6vudGLryp1o5dHHd65r+J7ebcLneN5I6dHERsL76x7bXzvnZV5LyR1kRowsUpLCMjcL52ROS8gdZFdC93q+a1p77mfC68pdaOTRx3QibmtZO+5nwuvKfWjk4cdk/mvobt5ssudrbkjpccRGIvvqMtOG5u7i68pdaOSBx3aua+qO3m6i52heSOlRxEfi++py0lvuZsLry21o5XHHdq5r687ebfLneF5I6dHEVHL76B7bXUvnZ85Lyj1kRI5sIM5rWmvudHLry21o5dHHZM5r6H7efXLneE5I63HER7L76fLTV+vndO5LyX1kV7L3eS5rWYvudGLryp1o5bHHd+5r6ULb7nTi68mtaOXxx3fea+ve3m8y52tuSOvhxFTy++ne202b52V+S8qNZESC93r+a1mL7nRi68qdaOaRx3fua+l3EaEZ2LUUp/Yr+zi4pYNzk=J=
I tried cutting incorrect base64 pattern out and then decoded it . i tried to decoded them with different character sets and still return unreadable text.I'm not sure how previous developer encoded this data.All i know is he encoded them with 2-layer base64 encoding and the result should be readable text in Thai.
Can anyone see some pattern that could help me identify and decode these strings ?

Plain string encoded in latin1 somehow got messed up when converted to utf8. Now can't reverse back to latin1

There are strings like this Vázquez Montañana that can easily be decoded back using an online decoder. However, for some reason things got messed up and some plaintext ended up like this Sofía Garcés Durá. When I try to decode this using Python, it gave the error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 3: invalid continuation byte. Is it impossible to reverse back? Do I have to do some guesswork or manual substitution? I'm not sure on what went wrong in the encoding process but it would help a lot if someone has a clue!
It's a triple mojibake case (example in Python):
('Sofía Garcés Durá'.
encode('cp1252').decode('utf-8').
encode('cp1252').decode('utf-8').
encode('cp1252').decode('utf-8'))
'Sofía Garcés Durá'
The former example is a simple mojibake:
'Vázquez Montañana'.encode('cp1252').decode('utf-8')
'Vázquez Montañana'

addPortalMessage requires decode('utf-8')

Currently it seems that in order for UTF-8 characters to display in a portal message you need to decode them first.
Here is a snippet from my code:
self.context.plone_utils.addPortalMessage(_(u'This document (%s) has already been uploaded.' % (doc_obj.Title().decode('utf-8'))))
If Titles in Plone are already UTF-8 encoded, the string is a unicode string and the underscore function is handled by i18ndude, I do not see a reason why we specifically need to decode utf-8. Usually I forget to add it and remember once I get a UnicodeError.
Any thoughts? Is this the expected behavior of addPortalMessage? Is it i18ndude that is causing the issue?
UTF-8 is a representation of Unicode, not Unicode and not a Python unicode string. In Python, we convert back and forth between Python's unicode strings and representations of unicode via encode/decode.
Decoding a UTF-8 string via utf8string.decode('utf-8') produces a Python unicode string that may be concatenated with other unicode strings.
Python will automatically convert a string to unicode if it needs to by using the ASCII decoder. That will fail if there are non-ASCII characters in the string -- because, for example, it is encoded in UTF-8.

PyQt and Unicode error

I have a problem related to Unicode in this line:
strToCompare = str(self.modelProxy.data(cellIndex, Qt.DisplayRole).toString()).lower()
The error is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 5: ordinal not in range(128)
This is because the data is retrieve from a field in a database which may contain unicode characters. Even though I added the unicode() function to convert to Unicode, the error is still there.
I've my solution, I just get the string in the model instead of using the function data(). This way I don't have to convert the QVariant into String

base64 encoding: input character

I'm trying to understand what the input requirements are for base64 encoding. Nicholas Zakas, who I have tremendous respect for has an article here where he quotes a specification that an error should be thrown if input contains any character with a code higher than 255 Zakas Article on base64
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters. Since base64 encoding requires eight bits per input character, any character with a code higher than 255 cannot be accurately represented. The specification indicates that an error should be thrown in this case:
if (/([^\u0000-\u00ff])/.test(text)){
throw new Error("Can't base64 encode non-ASCII characters.");
}
He provides a link in another separate part of the article to the RFC 3548 but I don't see any input requirements other than:
Implementations MUST reject the encoding if it contains characters
outside the base alphabet when interpreting base encoded data, unless
the specification referring to this document explicitly states
otherwise.
Not sure what "base alphabet" means but perhaps this is what Zakas is referring to. But by saying they must reject the encoding it seems to imply that this is something that has already been encoded as opposed to the input (of course if the input is invalid it will also show up in the encoding so perhaps the point is moot).
A bit confused on what the standard is.
Fundamentally, it's a mistake to talk about "base64 encoding a string" where "string" is meant in terms of text.
Base64 encoding is applied to binary data (a sequence of bytes, or octets if you want to be even more picky), and the result is text. Every character in the output is printable ASCII text. The whole point of base64 is to provide a safe way of converting arbitrary binary data into a text format which can be reliably embedded in other text, transported etc. ASCII is compatible with almost all character sets, so you're very unlikely to be unable to encode ASCII text as part of something else.
When someone talks about "base64 encoding a string" they're really talking about encoding text as binary using some existing encoding (e.g. UTF-8), then applying a base64 encoding to the result. When decoding, you'd need to decode the base64 back to binary, and then decode that binary data with the original encoding, to get the original text.
For me the (first) linked article has a fundamental problem:
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters
You don't base64 encode strings. You base64 encode byte sequences. And when you're dealing with any kind of encoding work, it's extremely important to keep in mind this difference.
Also, his check for 'ASCII' actually lets through everything from 80 to ff, which aren't ASCII - ASCII is only 00 to 7f.
Now, if you have a string which you have checked is pure ASCII, you can then safely treat it as a byte sequence of the ASCII values of the characters in it - but this is a separate earlier step, nothing strictly to do with the act of base64 encoding.
(I should say that I do like his repeated urging for the reader to note that base64 encoding is not in any shape or form encryption)