How to save non ASCII Characters in Mongo DB - mongodb

This question is repeated, but I can not find answer to problem in my context. I am trying to save Aéropostale as string in mongo DB:
name='Aéropostale'
obj=Mongo_Object()
obj.name=name
obj.save()
When I save the object, I get following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
How to proceed to save the string in original format and retrieve in same format?

As you are using Python 2.7, you need to do a few things:
Specify the file encoding, by adding a string similar to this to the top of your file:
#coding: utf8
Use a unicode string, as your string is not ASCII, and specify the encoding. I am using utf8 here which includes é:
name = unicode('Aéropostale', 'utf8')

Related

IW8ISO8859P8 to utf-8 conversion

My perl script reads values extracted from Oracle DB based on IW8ISO8859P8 codepage into a string. I have also an input file (saved as UTF-8) from which I read also a string.
I am trying to compare both string. printing first string gives me gibberish e.g òøáä -îåñáåú whereas the other string gives me the hebrew letter e.g הסבות. How can I encode the first string to give me the right Hebrew string ?
Thanks
Shimon
I tried using the format $string1 = Encode (uft-8, $string1) but it did not help

Error decoding ciphertext with RSA using pycryptodome

I'm using pycryptodome to encrypt a text in this way:
def encrypt_with_rsa(plain_text):
#First Public Key Encryption
cipher_pub_obj = PKCS1_OAEP.new(RSA.importKey(my_public_key))
#encrypt
_secret_byte_obj = cipher_pub_obj.encrypt(plain_text.encode())
return _secret_byte_obj
After that, I want to divide all the string in 8 parts to get a char of each one. What I receive has this form:
b"gi?\xf4\xa8{\xe8\x1b\xec8\xd5\x96\*,t\xad\xb8D=\rCGq\xc5\xed........"
So I Try to decode that with utf-8, throws error:
** 'utf-8' codec can't decode byte 0x81 in position 5: invalid start byte **
I tried with utf-16 and 32, but no one works. I tried with latin1 too, but it ignores some parts and I don't want that. The text encoded above is changed to:
"gi?ô¨{è\x1bì8Õ\x96\*,t\xad¸D=\rCGqÅíh\x1b\x84\x0eí ó=#ÉîKô4B......"
Some parts are the same, i don´t know what type of encoding it uses, what can I do to use the encrypted text?

Unicode Decode ErrorL Ordinal not in range

so I am trying to make the values in a csv file Unicode so a program I am using is able to read them. Here is my code, in Python 2.7, which I HAVE to use:
TEST_SENTENCES = []
with open('Book2.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
encoded_tweet = row["Tweet"].encode('utf-8')
TEST_SENTENCES.append(encoded_tweet)
I continue to receive the same error message, and have not been able to find anything that works. Here is the error message. I am sure someone out there can make a really easy solution.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 127: ordinal not in range(128)
As an update, I changed encode to decode, and the program says it is running the predictions, given it has been a while but it would not have started if it was not encoded in Unicode so let us pray.

Trouble understanding C# URL decode with Unicode character(s) in PowerShell

I'm currently working on something that requires me to pass a Base64 string to a PowerShell script. But while decoding the string back to the original I'm getting some unexpected results as I need to use UTF-7 during decoding and I don't understand why. Would someone know why?
The Mozilla documentation would suggest that it's insufficient to use Base64 if you have Unicode characters in your string. Thus you need to use a workaround that consists of using encodeURIComponent and a replace. I don't really get why the replace is needed and shortened it to btoa(escape('✓ à la mode')) to encode the string. The result of that operation would be JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl.
Using PowerShell to decode the string back to the original, I need to first undo the Base64 encoding. In order to do System.Convert can be used (which results in a byte array) and its output can be converted to a UTF-8 string using System.Text.Encoding. Together this would look like the following:
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
What's left to do is URL decode the whole thing. As it is a UTF-8 string I'd expect only to need to run the URL decode without any further parameters. But if you do that you end up with a accented a that looks like � in a file or ? on the console. To get the actual original string it's necessary to tell the URL decode to use UTF-7 as the character set. It's nice that this works but I don't really get why it's necessary since the string should be UTF-8 and UTF-8 certainly supports an accented a. See the last two lines of the entire script for what I mean. With those two lines you will end up with one line that has the garbled text and one which has the original text in the same file encoded as UTF-8
Entire PowerShell script:
Add-Type -AssemblyName System.Web
$inputstring = "JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl"
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
[System.Web.HttpUtility]::UrlDecode($utf8string) | Out-File -Encoding utf8 C:\temp\output.txt
[System.Web.HttpUtility]::UrlDecode($utf8string, [System.Text.UnicodeEncoding]::UTF7) | Out-File -Append -Encoding utf8 C:\temp\output.txt
Clarification:
The problem isn't the conversion of the Base64 to UTF-8. The problem is some inconsistent behavior of the UrlDecode of C#. If you run escape('✓ à la mode') in your browser you will end up with the following string %u2713%20%E0%20la%20mode. So we have a Unicode representation of the check mark and a HTML entity for the á. If we use this directly in UrlDecode we end up with the same error. My current assumption would be that it's an issue with the encoding of the PowerShell window and pasting characters into it.
Turns out it actually isn't all that strange. It's just for what I want to do it's advantages to use a newer function. I'm still not sure why it works if you use the UTF-7 encoding. But anyways, as an explanation:
... The hexadecimal form for characters, whose code unit value is 0xFF or less, is a two-digit escape sequence: %xx. For characters with a greater code unit, the four-digit format %uxxxx is used.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape
As TesselatingHecksler pointed out What is the proper way to URL encode Unicode characters? would indicate that the %u format wasn't formerly standardized. A newer version to escape characters exists though, which is encodeURIComponent.
The encodeURIComponent() function encodes a Uniform Resource Identifier (URI) component by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character (will only be four escape sequences for characters composed of two "surrogate" characters).
The output of this function actually works with the C# implementation of UrlDecode without supplying an additional encoding of UTF-7.
The original linked Mozilla article about a Base64 encode for an UTF-8 strings modifies the whole process in a way to allows you to just call the Base64 decode function in order to get the whole string. This is realized by converting the URL encode version of the string to bytes.

How to decode a Base64 string?

I have a normal string in Powershell that is from a text file containing Base64 text; it is stored in $x. I am trying to decode it as such:
$z = [System.Text.Encoding]::Unicode.GetString([System.Convert]::FromBase64String($x));
This works if $x was a Base64 string created in Powershell (but it's not). And this does not work on the $x Base64 string that came from a file, $z simply ends up as something like 䐲券.
What am I missing? For example, $x could be YmxhaGJsYWg= which is Base64 for blahblah.
In a nutshell, YmxhaGJsYWg= is in a text file then put into a string in this Powershell code and I try to decode it but end up with 䐲券 etc.
Isn't encoding taking the text TO base64 and decoding taking base64 BACK to text? You seem be mixing them up here. When I decode using this online decoder I get:
BASE64: blahblah
UTF8: nVnV
not the other way around. I can't reproduce it completely in PS though. See sample below:
PS > [System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String("blahblah"))
nV�nV�
PS > [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes("nVnV"))
blZuVg==
EDIT I believe you're using the wrong encoder for your text. The encoded base64 string is encoded from UTF8(or ASCII) string.
PS > [System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String("YmxhaGJsYWg="))
blahblah
PS > [System.Text.Encoding]::Unicode.GetString([System.Convert]::FromBase64String("YmxhaGJsYWg="))
汢桡汢桡
PS > [System.Text.Encoding]::ASCII.GetString([System.Convert]::FromBase64String("YmxhaGJsYWg="))
blahblah
There are no PowerShell-native commands for Base64 conversion - yet (as of PowerShell [Core] 7.1), but adding dedicated cmdlets has been suggested in GitHub issue #8620.
For now, direct use of .NET is needed.
Important:
Base64 encoding is an encoding of binary data using bytes whose values are constrained to a well-defined 64-character subrange of the ASCII character set representing printable characters, devised at a time when sending arbitrary bytes was problematic, especially with the high bit set (byte values > 0x7f).
Therefore, you must always specify explicitly what character encoding the Base64 bytes do / should represent.
Ergo:
on converting TO Base64, you must first obtain a byte representation of the string you're trying to encode using the character encoding the consumer of the Base64 string expects.
on converting FROM Base64, you must interpret the resultant array of bytes as a string using the same encoding that was used to create the Base64 representation.
Examples:
Note:
The following examples convert to and from UTF-8 encoded strings:
To convert to and from UTF-16LE ("Unicode") instead, substitute [Text.Encoding]::Unicode for [Text.Encoding]::UTF8
Convert TO Base64:
PS> [Convert]::ToBase64String([Text.Encoding]::UTF8.GetBytes('Motörhead'))
TW90w7ZyaGVhZA==
Convert FROM Base64:
PS> [Text.Encoding]::Utf8.GetString([Convert]::FromBase64String('TW90w7ZyaGVhZA=='))
Motörhead
This page shows up when you google how to convert to base64, so for completeness:
$b = [System.Text.Encoding]::UTF8.GetBytes("blahblah")
[System.Convert]::ToBase64String($b)
Base64 encoding converts three 8-bit bytes (0-255) into four 6-bit bytes (0-63 aka base64). Each of the four bytes indexes an ASCII string which represents the final output as four 8-bit ASCII characters. The indexed string is typically 'A-Za-z0-9+/' with '=' used as padding. This is why encoded data is 4/3 longer.
Base64 decoding is the inverse process. And as one would expect, the decoded data is 3/4 as long.
While base64 encoding can encode plain text, its real benefit is encoding non-printable characters which may be interpreted by transmitting systems as control characters.
I suggest the original poster render $z as bytes with each bit having meaning to the application. Rendering non-printable characters as text typically invokes Unicode which produces glyphs based on your system's localization.
Base64decode("the answer to life the universe and everything") = 00101010
If anyone would like to do it with a pipe in Powershell (like a filter) (e.g. read file contents and decode it), it can be achieved with a one-liner like that:
Get-Content base64.txt | %{[Text.Encoding]::UTF8.GetString([Convert]::FromBase64String($_))}
I had issues with spaces showing in between my output and there was no answer online at all to fix this issue. I literally spend many hours trying to find a solution and found one from playing around with the code to the point that I almost did not even know what I typed in at the time that I got it to work. Here is my fix for the issue: [System.Text.Encoding]::UTF8.GetString(([System.Convert]::FromBase64String($base64string)|?{$_}))
Still not a "built-in", but published to gallery, authored by MS:
https://github.com/powershell/textutility
TextUtility
ConvertFrom-Base64
Return a string decoded from base64.
ConvertTo-Base64
Return a base64 encoded representation of a string.