SQLAlchemy Unicode Problems in Exceptions - postgresql

I'm working on a Flask app with a postgres/SQLAlchemy/Flask-Admin. However, in the Admin interface, any DB error that contain Unicode letters can't be reported since unicode(exc) raises UnicodeDecodeError.
I was able to locate that problem to sqlalchemy.exc
class StatementError(SQLAlchemyError):
...
def __unicode__(self):
return self.__str__()
And reproduce the problem by with:
class A(Base):
__tablename__="a"
id = Column(Integer, primary_key=True)
name = Column(String)
name2 = Column(String, nullable=False)
session = Session()
a = A(name=u"עברית")
session.add(a)
try:
session.commit()
except Exception as e:
print(repr(e))
print("------------------")
print(unicode(e))
Which returns:
ProgrammingError('(psycopg2.ProgrammingError) column "name" of relation "a" does not exist\nLINE 1: INSERT INTO a (name, name2) VALUES (\'\xd7\xa2\xd7\x91\xd7\xa8\xd7\x99\xd7\xaa\', NULL) RETURNING...\n ^\n',)
------------------
Traceback (most recent call last):
File "test.py", line 27, in <module>
print(unicode(e))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 118: ordinal not in range(128)
And I currently solve it by replacing the relevant exceptions with my classes that decode from utf-8. However, this is a terrible hack, and I'm looking for a proper solution:
Is there way to configure SQLAlchemy to automatically decode the received error messages?
Is there way to configure Postgres to output messages in latin encoding (less favorable, but accetable)
Is there way to make unicode try to decode by utf-8 instead of ascii/latin?
Is there any way to resolve it at all???
(The problem is relevant only to Python2. In Python3 the code above works. I believe it's because the default encoding is utf-8)

I actually think that patching SQLAlchemy from your application is the right reasonably clean solution. Here's why:
You've identified something that generally is agreed to be a bug in SQLAlchemy.
You can write a patch that will behave the same for all situations that SQLAlchemy currently works with. That is, your patch will not break existing code
The probability is very high that even if SQLAlchemy is fixed your patch will be harmless.
Making this change reduces the impact of the SQLAlchemy bug throughout your code over solutions like changing every place where exceptions might be printed.
Changing PostGres to return latin1 encoding actually wouldn't help because python is using the ascii encoding, which would give the same error when given a latin1 string. Also, changing PostGres to return latin1 errors would probably involve changing the connection encoding; that likely creates issues for unicode data.
Here's a simple program that patches sqlalchemy.exc.StatementError and tests the patch. If you wanted you could even try generating a exception including unicode, convert that to unicode, and only apply the patch if that raises UnicodeDecodeError. If you did that, your patch would automatically stop being applied when sqlalchemy fixes the issue.
# -*- coding: utf-8 -*-
from sqlalchemy.exc import StatementError
def statement_error_unicode(self):
return unicode(str(self), 'utf-8')
# See <link to sqlalchemy issue>; can be removed once we require a
# version of sqlalchemy with a fix to that issue
StatementError.__unicode__ = statement_error_unicode
message = u'Sqlalchemy unicode 😞'
message_str = message.encode('utf-8')
error = StatementError(message_str, 'select * from users', tuple(), '')
print unicode(error)

Related

Unicode to Latin in Teradata Conversion

I have been trying to convert Unicode strings to Latin in Teradata version 16.20.32.23. I have seen many online forums but I was not able to formulate a solution. Following are some of the strings that I was unable to convert:
hyödyt
löydät
I have tried following solution but function translate_chk does not seems to work.
SELECT CASE WHEN Translate_Chk ( 'hyödyt' using UNICODE_TO_LATIN) <> 0
THEN
''
WHEN Translate_Chk ( 'hyödyt' using UNICODE_TO_LATIN ) = 0
THEN
Translate ( 'hyödyt' using UNICODE_TO_LATIN WITH ERROR)
END AS transalated
The error I receive is SELECT FAILED. 6706: The string contains untranslatable character.
I think I have reached a dead end, could anyone help me here?
I'm not familiar with Teradata, but the strings you have are double-mis-decoded as Windows-1252, which is a variation of ISO-8859-1 a.k.a latin1. Example to fix in Python:
>>> s='hyödyt'
>>> s.encode('cp1252').decode('utf8').encode('cp1252').decode('utf8')
'hyödyt'
>>> s='löydät'
>>> s.encode('cp1252').decode('utf8').encode('cp1252').decode('utf8')
'löydät'
So not a Teradata solution, but should help you figure it out.
Following is the python code I used, it might help someone. In order to use below code you need to follow below instructions:
Download chilkat package as per your python release:
https://www.chilkatsoft.com/python.asp#winDownloads
Follow installation guidelines in below URL:
https://www.chilkatsoft.com/installWinPython.asp
Open IDLE shell and run the following code
import sys
import chilkat
charset = chilkat.CkCharset()
charset.put_FromCharset("utf-8")
charset.put_ToCharset("ANSI")
charset.put_ToCharset("Windows-1252")
success = charset.ConvertFile("source_file_name.csv","target_file_name.csv")
if (success != True):
print(charset.lastErrorText())
sys.exit()
print("Success.")

Are you able to use PtrToStringAuto to decrypt a secure string in Powershell 7 on macOS?

I have had no success in getting the following code snippet to output "Hello World!" in PS7
$string = $("Hello World!" | ConvertTo-SecureString -AsPlainText -Force)
[System.Runtime.InteropServices.Marshal]::PtrToStringAuto(
[System.Runtime.InteropServices.Marshal]::SecureStringToBSTR($string))
The above code is an example of decrypting a secure string without specifying a length.
This same code works in PS6 and PS5 to fully decrypt the Secure String, but does not work in PS7. The only way around this I have found is to use PtrToStringBSTR. Then it works as expected across all versions of PS for this use case.
I raised an issue at the Powershell repo on Github, but haven't had any responses. I'm honestly just looking for some confirmation that the behavior is the same for others.
https://github.com/PowerShell/PowerShell/issues/11953
I would think something like this would be a breaking change for a lot of code being ported to PS7.
Here is what I have found so far:
Documentation
https://learn.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.marshal.ptrtostringauto?view=netframework-4.8
According to the documentation, when specifying an integer, PtrToStringAuto:
Allocates a managed String and copies the specified number of characters from a string stored in unmanaged memory into it.
Specifying an int of 11 Returns "Hello", this is because every other char returned is Null. In this case, you must specify an int of 23 to return the complete string "Hello World!" using this method. I have stored the output in a variable to demonstrate this.
$String = $("Hello World!" | ConvertTo-SecureString -AsPlainText -Force)
[System.Runtime.InteropServices.Marshal]::PtrToStringAuto(
[System.Runtime.InteropServices.Marshal]::SecureStringToBSTR($string), 23)
$String[0] Returns H
$String[1] Returns NULL
$String[2] Returns E
$String[3] Returns NULL
etc....
If no integer is specified, PtrToStringAuto:
Allocates a managed String and copies all characters up to the first null character from a string stored in unmanaged memory into it.
I believe this suggests that either the Secure String is being stored with NULL values, whereas in PS6 it was not, or that the behavior of the PtrToStringAuto function has changed, and now adheres to the behavior the documentation describes above.
This is only an issue on macOS; however, using PtrToStringBSTR in place of PtrToStringAuto to decrypt the Secure String works as expected across windows and macOS.
This seems related: https://stackoverflow.com/a/11022662/4257163
I also do not see anywhere that a change was made.
Note that [securestring] is not recommended for new code anymore.
While on Windows secure strings offer limited protection - by storing the string encrypted in memory - via the DPAPI - and by shortening the window during which the plain-text representation is held in memory, no encryption at all is used on Unix-like platforms.[1]
The only way around this I have found is to use PtrToStringBSTR.
That is not only a way around the problem, PtrToStringBSTR is the method that should have been used to begin with, given that the input string is a BSTR.[2]
Do note that converting a secure string to and from a regular [string] instance defeats the very purpose of using [securestring] to begin with: you'll end up with a plain-text representation of your sensitive data in your process' memory whose lifetime you cannot control.
If you really want to do this, a simpler, cross-platform-compatible approach is:
[System.Net.NetworkCredential]::new('dummy', $string).Password
[1] This is especially problematic when you save a secure string in a file, via ConvertFrom-SecureString or Export-CliXml - see this answer.
[2] The Auto in PtrToStringAuto() means that the unmanaged input string is assumed to use a platform-appropriate character encoding, whereas BSTR is
a "Unicode" (UTF-16) string on all platforms. On Windows, an unmanaged string is assumed to have UTF-16 encoding (which is why the code works), whereas on Unix-like platforms it is UTF-8 since .NET Core 3.0 (PowerShell [Core] 7.0 is based on .NET Core 3.1), which explains your symptoms: the NUL chars. in the BSTR instance's UTF-16 code units are interpreted as characters in their own right when (mis)interpreted as UTF-8. Note that .NET Core 2.x (which is what PowerShell [Core] 6.x is based on) (inappropriately) defaulted to UTF-16, which this PR fixed, amounting to a breaking change.

Trying to upload specific characters in Python 3 using Windows Powershell

I'm running this code in Windows Powershell and it includes this file called languages.txt where I'm trying to convert between bytes to strings:
Here is languages.txt:
Afrikaans
አማርኛ
Аҧсшәа
العربية
Aragonés
Arpetan
Azərbaycanca
Bamanankan
বাংলা
Bân-lâm-gú
Беларуская
Български
Boarisch
Bosanski
Буряад
Català
Чӑвашла
Čeština
Cymraeg
Dansk
Deutsch
Eesti
Ελληνικά
Español
Esperanto
فارسی
Français
Frysk
Gaelg
Gàidhlig
Galego
한국어
Հայերեն
हिन्दी
Hrvatski
Ido
Interlingua
Italiano
עברית
ಕನ್ನಡ
Kapampangan
ქართული
Қазақша
Kreyòl ayisyen
Latgaļu
Latina
Latviešu
Lëtzebuergesch
Lietuvių
Magyar
Македонски
Malti
मराठी
მარგალური
مازِرونی
Bahasa Melayu
Монгол
Nederlands
नेपाल भाषा
日本語
Norsk bokmål
Nouormand
Occitan
Oʻzbekcha/ўзбекча
ਪੰਜਾਬੀ
پنجابی
پښتو
Plattdüütsch
Polski
Português
Română
Romani
Русский
Seeltersk
Shqip
Simple English
Slovenčina
کوردیی ناوەندی
Српски / srpski
Suomi
Svenska
Tagalog
தமிழ்
ภาษาไทย
Taqbaylit
Татарча/tatarça
తెలుగు
Тоҷикӣ
Türkçe
Українська
اردو
Tiếng Việt
Võro
文言
吴语
ייִדיש
中文
Then, here is the code I used:
import sys
script, input_encoding, error = sys.argv
def main(language_file, encoding, errors):
line = language_file.readline()
if line:
print_line(line, encoding, errors)
return main(language_file, encoding, errors)
def print_line(line, encoding, errors):
next_lang = line.strip()
raw_bytes = next_lang.encode(encoding, errors=errors)
cooked_string = raw_bytes.decode(encoding, errors=errors)
print(raw_bytes, "<===>", cooked_string)
languages = open("languages.txt", encoding="utf-8")
main(languages, input_encoding, error)
Here's the output (credit: Learn Python 3 the Hard Way by Zed A. Shaw):
I don't know why it doesn't upload the characters and shows question blocks instead. Can anyone help me?
The first string which fails is አማርኛ. The first character, አ is in unicode 12A0 (see here). In UTF-8, that is b'\xe1\x8a\xa0'. So, that part is obviously fine. The file really is UTF-8.
Printing did not raise an exception, so your output encoding can handle all of the characters. Everything is fine.
The only remaining reason I see for it to fail is that the font used in the console does not support all of the characters.
If it is just for play, you should not worry about it. Consider it working correctly.
On the other hand, I would suggest changing some things in your code:
You are running main recursively for each line. There is absolutely no need for that and it would run into recursion depth limit on a longer file. User a for loop instead.
for line in lines:
print_line(line, encoding, errors)
You are opening the file as UTF-8, so reading from it automatically decodes UTF-8 into Unicode, then you encode it back into row_bytes and then encode again into cooked_string, which is the same as line. It would be a better exercise to read the file as raw binary, split it on newlines and then decode. Then you'd have a clearer picture of what is going on.
with open("languages.txt", 'rb') as f:
raw_file_contents = f.read()

column headers are corrupted when querying with pyodbc on Ubuntu

I'm using postgres on ubuntu and use unixodbc and pyodbc 4.0.16 to access the data. I seem to have an issue related to unicode.
When querying the DB, the column headers appear to be corrupted.
Here's an example:
import pyodbc
conn = pyodbc.connect("DSN=local_postgres")
conn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
#conn.execute('create schema test')
conn.execute('create table test.uni_test(column1 varchar)')
conn.execute("insert into test.uni_test(column1) values ('My value')")
results = conn.execute('select * from test.uni_test')
print results.description
columns = [column[0].decode('latin1') for column in results.description]
print "columns: " + str(columns)
print list(results)
Result:
((u'c\x00\x00\x00o\x00\x00', <type 'str'>, None, 255, 255, 0, True),)
columns: [u'c\x00\x00\x00o\x00\x00']
[(u'My value', )]
I'm not sure what the issue is.
BTW - exactly the same behavior is observed on my mac (el capitan).
Thanks in advance,
Alex
u'c\x00\x00\x00o\x00\x00' is the first 7 bytes of 'column1' in UTF-32LE encoding. (The value was apparently truncated at 7 bytes because 'column1' is 7 characters long.)
pyodbc received a significant upgrade to Unicode handling for its 4.x version, and one of the things that the developers discovered is the surprising variety of ways that ODBC drivers can mix-and-match encoding when returning values. The pyodbc Wiki page for Unicode recommends the following for PostgreSQL ODBC under Python 2.7
cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
but in this case the following was also required
cnxn.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-32le')

unicode().decode('utf-8', 'ignore') raising UnicodeEncodeError

Here is the code:
>>> z = u'\u2022'.decode('utf-8', 'ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2022' in position 0: ordinal not in range(256)
Why is UnicodeEncodeError raised when I am using .decode?
Why is any error raised when I am using 'ignore'?
When I first started messing around with python strings and unicode, It took me awhile to understand the jargon of decode and encode too, so here's my post from here that may help:
Think of decoding as what you do to go from a regular bytestring to unicode and encoding as what you do to get back from unicode. In other words:
You de-code a str to produce a unicode string (in Python 2)
and en-code a unicode string to produce a str (in Python 2)
So:
unicode_char = u'\xb0'
encodedchar = unicode_char.encode('utf-8')
encodedchar will contain your unicode character, displayed in the selected encoding (in this case, utf-8).
The same principle applies to Python 3. You de-code a bytes object to produce a str object. And you en-code a str object to produce a bytes object.
From http://wiki.python.org/moin/UnicodeEncodeError
Paradoxically, a UnicodeEncodeError may happen when
decoding. The cause of it seems to be the
coding-specific decode() functions that normally expect
a parameter of type str. It appears that on seeing a
unicode parameter, the decode() functions "down-convert"
it into str, then decode the result assuming it to be of
their own coding. It also appears that the
"down-conversion" is performed using the ASCII encoder.
Hence an encoding failure inside a decoder.
You're trying to decode a unicode. The implicit encoding to make the decode work is what's failing.