Unicode to Latin in Teradata Conversion - unicode

I have been trying to convert Unicode strings to Latin in Teradata version 16.20.32.23. I have seen many online forums but I was not able to formulate a solution. Following are some of the strings that I was unable to convert:
hyödyt
löydät
I have tried following solution but function translate_chk does not seems to work.
SELECT CASE WHEN Translate_Chk ( 'hyödyt' using UNICODE_TO_LATIN) <> 0
THEN
''
WHEN Translate_Chk ( 'hyödyt' using UNICODE_TO_LATIN ) = 0
THEN
Translate ( 'hyödyt' using UNICODE_TO_LATIN WITH ERROR)
END AS transalated
The error I receive is SELECT FAILED. 6706: The string contains untranslatable character.
I think I have reached a dead end, could anyone help me here?

I'm not familiar with Teradata, but the strings you have are double-mis-decoded as Windows-1252, which is a variation of ISO-8859-1 a.k.a latin1. Example to fix in Python:
>>> s='hyödyt'
>>> s.encode('cp1252').decode('utf8').encode('cp1252').decode('utf8')
'hyödyt'
>>> s='löydät'
>>> s.encode('cp1252').decode('utf8').encode('cp1252').decode('utf8')
'löydät'
So not a Teradata solution, but should help you figure it out.

Following is the python code I used, it might help someone. In order to use below code you need to follow below instructions:
Download chilkat package as per your python release:
https://www.chilkatsoft.com/python.asp#winDownloads
Follow installation guidelines in below URL:
https://www.chilkatsoft.com/installWinPython.asp
Open IDLE shell and run the following code
import sys
import chilkat
charset = chilkat.CkCharset()
charset.put_FromCharset("utf-8")
charset.put_ToCharset("ANSI")
charset.put_ToCharset("Windows-1252")
success = charset.ConvertFile("source_file_name.csv","target_file_name.csv")
if (success != True):
print(charset.lastErrorText())
sys.exit()
print("Success.")

Related

What is the '...' in sklearn.feature_extraction.text TfidfVectorizer return output?

When I used TfidfVectorizer to encode a paragraph, I receive the output that include '...' inside, this look like the matrix is too large and python cut off something. This problem happens occasionally, but not always, and cause error in another code because '...' has no meaning.
I run in on Conda with Python 36
My source code is like this:
vectorizer = TfidfVectorizer(stop_words='english',use_idf=True)
# print(data)
X = vectorizer.fit_transform(data).todense()
with open("spare_matrix.txt","w") as file:
file.write(str(X))
file.close()
Please suggest me some ideas for that with thanks

Extract the £ (pound) currency symbol and the amount (56) from an html file

Extract the £ (pound) currency symbol and the amount (56) from an html file. It is printing the amount as £56 and prints the currency as Â. How can I print only 56, without the symbol? It is working fine with a $ sign.
Part of the code:
cost= "£56"
currencySymbol = cost[0]
print (currencySymbol, cost[1:])
The output I am getting:
Â: £56
there are many ways that you can do it, you can use split, regex and one method that I did below:
Hope it helps you
import re
cost= "£560,000"
match = re.search(r'([\D]+)([\d,]+)', cost)
output = (match.group(1), match.group(2).replace(',',''))
print (output);
output -->('£', '560000')
check here (https://ideone.com/Y053Vb)
Resolved: i tried to run below code in separate file in eclipse and given error about utf-8.
i search the error and got answer, it is eclipse who is changing unicode style to avoid i used to run in python IDLE, i think we can change unicode in eclipse?.
Thanks to Martijn Pieters
[SyntaxError: Non-UTF-8 code starting with '\x91'
cost= "£56"
currencySymbol = cost[0]
print (currencySymbol, cost[1:])
#resolution :when using file use encoding
#with open('index.html', encoding="UTF-8") as productFile:

column headers are corrupted when querying with pyodbc on Ubuntu

I'm using postgres on ubuntu and use unixodbc and pyodbc 4.0.16 to access the data. I seem to have an issue related to unicode.
When querying the DB, the column headers appear to be corrupted.
Here's an example:
import pyodbc
conn = pyodbc.connect("DSN=local_postgres")
conn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
#conn.execute('create schema test')
conn.execute('create table test.uni_test(column1 varchar)')
conn.execute("insert into test.uni_test(column1) values ('My value')")
results = conn.execute('select * from test.uni_test')
print results.description
columns = [column[0].decode('latin1') for column in results.description]
print "columns: " + str(columns)
print list(results)
Result:
((u'c\x00\x00\x00o\x00\x00', <type 'str'>, None, 255, 255, 0, True),)
columns: [u'c\x00\x00\x00o\x00\x00']
[(u'My value', )]
I'm not sure what the issue is.
BTW - exactly the same behavior is observed on my mac (el capitan).
Thanks in advance,
Alex
u'c\x00\x00\x00o\x00\x00' is the first 7 bytes of 'column1' in UTF-32LE encoding. (The value was apparently truncated at 7 bytes because 'column1' is 7 characters long.)
pyodbc received a significant upgrade to Unicode handling for its 4.x version, and one of the things that the developers discovered is the surprising variety of ways that ODBC drivers can mix-and-match encoding when returning values. The pyodbc Wiki page for Unicode recommends the following for PostgreSQL ODBC under Python 2.7
cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
but in this case the following was also required
cnxn.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-32le')

Force newline instead of execute in ipython

There are a number of questions about how to force EXECUTE instead of newline in ipython. But I need the opposite. Consider:
In [9]: import sqlalchemy sqlalchemy.__version__
File "<ipython-input-9-84bd5002c701>", line 1
import sqlalchemy sqlalchemy.__version__
^
SyntaxError: invalid syntax
We can see that there should be two lines: ( 1 ) the import and ( 2 ) the __version__ invocation. Whatever I try I can not separate the two pieces.
One of the suggestions was ctl-v ctl-j: that simply did an EXECUTE again.
Another suggestion was Use %edit {tag} Adding line breaks in ipython : That gave an even more interesting behavior:
n [12]: %edit _i9
Editing... done. Executing edited code...
Out[12]: 'import sqlalchemy \nsqlalchemy.__version__ \n'
So notice: the (vi) editor did its job but then ipython simply converted the newlines to \n and did not interpret the newlines instead it concatenated them
So what key combination will get us the newline here?
Pressing Ctrl+q Ctrl-j should do the trick.
Check this answer here.

SQLAlchemy Unicode Problems in Exceptions

I'm working on a Flask app with a postgres/SQLAlchemy/Flask-Admin. However, in the Admin interface, any DB error that contain Unicode letters can't be reported since unicode(exc) raises UnicodeDecodeError.
I was able to locate that problem to sqlalchemy.exc
class StatementError(SQLAlchemyError):
...
def __unicode__(self):
return self.__str__()
And reproduce the problem by with:
class A(Base):
__tablename__="a"
id = Column(Integer, primary_key=True)
name = Column(String)
name2 = Column(String, nullable=False)
session = Session()
a = A(name=u"עברית")
session.add(a)
try:
session.commit()
except Exception as e:
print(repr(e))
print("------------------")
print(unicode(e))
Which returns:
ProgrammingError('(psycopg2.ProgrammingError) column "name" of relation "a" does not exist\nLINE 1: INSERT INTO a (name, name2) VALUES (\'\xd7\xa2\xd7\x91\xd7\xa8\xd7\x99\xd7\xaa\', NULL) RETURNING...\n ^\n',)
------------------
Traceback (most recent call last):
File "test.py", line 27, in <module>
print(unicode(e))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 118: ordinal not in range(128)
And I currently solve it by replacing the relevant exceptions with my classes that decode from utf-8. However, this is a terrible hack, and I'm looking for a proper solution:
Is there way to configure SQLAlchemy to automatically decode the received error messages?
Is there way to configure Postgres to output messages in latin encoding (less favorable, but accetable)
Is there way to make unicode try to decode by utf-8 instead of ascii/latin?
Is there any way to resolve it at all???
(The problem is relevant only to Python2. In Python3 the code above works. I believe it's because the default encoding is utf-8)
I actually think that patching SQLAlchemy from your application is the right reasonably clean solution. Here's why:
You've identified something that generally is agreed to be a bug in SQLAlchemy.
You can write a patch that will behave the same for all situations that SQLAlchemy currently works with. That is, your patch will not break existing code
The probability is very high that even if SQLAlchemy is fixed your patch will be harmless.
Making this change reduces the impact of the SQLAlchemy bug throughout your code over solutions like changing every place where exceptions might be printed.
Changing PostGres to return latin1 encoding actually wouldn't help because python is using the ascii encoding, which would give the same error when given a latin1 string. Also, changing PostGres to return latin1 errors would probably involve changing the connection encoding; that likely creates issues for unicode data.
Here's a simple program that patches sqlalchemy.exc.StatementError and tests the patch. If you wanted you could even try generating a exception including unicode, convert that to unicode, and only apply the patch if that raises UnicodeDecodeError. If you did that, your patch would automatically stop being applied when sqlalchemy fixes the issue.
# -*- coding: utf-8 -*-
from sqlalchemy.exc import StatementError
def statement_error_unicode(self):
return unicode(str(self), 'utf-8')
# See <link to sqlalchemy issue>; can be removed once we require a
# version of sqlalchemy with a fix to that issue
StatementError.__unicode__ = statement_error_unicode
message = u'Sqlalchemy unicode 😞'
message_str = message.encode('utf-8')
error = StatementError(message_str, 'select * from users', tuple(), '')
print unicode(error)