UnicodeDecodeError: 'ascii' codec can't decode, with gensim, python3.5 - encoding

I am using python 3.5 on both windows and Linux but get the same error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 0: ordinal not in range(128)
The error log is the following:
Reloaded modules: lazylinker_ext
Traceback (most recent call last):
File "<ipython-input-2-d60a2349532e>", line 1, in <module>
runfile('C:/Users/YZC/Google Drive/sunday/data/RA/data_20100101_20150622/w2v_coherence.py', wdir='C:/Users/YZC/Google Drive/sunday/data/RA/data_20100101_20150622')
File "C:\Users\YZC\Anaconda3\lib\site- packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Users\YZC\Anaconda3\lib\site- packages\spyderlib\widgets\externalshell\sitecustomize.py", line 88, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "C:/Users/YZC/Google Drive/sunday/data/RA/data_20100101_20150622/w2v_coherence.py", line 70, in <module>
model = gensim.models.Word2Vec.load('model_all_no_lemma')
File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line 1485, in load
model = super(Word2Vec, cls).load(*args, **kwargs)
File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py", line 248, in load
obj = unpickle(fname)
File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py", line 912, in unpickle
return _pickle.loads(f.read())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 0: ordinal not in range(128)
1.I checked and found the default decode method is utf-8 by:
import sys
sys.getdefaultencoding()
Out[2]: 'utf-8'
when read the file, I also added .decode('utf-8')
I did add shepang line in the beginning and declare utf-8
so I really dont know why python couldnt read the file. Can anybody help me out?
Here are the code:
# -*- coding: utf-8 -*-
import gensim
import csv
import numpy as np
import math
import string
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob, Word
class SpeechParser(object):
def __init__(self, filename):
self.filename = filename
self.lemmatize = WordNetLemmatizer().lemmatize
self.cached_stopwords = stopwords.words('english')
def __iter__(self):
with open(self.filename, 'rb', encoding='utf-8') as csvfile:
file_reader = csv.reader(csvfile, delimiter=',', quotechar='|', )
headers = file_reader.next()
for row in file_reader:
parsed_row = self.parse_speech(row[-2])
yield parsed_row
def parse_speech(self, row):
speech_words = row.replace('\r\n', ' ').strip().lower().translate(None, string.punctuation).decode('utf-8', 'ignore')
return speech_words.split()
# -- source: https://github.com/prateekpg2455/U.S-Presidential- Speeches/blob/master/speech.py --
def pos(self, tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return ''
if __name__ == '__main__':
# instantiate object
sentences = SpeechParser("sample.csv")
# load an existing model
model = gensim.models.Word2Vec.load('model_all_no_lemma')
print('\n-----------------------------------------------------------')
print('MODEL:\t{0}'.format(model))
vocab = model.vocab
# print log-probability of first 10 sentences
row_count = 0
print('\n------------- Scores for first 10 documents: -------------')
for doc in sentences:
print(sum(model.score(doc))/len(doc))
row_count += 1
if row_count > 10:
break
print('\n-----------------------------------------------------------')

It looks like a bug in Gensim when you try to use a Python 2 pickle file that has non-ASCII chars in it with Python 3.
The unpickle is happening when you call:
model = gensim.models.Word2Vec.load('model_all_no_lemma')
In Python 3, during the unpickle it wants to convert legacy byte strings to (Unicode) strings. The default action is to decode with 'ASCII' in strict mode.
The fix will be dependant on the encoding in your original pickle file and will require you to patch the gensim code.
I'm not familiar with gensim so you will have to try the following two options:
Force UTF-8
Chances are, your non-ASCII data is in UTF-8 format.
Edit C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py
Goto line 912
Change line to read:
return _pickle.loads(f.read(), encoding='utf-8')
Byte mode
Gensim in Python3 may happily work with byte strings:
Edit C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py
Goto line 912
Change line to read:
return _pickle.loads(f.read(), encoding='bytes')

Related

getting unicode decode error while trying to load pre-trained model using torch.load(PATH)

Trying to load a ResNet 18 pre-trained model using the torch.load(PATH) but getting Unicode decode error please help.
Traceback (most recent call last):
File "main.py", line 312, in <module>
main()
File "main.py", line 138, in main
checkpoint = torch.load(args.resume)
File "F:\InsSoft\Anaconda\lib\site-packages\torch\serialization.py", line 593, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "F:\InsSoft\Anaconda\lib\site-packages\torch\serialization.py", line 773, in _legacy_load
result = unpickler.load()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 2: invalid start byte
This error hits whenever the model is pretrained on torch version < 0.4 but using torch version > 0.4 for testing / resuming.
so use checkpoint = torch.load(args.resume,encoding='latin1')

ValueError: a string literal cannot contain nul (0x00) characters odoo

I'm getting this error when I try to import users
NB:
the module works fine: I can import users using odoo 10 and postgres 9.3 ubuntu 14
but here I'm using postgres 9.5 odoo 10 ubuntu 16
File "/home/belazar/Documents/addons_odoo10/hr_biometric_machine/models/biometric_machine.py", line 113, in create_user
'biometric_device': self.id, }
File "/opt/odoo/odoo/models.py", line 3830, in create
record = self.browse(self._create(old_vals))
File "/opt/odoo/odoo/models.py", line 3925, in _create
cr.execute(query, tuple(u[2] for u in updates if len(u) > 2))
File "/opt/odoo/odoo/sql_db.py", line 154, in wrapper
return f(self, *args, **kwargs)
File "/opt/odoo/odoo/sql_db.py", line 231, in execute
res = self._obj.execute(query, params)
ValueError: A string literal cannot contain NUL (0x00) characters.
the problem is solved by on /opt/odoo/odoo/sql_db.py commenting those lines:
#from tools import parse_version as pv
#if pv(psycopg2.__version__) < pv('2.7'):
# from psycopg2._psycopg import QuotedString
# def adapt_string(adapted):
# """Python implementation of psycopg/psycopg2#459 from v2.7"""
# if '\x00' in adapted:
# raise ValueError("A string literal cannot contain NUL (0x00) characters.")
# return QuotedString(adapted)
#psycopg2.extensions.register_adapter(str, adapt_string)
#psycopg2.extensions.register_adapter(unicode, adapt_string)

TensorFlow: use gfile.FastGfile() method can't not read a file with its path include Chinese characters

I want to read use gfile.FastGFile(image_path, 'rb').read() to read a picture and use it as the input of my project, and I use the directory name as the lable of these pictures which are include in the directory, when the directory name is in English, my code works fine, but when the directory name is in Chinese, it throws this Error:
Traceback (most recent call last):
File "F:/pythonWS/imageFilter/jpegFileJudge.py", line 27, in <module>
image_data = gfile.FastGFile(image_path, 'rb').read()
File "C:\Program Files\Python35\lib\site-
packages\tensorflow\python\lib\io\file_io.py", line 106, in read
self._preread_check()
File "C:\Program Files\Python35\lib\site-
packages\tensorflow\python\lib\io\file_io.py", line 73, in _preread_check
compat.as_bytes(self.__name), 1024 * 512, status)
File "C:\Program Files\Python35\lib\contextlib.py", line 66, in __exit__
next(self.gen)
File "C:\Program Files\Python35\lib\site-
packages\tensorflow\python\framework\errors_impl.py", line 466, in
raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: NewRandomAccessFile
failed to Create/Open: F:\vsWorkspace\pics\test\三宝鸟
\0ff41bd5ad6eddc403fa02d13bdbb6fd526633fe.jpg :
ϵͳ\udcd5Ҳ\udcbb\udcb5\udcbdָ\udcb6\udca8\udcb5\udcc4\udcceļ\udcfe\udca1\udca3
my test code is :
# -*- coding: utf-8 -*-
import glob
import os
import random
import numpy as np
import tensorflow as tf
from tensorflow.python.platform import gfile
image_folder='F:/vsWorkspace/pics/test'
os.chdir(image_folder)
count=0
for each in os.listdir(image_folder):
each=os.path.abspath(each)
os.chdir(each)
for image_path in os.listdir(each):
image_path = os.path.abspath(image_path)
print(image_path)
image_data = gfile.FastGFile(image_path, 'rb').read()
count += 1
os.chdir(image_folder)
My envirorment is Windows7 x64, python 3.5.3 and TensorFlow 1.0, How can I solve this problem?
By the way,I have to use Chinese directories' name use my pictures lables.

Problems with MySQL encoding

I have a serious problem with my populate. Characters are not stored correctly. My code:
def _create_Historial(self):
datos = [self.DB_HOST, self.DB_USER, self.DB_PASS, self.DB_NAME]
conn = MySQLdb.connect(*datos)
cursor = conn.cursor()
cont = 0
with open('principal/management/commands/Historial_fichajes_jugadores.csv', 'rv') as csvfile:
historialReader = csv.reader(csvfile, delimiter=',')
for row in historialReader:
if cont == 0:
cont += 1
else:
#unicodedata.normalize('NFKD', unicode(row[4], 'latin1')).encode('ASCII', 'ignore'),
cursor.execute('''INSERT INTO principal_historial(jugador_id, temporada, fecha, ultimoClub, nuevoClub, valor, coste) VALUES (%s,%s,%s,%s,%s,%s,%s)''',
(round(float(row[1]))+1,row[2], self.stringToDate(row[3]), unicode(row[4],'utf-8'), row[5], self.convertValue(row[6]), str(row[7])))
conn.commit()
cursor.close()
conn.close()
El error es el siguiente:
Traceback (most recent call last):
File "/home/tfg/pycharm-2016.3.2/helpers/pycharm/django_manage.py", line 41, in <module>
run_module(manage_file, None, '__main__', True)
File "/usr/lib/python2.7/runpy.py", line 188, in run_module
fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 82, in _run_module_code
mod_name, mod_fname, mod_loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/tfg/TrabajoFinGrado/demoTFG/manage.py", line 10, in <module>
execute_from_command_line(sys.argv)
File "/usr/local/lib/python2.7/dist- packages/django/core/management/__init__.py", line 443, in execute_from_command_line
utility.execute()
File "/usr/local/lib/python2.7/dist -packages/django/core/management/__init__.py", line 382, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 196, in run_from_argv
self.execute(*args, **options.__dict__)
File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 232, in execute
output = self.handle(*args, **options)
File "/home/tfg/TrabajoFinGrado/demoTFG/principal/management/commands/populate_db.py", line 230, in handle
self._create_Historial()
File "/home/tfg/TrabajoFinGrado/demoTFG/principal/management/commands/populate_db.py", line 217, in _create_Historial
(round(float(row[1]))+1,row[2], self.stringToDate(row[3]), unicode(row[4],'utf-8'), row[5], self.convertValue(row[6]), str(row[7])))
File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 187, in execute
query = query % tuple([db.literal(item) for item in args])
File "/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py", line 278, in literal
return self.escape(o, self.encoders)
File "/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py", line 208, in unicode_literal
return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 6-7: ordinal not in range(256)
The characters was shownn as follows: Nicolás Otamendi, Gaël Clichy ....
When I print the characteros on shell of the python, its wah shown correctly.
Sorry for my english :(
Ok, I'll keep this brief.
You should convert encoded data/strs to Unicodes early in your code. Don't inline .decode()/.encode()/unicode()
When you open a file in Python 2.7, it's opened in binary mode. You should use io.open(filename, encoding='utf-8'), which will read it as text and decode it from utf-8 to Unicodes.
The Python 2.7 CSV module is not Unicode compatible. You should install https://github.com/ryanhiebert/backports.csv
You need to tell the MySQL driver that you're going to pass Unicodes and use UTF-8 for the connection. This is done by adding the following to your connection string:
charset='utf8',
use_unicode=True
Pass Unicode strings to MySQL. Use the u'' prefix to avoid troublesome implied conversion.
All your CSV data is already str / Unicode str. There's no need to convert it.
Putting it all together, your code will look like:
from backports import csv
import io
datos = [self.DB_HOST, self.DB_USER, self.DB_PASS, self.DB_NAME]
conn = MySQLdb.connect(*datos, charset='utf8', use_unicode=True)
cursor = conn.cursor()
cont = 0
with io.open('principal/management/commands/Historial_fichajes_jugadores.csv', 'r', encoding='utf-8') as csvfile:
historialReader = csv.reader(csvfile, delimiter=',')
for row in historialReader:
if cont == 0:
cont += 1
else:
cursor.execute(u'''INSERT INTO principal_historial(jugador_id, temporada, fecha, ultimoClub, nuevoClub, valor, coste) VALUES (%s,%s,%s,%s,%s,%s,%s)''',
round(float(row[1]))+1,row[2], self.stringToDate(row[3]), row[4], row[5], self.convertValue(row[6]), row[7]))
conn.commit()
cursor.close()
conn.close()
You may also want to look at https://stackoverflow.com/a/35444608/1554386, which covers what Python 2.7 Unicodes are.

Pandas HDF5 store unicode error on select query

I have unicode data as read from this file:
Mdt,Doccompra,OrgC,Cen,NumP,Criadopor,Dtcriacao,Fornecedor,P,Fun
400,8751215432,2581,,1,MIGRAÇÃO,01.10.2004,75852214,,TD
400,5464282154,9874,,1,MIGRAÇÃO,01.10.2004,78995411,,FO
I have two problems:
When I try to query this unicode data I get a UnicodeDecodeError:
Traceback (most recent call last):
File "<ipython-input-1-4423dceb2b1d>", line 1, in <module>
runfile('C:/Users/u5en/Documents/SAP/Programação/Problema HDF.py', wdir='C:/Users/u5en/Documents/SAP/Programação')
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 48, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "C:/Users/u5en/Documents/SAP/Programação/Problema HDF.py", line 15, in <module>
store.select("EKKA", "columns=['Mdt', 'Fornecedor']")
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 665, in select
return it.get_result()
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 1359, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 658, in func
columns=columns, **kwargs)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 3968, in read
if not self.read_axes(where=where, **kwargs):
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 3201, in read_axes
a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2058, in convert
self.data, nan_rep=nan_rep, encoding=encoding)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4359, in _unconvert_string_array
data = f(data)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 1700, in __call__
return self._vectorize_call(func=func, args=vargs)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 1769, in _vectorize_call
outputs = ufunc(*inputs)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4358, in <lambda>
f = np.vectorize(lambda x: x.decode(encoding), otypes=[np.object])
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 7: unexpected end of data
How can I store and query my unicode data in hdf5?
I have many tables with column names I do not know beforehand and which are not proper pytable names (NaturalNameWarning). I would like the user to be able to query on this columns, so I wonder how could I query these when their name prevents me? I see this used to have no easy fix, so if that is still the case I will just remove the offending characters from the heading.
import csv
import pandas as pd
dados = pd.read_csv("EKKA - Cópia.csv")
print(dados)
store= pd.HDFStore('teste.h5' , encoding="utf-8")
store.append("EKKA", dados, format="table", data_columns=True)
store.select("EKKA", "columns=['Mdt', 'Fornecedor']")
store.close()
Would I be better off doing this in sqlite?
Environment:
Windows 7 64bit
Pandas 15.2
NumPy 1.9.2
So under Python 2.7 on Windows 7, pandas 0.15.2, everything worked as expected, no encoding necessary. However on Python 3.4, the following worked for me. Apparently some characters are not representable in 'utf-8'; 'latin1' encoding usually solves these issues. Note that I had to read the csv in the first place with this encoding.
>>> df = pd.read_csv('../../test.csv',encoding='latin1')
>>> df
Mdt Doccompra OrgC Cen NumP Criadopor Dtcriacao Fornecedor P Fun
0 400 8751215432 2581 NaN 1 MIGRAÇ\xc3O 01.10.2004 75852214 NaN TD
1 400 5464282154 9874 NaN 1 MIGRAÇ\xc3O 01.10.2004 78995411 NaN FO
Further, the encoding must be specified not when opening the store, but on the append/put calls
>>> df.to_hdf('test.h5','df',format='table',mode='w',data_columns=True,encoding='latin1')
>>> pd.read_hdf('test.h5','df')
Mdt Doccompra OrgC Cen NumP Criadopor Dtcriacao Fornecedor P Fun
0 400 8751215432 2581 NaN 1 MIGRAÇ\xc3O 01.10.2004 75852214 NaN TD
1 400 5464282154 9874 NaN 1 MIGRAÇ\xc3O 01.10.2004 78995411 NaN FO
Once it is written encoded, it is not necessary to specify the encoding when reading.