Uploading csv to postgresql using psycopg2 - postgresql

hI I am trying to upload a csv file to postgresql database using python
A table called "userspk" is already created in the database called "DVD"
below are the codes
import pandas as pd
import psycopg2 as pg2
conn = pg2.connect(database='DVD', user=xxx,password=xxx)
cur = conn.cursor()
def upload_data():
with open('/Users/Downloads/DVDlist.csv', 'r') as f:
next(f) #skips the header row
cur.copy_from(f, 'userspk', sep=',')
conn.commit()
upload_data()
keep getting this error. I would have thought it should be fairly straightforward. Something wrong with codes?
/Users/pk/.conda/envs/Pk/bin/python /Users/pk/PycharmProjects/Pk/SQL_upload_file.py
Traceback (most recent call last):
File "/Users/pk/PycharmProjects/Pk/SQL_upload_file.py", line 44, in <module>
upload_data()
File "/Users/pk/PycharmProjects/Pk/SQL_upload_file.py", line 37, in upload_data
next(f) # Skip the header row.
File "/Users/pk/.conda/envs/Pk/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 5718: invalid start byte

The error seems to be coming from the next(f) function, and so has nothing to do with psycopg2 or PostgreSQL. It looks like your file has characters which python consider to be invalid as utf-8 characters.
This file is probably in Latin-1, and that is the British pound sterling sign.
You might be able to fix it by specifying the encoding when you open the file.
open('/Users/Downloads/DVDlist.csv', 'r',encoding="latin-1")
But the rows after the header might also have some issues.

Related

psycopg2 error while writing data from csv file: extra data after last expected column

I am trying to insert data from a csv file (file.csv) into two columns of the table in Postgres. The data looks like this:
#Feature AC;Feature short label
EBI-517771;p.Leu107Phe
EBI-491052;p.Gly23Val
EBI-490120;p.Pro183His
EBI-517851;p.Gly12Val
EBI-492252;p.Lys49Met
EBI-527190;p.Cys360Ser
EBI-537514;p.Cys107Ser
The code I am running is as follows:
# create table in ebi_mut_db schema
cursor.execute("""
CREATE TABLE IF NOT EXISTS ebi_mut_db.mutations_affecting_interactions(
feature_ac TEXT,
feature_short_label TEXT)
""")
with open(file.csv', 'r') as f:
# Notice that we don't need the `csv` module.
next(f) # Skip the header row.
cursor.copy_from(f, 'ebi_mut_db.mutations_affecting_interactions', sep=';')
conn.commit()
The table is created but while writing the data, it is showing below error.
Traceback (most recent call last):
File "stdin<>", line 38, in <module>
cursor.copy_from(f, 'ebi_mut_db.mutations_affecting_interactions', sep=';')
psycopg2.errors.BadCopyFileFormat: extra data after last expected column
CONTEXT: COPY mutations_affecting_interactions, line 23: "EBI-878110;"p.[Ala223Pro;Ala226Pro;Ala234Asp]""
There are no extra columns except the two. My understanding is the code is detecting more than 2 columns.
Thanks
You have not told the COPY you are using CSV format, so it is using the default TEXT format. In that format, quoting does not protect special characters, and since there is more than one ; there is more than two columns.
If you want the COPY to know that ; inside quotes do not count as separators, then you have to tell it to use CSV format. In psycopg2, I think you have to use copy_expert, not copy_from, in order to accomplish this.

SQLAlchemy Unicode Problems in Exceptions

I'm working on a Flask app with a postgres/SQLAlchemy/Flask-Admin. However, in the Admin interface, any DB error that contain Unicode letters can't be reported since unicode(exc) raises UnicodeDecodeError.
I was able to locate that problem to sqlalchemy.exc
class StatementError(SQLAlchemyError):
...
def __unicode__(self):
return self.__str__()
And reproduce the problem by with:
class A(Base):
__tablename__="a"
id = Column(Integer, primary_key=True)
name = Column(String)
name2 = Column(String, nullable=False)
session = Session()
a = A(name=u"עברית")
session.add(a)
try:
session.commit()
except Exception as e:
print(repr(e))
print("------------------")
print(unicode(e))
Which returns:
ProgrammingError('(psycopg2.ProgrammingError) column "name" of relation "a" does not exist\nLINE 1: INSERT INTO a (name, name2) VALUES (\'\xd7\xa2\xd7\x91\xd7\xa8\xd7\x99\xd7\xaa\', NULL) RETURNING...\n ^\n',)
------------------
Traceback (most recent call last):
File "test.py", line 27, in <module>
print(unicode(e))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 118: ordinal not in range(128)
And I currently solve it by replacing the relevant exceptions with my classes that decode from utf-8. However, this is a terrible hack, and I'm looking for a proper solution:
Is there way to configure SQLAlchemy to automatically decode the received error messages?
Is there way to configure Postgres to output messages in latin encoding (less favorable, but accetable)
Is there way to make unicode try to decode by utf-8 instead of ascii/latin?
Is there any way to resolve it at all???
(The problem is relevant only to Python2. In Python3 the code above works. I believe it's because the default encoding is utf-8)
I actually think that patching SQLAlchemy from your application is the right reasonably clean solution. Here's why:
You've identified something that generally is agreed to be a bug in SQLAlchemy.
You can write a patch that will behave the same for all situations that SQLAlchemy currently works with. That is, your patch will not break existing code
The probability is very high that even if SQLAlchemy is fixed your patch will be harmless.
Making this change reduces the impact of the SQLAlchemy bug throughout your code over solutions like changing every place where exceptions might be printed.
Changing PostGres to return latin1 encoding actually wouldn't help because python is using the ascii encoding, which would give the same error when given a latin1 string. Also, changing PostGres to return latin1 errors would probably involve changing the connection encoding; that likely creates issues for unicode data.
Here's a simple program that patches sqlalchemy.exc.StatementError and tests the patch. If you wanted you could even try generating a exception including unicode, convert that to unicode, and only apply the patch if that raises UnicodeDecodeError. If you did that, your patch would automatically stop being applied when sqlalchemy fixes the issue.
# -*- coding: utf-8 -*-
from sqlalchemy.exc import StatementError
def statement_error_unicode(self):
return unicode(str(self), 'utf-8')
# See <link to sqlalchemy issue>; can be removed once we require a
# version of sqlalchemy with a fix to that issue
StatementError.__unicode__ = statement_error_unicode
message = u'Sqlalchemy unicode 😞'
message_str = message.encode('utf-8')
error = StatementError(message_str, 'select * from users', tuple(), '')
print unicode(error)

unable to open html file with Chinese character

everyone, i run into a trouble when trying to open a HTML file containing Chinese characters, here is the code
#problem with chinese character
file =wget.download("http://nba.stats.qq.com/player/list.htm#teamId=1")
with open(file,encoding ='utf-8') as f:
html = f.read()
print(html)
However in the output I get error as follows
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 535: invalid continuation byte
I searched for a while , and i saw some similar issues, but the solutions seems to use latin-1, which is obvious not the case here, I'm not sure how which encoding to use?
any suggestions? thanks ~
The page you are referring to is not encoded in UTF-8 encoding, but in GBK. You can tell by looking at the header:
<meta charset="GBK">
If you specify encoding='gbk' it'll work.
On another note, I would opt for not using wget unless you have to, and instead going with urllib which comes with the Python Standard Library. It also saved the disk write, and the code is simpler:
import urllib.request
with urllib.request.urlopen("http://nba.stats.qq.com/player/list.htm") as file:
html = file.read()
print(html.decode('gbk'))

python encoding conversion

Here's my problem, I have a variable wrongly encoded that I want to fix. Long story short, I end up with:
myVar=u'\xc3\xa9'
which is wrong because it's the character 'é' or \u00e9 UTF-8 encoded, not unicode.
None of the combinations of encode/decode I tried seem to solve the problem. I looked towards the bytearray object, but you must provide an encoding, and obviously none of them fits.
Basically I need to reinterpret the byte array into the correct encoding. Any ideas on how to do that?
Thanks.
What you should have done.
>>> b='\xc3\xa9'
>>> b
'\xc3\xa9'
>>> b.decode("UTF-8")
u'\xe9'
Since you didn't show the broken code that caused the problem, all we can do is make a complex problem more complex.
This appears to be what you're seeing.
>>> c
u'\xc3\xa9'
>>> c.decode("UTF-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Here's a workaround.
>>> [ chr(ord(x)) for x in c ]
['\xc3', '\xa9']
>>> ''.join(_)
'\xc3\xa9'
>>> _.decode("UTF-8")
u'\xe9'
Fix the code that produced the wrong stuff to begin with.
The hacky solution: pull out the codepoints with ord, then build characters (length-one strings) out of these with chr, then paste the lot back together and decode.
>>> u = u'\xc3\xa9'
>>> s = ''.join(chr(ord(c)) for c in u)
>>> unicode(s, encoding='utf-8')
u'\xe9'

unicode().decode('utf-8', 'ignore') raising UnicodeEncodeError

Here is the code:
>>> z = u'\u2022'.decode('utf-8', 'ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2022' in position 0: ordinal not in range(256)
Why is UnicodeEncodeError raised when I am using .decode?
Why is any error raised when I am using 'ignore'?
When I first started messing around with python strings and unicode, It took me awhile to understand the jargon of decode and encode too, so here's my post from here that may help:
Think of decoding as what you do to go from a regular bytestring to unicode and encoding as what you do to get back from unicode. In other words:
You de-code a str to produce a unicode string (in Python 2)
and en-code a unicode string to produce a str (in Python 2)
So:
unicode_char = u'\xb0'
encodedchar = unicode_char.encode('utf-8')
encodedchar will contain your unicode character, displayed in the selected encoding (in this case, utf-8).
The same principle applies to Python 3. You de-code a bytes object to produce a str object. And you en-code a str object to produce a bytes object.
From http://wiki.python.org/moin/UnicodeEncodeError
Paradoxically, a UnicodeEncodeError may happen when
decoding. The cause of it seems to be the
coding-specific decode() functions that normally expect
a parameter of type str. It appears that on seeing a
unicode parameter, the decode() functions "down-convert"
it into str, then decode the result assuming it to be of
their own coding. It also appears that the
"down-conversion" is performed using the ASCII encoder.
Hence an encoding failure inside a decoder.
You're trying to decode a unicode. The implicit encoding to make the decode work is what's failing.