SQLAlchemy, Psycopg2 and Postgresql COPY - postgresql

It looks like Psycopg has a custom command for executing a COPY:
psycopg2 COPY using cursor.copy_from() freezes with large inputs
Is there a way to access this functionality from with SQLAlchemy?

accepted answer is correct but if you want more than just the EoghanM's comment to go on the following worked for me in COPYing a table out to CSV...
from sqlalchemy import sessionmaker, create_engine
eng = create_engine("postgresql://user:pwd#host:5432/db")
ses = sessionmaker(bind=engine)
dbcopy_f = open('/tmp/some_table_copy.csv','wb')
copy_sql = 'COPY some_table TO STDOUT WITH CSV HEADER'
fake_conn = eng.raw_connection()
fake_cur = fake_conn.cursor()
fake_cur.copy_expert(copy_sql, dbcopy_f)
The sessionmaker isn't necessary but if you're in the habit of creating the engine and the session at the same time to use raw_connection you'll need separate them (unless there is some way to access the engine through the session object that I don't know). The sql string provided to copy_expert is also not the only way to it, there is a basic copy_to function that you can use with subset of the parameters that you could past to a normal COPY TO query. Overall performance of the command seems fast for me, copying out a table of ~20000 rows.
http://initd.org/psycopg/docs/cursor.html#cursor.copy_to
http://docs.sqlalchemy.org/en/latest/core/connections.html#sqlalchemy.engine.Engine.raw_connection

If your engine is configured with a psycopg2 connection string (which is the default, so either "postgresql://..." or "postgresql+psycopg2://..."), you can create a psycopg2 cursor from an SQL Alchemy session using
cursor = session.connection().connection.cursor()
which you can use to execute
cursor.copy_from(...)
The cursor will be active in the same transaction as your session currently is. If a commit or rollback happens, any further use of the cursor with throw a psycopg2.InterfaceError, you would have to create a new one.

You can use:
def to_sql(engine, df, table, if_exists='fail', sep='\t', encoding='utf8'):
# Create Table
df[:0].to_sql(table, engine, if_exists=if_exists)
# Prepare data
output = cStringIO.StringIO()
df.to_csv(output, sep=sep, header=False, encoding=encoding)
output.seek(0)
# Insert data
connection = engine.raw_connection()
cursor = connection.cursor()
cursor.copy_from(output, table, sep=sep, null='')
connection.commit()
cursor.close()
I insert 200000 lines in 5 seconds instead of 4 minutes

It doesn't look like it.
You may have to just use psycopg2 to expose this functionality and forego the ORM capabilities. I guess I don't really see the benefit of ORM in such an operation anyway since it's a straight bulk insert and dealing with individual objects a la an ORM would not really make a whole lot of sense.

If you're starting from SQLAlchemy, you need to first get to the connection engine (also known by the property name bind on some SQLAlchemy objects):
engine = create_engine('postgresql+psycopg2://myuser:password#localhost/mydb')
# or
engine = session.engine
# or any other way you know to get to the engine
From the engine you can isolate a psycopg2 connection:
# get a psycopg2 connection
connection = engine.connect().connection
# get a cursor on that connection
cursor = connection.cursor()
Here are some templates for the COPY statement to use with cursor.copy_expert(), a more complete and flexible option than copy_from() or copy_to() as it is indicated here: https://www.psycopg.org/docs/cursor.html#cursor.copy_expert.
# to dump to a file
dump_to = """
COPY mytable
TO STDOUT
WITH (
FORMAT CSV,
DELIMITER ',',
HEADER
);
"""
# to copy from a file:
copy_from = """
COPY mytable
FROM STDIN
WITH (
FORMAT CSV,
DELIMITER ',',
HEADER
);
"""
Check out what the options above mean and others that may be of interest to your specific situation https://www.postgresql.org/docs/current/static/sql-copy.html.
IMPORTANT NOTE: The link to the documentation of cursor.copy_expert() indicates to use STDOUT to write out to a file and STDIN to copy from a file. But if you look at the syntax on the PostgreSQL manual, you'll notice that you can also specify the file to write to or from directly in the COPY statement. Don't do that, you're likely just wasting your time if you're not running as root (who runs Python as root during development?) Just do what's indicated in the psycopg2's docs and specify STDIN or STDOUT in your statement with cursor.copy_expert(), it should be fine.
# running the copy statement
with open('/path/to/your/data/file.csv') as f:
cursor.copy_expert(copy_from, file=f)
# don't forget to commit the changes.
connection.commit()

You don't need to drop down to psycopg2, use raw_connection nor a cursor.
Just execute the sql as usual, you can even use bind parameters with text():
engine.execute(text('''copy some_table from :csv
delimiter ',' csv'''
).execution_options(autocommit=True),
csv='/tmp/a.csv')
You can drop the execution_options(autocommit=True) if this PR will be accepted

Related

Matlab create new Microsoft Access Database file *.accdb

I have used the following code pattern to access my *.accdb files:
accdb_path='C:\path\to\accdb\file\wbe3.accdb';
accdb_url= [ 'jdbc:odbc:Driver={Microsoft Access Driver (*.mdb, *.accdb)};DSN='''';DBQ=' accdb_path ];
conn = database('','','','sun.jdbc.odbc.JdbcOdbcDriver',accdb_url);
If instead I want to create a new *.accdb file, how would I do that? There is much on the web about how to connect, but I haven't found how to create the *.accdb file itself.
In case it matters, I want to be able to execute SQL 92 syntax. I am using Matlab 2015b. I do not want to use the Matlab GUI for exploring databases.
Actually, what you are attempting to do can be very tricky to achieve. It may require a direct interface to Access through an ActiveX control and I'm not even sure it can be done. It seems that the web is lacking a solid information pool concerning Access interoperability.
One quick workaround I can suggest you, althrough miserable, is to manually create an empty ACCDB file that you can use as template and then duplicate it whenever a new database must be created:
conn = CreateDB('C:\PathB\wbe3.accdb');
function accdb_conn = CreateDB(accdb_path)
status = copyfile('C:\PathA\template.accdb',accdb_path,'f');
if (status)
accdb_url = ['jdbc:odbc:Driver={Microsoft Access Driver (*.mdb, *.accdb)};DSN='';DBQ=' accdb_path];
accdb_conn = database('','','','sun.jdbc.odbc.JdbcOdbcDriver',accdb_url);
else
accdb_conn = [];
error(['Could not duplicate the ACCDB template to the directory "' accdb_path '".']);
end
end
The following example is based on Tommaso's answer, which provides code for copying an empty *.accdb file and connecting to the copy. Based on an afternoon of trial, error, perusing of the web/help, I've expanded on that to create a database table and export a Matlab table to it. I've also embedded comments showing where modifications are needed, presumably due to my older 2015b version of Matlab, error catching constructs, and caveats in the file copy.
srcPath = [pwd '/emptyFile.accdb']; % Source
tgtPath = [pwd '/new.accdb']; % Target
cpyStatOk = copyfile( srcPath, tgtPath );
% No warning B4 clobber target file
if cpyStatOk
accdb_url= [ ...
'jdbc:odbc:Driver={Microsoft Access Driver (*.mdb, *.accdb)};DSN='''';DBQ=' ...
tgtPath ];
conn = database('','','','sun.jdbc.odbc.JdbcOdbcDriver',accdb_url);
else
error('Couldn''t copy %s to %s',srcPath,tgtPath);
end % if cpyStatOk
try
% conn.Execute(['CREATE TABLE tstMLtbl2accdb ' ... Not for 2015b
curs = conn.exec(['CREATE TABLE tstMLtbl2accdb ' ...
'( NumCol INTEGER, StrCol VARCHAR(255) );']);
if ~isempty( curs.Message )
% fprintf(2,'%s: %s\n',mfilename,curs.Message);
error('%s: %s\n',mfilename,curs.Message);
% Trigger `catch` & close(conn)
end %if
% sqlwrite( conn, 'tstMLtbl2accdb', ...Not supported in 2015b
datainsert( conn, 'tstMLtbl2accdb', {'NumCol','StrCol'}, ...
table( floor(10*rand(5,1)), {'abba';'cadabra';'dog';'cat';'mouse'}, ...
'VariableNames',{'NumCol','StrCol'} ) );
catch xcptn
close(conn)
fprintf(2,'Done `catch xcptn`\n');
rethrow(xcptn);
end % try
%
% Other database manipulations here
%
close(conn)
disp(['Done ' mfilename]);
This has been immensely educational for myself, and I hope it is useful for others considering the use of SQL as an alternative to the more code-heavy Matlab counterpart to relational database manipulations. With this amount of overhead, I'd have to say that it is not attractive to perform SQL manipulations on data residing in the Matlab workspace except where one really needs the hyperoptimization of relational database query engines.
To those savvy with interfacing to Access, your comment on the purpose of the field names argument of the datainsert function would be appreciated. It is dubbed colnames in the documentation. From testing, the field names and number of columns must match between the existing target table in Access and the source table in Matlab. So the field names argument doesn't seem to serve any purpose. The help documentation isn't all that helpful.
AFTERNOTE: I've composed a "specification" for the colnams argument based on examples from TMW. TMW has confirmed this explanation:
The colnames argument tells the external database environment the names and order of fields of the data container supplied via the data argument. These field names are used to match the fields of the transferred data with fields in the table tablename residing within the external database environment. Because of this explicit name matching, the order of the fields in data do not have to match the order of the fields in tablename.
If I find any departures of empirical behaviour from the above "specification", I will update this answer.

neo4j import script endless loop because 2 properties with same name

I just managed to freeze my whole environment with a cypher import script. The process was running with 99% CPU uncontrollably until we killed it.
I am not sure, but I think the bug was in the import script - trying to set 2 properties with the same name - reading like
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///import.csv' AS import FIELDTERMINATOR ';'
... (some WITH / WHERE clauses)
CREATE (:Mylabel {myproperty: import.column1, myproperty: import.column2});
Does anyone have experience with behaviour like that?
EDIT:
I am not allowed to copypaste the exact code, but I can try and leave it semantically intact:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///db.csv' AS row FIELDTERMINATOR ';'
WITH row
WHERE row.typerow = 'Some_Identifier'
WITH head(collect(row.id)) as aid, row.exclusive AS excl, toInteger(row.alwsel) AS alwsel
CREATE (:Mylabel:Mytype {aid: toInteger(aid), exclusive: toString(excl),
exclusive: CASE WHEN alwsel=1 THEN true ELSE NULL END});
As was inquired below: there is no constraint on the property in question. I am currently not able to do any tests. I will be in a few days.

How to insert similar value into multiple locations of a psycopg2 query statement using dict? [duplicate]

I have a Python script that runs a pgSQL file through SQLAlchemy's connection.execute function. Here's the block of code in Python:
results = pg_conn.execute(sql_cmd, beg_date = datetime.date(2015,4,1), end_date = datetime.date(2015,4,30))
And here's one of the areas where the variable gets inputted in my SQL:
WHERE
( dv.date >= %(beg_date)s AND
dv.date <= %(end_date)s)
When I run this, I get a cryptic python error:
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) argument formats can't be mixed
…followed by a huge dump of the offending SQL query. I've run this exact code with the same variable convention before. Why isn't it working this time?
I encountered a similar issue as Nikhil. I have a query with LIKE clauses which worked until I modified it to include a bind variable, at which point I received the following error:
DatabaseError: Execution failed on sql '...': argument formats can't be mixed
The solution is not to give up on the LIKE clause. That would be pretty crazy if psycopg2 simply didn't permit LIKE clauses. Rather, we can escape the literal % with %%. For example, the following query:
SELECT *
FROM people
WHERE start_date > %(beg_date)s
AND name LIKE 'John%';
would need to be modified to:
SELECT *
FROM people
WHERE start_date > %(beg_date)s
AND name LIKE 'John%%';
More details in the pscopg2 docs: http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries
As it turned out, I had used a SQL LIKE operator in the new SQL query, and the % operand was messing with Python's escaping capability. For instance:
dv.device LIKE 'iPhone%' or
dv.device LIKE '%Phone'
Another answer offered a way to un-escape and re-escape, which I felt would add unnecessary complexity to otherwise simple code. Instead, I used pgSQL's ability to handle regex to modify the SQL query itself. This changed the above portion of the query to:
dv.device ~ E'iPhone.*' or
dv.device ~ E'.*Phone$'
So for others: you may need to change your LIKE operators to regex '~' to get it to work. Just remember that it'll be WAY slower for large queries. (More info here.)
For me it's turn out I have % in sql comment
/* Any future change in the testing size will not require
a change here... even if we do a 100% test
*/
This works fine:
/* Any future change in the testing size will not require
a change here... even if we do a 100pct test
*/

Is there a way to use User Activity Variables to store SQL in Datastage

I am considering using RCP to run a generic datastage job, but the initial SQL changes each time it's called. Is there a process in which I can use a User Activity Variable to inject SQL from a text file or something so I can use the same datastage?
I know this Routine can read a file to look up parameters:
Routine = ‘ReadFile’
vFileName = Arg1
vArray = ”
vCounter = 0
OPENSEQ vFileName to vFileHandle
Else Call DSLogFatal(“Error opening file list: “:vFileName,Routine)
Loop
While READSEQ vLine FROM vFileHandle
vCounter = vCounter + 1
vArray = Fields(vLine,’,’,1)
vArray = Fields(vLine,’,’,2)
vArray = Fields(vLine,’,’,3)
Repeat
CLOSESEQ vFileHandle
Ans = vArray
Return Ans
But does that mean I just store the SQL in one Single line, even if it's long?
Thanks.
Why not just have the SQL within the routine itself and propagate parameters?
I have multiple queries within a single routine that does just that (one for source and one for AfterSQL statement)
This is an example and apologies I'm answering this on my mobile!
InputCol=Trim(pTableName)
If InputCol='Table1' then column='Day'
If InputCol='Table2' then column='Quarter, Day'
SQLCode = ' Select Year, Month, '
SQLCode := column:", Time, "
SQLCode := " to_date(current_timestamp, 'YYYY-MM-DD HH24:MI:SS'), "
SQLCode := \ "This is example text as output" \
SQLCode := "From DATE_TABLE"
crt SQLCode
I've used the multiple encapsulations in the example above, when passing out to a parameter make sure you check the ', " have either been escaped or are displaying correctly
Again, apologies for the quality but I hope it gives you some ideas!
You can give this a try
As you mentioned ,maintain the SQL in a file ( again , if the SQL keeps changing , you need to build a logic to automate populating the new SQL)
In the Datastage Sequencer , use a Execute Command Activity to open the SQL file
eg : cat /home/bk/query.sql
In the job activity which calls your generic job . you should map the command output of your EC activity to a job parameter
so if EC activity name is exec_query , then the job parameter will be
exec_query.$CommandOuput
When you run the sequence , your query will flow from
SQL file --> EC activity-->Parameter in Job activity-->DB stage( query parameterised)
Has you thinked to invoke a shellscript who connect to database and execute the SQL script from the sequential job? You could use sqlplus to connect in the shellscript and read the file with the SQL and use it. To execute the shellscript from the sequential job use a ExecCommand Stage (sh, ./, ...), it depends from the interpreter.
Other way to solve this, depends of the modification degree of your SQL; you could invoke a routine base who handle the parameters and invokes your parallel job.
The principal problem that I think you could have, is the limit of the long of the variable where you could store the parameter.
Tell me what option you choose and I could help you more.

Psycopg2 SQL Statment is working from the query Builder but not from Python

The statement I want to use is the following:
UPDATE table SET colum1 = 1 WHERE colum2 LIKE '%gasse%';
when I use exactly this statement everything containing gasse gets updated but when I do this in python with Psycopg2:
sqlstring = """UPDATE table SET colum1 = 1 WHERE colum2 LIKE '%gasse%';"""
cur.execute(sqlstring)
colum1 does not get updated have i done something wrong did i not escape something correctly?
It is necessary to commit the transaction that is automatically opened by Psycopg before the first command
connection.commit()
http://initd.org/psycopg/docs/connection.html