How to use pyodbc to migrate tables from MS Access to Postgres? - postgresql

I need to migrate tables from MS Access to Postgres. I'd like to use pyodbc to do this as it allows me to connect to the Access database using python and query the data.
The problem I have is I'm not exactly sure how to programmatically create a table with the same schema other than just creating a SQL statement using string formatting. pyodbc provides the ability to list all of the fields, field types and field lengths, so I can create a long SQL statement with all of the relevant information, however how can I do this for a bunch of tables? would I need to build SQL string statements for each table?
import pyodbc
access_conn_str = (r'DRIVER={Microsoft Access Driver (*.mdb, *.accdb)}; 'r'DBQ=C:\Users\bob\access_database.accdb;')
access_conn = pyodbc.connect(access_conn_str)
access_cursor = access_conn.cursor()
postgres_conn_str = ("DRIVER={PostgreSQL Unicode};""DATABASE=access_database;""UID=user;""PWD=password;""SERVER=localhost;""PORT=5433;")
postgres_conn = pyodbc.connect(postgres_conn_str)
postgres_cursor = postgres_conn.cursor()
table_ditc = {}
row_dict = {}
for row in access_cursor.columns(table='table1'):
row_dict[row.column_name] = [row.type_name, row.column_size]
table_ditc['table1'] = row_dict
for table, values in table_ditc.items():
print(f"Creating table for {table}")
access_cursor.execute(f'SELECT * FROM {table}')
result = access_cursor.fetchall()
postgres_cursor.execute(f'''CREATE TABLE {table} (Do I just put a bunch of string formatting in here?);''')
postgres_cursor.executemany(f'INSERT INTO {table} (Do I just put a bunch of string formatting) VALUES (string formatting?)', result)
postgres_conn.commit()
As you can see, with pyodbc I'm not exactly sure how to build the SQL statements. I know I could build a long string by hand, but if I were doing a bunch of different tables, with different fields etc. that would not be realistic. Is there a better, easier way to create the table and insert rows based off of the schema of the Access database?

I ultimately ended up using a combination of pyodbc and pywin32. pywin32 is "basically a very thin wrapper of python that allows us to interact with COM objects and automate Windows applications with python" (quoted from second link below).
I was able to programmatically interact with Access and export the tables directly to Postgres with DoCmd.TransferDatabase
https://learn.microsoft.com/en-us/office/vba/api/access.docmd.transferdatabase
https://pbpython.com/windows-com.html
import win32com.client
import pyodbc
import logging
from pathlib import Path
conn_str = (r'DRIVER={Microsoft Access Driver (*.mdb, *.accdb)}; 'rf'DBQ={access_database_location};')
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()
a = win32com.client.Dispatch("Access.Application")
a.OpenCurrentDatabase(access_database_location)
table_list = []
for table_info in cursor.tables(tableType='TABLE'):
table_list.append(table_info.table_name)
for table in table_list:
logging.info(f"Exporting: {table}")
acExport = 1
acTable = 0
db_name = Path(access_database_location).stem.lower()
a.DoCmd.TransferDatabase(acExport, "ODBC Database", "ODBC;DRIVER={PostgreSQL Unicode};"f"DATABASE={db_name};"f"UID={pg_user};"f"PWD={pg_pwd};""SERVER=localhost;"f"PORT={pg_port};", acTable, f"{table}", f"{table.lower()}_export_from_access")
logging.info(f"Finished Export of Table: {table}")
logging.info("Creating empty table in EGDB based off of this")
This approach seems to be working for me. I like how the creation of the table/fields as well as insertion of data is all handled automatically (which was the original problem I was having with pyodbc).
If anyone has better approaches I'm open to suggestions.

Related

Tabpy and Postgres

I'm experimenting with Tabpy in Tableau and writing data back to a database via a users selection in a dashboard.
My data connections work fine. However, I'm having trouble passing a variable back into the SQL query and getting an error saying the "variable name" doesn't exist in the table that I'm trying to update. Here is my code.
The error states that "dlr_status" does not exist on the table. This is the variable I'm trying to pass back to the query. The database is a Postgres database. Any help is greatly appreciated. I've been researching this for several days and can't find anything.
SCRIPT_STR("
import psycopg2
import numpy as np
from datetime import date
con = psycopg2.connect(
dbname='edw',
host='serverinfo',
port='5439',
user='username',
password='userpassword')
dlr_no_change = _arg1[0]
dlr_status = _arg2[0]
update_trigger = _arg3[0]
sql = '''update schema.table set status = dlr_status where dlr_no = dlr_no_change'''
if update_trigger == True:
cur = con.cursor()
cur.execute(sql)
cur.commit",
ATTR([Dlr No]), ATTR([dlr_status]), ATTR([Update_Now]))
Your commit is missing "()". Or add a con.autocommit = True after creating the connection if you don't want to commit each step.

How to list all tables in sqlite.swift

Apparently I'm missing something here.
Given you want to connect to an unknown sqlite database with multiple tables, is there a way to call something like .tables on
let db = try Connection("path/to/db.sqlite3")
to retrieve a list of tables and then create a reference to it.
i tried (amongst other things)
let statement = try db?.prepare("select * from sqlite_master where type='table'")
let tables = try statement!.run()
which returned a list of bindings, I have no idea how to deal with.

PostgreSQL: Import columns into table, matching key/ID

I have a PostgreSQL database. I had to extend an existing, big table with a few more columns.
Now I need to fill those columns. I tought I can create an .csv file (out of Excel/Calc) which contains the IDs / primary keys of existing rows - and the data for the new, empty fields. Is it possible to do so? If it is, how to?
I remember doing exactly this pretty easily using Microsoft SQL Management Server, but for PostgreSQL I am using PG Admin (but I am ofc willing to switch the tool if it'd be helpfull). I tried using the import function of PG Admin which uses the COPY function of PostgreSQL, but it seems like COPY isn't suitable as it can only create whole new rows.
Edit: I guess I could write a script which loads the csv and iterates over the rows, using UPDATE. But I don't want to reinvent the wheel.
Edit2: I've found this question here on SO which provides an answer by using a temp table. I guess I will use it - although it's more of a workaround than an actual solution.
PostgreSQL can import data directly from CSV files with COPY statements, this will however only work, as you stated, for new rows.
Instead of creating a CSV file you could just generate the necessary SQL UPDATE statements.
Suppose this would be the CSV file
PK;ExtraCol1;ExtraCol2
1;"foo",42
4;"bar",21
Then just produce the following
UPDATE my_table SET ExtraCol1 = 'foo', ExtraCol2 = 42 WHERE PK = 1;
UPDATE my_table SET ExtraCol1 = 'bar', ExtraCol2 = 21 WHERE PK = 4;
You seem to work under Windows, so I don't really know how to accomplish this there (probably with PowerShell), but under Unix you could generate the SQL from a CSV easily with tools like awk or sed. An editor with regular expression support would probably suffice too.

Running cleaning/validation code before committing in sqlalchemy

I'm totally new to PostGreSQL and SQLAlchemy, and I'm trying to figure out how to run validation/cleaning code on an SQLAlchemy model before it is committed to the database. The idea is to ensure data consistency beyond the standard type enforcement that comes built into SQL databases. For example, if I have a User model built on SQLAlchemy's models,
class User(db.Model):
...
email = db.Column(db.String())
zipCode = db.Column(db.String())
lat = db.Column(db.Float())
lng = db.Column(db.Float())
...
Before committing this document, I want to:
trim any leading & trailing spaces off the email field
ensure that the zip code is a 5-digit string of numbers (I'll define a regex for this)
automatically look up the corresponding latitude/longitude of the zip code and save those in the lat & lng fields.
other things that a database schema can't enforce
Does SQLAlchemy provide an easy way to provide Python code that is guaranteed to run before committing to do arbitrary tasks like this?
I found the easiest is to hook onto the update and insert events. http://docs.sqlalchemy.org/en/latest/orm/events.html
from sqlalchemy import event
def my_before_insert_listener(mapper, connection, target):
target.email=target.email.trim()
#All the other stuff
# associate the listener function with User,
# to execute during the "before_insert" hook
event.listen(
User, 'before_insert', my_before_insert_listener)
You can create custom sqlalchemy types that do this sort of thing.

Using Script Task to create ADO NET (ODBC) Data Flow Source

I need some help with a SSIS Script Task (SQL 2008 R2) that dynamically creates a package. I am refining a package that copies data from a Sage Timberline (Now rebranded to Sage 300) Pervasive SQL environment to a SQL server data warehouse. I can create a package that opens the connection to Timberline and copies the data to a table in SQL Server. The problem is, for each company in timberline and each table in SQL, I need to create a separate data flow task. Given the three Timberline company folders and the number of tables in each folder, this would take a lot of time to create and be cumbersome to maintain and troubleshoot.
I am trying to create a package that uses a Foreach Loop to create a package that creates a ADO/ODBC source (Timberline), a OLE destination (SQL) and dynamically handles the column mapping. I found code here that almost does what I need.
I tested this code and it works great using OLE SQL source and destinations. What makes this script work is that it dynamically handles the column mapping. So, it you placed it into a Foreach Loop of the 100 or so tables, with each loop it could dynamically create the data flow and map the columns, then execute the new package.
My problem is that I can only connect to Timberline using ODBC. So, I need to modify the script to create the source connection with ADO NET (ODBC) instead of OLE. I’m having a lot of trouble trying to figure this out. Could someone please help me out with this?
Here the other couple of things I tried first, other than this approach:
Solution: Setup a Linked server to Timberline Pervasive SQL
Problem: SQL server is 64-bit and the Timberline driver is 32-bit. Using a linked server returns a architecture mismatch error. I called Sage and they said they have no plans to release a 64-bit drive.
Solution: Use one of the SQL Transfer tasks
Problem: Only works with SQL databases. This source is a Pervasive SQL database
Solution: Use a “INSERT … INTO …” type script
Problem: This requires a linked server. See the problem above
Here’s the section of the original VB .NET code I need help with:
'To Create a package named [Sample Package]
Dim package As New Package()
package.Name = "Sample Package"
package.PackageType = DTSPackageType.DTSDesigner100
package.VersionBuild = 1
'To add Connection Manager to the package
'For source database (OLTP)
Dim OLTP As ConnectionManager = package.Connections.Add("OLEDB")
OLTP.ConnectionString = "Data Source=.;Initial Catalog=OLTP;Provider=SQLNCLI10;Integrated Security=SSPI;Auto Translate=False;"
OLTP.Name = "LocalHost.OLTP"
'To add Load Employee Dim to the package [Data Flow Task]
Dim dataFlowTaskHost As TaskHost = DirectCast(package.Executables.Add("SSIS.Pipeline.2"), TaskHost)
dataFlowTaskHost.Name = "Load Employee Dim"
dataFlowTaskHost.FailPackageOnFailure = True
dataFlowTaskHost.FailParentOnFailure = True
dataFlowTaskHost.DelayValidation = False
dataFlowTaskHost.Description = "Data Flow Task"
'-----------Data Flow Inner component starts----------------
Dim dataFlowTask As MainPipe = TryCast(dataFlowTaskHost.InnerObject, MainPipe)
' Source OLE DB connection manager to the package.
Dim SconMgr As ConnectionManager = package.Connections("LocalHost.OLTP")
' Create and configure an OLE DB source component.
Dim source As IDTSComponentMetaData100 = dataFlowTask.ComponentMetaDataCollection.[New]()
source.ComponentClassID = "DTSAdapter.OLEDBSource.2"
' Create the design-time instance of the source.
Dim srcDesignTime As CManagedComponentWrapper = source.Instantiate()
' The ProvideComponentProperties method creates a default output.
srcDesignTime.ProvideComponentProperties()
source.Name = "Employee Dim from OLTP"
' Assign the connection manager.
source.RuntimeConnectionCollection(0).ConnectionManagerID = SconMgr.ID
source.RuntimeConnectionCollection(0).ConnectionManager = DtsConvert.GetExtendedInterface(SconMgr)
' Set the custom properties of the source.
srcDesignTime.SetComponentProperty("AccessMode", 0)
' Mode 0 : OpenRowset / Table - View
srcDesignTime.SetComponentProperty("OpenRowset", "[dbo].[Employee_Dim]")
' Connect to the data source, and then update the metadata for the source.
srcDesignTime.AcquireConnections(Nothing)
srcDesignTime.ReinitializeMetaData()
srcDesignTime.ReleaseConnections()
Thanks in advance!
The C# code here is what you need if you need a Derived Column transform between the Source and Destination...
http://bifuture.blogspot.com/2011/01/ssis-adding-derived-column-to-ssis.html
To get the Source & Destination connections working, there is some secret sauce here to get things working between COM and .Net...
http://blogs.msdn.com/b/mattm/archive/2008/12/30/api-sample-ado-net-source.aspx
There is a similar page showing what to do for OleDB connections too.
Creating the source tables is easy. The available ODBC Metadata collections accessible should be retrieved with GetSchema("MetaDataCollections"). This will return a list of the available schema collections available for that particular ODBC driver.
Next, you'll want to see the data types returned from GetSchema("DataTypes"), so you can correctly interpret the data types for each column retrieved from GetSchema("Columns") to make your SQL Server create table script (which I'm assuming you've done).
To at least figure out which tables have primary keys, you'll need to loop over each table returned from GetSchema("Tables") in order to work with GetSchema("Indexes"). There's a bug that requires you to query the Indexes one table at a time. It is easy to google this - create a string array to pass in as the 3rd parameter: GetSchema("Indexes", tblName, resultArray[])
What I did was got the Tables and Columns collections into object variables in my parent SSIS package. Because Timberline is so fast (not), it seemed more efficient to pull all the columns down and filter them locally...which I do to create the tables in SQL Server, if necessary.
Once that is done, use the local copy of Tables again to manipulate a SSIS package in a Script task in "design mode" (change source and destination target tables, and redo the column mappings), and execute the now-in-memory SSIS package.
For me it took awhile to figure out. Both above URLs were required. I found and copied the .Net 2.0 Dts.PipelineWrap and Dts.RuntimeWrap .dlls to Microsoft.Net\FrameworkV2.0xxxxx folder, then referenced these in each script task wanting to use them, before setting up my "using DtsPW = Microsoft.SqlServer.Dts.Pipeline.Wrapper", etc.
Of note, because Timberline is 32-bit ODBC, I think it's necessary to build the SSIS package to use "X86", and target the script tasks to use .Net 2.0 framework.
I used the Derived Column code because I needed to copy multiple Timberline DBs into one SQL Server DB. Derived Column adds a "CompanyID" value to the output pipeline to SQL Server.
In the end, map the Destination's Virtual Input columns to its External Metadata columns, based off of the pipeline the Destination is attached to:
foreach (DtsPW.IDTSVirtualInputColumn100 vColumn in destVirtInput.VirtualInputColumnCollection)
{
var vCol = destInst.SetUsageType(destInput.ID, destVirtInput, vColumn.LineageID, DtsPW.DTSUsageType.UT_READWRITE);
destInst.MapInputColumn(destInput.ID, vCol.ID, destInput.ExternalMetadataColumnCollection[vColumn.Name].ID);
}
Anyways, that code will make more sense in the context of the bifuture.blogspot.com page.
The EzApi library could help with this too, but the AdoNet connection source for it is coded as a virtual class, so you'd need to implement specific classes to use. My C# kungfu is not strong enough for that in the time I have...
Also, CozyRoc sells a toolset with custom SSIS controls (data flow Source and Destination controls...) that looks like it does this on the fly input-to-output column mapping as well.
My package seems to work good enough now... Oh, and one more, I did not have luck trying to use DSN-less ODBC connections to Timberline, just: Dsn=dsnname;Uid=user;Pwd=pwd;
SSIS packages running in 64-bit land cannot see 32-bit DSNs on 64-bit OS, it seems...at least, it didn't work for me (win7-64, 32-bit Text ODBC DSN).