How to check C/C++ code for wrong SQL queries? - postgresql

I have worked on a Python application that queries a PostgreSQL DB using SQLAlchemy. My DB has a large number of tables many of which get modified every release. Whenever I make a change to a table (say, renaming a column), I execute mypy to find places in the code which are referencing old column names and I fix them in all such places. I now have to write some code using C/C++ and I have used libpq. The problem is that I have to manually scan the code to find places where changes are required whenever I make table changes. This is error prone. Is there a better way to ensure that the C/C++ code is in sync with the DB schema ?

Related

Importing many columns from a CSV into Postgres

I am trying to import to pgAdmin a big table with more than 100 columns. Is there any way to import the table without creating those 100 columns in a table within the pgAdmin? That would be a considerably time-consuming task.
You are not importing data into pgAdmin, you are importing it into Postgres, and using pgAdmin to help you in that task. Graphical tools like pgAdmin are, at heart, just convenience wrappers around the actual functionality of the database, and everything they do can be done in other ways.
In the case of a simple task like creating a table, the relevant SQL syntax is well worth learning. It will work in any database tool, even (with some minor changes) on other SQL databases (e.g. MySQL), can be saved in version control, and manipulated with an editor of your choice.
You could even go so far as to write a script in the language of your choice that generates the SQL for you based on some other data (e.g. the headings of the CSV file) - although make sure you don't run that with third-party data without checking the result or taking extreme care with code injection and other security concerns!
The Postgres manual has an introduction to tables and creating them which would be a good place to start.

SQLite to PostgreSQL data-only transfer (to maintain alembic functionality)

There are a few questions and answers already on PostgreSQL import (as well as the specific SQLite->PostgreSQL situation). This question is about a specific corner-case.
Background
I have an existing, in-production web-app written in python (pyramid) and using alembic for easy schema migration. Due to the database creaking with unexpectedly high write-load (probably due to the convoluted nature of my own code), I've decided to migrate to PostgreSQL.
Data migration
There are a few recommendations on data migration. The simplest one involved using
sqlite3 my.db .dump > sqlitedumpfile.sql
and then importing it with
psql -d newpostgresdb < sqlitedumpfile.sql
This required a bit of editing of sqlitedumpfile. In particular, removing some incompatible operations, changing values (sqlite represents booleans as 0/1) etc. It ended up being too complicated to do programmatically for my data, and too much work to handle manually (some tables had 20k rows or so).
A good tool for data migration which I eventually settled on was pgloader, which 'worked' immediately. However, as is typical for data migration of this sort, this exposed various data inconsistencies in my database which I had to solve at source before doing the migration (in particular, removing foreign keys to non-unique columns which seemed a good idea at the time for convenient joins and removing orphan rows which relied on rows in other tables which had been deleted). After these were solved, I could just do
pgloader my.db postgresql:///newpostgresdb
And get all my data appropriately.
The problem?
pgloader worked really well for data but not so well for the table structure itself. This resulted in three problems:-
I had to create a new alembic revision with a ton of changes (mostly datatype related, but also some related to problem 2).
Constraint/index names were unreliable (unique numeric names generated). There's actually an option to disable this, and this was a problem because I needed a reliable upgrade path which was replicable in production without me having to manually tweak the alembic code.
Sequences/autoincrement just failed for most primary keys. This broke my webapp as I was not able to add new rows for some (not all) databases.
In contrast, re-creating a blank database using alembic to maintain the schema works well without changing any of my webapps code. However pgloader defaults to over-riding existing tables, so this would leave me nowhere as the data is what really needs migrating.
How do I get proper data migration using a schema I've already defined (and which works)?
What eventually worked was, in summary:-
Create the appropriate database structure in postgresql://newpostgresdb (I just used alembic upgrade head for this)
Use pgloader to move data over from sqlite to a different database in postgresql. As mentioned in the question, some data inconsistencies need to be solved before this step, but that's not relevant to this question itself.
createdb tempdb
pgloader my.db postgresql:///tempdb
Dump the data in tempdb using pg_dump
pg_dump -a -d tempdb > dumped_postgres_database
Edit the resulting dump to accomplish the following:-
SET session_replication_role = replica because some of my rows are circular in reference to other rows in the same table
Delete the alembic_version table, as we're restarting a new branch for alembic.
Regenerate any sequences, with the equivalent of SELECT pg_catalog.setval('"table_colname_seq"', (select max(colname) from table));
Finally, psql can be used to load the data to your actual database
psql -d newpostgresdb < dumped_postgres_database

SSIS or TSQL for SQL/MySQL table comparrison

I am new to SSIS and am after some assistance in creating an SSIS package to do a specific task. My data is stored remotely within a MySQL Database and this is downloaded to a SQL Server 2014 Database. What I want to do is the following, create a package where I can enter 2 dates that can be compared against the create date/date modified per record on a number of tables to give me a snap shot and compare the MySQL Data to the SQL Data so that I can see if there are any rows that are missing from my local SQL Database or if any need to be updated. Some tables have no dates so I just want to see a record count on what is missing if anything between the 2. If this is better achieved through TSQL I am happy to hear about other suggestions or sites to look at where things have been done similar.
In relation to your query Tab :
"Hi Tab, What happens at the moment is our master data is stored in a MySQL Database, the data was then downloaded to a SQL Server Database as a one off. What happens at the moment is I have a SSIS package that uses the MAX ID which can be found on most of the tables to work out which records are new and just downloads them or updates them. What I want to do is run separate checks on the tables to make sure that during the download nothing has been missed and everything is within sync. In an ideal world I would like to pass in to a SSIS package or tsql stored procedure a date range, shall we say calender week, this would then check for any differences between the remote MySQL database tables and the local SQL tables. It does not currently have to do anything but identify issues, correcting them may come later or changes would need to be made to the existing sync package. Hope his makes more sense."
Thanks P
To do this, you need to implement a Type 1 Slowly Changing Dimension type data flow in SSIS. There are a number of ways to do this, including a built in transformation aptly called the Slowly Changing Dimension transformation. Whilst this is easy to set up, it is a pain to maintain and it runs horrendously slowly.
There are numerous ways to set this up using other transformations or even SQL merge statements which are detailed here: https://bennyaustin.wordpress.com/2010/05/29/alternatives-to-ssis-scd-wizard-component/
I would recommend that you use Lookup transformations as they perform better than the Slowly Changing Dimension transformation but offer better diagnostics and error handling than the better performing SQL merge statement.
Before you do this you will need to add a Checksum or Hashbytes column to your SQL data for ease of comparison with the incoming MySQL data.
In short, calculate some sort of repeatable checksum as the data is downloaded into your SQL Server, then use this in an SSIS Lookup, matching on the row key, to check for changes. Where the checksum value is different for the same row it needs updating and where there is no matching row key in your SQL Data you need to insert the new row.

How can I import just data *and indexes* into Oracle 11?

I am importing into an Oracle 11g database using the original imp tool. Usually I do this with the parameter DATA_ONLY=y because I am only concerned with data-related errors.
However, I am now investigating some indexing-related issues, so I want to import indexes as well as data, but no other metadata. I've looked at imp help=y and the documentation linked above, but I can't figure out what combination of options, or what sequence of imp calls, would achieve this.
Any ideas? (Parameters specific to 11g answers are fine as long as they would work on a 10g dmp file too.)
The simplest way to build the indexes from a dump file (from exp; not sure why you aren't using data pump and expdp/impdp if you're on 11g, but hopefully you're moving data from 9i or something) is with the INDEXFILE parameter.
Use that to create a .sql file with all the index DDL. (It has all the table DDL too, but commented out). You can then make changes if you need to. Then run it as a normal script from SQL*Plus, and it will execute the DDL and build all the indexes. There isn't an imp call to only build the indexes, you need to do it in those two steps.
It won't update or recreate any indexes you already have, so if an index definition has changed then it will have effect - it will just complain that the index already exists. You can drop existing indexes before running the script if that's the case.
Generally you'd run the INDEXFILE call as a separate step after the DATA_ONLY call, because it's usually faster to build the indexes when all the data is present than it is to import the data with the indexes in place - because of the overhead of updating the indexes for each row of data. So, imp DATA_ONLY=y, then imp INDEXFILE=ind.sql, then sqlplus user/pass < ind.sql.

Data Warehousing Postgres

We're considering using SSIS to maintain a PostgreSql data warehouse. I've used it before between SQL Servers with no problems, but am having a lot of difficulty getting it to play nicely with Postgres. I’m using the evaluation version of the OLEDB PGNP data provider (http://www.postgresql.org/about/news.1004).
I wanted to start with something simple like UPSERT on the fact table (10k-15k rows are updated/inserted daily), but this is proving very difficult (not to mention I’ll want to use surrogate keys in the future).
I’ve attempted (Link) and (http://consultingblogs.emc.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx) which are effectively the same (except I don’t really understand the union all at the end when I’m trying to upsert) But I run into the same problem with parameters when doing the update using a OLEDb command – which I tried to overcome using (http://technet.microsoft.com/en-us/library/ms141773.aspx) but that just doesn’t seem to work, I get a validation error –
The external columns for complent.... are out of sync with the datasource columns... external column “Param_2” needs to be removed from the external columns.
(this error is repeated for the first two parameters as well – never came across this using the sql connection as it supports named parameters)
Has anyone come across this?
AND:
The fact that this simple task is apparently so difficult to do in SSIS suggests I’m using the wrong tool for the job - is there a better (and still flexible) way of doing this? Or would another ETL package be better for use between two Postgres database? -Other options include any listed on (http://en.wikipedia.org/wiki/Extract,_transform,_load#Open-source_ETL_frameworks). I could just go and write a load of SQL to do this for me, but I wanted a neat and easily maintainable solution.
I have used the Slowly Changing Dimension wizard for this with good success. It may give you what you are looking for especially with the Wizard
http://msdn.microsoft.com/en-us/library/ms141715.aspx
The External Columns Out Of Sync: SSIS is Case Sensitive - I encountered this issue multiple times and it makes me want to pull my hair out.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
SCD is way too slow for what I want. I need to use set based sql.
It turned out that a lot of my problems were with bugs in the provider.
I opened a forum topic (http://www.pgoledb.com/forum/viewtopic.php?f=4&t=49) and had a useful discussion with the moderator/support/developer person.
Also Postgres doesn't let you do cross db querys, so I solved the problem this way:
Data Source from Production DB to a temp Archive DB table
Run set based query between temp table and archive table
Truncate temp table
Note that the temp table is not atchally a temp table, but a copy of the archive table schema to temporarily stored data in.
Took a while, but I got there in the end.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
What enterprise ETL solution would you suggest?