Can COPY FROM tolerantly consume bad CSV? - postgresql

I am trying to load text data into a postgresql database via COPY FROM. Data is definitely not clean CSV.
The input data isn't always consistent: sometimes there are excess fields (separator is part of a field's content) or there are nulls instead of 0's in integer fields.
The result is that PostgreSQL throws an error and stops loading.
Currently I am trying to massage the data into consistency via perl.
Is there a better strategy?
Can PostgreSQL be asked to be as tolerant as mysql or sqlite in that respect?
Thanks

PostgreSQL's COPY FROM isn't designed to handle bodgy data and is quite strict. There's little support for tolerance of dodgy data.
I thought there was little interest in adding any until I saw this proposed patch posted just a few days ago for possible inclusion in PostgreSQL 9.3. The patch has been resoundingly rejected, but shows that there's some interest in the idea; read the thread.
It's sometimes possible to COPY FROM into a staging TEMPORARY table that has all text fields with no constraints. Then you can massage the data using SQL from there. That'll only work if the SQL is at least well-formed and regular, though, and it doesn't sound like yours is.
If the data isn't clean, you need to pre-process it with a script in a suitable scripting language.
Have that script:
Connect to PostgreSQL and INSERT rows;
Connect to PostgreSQL and use the scripting language's Pg APIs to COPY rows in; or
Write out clean CSV that you can COPY FROM
Python's csv module can be handy for this. You can use any language you like; perl, python, php, Java, C, whatever.
If you were enthusiastic you could write it in PL/Perlu or PL/Pythonu, inserting the data as you read it and clean it up. I wouldn't bother.

Related

Is there a way to show everything that was changed in a PostgreSQL database during a transaction?

I often have to execute complex sql scripts in a single transaction on a large PostgreSQL database and I would like to verify everything that was changed during the transaction.
Verifying each single entry on each table "by hand" would take ages.
Dumping the database before and after the script to plain sql and using diff on the dumps isn't really an option since each dump would be about 50G of data.
Is there a way to show all the data that was added, deleted or modified during a single transaction?
Dude, What are you looking for is the most searchable thing on the internet when it comes to capturing Database changes. It is a kind of version control we can say.
But as long as I know, sadly there are no in-built approaches are available in PostgreSQL or MySql. But you can overcome it by setting/adding some triggers for your most usable operations.
You can create some backup schemas, and tables to capture your changes that are changed(updated), created, or deleted.
In this way you can achieve what you want. I know this process is fully manual, But really effective.
If you need to analyze the script's behaviour only sporadically, then the easiest approach would be to change server configuration parameter log_min_duration_statement to 0 and then back to any value it had before the analysis. Then all of the script activity will be written to the instance log.
This approach is not suitable if your storage is not prepared to accommodate this amount of data, or for systems in which you don't want sensitive client data to be written to a plain-text log file.

Importing many columns from a CSV into Postgres

I am trying to import to pgAdmin a big table with more than 100 columns. Is there any way to import the table without creating those 100 columns in a table within the pgAdmin? That would be a considerably time-consuming task.
You are not importing data into pgAdmin, you are importing it into Postgres, and using pgAdmin to help you in that task. Graphical tools like pgAdmin are, at heart, just convenience wrappers around the actual functionality of the database, and everything they do can be done in other ways.
In the case of a simple task like creating a table, the relevant SQL syntax is well worth learning. It will work in any database tool, even (with some minor changes) on other SQL databases (e.g. MySQL), can be saved in version control, and manipulated with an editor of your choice.
You could even go so far as to write a script in the language of your choice that generates the SQL for you based on some other data (e.g. the headings of the CSV file) - although make sure you don't run that with third-party data without checking the result or taking extreme care with code injection and other security concerns!
The Postgres manual has an introduction to tables and creating them which would be a good place to start.

fixed width data into postgres

Looking for good way to load FIXED-Width data into postgres tables. I do this is sas and python not postgres. I guess there is not a native method. The files are a few GB. The one way I have seen does not work on my file for some reason (possibly memory issues). There you load as one large column and then parse into tables. I can use psycopy2 but because of memory issues would rather not. Any ideas or tools that work. Does pgloader work well or are there native methods?
http://www.postgresonline.com/journal/index.php?/archives/157-Import-fixed-width-data-into-PostgreSQL-with-just-PSQL.html
Thanks
There's no convenient built-in method to ingest fixed-width tabular data in PostgreSQL. I suggest using a tool like Pentaho Kettle or Talend Studio to do the data-loading, as they're good at consuming many different file formats. I don't remember if pg_bulkload supports fixed-width, but suspect not.
Alternately, you can generally write a simple script with something like Python and the psycopg2 module, loading the fixed-width data row by row and sending that to PostgreSQL. psycopg2's support for the COPY command via copy_from makes this vastly more efficient. I didn't find a convenient fixed-width file reader for Python in a quick search but I'm sure they're out there. You can use whatever language you like anyway - Perl's DBI and DBD::Pg do just as well, and there are millions of fixed-width file reader modules for Perl.
The Python Pandas library has a function pandas.read_fwf which works great.
Data can be read in using python, then written to Postgres database.

How can I see the call tree for SQL stored procedures offline (without actually creating them)

I have a huge SQL script which i need to analyse. It would be really helpful if i could find a way which can generate a call tree; ie, to see which all procedures are called from a particular procedure. a perl based example is here, http://sqlblog.com/blogs/linchi_shea/archive/2009/10/23/find-the-complete-call-tree-for-a-stored-procedure.aspx
but i need a tool to analyse the text file (.sql file), not the procedure stored in the database. due to some reasons i will not be able to create the whole set of procedures in the database and use the above mentioned tool.
please respond if you have come across any ide/tool with this feature.
Probably not very helpful, as it violates your request for a "offline" sql file, text based parsing tool, but wanted to throw this redgate tool out there that I have used with great success in the past; RedGate Sql Dependency Tracker. It works very well and does a good job mapping out your objects and all their dependencies (definable as to what you want mapped). But it does require a database with all of the existing objects in place to work properly. :(
If you can't find one out there, I guess you could maybe do some script/macro text parsing if all the procedure calls are easily defined and predictable in the file. AutoHotKey is a great general purpose scripting tool/framework, and there are a few sql based scripts out there...just not one exactly like you are looking for that I have seen.

Data Warehousing Postgres

We're considering using SSIS to maintain a PostgreSql data warehouse. I've used it before between SQL Servers with no problems, but am having a lot of difficulty getting it to play nicely with Postgres. I’m using the evaluation version of the OLEDB PGNP data provider (http://www.postgresql.org/about/news.1004).
I wanted to start with something simple like UPSERT on the fact table (10k-15k rows are updated/inserted daily), but this is proving very difficult (not to mention I’ll want to use surrogate keys in the future).
I’ve attempted (Link) and (http://consultingblogs.emc.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx) which are effectively the same (except I don’t really understand the union all at the end when I’m trying to upsert) But I run into the same problem with parameters when doing the update using a OLEDb command – which I tried to overcome using (http://technet.microsoft.com/en-us/library/ms141773.aspx) but that just doesn’t seem to work, I get a validation error –
The external columns for complent.... are out of sync with the datasource columns... external column “Param_2” needs to be removed from the external columns.
(this error is repeated for the first two parameters as well – never came across this using the sql connection as it supports named parameters)
Has anyone come across this?
AND:
The fact that this simple task is apparently so difficult to do in SSIS suggests I’m using the wrong tool for the job - is there a better (and still flexible) way of doing this? Or would another ETL package be better for use between two Postgres database? -Other options include any listed on (http://en.wikipedia.org/wiki/Extract,_transform,_load#Open-source_ETL_frameworks). I could just go and write a load of SQL to do this for me, but I wanted a neat and easily maintainable solution.
I have used the Slowly Changing Dimension wizard for this with good success. It may give you what you are looking for especially with the Wizard
http://msdn.microsoft.com/en-us/library/ms141715.aspx
The External Columns Out Of Sync: SSIS is Case Sensitive - I encountered this issue multiple times and it makes me want to pull my hair out.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
SCD is way too slow for what I want. I need to use set based sql.
It turned out that a lot of my problems were with bugs in the provider.
I opened a forum topic (http://www.pgoledb.com/forum/viewtopic.php?f=4&t=49) and had a useful discussion with the moderator/support/developer person.
Also Postgres doesn't let you do cross db querys, so I solved the problem this way:
Data Source from Production DB to a temp Archive DB table
Run set based query between temp table and archive table
Truncate temp table
Note that the temp table is not atchally a temp table, but a copy of the archive table schema to temporarily stored data in.
Took a while, but I got there in the end.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
What enterprise ETL solution would you suggest?