Looking for good way to load FIXED-Width data into postgres tables. I do this is sas and python not postgres. I guess there is not a native method. The files are a few GB. The one way I have seen does not work on my file for some reason (possibly memory issues). There you load as one large column and then parse into tables. I can use psycopy2 but because of memory issues would rather not. Any ideas or tools that work. Does pgloader work well or are there native methods?
http://www.postgresonline.com/journal/index.php?/archives/157-Import-fixed-width-data-into-PostgreSQL-with-just-PSQL.html
Thanks
There's no convenient built-in method to ingest fixed-width tabular data in PostgreSQL. I suggest using a tool like Pentaho Kettle or Talend Studio to do the data-loading, as they're good at consuming many different file formats. I don't remember if pg_bulkload supports fixed-width, but suspect not.
Alternately, you can generally write a simple script with something like Python and the psycopg2 module, loading the fixed-width data row by row and sending that to PostgreSQL. psycopg2's support for the COPY command via copy_from makes this vastly more efficient. I didn't find a convenient fixed-width file reader for Python in a quick search but I'm sure they're out there. You can use whatever language you like anyway - Perl's DBI and DBD::Pg do just as well, and there are millions of fixed-width file reader modules for Perl.
The Python Pandas library has a function pandas.read_fwf which works great.
Data can be read in using python, then written to Postgres database.
Related
I am trying to import to pgAdmin a big table with more than 100 columns. Is there any way to import the table without creating those 100 columns in a table within the pgAdmin? That would be a considerably time-consuming task.
You are not importing data into pgAdmin, you are importing it into Postgres, and using pgAdmin to help you in that task. Graphical tools like pgAdmin are, at heart, just convenience wrappers around the actual functionality of the database, and everything they do can be done in other ways.
In the case of a simple task like creating a table, the relevant SQL syntax is well worth learning. It will work in any database tool, even (with some minor changes) on other SQL databases (e.g. MySQL), can be saved in version control, and manipulated with an editor of your choice.
You could even go so far as to write a script in the language of your choice that generates the SQL for you based on some other data (e.g. the headings of the CSV file) - although make sure you don't run that with third-party data without checking the result or taking extreme care with code injection and other security concerns!
The Postgres manual has an introduction to tables and creating them which would be a good place to start.
Is there a convenient, open-source method to generate a SAS XPORT Transport Format (xpt) file from a postgreSQL database for FDA submission?
I have checked the FDA specifications, available at http://www.fda.gov/downloads/ForIndustry/DataStandards/StudyDataStandards/UCM312964.pdf
These state that 'SAS XPORT transport files can be converted to various other formats using commercially available off the shelf software', but no software packages other than SAS are suggested.
The specifications for an SAS XPORT file are available at http://support.sas.com/techsup/technote/ts140.html
I have checked OpenClinica (which is the EDC software we are using), PGAdmin3 and AM (which can import .xpt files, but I didn't find an export method)
Easy way? Not that I know of. I think one way or another it will take some development work.
My recommendation is to do it as follows:
Write a user-defined function/stored procedure for pulling the data you need for each section.
Write a user-defined function to pull this data from each section and arrange it into an XML file. TheXML functions are likely to come in handy for this.
Of course you could also put the xml conversion in an arbitrary front-end. However, in general, you will find that the design above forces you to push everything into set-based logic which is likely to be more powerful in your case.
If you don't mind using Python, my XPORT module can write xpt files. https://github.com/selik/xport
If you have trouble using it, write me a note and I'll try to help. https://github.com/selik/xport/issues
In many of my scripts I am using sqlite for reporting info and I need first to upload my big table data (millions of csv rows). In the past I have found that .import was quicker than line by line inserting (even using transactions).
Nowadays my scripts implement a method that do system call for sqlite3 db '.import ....'. I wonder if it is possible to call .import from dbd-sqlite. Or it would be better to keep calling insert from system?.
PD: The reason for wanting to call .import from inside dbd-sql is to remove the sqlite3 dependency when my software is installed elsewhere.
.import is a SQLite-specific command, so you won't find a DBI method for it which is independent of the database driver; while any given database engine almost certainly has equivalent functionality, each will implement it differently (e.g. SQLite .import vs MySQL LOAD DATA INFILE, &c.)
If you're looking for true engine independence, you'll need to import your data by means of INSERT queries, which can be relied upon in the simplest case to work more or less equivalently everywhere. However, if the difference in execution time is significant enough, it may be worth your while to write an engine-agnostic interface to the import functionality, with a wrapper around each engine's specific import command, and determining from the currently active database driver (or some other method, depending on your code) which wrapper to invoke at runtime.
if you are not opposed to "shelling out"
perl -e 'system(qq(sqlite3 foo.db ".import file.dat table")) and die $!'
I am trying to load text data into a postgresql database via COPY FROM. Data is definitely not clean CSV.
The input data isn't always consistent: sometimes there are excess fields (separator is part of a field's content) or there are nulls instead of 0's in integer fields.
The result is that PostgreSQL throws an error and stops loading.
Currently I am trying to massage the data into consistency via perl.
Is there a better strategy?
Can PostgreSQL be asked to be as tolerant as mysql or sqlite in that respect?
Thanks
PostgreSQL's COPY FROM isn't designed to handle bodgy data and is quite strict. There's little support for tolerance of dodgy data.
I thought there was little interest in adding any until I saw this proposed patch posted just a few days ago for possible inclusion in PostgreSQL 9.3. The patch has been resoundingly rejected, but shows that there's some interest in the idea; read the thread.
It's sometimes possible to COPY FROM into a staging TEMPORARY table that has all text fields with no constraints. Then you can massage the data using SQL from there. That'll only work if the SQL is at least well-formed and regular, though, and it doesn't sound like yours is.
If the data isn't clean, you need to pre-process it with a script in a suitable scripting language.
Have that script:
Connect to PostgreSQL and INSERT rows;
Connect to PostgreSQL and use the scripting language's Pg APIs to COPY rows in; or
Write out clean CSV that you can COPY FROM
Python's csv module can be handy for this. You can use any language you like; perl, python, php, Java, C, whatever.
If you were enthusiastic you could write it in PL/Perlu or PL/Pythonu, inserting the data as you read it and clean it up. I wouldn't bother.
how to use copy statement in postgresql to load data from a text file where the file has an escape character as a delimiter into a postgresql table?
Is there any otherway of loading data from textfile into a PostgreSQL table?
pg loader emulates oracles sql loader:
http://pgfoundry.org/projects/pgloader/
pg bulkload is used to load lots of data in an otherwise offline db. Useful for large data warehouses, fast, and somewhat dangerous and quirky:
http://pgbulkload.projects.postgresql.org/
You should use COPY with the DELIMITER 'xx' option. You probably need to play around a little bit to get it right, but the docs give a pretty good information about what to do with each option available to the command.