I am importing into an Oracle 11g database using the original imp tool. Usually I do this with the parameter DATA_ONLY=y because I am only concerned with data-related errors.
However, I am now investigating some indexing-related issues, so I want to import indexes as well as data, but no other metadata. I've looked at imp help=y and the documentation linked above, but I can't figure out what combination of options, or what sequence of imp calls, would achieve this.
Any ideas? (Parameters specific to 11g answers are fine as long as they would work on a 10g dmp file too.)
The simplest way to build the indexes from a dump file (from exp; not sure why you aren't using data pump and expdp/impdp if you're on 11g, but hopefully you're moving data from 9i or something) is with the INDEXFILE parameter.
Use that to create a .sql file with all the index DDL. (It has all the table DDL too, but commented out). You can then make changes if you need to. Then run it as a normal script from SQL*Plus, and it will execute the DDL and build all the indexes. There isn't an imp call to only build the indexes, you need to do it in those two steps.
It won't update or recreate any indexes you already have, so if an index definition has changed then it will have effect - it will just complain that the index already exists. You can drop existing indexes before running the script if that's the case.
Generally you'd run the INDEXFILE call as a separate step after the DATA_ONLY call, because it's usually faster to build the indexes when all the data is present than it is to import the data with the indexes in place - because of the overhead of updating the indexes for each row of data. So, imp DATA_ONLY=y, then imp INDEXFILE=ind.sql, then sqlplus user/pass < ind.sql.
Related
I have worked on a Python application that queries a PostgreSQL DB using SQLAlchemy. My DB has a large number of tables many of which get modified every release. Whenever I make a change to a table (say, renaming a column), I execute mypy to find places in the code which are referencing old column names and I fix them in all such places. I now have to write some code using C/C++ and I have used libpq. The problem is that I have to manually scan the code to find places where changes are required whenever I make table changes. This is error prone. Is there a better way to ensure that the C/C++ code is in sync with the DB schema ?
We have a system which stores data in a postgres database. In some cases, the size of the database has grown to several GBs.
When this system is upgraded, the data in the said database is backed up, and finally it's restored in the database. Owing to the huge amounts of data, the indexing takes a long time to complete (~30 minutes) during restoration, thereby delaying the upgrade process.
Is there a way where the data copy and indexing can be split into two steps, where the data is copied first to complete the upgrade, followed by indexing which can be done at a later time in the background?
Thanks!
There's no built-in way to do it with pg_dump and pg_restore. But pg_restore's -j option helps a lot.
There is CREATE INDEX CONCURRENTLY. But pg_restore doesn't use it.
It would be quite nice to be able to restore everything except secondary indexes not depended on by FK constraints. Then restore those as a separate phase using CREATE INDEX CONCURRENTLY. But no such support currently exists, you'd have to write it yourself.
You can, however, filter the table-of-contents used by pg_restore, so you could possibly do some hacky scripting to do the needed work.
There is an option to separate the data and creating index in postgresql while taking pg_dump.
Here pre-data refers to Schema, post-data refers to index and triggers.
From the docs,
--section=sectionname Only dump the named section. The section name can be pre-data, data, or post-data. This option can be specified more
than once to select multiple sections. The default is to dump all
sections.
The data section contains actual table data, large-object contents,
and sequence values. Post-data items include definitions of indexes,
triggers, rules, and constraints other than validated check
constraints. Pre-data items include all other data definition items.
May be this would help :)
There are a few questions and answers already on PostgreSQL import (as well as the specific SQLite->PostgreSQL situation). This question is about a specific corner-case.
Background
I have an existing, in-production web-app written in python (pyramid) and using alembic for easy schema migration. Due to the database creaking with unexpectedly high write-load (probably due to the convoluted nature of my own code), I've decided to migrate to PostgreSQL.
Data migration
There are a few recommendations on data migration. The simplest one involved using
sqlite3 my.db .dump > sqlitedumpfile.sql
and then importing it with
psql -d newpostgresdb < sqlitedumpfile.sql
This required a bit of editing of sqlitedumpfile. In particular, removing some incompatible operations, changing values (sqlite represents booleans as 0/1) etc. It ended up being too complicated to do programmatically for my data, and too much work to handle manually (some tables had 20k rows or so).
A good tool for data migration which I eventually settled on was pgloader, which 'worked' immediately. However, as is typical for data migration of this sort, this exposed various data inconsistencies in my database which I had to solve at source before doing the migration (in particular, removing foreign keys to non-unique columns which seemed a good idea at the time for convenient joins and removing orphan rows which relied on rows in other tables which had been deleted). After these were solved, I could just do
pgloader my.db postgresql:///newpostgresdb
And get all my data appropriately.
The problem?
pgloader worked really well for data but not so well for the table structure itself. This resulted in three problems:-
I had to create a new alembic revision with a ton of changes (mostly datatype related, but also some related to problem 2).
Constraint/index names were unreliable (unique numeric names generated). There's actually an option to disable this, and this was a problem because I needed a reliable upgrade path which was replicable in production without me having to manually tweak the alembic code.
Sequences/autoincrement just failed for most primary keys. This broke my webapp as I was not able to add new rows for some (not all) databases.
In contrast, re-creating a blank database using alembic to maintain the schema works well without changing any of my webapps code. However pgloader defaults to over-riding existing tables, so this would leave me nowhere as the data is what really needs migrating.
How do I get proper data migration using a schema I've already defined (and which works)?
What eventually worked was, in summary:-
Create the appropriate database structure in postgresql://newpostgresdb (I just used alembic upgrade head for this)
Use pgloader to move data over from sqlite to a different database in postgresql. As mentioned in the question, some data inconsistencies need to be solved before this step, but that's not relevant to this question itself.
createdb tempdb
pgloader my.db postgresql:///tempdb
Dump the data in tempdb using pg_dump
pg_dump -a -d tempdb > dumped_postgres_database
Edit the resulting dump to accomplish the following:-
SET session_replication_role = replica because some of my rows are circular in reference to other rows in the same table
Delete the alembic_version table, as we're restarting a new branch for alembic.
Regenerate any sequences, with the equivalent of SELECT pg_catalog.setval('"table_colname_seq"', (select max(colname) from table));
Finally, psql can be used to load the data to your actual database
psql -d newpostgresdb < dumped_postgres_database
I'm trying to migrate our database engine from MsSql to PostgreSQL. In our automated test, we restore the database back to "clean" state at the start of every test. We do this by comparing the "diff" between the working copy of the database with the clean copy (table by table). Then copying over any records that have changed. Or deleting any records that have been added. So far this strategy seems to be the best way to go about for us because per test, not a lot of data is changed, and the size of the database is not very big.
Now I'm looking for a way to essentially do the same thing but with PostgreSQL. I'm considering doing the exact same thing with PostgreSQL. But before doing so, I was wondering if anyone else has done something similar and what method you used to restore data in your automated tests.
On a side note - I considered using MsSql's snapshot or backup/restore strategy. The main problem with these methods is that I have to re-establish the db connection from the app after every test, which is not possible at the moment.
If you're okay with some extra storage, and if you (like me) are particularly not interested in re-inventing the wheel in terms of checking for diffs via your own code, you should try creating a new DB (per run) via templates feature of createdb command (or CREATE DATABASE statement) in PostgreSQL.
So for e.g.
(from bash) createdb todayDB -T snapshotDB
or
(from psql) CREATE DATABASE todayDB TEMPLATE snaptshotDB;
Pros:
In theory, always exact same DB by design (no custom logic)
Replication is a file-transfer (not DB restore). So far less time taken (i.e. doesn't run SQL again, doesn't recreate indexes / restore tables etc.)
Cons:
Takes 2x the disk space (although template could be on a low performance NFS etc)
For my specific situation. I decided to go back to the original solution. Which is to compare the "working" copy of the database with "clean" copy of the database.
There are 3 types of changes.
For INSERT records - find max(id) from clean table and delete any record on working table that has higher ID
For UPDATE or DELETE records - find all records in clean table EXCEPT records found in working table. Then UPSERT those records into working table.
We have a series of modifications to a Postgres database, which can generally be written all in SQL. So it seems Flyway would be a great fit to automate these.
However, they also include imports from files to tables, such as
COPY mytable FROM '${PWD}/mydata.sql';
And secondarily, we'd like not to rely on Postgres' use of file paths like this, which apparently must reside on the server. It should be possible to run any migration from a remote client -- as in Amazon's RDS documentation (last section).
Are there good approaches to handling this kind of scenario already in Flyway? Or alternate approaches to avoid this issue altogether?
Currently, it looks like it'd work to implement the whole migration in Java and use the Postgres driver's CopyManager to import the data. However, that means most of our migrations have to be done in Java, which seems much clumsier. (As far as I can tell, hybrid Java+SQL migrations are not expected?)
Am new to looking at Flyway so thought I'd ask what other alternatives might exist with Flyway, since I'd expect it's pretty common to import a table during a migration.
Starting with Flyway 3.1, you can use COPY FROM STDIN statements within your migration files to accomplish this. The SQL execution engine will automatically use PostgreSQL's CopyManager to transfer the data.