How do you manage a major schema change when you are using a Nosql store like SimpleDB?
I know that I am still thinking in SQL terms, but after working with SimpleDB for a few weeks I need to make a change to a running database. I would like to change one of the object classes to have a unique id, as rather than a business name, and as it is referenced by another object, I will need to also update the reference value in these objects.
With a SQL database you would run set of sql statements as part of the client software deployment process. Obviously this will not work with something like SimpleDB as
there is no equivalent of a SQL update statement.
Due to the distributed nature of SimpleDB, there is no way of knowing when the changes you have made to the database have 'filtered' out to all the nodes running your client software.
Some solutions I have thought of are
Each domain has a version number. The client software knows which version of the domain it should use. Write some code that copies the data from one domain version to another, making any required changes as you go. You can then install new client software that then accesses the new domain version. This approach will not work unless you can 'freeze' all write access during the update process.
Each item has a version attribute that indicates the format used when it was stored. The client uses this attribute when loading the object into memory. Object can then be converted to the latest format when it is written back to SimpleDB. The problem with this is that the new software needs to be deployed to all servers before any writes in the new format occur, or clients running the old software will not know how to read the new format.
It all is rather complex and I am wondering if I am missing something?
Thanks
Richard
I use something similar to your second option, but without the version attribute.
First, try to keep your changes to things that are easy to make backward compatible - changing the primary key is the worst case scenario for this.
Removing a field is easy - just stop writing to that field once all servers are running a version that doesn't require it.
Adding a field requires that you never write that object using code that won't save that field. If you can't deploy the new version everywhere at once, use an intermediate version that supports saving the field before you deploy a version that requires it.
Changing a field is just a combination of these two operations.
With this approach changes are applied as needed - write using the new version, but allow reading of the old version with default or derived values for the new field.
You can use the same code to update all records at once, though this may not be appropriate on a large dataset.
Changing the primary key can be handled the same way, but could get really complex depending on which nosql system you are using. You are probably stuck with designing custom migration code in this case.
RavenDB another NoSQL database uses migrations to acheive this
http://ayende.com/blog/66563/ravendb-migrations-rolling-updates
http://ayende.com/blog/66562/ravendb-migrations-when-to-execute
Normally these type of changes are handled by your application that changes the schema to a newer one upon loading version X and converting to version Y and persisting
Related
So I am learning FastAPI and want to get more experience with relational databases. I am using SQLAlchemy ORM, Pydantic and Alembic. Database is Postgres. One thing that I am running into however, is when I want to add a single column to a table I need to change a model, a schema and alembic in order to reflect this change. Isn't this a huge violation of DRY, error prone and very hard to maintain in the long run?
Check out SQLModel. It attempts to tackle this exact issue. Instead of defining a database model (SQLAlchemy) and a corresponding Pydantic model, you only define one SQLModel that combines both.
It doesn't change the fact that you need to verify that your Alembic migrations work as intended thought.
The project is still in its early stages, but I find it very promising.
In general, I don't see any benefit in defining the schema twice, when you are developing an API. There is however, the obvious downside of making the entire application much more error prone, when you have to repeat yourself for every change in the schema.
With SQLModel you will probably see a substantial reduction in the lines of code, unless you have very special requirements (such as exotic types, multiple layers of validation/conversion, highly complex/nested relationships).
I need to implement schema migration mechanism for PostgreSQL.
Just to remove ambiguity: with schema-migration I mean that I need upgrade my database structures to the latest version regardless of their current state on particular server instance.
For example in version one I created some tables, then in version two I renamed some columns and in version three I removed one table and created another one. I have multiple servers and on some of them I have version one on some version three etc.
My idea:
Generate hash for output produced by
pg_dump --schema-only
every time before I change my database schema. This will be a reliable way to identify database version in the future to which the patch should apply.
Contain a list of patches with the associated hashed to which they should apply.
When I need to upgrade my database I will run an application that will search for hash that corresponds to current database structure (by calculating hash of local database and comparing it with hash set that I have) and apply associated patch.
Repeat until next hash is not found.
Could you please point any weak sides of this approach?
Have you ever heard of https://pgmodeler.io ? At the company where I work we decided to go for this since it can perform schema diff even between local and remote. We are very satisfied with it.
Otherwise if you are more for a free solution, you could develop a migration tool which can be used to apply migrations you store in a single repo. Furthermore this tool could rely on a migration table you keep in a separate schema so that your DB(s) will always know which migrations were applied or not.
The beauty of this approach is that migrations can both be about a schema change and data changes.
I hope this can give you some ideas.
There is a small project of mine reaching its release, based on squeryl - typesafe relational database framework for Scala (JVM based language).
I foresee multiple updates after initial deployment. The data entered in the database should be persisted over them. This is impossible without some kind of data migration procedure, upgrading data for newer DB schema.
Using old data for testing new code also requires compatibility patches.
Now I use automatic schema generation by framework. It seem to be only able create schema from scratch - no data persists.
Are there methods that allow easy and formalized migration of data to changed schema without completely dropping automatic schema generation?
So far I can only see an easy way to add columns: we dump old data, provide default values for new columns, reset schema and restore old data.
How do I delete, rename, change column types or semantics?
If schema generation is not useful for production database migration, what are standard procedures to follow for conventional manual/scripted redeployment?
There have been several discussions about this on the Squeryl list. The consensus tends to be that there is no real best practice that works for everyone. Having an automated process to update your schema based on your model is brittle (can't handle situations like column renames) and can be dangerous in production. Personally, I like the idea of "migrations" where all of your schema changes are written as SQL. There are a few frameworks that help with this and you can find some of them here. Personally, I just use a light wrapper around the psql command line utility to do schema migrations and data loading as it's a lot faster for the latter than feeding in the data over JDBC.
The problem is like this: my company has a service that cannot stop running for long periods of time and I was working on some modifications in the database structure used by this service.
Now that all my modifications are ready and well tested in a test bench environment, I want to export them to the running system. I could do this manually with IBExpert or FlameRobin, but I wanted to know if there is a more automated method for doing this (I feel dumb by spending a whole day creating tables, attributes, and so on one by one).
Is there?
You mention IBExpert - It has the Database Comparer Tool which generates desired DDL to merge databases structure.
And as you know you can use IBEBlock to fully automate that process.
PS. Or deploy your own app using IBEScript.dll - which lets you use all functionalities of the IBEBlock scripting language
Please read: http://ibexpert.net/ibe/index.php?n=Main.IBEScriptDll
Check out the database compare feature of Database Workbench (Windows client). It can compare whatever database objects you select and generate DDL to modify your destination database. Unfortunately you will need the Pro edition, but there is a 30 day trial.
We're considering using SSIS to maintain a PostgreSql data warehouse. I've used it before between SQL Servers with no problems, but am having a lot of difficulty getting it to play nicely with Postgres. I’m using the evaluation version of the OLEDB PGNP data provider (http://www.postgresql.org/about/news.1004).
I wanted to start with something simple like UPSERT on the fact table (10k-15k rows are updated/inserted daily), but this is proving very difficult (not to mention I’ll want to use surrogate keys in the future).
I’ve attempted (Link) and (http://consultingblogs.emc.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx) which are effectively the same (except I don’t really understand the union all at the end when I’m trying to upsert) But I run into the same problem with parameters when doing the update using a OLEDb command – which I tried to overcome using (http://technet.microsoft.com/en-us/library/ms141773.aspx) but that just doesn’t seem to work, I get a validation error –
The external columns for complent.... are out of sync with the datasource columns... external column “Param_2” needs to be removed from the external columns.
(this error is repeated for the first two parameters as well – never came across this using the sql connection as it supports named parameters)
Has anyone come across this?
AND:
The fact that this simple task is apparently so difficult to do in SSIS suggests I’m using the wrong tool for the job - is there a better (and still flexible) way of doing this? Or would another ETL package be better for use between two Postgres database? -Other options include any listed on (http://en.wikipedia.org/wiki/Extract,_transform,_load#Open-source_ETL_frameworks). I could just go and write a load of SQL to do this for me, but I wanted a neat and easily maintainable solution.
I have used the Slowly Changing Dimension wizard for this with good success. It may give you what you are looking for especially with the Wizard
http://msdn.microsoft.com/en-us/library/ms141715.aspx
The External Columns Out Of Sync: SSIS is Case Sensitive - I encountered this issue multiple times and it makes me want to pull my hair out.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
SCD is way too slow for what I want. I need to use set based sql.
It turned out that a lot of my problems were with bugs in the provider.
I opened a forum topic (http://www.pgoledb.com/forum/viewtopic.php?f=4&t=49) and had a useful discussion with the moderator/support/developer person.
Also Postgres doesn't let you do cross db querys, so I solved the problem this way:
Data Source from Production DB to a temp Archive DB table
Run set based query between temp table and archive table
Truncate temp table
Note that the temp table is not atchally a temp table, but a copy of the archive table schema to temporarily stored data in.
Took a while, but I got there in the end.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
What enterprise ETL solution would you suggest?