Code & data tracking / deployment - deployment

For a long time now, we've held our data within the project's repository. We just held everything under data/sql, and each table had its own create_tablename.sql and data_tablename.sql files.
We have now just deployed our 2nd project onto Scalr and we've realised it's a bit messy.
The way we deploy:
We have a "packageup" collection of scripts which tear apart the project into 3 archives (data, code, static files) which we then store in 3 separate buckets on S3.
Whenever a role starts up, it downloads one of the files (depending on the role: data, nfs or web) and then a "unpackage" script sets up everything for each role, loads the data into mysql, sets up the nfs, etc.
We do it like this because we don't want to save server images, we always start from vanilla instances onto which we install everything from scratch using various in-house built scripts. Startup time isn't an issue (we have a ready to use farm in 9 minutes).
The issue is that it's a pain trying to find the right version of the database whenever we try to setup a new development build (at any point in time, we've got about 4 dev builds for a project). Also, git is starting to choke once we go into production, as the sql files end up totalling around 500mb.
The question is:
How is everyone else managing databases? I've been looking for something that makes it easy to take data out of production into dev, and also migrating data from dev into production, but haven't stumbled upon anything.

You should seriously take a look at dbdeploy (dbdeploy.com). It is ported to many languages, the major ones being Java and PHP. It is integrated in build-tools like Ant and Phing, and allows easy sharing of so called delta files.
A delta file always consists of a deploy section, but can also contain an undo section. When you commit your delta file and another developer checks it out, he can just run dbdeploy and all new changes are automatically applied to his database.
I'm using dbdeploy for my open source blog, so you can take a look on how delta files are organized: http://site.svn.dasprids.de/trunk/sql/deltas/

How I understand your main question is expirience of other people in migrating of SQL data from dev into production.
I use Microsoft SQL Server instead of My SQL, so I am not sure, that my expirience you can use directly. Nevertheless this way works very good.
I use Visual Studio 2010 Ultimate edition to compare data in two databases. The same feature exist also in Vinsual Studio Team Edition 2008 (or Database edition). You can read http://msdn.microsoft.com/en-us/library/dd193261.aspx to understand how it works. You can compare two databases (dev and prod) and generate SQL Script for modifying the data. You can easy exclude some tables or some columns from the comparing. You can also examine the results and exclude some entries from generation of the script. So one can easy and flexible generate scripts which can de used for deployment of the changes in the database. You can separetely compare the data of two databases from the sructure (schema compareing). So you can refresh data in dev with the data from prod or generate scripts which modify prod database to the last version of the dev database. I recommend you to look at this features and some products of http://www.red-gate.com/ (like http://www.red-gate.com/products/SQL_Compare/index.htm).

Check out capistrano. It's a tool the ruby community uses for deployment to different enviroments and I find it really useful.
Also if your deployment is starting to choke try a tool twitter built called Murder.

Personally i'd look at Toad
http://www.toadworld.com/
Less than 10k ;) ... will analyse database structures, produce scripts to modify them and also will migrate data.

One part of the solution is to capture the version of each of your code modules and their corresponding data resources in a single location, and compare them to ensure consistency. For example, an increment in the version number of your, say, customer_comments module will require a corresponding SQL delta file to upgrade the relevant DB tables to the equal version number for the data.
For an example, have a look at Magento's core_resource approach as documented by #AlanStorm.
Cheers,
JD

Related

How to deploy/versioning database with Cruise Control Net?

Hi i have configured the basics of cruise control to make releases, and automated nunit test using just MSBuild. Now i'm wondering if is possible to deploy/versioning databases with this?
I'm a beginner at CCNet .So if is possible some suggestions or tutorials (if there are) . Also if someone knows a free tool for database deployment/versioning let me know.. i will be grateful.
Thanks in advance
Hugh
It isn't free but SQL Source Control from RedGate can do what you're looking for, assuming it's a SQL Server database. It has a commandline interface that you can use in CCNet tasks. The easy approach of just migrating up is... easy, the changes are applied to your database schema / data. There was an issue with v2x of the tool that they've overcome with 3, which is that if you were to rename a table column then it would delete the column and create a new one with the right name. Obviously that's quite a big problem if you've got data you want to keep, so with v3 there's the concept of migrations and this allows you to specify alter scripts so instead of dropping the column you could script the change non-destructively.
As far as I know, at this time, they don't have anything that allows you to roll back your version.
Otherwise you could take a look at database migration tools, there seemed to be some promise for these in .Net at least. There is also this post that has some other tools (again for .net) and then there's this https://stackoverflow.com/search?q=database+migration+tool which is not restricted to any language but is general database migrations
If you're still looking for ways to version and migrate databases, one such tool is dbdeploy.net . I've hosted it on github after forking it and doing some work. Latest version is fully up to date and has some interesting features (done by someone who also uses it and sent a pull request).

Why would I use an SSIS package in SQL Server 2008 as opposed to some other technology?

I'm in a QA department of an internal development group. Our production database programmers have been building an SSIS package to create a load file from various database bits for import into a third-party application (we are testing integration with this).
Once built, it was quickly discovered that it had dependencies on the version of SQL Server and Visual Studio that it was created with, and had quite of few dependencies on the production environment as well (this is not an SSIS problem, just describing the nature of our setup).
Getting this built took several days of solid effort, and then would not run under our QA environment.
After asking that team for the SQL queries that their package was running (it works fine in the production environment), I wrote a python script that performed the same task without any dependencies. It took me a little over two hours (note that I already had a custom library for handling our database interaction), and I was able to write out a UTF-16LE file that I needed.
Now, our production database programmers are not SSIS experts, but they use it a fair bit in their workflows -- I would readily call all of them competent in their positions.
Thus, my question -- given the time it appears to take and the dependencies on the versions of SQL Server and Visual Studio, what advantage or benefits does an SSIS package bring that I may not see with my python code? Or a shell script, or Ruby or code-flavor-of-the-moment?
I am not an expert in SSIS by any means but an average developer who has experience working with SSIS for little over three years. Like any other software, there are short comings with SSIS as well but so far I have enjoyed working with SSIS. Selection of technology depends on one's requirement and preferences. I am not going to say SSIS is superior over other technologies. Also, I have not worked with Python, Ruby or other technologies that you have mentioned.
Here are my two cents. Please take this with a grain of salt.
From an average developer point of view, SSIS is easy to use once you understand the nuances of how to handle it. I believe that the same is true for any other technology. SSIS packages are visual work flows rather than a coding tool (of course, SSIS has excellent coding capabilities too). One can easily understand what is going on within a package by looking at the work flows instead of going through hundreds of lines of code.
SSIS is built mainly to perform ETL (Extract, Transform, Load) jobs. It is fine tuned to handle that functionality really well especially with SQL Server and not to mention that it can handle flat files, DB2, Oracle and other data sources as well.
You can perform most of the tasks with minimal or no coding. It can load millions of rows from one data source to another within few minutes. See this example demonstrating a package that loads a million rows from tab delimited file into SQL Server within 3 minutes.
Logging capabilities to capture every action performed by the package and its tasks. It helps to pinpoint the errors or track information about the actions performed by the package. This requires no coding. See this example for logging.
Check Points help to capture the package execution like a recorder and assists in restarting the package execution from the point of failure instead of running the package from the beginning.
Expressions can be used to determine the package flow depending on a given condition.
Package configurations can be set up for different environments using database or XML based dtsconfig files or Machine based Environment variables. See this example for Environment Variables based configuration. Points #4 - #7 are out-of-the-box features which require minor configuration and requires no coding at all.
SSIS can leverage the .NET framework capabilities and also developers can create their own custom components if they can't find a component that meets their requirement. See this example to understand how .NET coding can be best used along with different data source. This example was created in less than 3 hours.
SSIS can use the same data source for multiple transformations without having to re-read the data. See this example to understand what Multicasting means. Here is an example of how XML data sources can be handled.
SSIS can also integrate with SSRS (Reporting Services) and SSAS (Analysis Services) easily.
I have just listed very basic things that I have used in SSIS but there are lot of nice features. As I mentioned earlier, I am not sure if Python, Ruby or other languages can handle these tasks with such ease.
It all boils down to one's comfort with the technology. When the technology is new, people are very much skeptical and unwilling to adapt it.
In my experience, once you understand and embrace SSIS it is really a nice technology to use. It works really well with SQL Server. I don't deny the fact that I faced obstacles during development of my packages but mostly found a way to overcome them.
This may not be the answer that you were expecting but I hope this gives an idea.

Database Versioning - How does branch switching work?

This is a question for those of you developing on a team of devs where all of you have separate databases. You're versioning your database using source control and other tools which will automatically bring dev databases up to date to the latest version of the database (schema, data, SP's, functions, etc.).
OK Great! But wait! What if you are developing on version 4.0 of your software, but now you need to switch branches to the 3.2 branch to fix a bug? The schema could be (almost assuredly is) very different by now...
I suppose if you went through the extra effort to write rollback scripts along with your change scripts, this could work. But that seems like a lot of work - is it really worth it?
Much easier would be to create a new 3.2-branch database and work with that while working on the 3.2-branch code. It doesn't seem reasonable to me to require that each developer has exactly one database to work with.
I'm going on a limb and assume that you are versioning the database as a binary? If all your database assets were in the form of constructive code (eg SQL scripts and/or text data dumps), the solution would be simple, as suggested by Mark: store these assets as part of the development branch. To work on version 3.2, switch the branch, re-run the create scripts and presto, 3.2 database. Merging would be just as easy as with regular code (or just as painful, depending on your version control system of choice).
Here are some suggestions to work in this mode:
If creating the database instances from text is too slow, make a cache on a shared disk volume, keyed by the contents of all the schema / data files (or the MD5 sum thereof).
Write a pre-commit hook to ensure that the schema and data dumps in the developer's instance are the same as the ones under version control. This prevents people from making changes to their dev database with an interactive tool, and then forgetting to commit them.
You mention change scripts; treat them as a liability. While they may be required by your deployment scenario (eg for customers who want to upgrade in-place), they duplicate information from the version history of the database, and per Murphy's law duplication means desynchronization sooner or later. Try to auto-generate the change scripts from the versioned database assets using "diff"; or if this cannot be achieved, dedicate some serious unit tests to database upgrades.

Sql Server Development Server and Live

I have a database project that goes through iterations (only one so far) and I need to deploy a testing version to a live server. I'm not sure how to go about this.
I can make all the changes in a copy and then remake those changes in the live version. That doesn't make sense.
Is there a way to change a server name to an existing server? What's the best practice for this scenario?
With a Visual Studio Database Project, you should be able to have as many database connections defined as you like. When you go run your scripts, you can pick an menu option called "Run On...." and then pick which server connection to run those scripts on.
Just make sure the database name is the same for both instances, or make sure that you do not specify USE (database) at the top of all your scripts, if the database names are different from target to target.
In the first place, you should have scripts already written for the changes you made. They should be in source control. No changes ever should be made to database structure without a script and versioning.
Since you don't appear to have what you should have to deploy, then you need to tool to check the differences between the databases. Redgate's SQL Compare is the one to buy.
Be careful of simply using the tool without thinking, there may be changes in dev you are not yet ready to promote to prod. Read through the scripts before running them.
Also you may need SQL Data Compare to run against any lookup tables you have to see if new values have been added in dev that need to go to prod. Again these inserts should have been scripted and in source control and then deploying is simple.
Maybe I'm misunderstanding the question, but I don't see how you could just swap the databases. If you make a development version of a database and update the schema, you must surely run some tests and update the data. You can't just make that the development database now because it's full of test data.
What you need to do is run a tool that compares the old schema to the new schema and then apply these changes to the production database. There are tools out there on the market to do this. Failing that, you could dump the old and new schemas, run them through an ordinary file compare to get the differences, and then build an update script out of that.
On my present project we use what I think is a terrible practice: We keep a hand-maintained script of schema updates for each version, and every time someone makes a change they're supposed to update this script. Every now and then someone makes a mistake and we have to scramble to figure out what went wrong. Like we just had a problem deploying to our user acceptance test because someone updated the create statement for a new table to include a foreign key to another new table ... not realizing that the table being referenced was created until further down in the script. It worked fine it test because the tables were created in an order that made it work.
My conclusion is you're much better off to just make changes to the schema on the fly, then when you're done, run an automated compare to generate the ALTER statements.
By the way, on a project I worked on a few years ago, for a desktop application where each customer had their own copy of the database, we put in what I thought was a very nice feature: Every time the program started up, it compared the schema of the database to what it thought it ought to be, and if they didn't match, it automatically updated it. So when they installed a new version, it just automatically updated the database the first time they ran it.

Version-control in a large SSIS ETL project

We're about to make data transformation from one system to another using SSIS. We are four people people who will continuously be working on this for two years and therefore we need some sort of versioning system. We can not use team foundation. We're currently configuring a SVN server, but digging into it I've seen some big risks.
It seems that a solution is stored in one huge XML file. This must be a huge problem in a combined code/drag and drop environment as SSIS, as it will be impossible for SVN to merge the changes correctly, and whenever we get an error when commiting we will have to look inside that huge XML file and correct the mistakes manually.
One way to solve this problem is to create many solution projects in SSIS. However, this is not really the setup we want as we are creating one big monster which will have 2 days to execute and we want to follow its progress as it executes. If we have to create several solutions are there ways to link their execution and still have a visual look of whats going on and how well the execution is doing?
Has anyone had similar problems and/or do you have any suggestions as to how to solve them?
Just how many packages are you talking about? If it is hundreds of packages, then what is the specific problem you are trying to avoid? Here are a few things you might be trying to avoid based on your post:
Slow solution and project load time at startup in BIDS. I suppose this could be irritating from time to time. But if you keep BIDS open all day, that seems like a once a day cost.
Slow solution and project load time when you get latest solution definition from your version control system. Again, I suppose this could be irritating from time to time, but how frequently do you need to refresh the whole solution? If you break the solution into separate projects, then you only need to refresh a project. You would only need to refresh the whole solution if you want to get access to a new project within the solution.
What do you mean by "one huge XML file"? The solution file is an XML file that keeps track of the projects. Each project file is an XML file that keeps track of its SSIS packages. So if you have 1,000 SSIS packages evenly distribution across 10 projects in 1 solution, then each file would have no more than 100 objects to track. I can tell you from experience that I've had Reporting Services projects with more RDL files than this and it only took seconds to load the solution properly in BIDS. And as #revelator pointed out, the actual SSIS packages are their own individual XML files. Any version control system should track each of these as separate files and won't combine them into "one huge XML file". If you clarify what you mean by this point, then I think you will get better help on the question.
Whether you are running one package or 1,000 packages, you won't be doing this interactively from BIDS. You will probably deploy the packages to server first and then have the server run the packages. If that's the case, then you will need to call the packages probably with a SQL Server Agent job. Whether you chain the packages by making each package call another package or if you chain the packages by having the job call each package as a separate job step, you can still track where you are in the chain with logging. If you are calling the packages with jobs, then you can track it with job steps too. I run a data warehouse that has scores of packages and I primarily rely on separating processes into jobs that each contain one or more packages. I also chain jobs with start job commands so that I can more easily monitor performance of logical groups of loads. Also, each package shows its execution time in the job history at the step level. Furthermore, I have custom logging in each stored procedure and package that shows how many seconds and rows an individual data load or stored procedure took so that I can troubleshoot performance bottlenecks.
Whatever you do, don't rely on running packages interactively as a way to track performance! You won't get optimal performance running ETL on your machine, let alone running it with a GUI. Run packages in jobs on servers, not desktops. Interactively running packages is just their to help build and troubleshoot individual packages, not to adminster daily ETL.
If you are building generic packages that change their targets and sources based on parameters, then you probably need to build a control table in a database tha tracks progress. If you are simply moving data from one large system to another as a one time event, then you are probably going to divide the load into small sets of packages and have separate jobs for each so that you can more easily manage recovering from failures. If you intend to build something that runs regularly to move data, then how could 2 days of constant running for one process even make sense? It sounds like the underlying data will change on you within 2 days...
If you are concerned about which version control system to use for managing SSIS package projects, then I can say that just about any will do. I've used Visual SourceSafe and Perforce at different companies and both have the same basic features of checking in and checking out individual packages. I'm sure just about any version control system that integrates with Visual Studios will do this for you.
Hope you find something useful in the above and good luck with your project.
Version control makes it possible to have multiple people developing together and working on same project. If I am working on something, a fellow ETL developer will not be able to check it out and make changes to it until I am finished with my changes and check those back in. This addresses the common situation where one developer’s project artifact and code changes clobber that of another developer by accident.
http://blog.sqlauthority.com/2011/08/10/sql-server-who-needs-etl-version-control/
Most ETL projects I work use SVN as the source control repository. The best method I have found is to break each project or solution down into smaller, distinct (and often independently runnable) packages. So for example, say you had a process called ManufacturingImport, this could be your project. Within this you would have a Master package, which then called other packages as required. This means that members of the team can work on distinct packages or pieces of work, rather than everyone trying to edit the same package and getting into troublesome situations with merging.