Version-control in a large SSIS ETL project - version-control

We're about to make data transformation from one system to another using SSIS. We are four people people who will continuously be working on this for two years and therefore we need some sort of versioning system. We can not use team foundation. We're currently configuring a SVN server, but digging into it I've seen some big risks.
It seems that a solution is stored in one huge XML file. This must be a huge problem in a combined code/drag and drop environment as SSIS, as it will be impossible for SVN to merge the changes correctly, and whenever we get an error when commiting we will have to look inside that huge XML file and correct the mistakes manually.
One way to solve this problem is to create many solution projects in SSIS. However, this is not really the setup we want as we are creating one big monster which will have 2 days to execute and we want to follow its progress as it executes. If we have to create several solutions are there ways to link their execution and still have a visual look of whats going on and how well the execution is doing?
Has anyone had similar problems and/or do you have any suggestions as to how to solve them?

Just how many packages are you talking about? If it is hundreds of packages, then what is the specific problem you are trying to avoid? Here are a few things you might be trying to avoid based on your post:
Slow solution and project load time at startup in BIDS. I suppose this could be irritating from time to time. But if you keep BIDS open all day, that seems like a once a day cost.
Slow solution and project load time when you get latest solution definition from your version control system. Again, I suppose this could be irritating from time to time, but how frequently do you need to refresh the whole solution? If you break the solution into separate projects, then you only need to refresh a project. You would only need to refresh the whole solution if you want to get access to a new project within the solution.
What do you mean by "one huge XML file"? The solution file is an XML file that keeps track of the projects. Each project file is an XML file that keeps track of its SSIS packages. So if you have 1,000 SSIS packages evenly distribution across 10 projects in 1 solution, then each file would have no more than 100 objects to track. I can tell you from experience that I've had Reporting Services projects with more RDL files than this and it only took seconds to load the solution properly in BIDS. And as #revelator pointed out, the actual SSIS packages are their own individual XML files. Any version control system should track each of these as separate files and won't combine them into "one huge XML file". If you clarify what you mean by this point, then I think you will get better help on the question.
Whether you are running one package or 1,000 packages, you won't be doing this interactively from BIDS. You will probably deploy the packages to server first and then have the server run the packages. If that's the case, then you will need to call the packages probably with a SQL Server Agent job. Whether you chain the packages by making each package call another package or if you chain the packages by having the job call each package as a separate job step, you can still track where you are in the chain with logging. If you are calling the packages with jobs, then you can track it with job steps too. I run a data warehouse that has scores of packages and I primarily rely on separating processes into jobs that each contain one or more packages. I also chain jobs with start job commands so that I can more easily monitor performance of logical groups of loads. Also, each package shows its execution time in the job history at the step level. Furthermore, I have custom logging in each stored procedure and package that shows how many seconds and rows an individual data load or stored procedure took so that I can troubleshoot performance bottlenecks.
Whatever you do, don't rely on running packages interactively as a way to track performance! You won't get optimal performance running ETL on your machine, let alone running it with a GUI. Run packages in jobs on servers, not desktops. Interactively running packages is just their to help build and troubleshoot individual packages, not to adminster daily ETL.
If you are building generic packages that change their targets and sources based on parameters, then you probably need to build a control table in a database tha tracks progress. If you are simply moving data from one large system to another as a one time event, then you are probably going to divide the load into small sets of packages and have separate jobs for each so that you can more easily manage recovering from failures. If you intend to build something that runs regularly to move data, then how could 2 days of constant running for one process even make sense? It sounds like the underlying data will change on you within 2 days...
If you are concerned about which version control system to use for managing SSIS package projects, then I can say that just about any will do. I've used Visual SourceSafe and Perforce at different companies and both have the same basic features of checking in and checking out individual packages. I'm sure just about any version control system that integrates with Visual Studios will do this for you.
Hope you find something useful in the above and good luck with your project.

Version control makes it possible to have multiple people developing together and working on same project. If I am working on something, a fellow ETL developer will not be able to check it out and make changes to it until I am finished with my changes and check those back in. This addresses the common situation where one developer’s project artifact and code changes clobber that of another developer by accident.
http://blog.sqlauthority.com/2011/08/10/sql-server-who-needs-etl-version-control/

Most ETL projects I work use SVN as the source control repository. The best method I have found is to break each project or solution down into smaller, distinct (and often independently runnable) packages. So for example, say you had a process called ManufacturingImport, this could be your project. Within this you would have a Master package, which then called other packages as required. This means that members of the team can work on distinct packages or pieces of work, rather than everyone trying to edit the same package and getting into troublesome situations with merging.

Related

Why would I use an SSIS package in SQL Server 2008 as opposed to some other technology?

I'm in a QA department of an internal development group. Our production database programmers have been building an SSIS package to create a load file from various database bits for import into a third-party application (we are testing integration with this).
Once built, it was quickly discovered that it had dependencies on the version of SQL Server and Visual Studio that it was created with, and had quite of few dependencies on the production environment as well (this is not an SSIS problem, just describing the nature of our setup).
Getting this built took several days of solid effort, and then would not run under our QA environment.
After asking that team for the SQL queries that their package was running (it works fine in the production environment), I wrote a python script that performed the same task without any dependencies. It took me a little over two hours (note that I already had a custom library for handling our database interaction), and I was able to write out a UTF-16LE file that I needed.
Now, our production database programmers are not SSIS experts, but they use it a fair bit in their workflows -- I would readily call all of them competent in their positions.
Thus, my question -- given the time it appears to take and the dependencies on the versions of SQL Server and Visual Studio, what advantage or benefits does an SSIS package bring that I may not see with my python code? Or a shell script, or Ruby or code-flavor-of-the-moment?
I am not an expert in SSIS by any means but an average developer who has experience working with SSIS for little over three years. Like any other software, there are short comings with SSIS as well but so far I have enjoyed working with SSIS. Selection of technology depends on one's requirement and preferences. I am not going to say SSIS is superior over other technologies. Also, I have not worked with Python, Ruby or other technologies that you have mentioned.
Here are my two cents. Please take this with a grain of salt.
From an average developer point of view, SSIS is easy to use once you understand the nuances of how to handle it. I believe that the same is true for any other technology. SSIS packages are visual work flows rather than a coding tool (of course, SSIS has excellent coding capabilities too). One can easily understand what is going on within a package by looking at the work flows instead of going through hundreds of lines of code.
SSIS is built mainly to perform ETL (Extract, Transform, Load) jobs. It is fine tuned to handle that functionality really well especially with SQL Server and not to mention that it can handle flat files, DB2, Oracle and other data sources as well.
You can perform most of the tasks with minimal or no coding. It can load millions of rows from one data source to another within few minutes. See this example demonstrating a package that loads a million rows from tab delimited file into SQL Server within 3 minutes.
Logging capabilities to capture every action performed by the package and its tasks. It helps to pinpoint the errors or track information about the actions performed by the package. This requires no coding. See this example for logging.
Check Points help to capture the package execution like a recorder and assists in restarting the package execution from the point of failure instead of running the package from the beginning.
Expressions can be used to determine the package flow depending on a given condition.
Package configurations can be set up for different environments using database or XML based dtsconfig files or Machine based Environment variables. See this example for Environment Variables based configuration. Points #4 - #7 are out-of-the-box features which require minor configuration and requires no coding at all.
SSIS can leverage the .NET framework capabilities and also developers can create their own custom components if they can't find a component that meets their requirement. See this example to understand how .NET coding can be best used along with different data source. This example was created in less than 3 hours.
SSIS can use the same data source for multiple transformations without having to re-read the data. See this example to understand what Multicasting means. Here is an example of how XML data sources can be handled.
SSIS can also integrate with SSRS (Reporting Services) and SSAS (Analysis Services) easily.
I have just listed very basic things that I have used in SSIS but there are lot of nice features. As I mentioned earlier, I am not sure if Python, Ruby or other languages can handle these tasks with such ease.
It all boils down to one's comfort with the technology. When the technology is new, people are very much skeptical and unwilling to adapt it.
In my experience, once you understand and embrace SSIS it is really a nice technology to use. It works really well with SQL Server. I don't deny the fact that I faced obstacles during development of my packages but mostly found a way to overcome them.
This may not be the answer that you were expecting but I hope this gives an idea.

Load CMS core files from one server from multiple servers

I'm almost done with our custom CMS system. Now we want to install this for different websites (and more in the future), but every time I change the core files I will need to update each server/website seperatly.
What I really want is to load the core files from our server, so if I install an CMS I only define the nedded config files (on that server) and the rest is loaded from our server. This way I can pass changes in the core very simple, and only once.
How to do this, or this a completely wrong way? If so, what is the right way? Thing I need to look out for? Is it secure (without paying thousands for a https connection)?
I have completely no idea how to start or were to begin, and couldn't find anything helpful (maybe wrong search) so everything is helpful!
Thanks in advance!
Note: My application is build using the Zend Framework
You can't load the required files from remote on runtime (or really don't want to ;). This problem goes down to a proper release & configuration managment where you update all of your servers. But this can mostly be done automatically.
Depending on how much time you want to spend on this mechanism there are some things you have to be aware of. The general idea is, that you have one central server which holds the releases and have all other servers check if for updates, download and install them. There are lot's of possibilities like svn, archives, ... and the check/update can be done manually at the frontend or by crons in the background. Usually you'll update all changed files except the config files and the database as they can't be replaced but have to be modified in a certain way (this is the place where update scripts come into place).
This could look like this:
Cronjob is running on the server which checks for updates via svn
If there is a new revision it'll do a svn-update
This is an very easy to implement mechansim which holds some drawbacks like you can't change the config-files and database. Well infact it'd be possible but quite difficult to achieve.
Maybe this could be easier with a archive-based solution:
Cronjob checks updateserver for a new version. This could be done by reading the contents of a file on the update-server and compare it to a local copy
If there is a new version, download the related archive
Unpack the archive and copy the files
With that approach you might be able to include update-scripts into updates to modify configs/databases.
Automatic updatedistribution is a very very complex topic and that are only two very simple approaches. There are probably very many different solutions out there and 'selecting' the right one is not an easy task (it does even get more complex if you have different versions of a product with dependencies :) and there is no "this is the way it has to be done".

Code & data tracking / deployment

For a long time now, we've held our data within the project's repository. We just held everything under data/sql, and each table had its own create_tablename.sql and data_tablename.sql files.
We have now just deployed our 2nd project onto Scalr and we've realised it's a bit messy.
The way we deploy:
We have a "packageup" collection of scripts which tear apart the project into 3 archives (data, code, static files) which we then store in 3 separate buckets on S3.
Whenever a role starts up, it downloads one of the files (depending on the role: data, nfs or web) and then a "unpackage" script sets up everything for each role, loads the data into mysql, sets up the nfs, etc.
We do it like this because we don't want to save server images, we always start from vanilla instances onto which we install everything from scratch using various in-house built scripts. Startup time isn't an issue (we have a ready to use farm in 9 minutes).
The issue is that it's a pain trying to find the right version of the database whenever we try to setup a new development build (at any point in time, we've got about 4 dev builds for a project). Also, git is starting to choke once we go into production, as the sql files end up totalling around 500mb.
The question is:
How is everyone else managing databases? I've been looking for something that makes it easy to take data out of production into dev, and also migrating data from dev into production, but haven't stumbled upon anything.
You should seriously take a look at dbdeploy (dbdeploy.com). It is ported to many languages, the major ones being Java and PHP. It is integrated in build-tools like Ant and Phing, and allows easy sharing of so called delta files.
A delta file always consists of a deploy section, but can also contain an undo section. When you commit your delta file and another developer checks it out, he can just run dbdeploy and all new changes are automatically applied to his database.
I'm using dbdeploy for my open source blog, so you can take a look on how delta files are organized: http://site.svn.dasprids.de/trunk/sql/deltas/
How I understand your main question is expirience of other people in migrating of SQL data from dev into production.
I use Microsoft SQL Server instead of My SQL, so I am not sure, that my expirience you can use directly. Nevertheless this way works very good.
I use Visual Studio 2010 Ultimate edition to compare data in two databases. The same feature exist also in Vinsual Studio Team Edition 2008 (or Database edition). You can read http://msdn.microsoft.com/en-us/library/dd193261.aspx to understand how it works. You can compare two databases (dev and prod) and generate SQL Script for modifying the data. You can easy exclude some tables or some columns from the comparing. You can also examine the results and exclude some entries from generation of the script. So one can easy and flexible generate scripts which can de used for deployment of the changes in the database. You can separetely compare the data of two databases from the sructure (schema compareing). So you can refresh data in dev with the data from prod or generate scripts which modify prod database to the last version of the dev database. I recommend you to look at this features and some products of http://www.red-gate.com/ (like http://www.red-gate.com/products/SQL_Compare/index.htm).
Check out capistrano. It's a tool the ruby community uses for deployment to different enviroments and I find it really useful.
Also if your deployment is starting to choke try a tool twitter built called Murder.
Personally i'd look at Toad
http://www.toadworld.com/
Less than 10k ;) ... will analyse database structures, produce scripts to modify them and also will migrate data.
One part of the solution is to capture the version of each of your code modules and their corresponding data resources in a single location, and compare them to ensure consistency. For example, an increment in the version number of your, say, customer_comments module will require a corresponding SQL delta file to upgrade the relevant DB tables to the equal version number for the data.
For an example, have a look at Magento's core_resource approach as documented by #AlanStorm.
Cheers,
JD

Is there any form of Version Control for LSL?

Is there any form of version control for Linden Scripting Language?
I can't see it being worth putting all the effort into programming something in Second Life if when a database goes down over there I lose all of my hard work.
Unfortunately there is no source control in-world. I would agree with giggy. I am currently moving my projects over to a Subversion (SVN) system to get them under control. Really should have done this a while ago.
There are many free & paid SVN services available on the net.
Just two free examples:
http://www.sourceforge.net
http://code.google.com
You also have the option to set one up locally so you have more control over it.
Do a search on here for 'subversion' or 'svn' to learn more about how to set one up.
[edit 5/18/09]
You added in a comment you want to backup entire objects. There are various programs to do that. One I came across in a quick Google search was: Second Inventory
I cannot recommend this or any other program as I have not used them. But that should give you a start.
[/edit]
-cb
You can use Meerkat viewer to backupt complete objects. or use some of the test programas of libopenmetaverse to backup in a text environment. I think you can backup scripts from the inventory with them.
Jon Brouchoud, an architect working in SL, developed an in-world collaborative versioning system called Wikitree. It's a visual SVN without the delta-differencing that occurs in typical source code control systems. He announced that it was being open sourced in http://archvirtual.com/2009/10/28/wiki-tree-goes-open-source/#.VQRqDeEyhzM
Check out the video in the blog post to see how it's used.
Can you save it to a file? If so then you can use just about anything, SVN, Git, VSS...
There is no good source control in game. I keep meticulous version information on the names of my scripts and I have a pile of old versions of things in folders.
I keep my source out of game for the most part and use SVN. LSLEditor is a decent app for working with the scripts and if you create a solution with objects, it can emulate alot of the in game environment. (Giving Objects, reading notecards etc.) link text
I personally keep any code snippets that I feel are worth keeping around on github.com (http://github.com/cylence/slscripts).
Git is a very good source code manager for LSL since its commits work line-by-line, unlike other SCM's such as Subversion or CVS. The reason this is so crucial is due to the fact that most Second Life scripts live in ONE FILE (since they can't call each other... grrr). So having the comparison done on the file level is not nearly as effective. Comparing line by line is perfect for LSL. With that said, it also (alike SourceForge and Google Code) allows you to make your code publicly viewable (if you so choose) and available for download in a compressed file for easier distribution.
Late reply, I know, but some things have changed in SecondLife, and some things, well, have not. Since the Third Party Viewer policy still keeps a hard wall up against saving and loading objects between viewer and system, I was thinking about another possibility so far completely overlooked: Bots!
Scripted agents, AKA Bots, have all usual avatar actions available to them. Although I have never seen one used as an object repository, there is no reason you couldn't create one. Logged in as a separate account the agent can be wherever you want automatically or by command, then collect any or all objects you are working on at set intervals or by command, and anything they have collected may be given to you or collaborators.
I won't say it's easy to script an agent, and couldn't even speak for making an extension to a scripted agent myself, but if you don't want to start from scratch there is an extensive open source framework to build on, Corrade. Other bot services don't seem to list 'object repository' among their abilities either but any that support CasperVend must already provide the ability to receive items on request.
Of course the lo-fi route, just regularly taking a copy and sending the objects to a backup avatar, may still be a simple backup solution for one user. Although that does necessitate logging in as the other account either in parallel or once every 20 or-so items to be sure they are being received and not capped by the server. This process cannot rename the items or sort them automatically like a bot may. Identically named items are listed in inventory as most recent at the top but this is a mess when working with multiples of various items.
Finally, there is a Coalesce feature for managing several items as one in inventory. This is currently not supported for sending or receiving objects, but in the absence of a bot, can make it easier to keep track of projects you don't wish to actually link as one item. (Caveat; don't rezz 'no-copy' coalesced items near 'no-build' land parcels, any that cannot be rezzed are completely lost)

Which is the faster way to interact with SourceSafe? Command line or object model?

Our project is held in a SourceSafe database. We have an automated build, which runs every evening on a dedicated build machine. As part of our build process, we get the source and associated data for the installation from SourceSafe. This can take quite some time and makes up the bulk of the build process (which is otherwise dominated by the creation of installation files).
Currently, we use the command line tool, ss.exe, to interact with SourceSafe. The commands we use are for a recursive get of the project source and data, checkout of version files, check-in of updated version files, and labeling. However, I know that SourceSafe also supports an object model.
Does anyone have any experience with this object model?
Does it provide any advantages over using the command line tool that might be useful in our process?
Are there any disadvantages?
Would we gain any performance increase from using the object model over the command line?
I should imagine the command line is implemented internally with the same code as you'd find in the object model, so unless there's a large amount of startup required, it shouldn't make much of a difference.
The cost of rewriting to use the object model is probably more than would be saved in just leaving it go as it is. Unless you have a definite problem with the time taken, I doubt this will be much of a solution for you.
You could investigate shadow directories so the latest version is always available, so you don't have to perform a 'getlatest' every time, and you could ensure that you're talking to a local VSS (as all commands are performed directly on the filesystem, so WAN operations are tremendously expensive).
Otherwise, you're stuck unless you'd like to go with a different SCM (and I recommend SVN - there's an excellent converter available on codeplex for it, with example code showing how to use the VSS ans SVN object models)
VSS uses a mounted file system to share the database. When you get a file from SourceSafe it works at the file system level which means that instead of just sending you the file it send you all the blocks of the disk to find the file and the file. This adds up to a lot more transactions and extra data.
When using VSS over a remote or slow connection or with huge projects it can be pretty much unusable.
There is a product which amongst other things improves the speed of VSS by ~12 times when used over a network. It does this by implementing a client server protocol. This additionally can be encripted which is useful when using VSS over the internet.
I don't work or have any connection with them I just used it in a previous company.
see SourceOffSite at www.sourcegear.com.
In answer to the only part of your question which seems to have any substance - no switching to the object model will not be any quicker as the "slowness" is coming from the protocol used for sharing the files between VSS and the database - see my other answer.
The product I mentioned works along side VSS to address the problem you have. You still use VSS and ahev to have licences to use it... it just speeds it up where you need it.
Not sure why you marked me down?!
We've since upgraded our source control to Team Foundation Server. When we were using VSS, I noticed the same thing in the CruiseControl.Net build logs (caveat: I never researched what CC uses; I'm assuming the command line).
Based on my experience, I would say the problem is VSS. Our TFS is located over 1000 miles away and gets are faster than when the servers were separated by about 6 feet of ethernet cables.
Edit: To put on my business hat, if you add up the time spent waiting for builds + the time spent trying to speed them up may be enough to warrant upgrading or the VSS add-on mentioned in another post (already +1'd it). I wouldn't spend much of your time building a solution on VSS.
I betting running the Object Model will be slower by at least 2 hours.... ;-)
How is the command line tool used? You're not by chance calling the tool once per file?
It doesn't sound like it ('recursive get' pretty much implies you're not), but I thought I'd throw this thought in. Others may have similar problems to yours, and this seems frighteningly common with source control systems.
ClearCase at one client performed like a complete dog because the client's backend scripts did this. Each command line call created a connection, authenticated the user, got a file, and closed the connection. Tens of thousands of times. Oh, the dangers of a command line interface and a little bit of Perl.
With the API, you're very likely to properly hold the session open between actions.