I want to separate binary files (media) from my code repositories. Is it worth it? If so, how can I manage them?

I want to separate binary files (media) from my code repositories. Is it worth it? If so, how can I manage them? - version-control

Our repositories are getting huge because there's tons of media we have ( hundreds of 1 MB jpegs, hundreds of PDFs, etc ).
Our developers who check out these repositories have to wait an abnormally long time because of this for certain repos.
Has anyone else had this dilemma before? Am I going about it the right way by separating code from media? Here are some issues/worries I had:
If I migrate these into a media server then I'm afraid it might be a pain for the developer to use. Instead of making updates to one server he/she will have to now update two servers if they are doing both programming logic and media updates.
If I migrate these into a media server, I'll still have to revision control the media, no? So the developer would have to commit code updates and commit media updates.
How would the developer test locally? I could make my site use absolute urls, eg src="http://media.domain.com/site/blah/image.gif", but this wouldn't work locally. I assume I'd have to change my site templating to decide whether it's local/development or production and based on that, change the BASE_URL.
Is it worth all the trouble to do this? We deal with about 100-150 sites, not a dozen or so major sites and so we have around 100-150 repositories. We won't have the time or resources to change existing sites, and we can only implement this on brand new sites.
I would still have to keep scripts that generate media ( pdf generators ) and the generated media on the code repository, right? It would be a huge pain to update all those pdf generators to POST files to external media servers, and an extra pain taking caching into account.
I'd appreciate any insight into the questions I have regarding managing media and code.

First, yes, separating media and generated content (like the generated pdf) from the source control is a good idea.
That is because of:
disk space and checkout time (as you describe in your question)
the lack of CVS feature actually used by this kind of file (no diff, no merge, only label and branches)
That said, any transition of this kind is costly to put in place.
You need to separate the release management process (generate the right files at the right places) from the development process (getting from one or two referential the right material to develop/update your projects)
Binaries fall generally into two categories:
non-generated binaries:
They are best kept in an artifact repository (like Nexus for instance), under a label that would match the label used for the text sources in a VCS
generated binaries (like your pdf):
ideally, they shouldn't be kept in any repository, but only generated during the release management phase in order to be deployed.

Related

How to implement continuous migration for large website?

I am working on a website of 3,000+ pages that is updated on a daily basis. It's already built on an open source CMS. However, we cannot simply continue to apply hot fixes on a regular basis. We need to replace the entire system and I anticipate the need to replace the entire system on a 1-2 year basis. We don't have the staff to work on a replacement system while the other is being worked on, as it results in duplicate effort. We also cannot have a "code freeze" while we work on the new site.
So, this amounts to changing the tire while driving. Or fixing the wings while flying. Or all sorts of analogies.
This brings me to a concept called "continuous migration." I read this article here: https://www.acquia.com/blog/dont-wait-migrate-drupal-continuous-migration
The writer's suggestion is to use a CDN like Fastly. The idea is that a CDN allows you to switch between a legacy system and a new system on a URL basis. This idea, in theory, sounds like a great idea that would work. This article claims that you can do this with Varnish but Fastly makes the job easier. I don't work much with Varnish, so I can't really verify its claims.
I also don't know if this is a good idea or if there are better alternatives. I looked at Fastly's pricing scheme, and I simply cannot translate what it means to a specific price point. I don't understand these cryptic cloud-service pricing plans, they don't make sense to me. I don't know what kind of bandwidth the website uses. Another agency manages the website's servers.
Can someone help me understand whether or not using an online CDN would be better over using something like Varnish? Is there free or cheaper solutions? Can someone tell me what this amounts to, approximately, on a monthly or annual basis? Any other, better ways to roll out a new website on a phased basis for a large website?
Thanks!

I think I do not have the exact answers to your question but may be my answer helps a little bit.
I don't think that the CDN gives you an advantage. It is that you have more than one system.
Changes to the code
In professional environments I'm used to have three different CMS installations. The fist is the development system, usually on my PC. That system is used to develop the extensions, fix bugs and so on supported by unit-tests. The code is committed to a revision control system (like SVN, CVS or Git). A continuous integration system checks the commits to the RCS. When feature is implemented (or some bugs are fixed) a named tag will be created. Then this tagged version is installed on a test-system where developers, customers and users can test the implementation. After a successful test exactly this tagged version will be installed on the production system.
A first sight this looks time consuming. But it isn't because most of the steps can be automated. And the biggest advantage is that the customer can test the change on a test system. And it is very unlikely that an error occurs only on your production system. (A precondition is that your systems are build on a similar/equal environment. )
Changes to the content
If your code changes the way your content is processed it is an advantage when your
CMS has strong workflow support. Than you can easily add a step to your workflow
which desides if the content is old and has to be migrated for the current document.
This way you have a continuous migration of the content.
HTH

Varnish is a cache rather than a CDN. It intercepts page requests and delivers a cached version if one exists.
A CDN will serve up contents (images, JS, other resources etc) from an off-server location, typically in the cloud.
The cloud-based solutions pricing is often very cryptic as it's quite complicated technology.
I would be careful with continuous migration. I've done both methods in the past (continuous and full migrations) and I have to say, continuous is a pain. It means double the admin time for everything, and assumes your requirements are the same at all points in time.
Unfortunately, I would say you're better with a proper rebuilt on a 1-2 year basis than a continuous migration, but obviously you know best about that.
I would suggest you maybe also consider a hybrid approach? Build yourself an export tool to keep all of your content in a transferrable state like CSV/XML/JSON so you can just import into a new system when ready. This means you can incorporate new build requests when you need them in a new system (what's the point in a new system if it does exactly the same as the old one) and you get to keep all your content. Plus you don't need to build and maintain two CMS' all the time.

Automatically store each compile on github?

My code is compiled by multiple people (multiple machines) multiple times per day. I would like to have each compile automatically uploaded to our Github account each time someone completes a compile. Basically, these compiled zips get sent to actual hardware via either flash drive or email or dropbox (any number of ways based on many conditions). Since the zip is always named the same, sometimes the old version is deleted on the device, sometimes stored in an /old directory. I would like to stop losing old versions and retain a central repository of each version stored chronologically. Github seems the perfect place for that.
I could of course ask each user to upload the finished zip that they created to a central location, but I would like for it to be an automatic process if possible. So - does Github offer a feature like that?

Github seems the perfect place for that.
Not really, since putting large binaries in a distributed repo (ie, a repo which is cloned around in its entirety) is not a good idea.
To get a specific version of your binary, you would have to clone your "artifacts" repo from GitHub, before being able to select the right one to deploy.
And with the multiple deliveries, that repo would get bigger and bigger, making the clone longer.
However, if you have only one place to deploy, it is true that a git fetch would only get the new artifacts (incremental update).
But:
GitHub doesn't offer an unlimited space (and again, that repo would grow rapidly with all the deliveries)
cleaning a repo (ie deleting old versions of a binaries you might not need) is hard.
So again, using GitHub or any other DVCS (Distributed VCS) for delivery purpose doesn't seem adequate.
If you can setup a public artifact repository like Nexus, then you will be able to deliver as many binaries you want, and you will be able to clean them (delete them) easily.

Github has the concept of a files attached to a repository (but not actually in the repo - they're stored on s3) and there's an api for uploading files to it.
You could call that api as part of your build process.
If you have a continuous integration server building your code after every commit you should be able to get it to store the build products somewhere, but you might have to handle the integration yourself if you want them stored on GitHub (as opposed to on the CI server's hard disk)

While github is perfect for collaborating on the sources, it is not for managing the build and its artifacts. You may eventually want to look at companies like Cloudbees, which provide hosted build and integration environments, that target exactly the workflow parts beyond the source management. But those are mostly targeted towards Java development, which may or may not fit your needs.
Besides that, if you really only want to have a lot of time stamped zip files from your builds accessible by a lot of people, wouldn't a good old fashioned FTP server be enough for your needs, maybe?

Recommended DVCS mechanism for hosting many independent patches

I have a project just getting started at http://sourceforge.net/projects/iotabuildit/ (more details at http://sourceforge.net/p/iotabuildit/wiki/Home/) that is currently using Mercurial for revision control. And it seems like Mercurial and SourceForge almost have all the right features or elements to put together the collaboration mechanism I have in mind for this project, but I think I'm not quite there yet. I want people to be able to submit, discuss and vote on individual changes from a large number of individuals (more developers than a project would normally have). And I want it to be as easy as possible for users to participate in this. The thought right now is that people can clone the "free4all" fork, which is a clone of the base "code" repository, or they can create their own fork in their own SourceForge user project (SourceForge now provides a workspace for every user to host miscellaneous project-related content). Then they can clone that to their local repository (after downloading TortoiseHg or their preferred Mercurial client). Then they can make modifications, commit them, push them to the fork, and request a merge into the base "code" repository, at which point we can discuss/review the merge request. This all is still far too many steps, and more formal than I'd like.
I see there is such a thing as "shelving" in Mercurial, but I don't see how/if that is supported in the SourceForge repository. And there probably isn't a way to discuss shelved changes as there is merge requests.
I'm looking for any suggestions that would make this easier. Ideally, I would like users to be able to:
Specify any version that they would like to play, and have that requested version extracted from source control hosted for the user to play at SourceForge (because the game can't be played locally due to security restrictions the Chrome browser properly applies to javascript code accessing image content in independent files)
Allow the user to download the requested version of the project for local editing (a C# version built from the same source is also playable locally, or Internet Explorer apparently ignores the security restriction, allowing local play in a browser)
Accept submitted modifications in a form that can be merged with any other compatible "branch" or version of the game that has been submitted/posted (ideally this would be very simple -- perhaps used just uploads the whole set of files back to the server and the compare and patch/diff extraction is performed there)
Other players can see a list of available submitted patches and choose any set to play/test with, then discuss and vote on changes.
Clearly some of these requirements are very specific, and I will probably need to write some server side code if I want to reach the ideal goal. But I want to take the path of least resistance and use the technologies available if much of the functionality I need is already almost there. Or I'd like to see if I can get any closer than the process I outlined earlier without writing any server code. So what pieces will help me do this? Does Mercurial & SourceForge support storing and sharing shelved code in the way I would want? Is there something to this "Patch Queue" (that I see, but can't understand or get to work yet) that might help? Is there a way to extract a patch file from a given set of files compared to a specific revision in a repository (on the server side), without having the user download any Mercurial components?

It sounds like something you could do with mercurial queue (mq) patch queues. The patch queue can be is own, separate versioned repository, and people can use 'guards' to apply only the patches they want to try.
But really it sounds even easier to use bitbucket or github, both of which have excellent patch-submission, review, and acceptance workflows built into them.

Distributed Version Control. - Git & Mercurial... multiple sites

I'm looking for a best practice scenario on managing multiple "sites" in mercurial. Since I'm likely to have multiple sites in a web root, all of which are different - but somewhat similar (as they are 'customizations' of a root app)
Should I
A) make a single repository of the wwwroot folder (catching all changes across all sites)
B) make EACH sits folder a different repository
this issue is that each site needs a distinct physical directory, due to vhost pointing for development, and a current need to have "some" physical file difference cross site.
What's the best practice here? I'm leaning towards separate repositories for each directory. which will make tracking any branching and merging for that ONE site cleaner....

It depends on how your software is structured, and how independent the different sites are. The best situation is when you can use your core code like a library, which lives in its own directory, and there is no need in the different sites to change even a single file of core. Then you have the free choice if you want to develop the core along with the different sites in a single repo, or to seperate core from sites. When core and the different sites are dependent on each other, you very probably have to deal with all of them in a sigle repo.
Since in my experience development work better when the different parts are independend of each other I strongly recommend to bring the core stuff into something which can be included into the sites by a directory inclusion.
The next point is how are the different sites developed. If they share lots of code, they can be developed as different branches. But there are two disadvantages of this scheme:
the different sites are normally not visible to the developer, since there is typically only one checked out
The developer has to take great care where to create changes, so that only the wanted changes are going into other branches, not something which is special to a single branch only
You might consider to move common parts of different sites into core if they share lots of common code.
Another situation is if they all have nothing in common, since then things are much better. Then you need to decide if you want them to reside in different repos, or as different directories in a single repos. When these different sites are somehow related to each other (say that they are all of the same company), then it might be better to put them into a common repo, as different subdirectories. When they are unrelated to each other (every site belongs to a different customer, and changes on these sites are not created in synch to each other), than one repo per site is better.
When you have the one repo per site approach, it might also be good if you first create a template site, which includes the core component and basic configuration, and derive your site-repos as clones from this template. Then when you change something in the core which also affects the sites, you do these changes in the template, and merge these changes afterwards into the sites repos (you only need to take care to NOT do this change in one of the site-repos, since when you merge from sites to template you might get stuff from the specific site into the template, which you don't want to be in the template).
So I suggest
develop core as a single independent product
choose the correct development model for your sites
all in one repo, with branches, when there is much code-exchange is goin on between different sites
but better refactor the sites to not share code, since the branches approach has drawbacks
all in one repo, no branches but different folders, if there is no code exchange between different sites
one repo for each site if they are completely independent.

I think, you have to try Mercurial Queues with one repo. I.e
you store "base" site in repository
all site-specific changes separated into the set of MQ-patches (one? patch per site)
you can create "push-only" repos in sites, add they to [paths] section of "working" repo and push changes or use export-copy technique
and after applying the site-patch to codebase you'll get ready to use code for each and every site

new to mercurial and VCS: shared code multi-server setup

In our small office we're setting up mercurial - our first time using a "real" version control system. We've got three servers - a live server, a staging server and a development server.
We've also got three relatively large web sites - one for visitors, one for users and an intranet site, for the office staff.
The three web sites share some code. (for instance - a php class library, some commonly used code snippets, etc.)
Before version control, we just used symbolic links to link to the shared libraries. For example: each site had a symbolic link to an "ObjectClasses" directory - any changes made to a file in ObjectClasses would be instantly available to all the sites. You'd just upload the changed file to staging and to live, and you were done.
But... Mercurial doesn't follow symbolic links. So I've set up a subrepository for the shared libraries in the three sites on the three servers (actually 'four' servers if you count the fact that there are two programmers with two separate clones of the repository on the development server).
So there are 12 working copies of the shared object library.
So here's the question:
Is there any way to simplify the above set up?
Here's an example of what our workflow will be and it seems too complicated - but maybe this is what it's like using version control and we just need to get used to it:
Programmer A makes a change to Object Foo in the subrepo in Site 1. He wants to make this available everywhere, so he commits it, then pushes it to the staging server. I set up hooks on the staging server to automatically propogate the changes to the three sites, on the staging server, and again to the three sites on the live server. That takes care of the 6 working copies on the staging and live servers. So far, so good.
but what about the development server, where there may be work-in-progress on these files?
Programmer A now needs to manually pull the shared subrepo to Sites 2 and 3 on the development server. He also needs to tell Programmer B to manually pull the shared subrepo on Sites 1, 2 and 3 on his copy of the site on the development server. What if he's editing Object Foo on Site 1 and making different edits to Object Foo on Site 2. He's going to have to resolve two separate conflicts.
We make changes to the objects relatively frequently. This is going to drive us nuts. I really love the idea of version control - but after two weeks of wrestling with trying to find the best setup, the old sloppy way of having one copy of the shared files and calling out "hey - ya working on that file, I wanna make a change" is looking pretty good right now.
Is there really no simpler way to set this up?

Without more information about the specific web platform and technologies you're using (e.g., .NET, LAMP, ColdFusion, etc.), this answer may be inadequate, but let me take a stab nevertheless. First, if I understand you correctly, it's your working paradigm that's the problem. You're having developers make changes to files and then push them to three different sites. I suggest separating the development concerns altogether from the build/deploy concerns.
It sounds like you're using subrepositories in Mercurial to handle shared code--which is smart, by the way--so that's good. That takes care of sharing code across multiple projects. But rather than have each programmer pushing stuff to a given server after he updates it, have the programmers instead be pushing to some other "staging" repository. You could have one for each of your servers if you wish, though I think it probably makes more sense to keep all development in a single staging or "master" repository which is then used to build/deploy to your staging and/or live server.
If you wish to automate this process, there are a number of tools that can do this. I usually prefer NAnt with CruiseControl for build integration, but then my work is mostly .NET which makes it a great fit. If you can provide more specifics I can provide more details if you like, but I think the main problem for you to overcome is the way you're handling the workflow. Use Mercurial to keep multiple developers happy pulling/pushing from a single repository and then worry about deploying to your servers for testing as a separate step.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse