How to review a commit that moves large chunks of code around? - version-control

For refactoring, I'm literally copy-and-pasting to split a file in half.
Conceptually this commit is very simple, but it shows up on source control tools as a big chunk of red (for deletion) and two big chunks of green (for addition). There is no side-by-side view, no indication that the chunks are identical, making it very hard to review.
How to review a commit like this?

Shouldn’t it be obvious at a glance that the big chunks are identical?
What VCS are you using? What are you viewing the diffs with?

Related

Best practice: To clean or not to clean old branches

I've worked in different team and in one team people tend to clean the old branches as soon as they merge them. In other team branches stay forever. What is the benefit of deleting/keeping old branches? Does it depends on what source control system we use? (In my case - SVN).
The answer may be depended on the version control system that you use. For example, if you would use Git, then you should not try to remove any branch, since the branching system and the way commit and push history is handled (depended on branches) is way different than SVN.
In general, however, I tend to keep the old branches, and not delete them. And in professional places that I have worked, they tend to keep the branches too. In my point of view, keeping a branch not only provides you with code history, but also:
Failed attempts history. You may later think about doing something that has failed before. If you keep the failed branch, you will be able to understand why it failed in the first place.
Good reusable code may exist in these branches. Sometimes when the main stable branches ends discarding much code, good code developed for this branch specifically may end up in the trash, too. However, some of this code may prove useful in other situations in later stages of the development. So, why reinvent the wheel?
Spinoff projects. In big projects, some times branches contain features that did not make it into the final product. From these features, there may be some new ideas that could form a standalone project by themselves.
Proof. Let's face it, in companies, especially big ones, there are managerial concerns that need to be taken into account while committing code. For example, while looking at code history, you can immediately see who has committed faulty or good code, and avoid misunderstandings. I know it sounds cynical, but sometimes it saves people a lot of trouble.
In general, its History. Why delete branches that remind you of the paths that the development has followed up until now? I doubt it will have a significant impact on disk space (in most cases, at least. In other cases, it can have a big impact, but companies should take care of the problem of space before it actually becomes a concern). Branches represent thousands of man hours in terms of work. Deleting them is as if you throw this time away.
As far as discarding the branches, I cannot think of any reason other than to save space.
Simple... You can back track as long as you want, if you have them.
In my case also it is SVN. I use to archive them by different tags and move it to a different folder. So always one hot folder (Live) with parallel dev branches, once merge is completed, go for archiving the branch.

How can revisions/version control be implemented for web apps' data

I believe Wordpress stores multiple entries of posts as "revisions" but I think thats terribly inefficient use of space?
Is there a better way? I think gitit is a Wiki that uses GIT for version control, but how is it done? eg. my application is in PHP and I must make it talk to GIT to commit and retrieve data?
So, what is a good way of implementing version control in web apps (eg. in a blog it might be the post content)
I've recently implemented just such a system - which uses the concept of superseded records, together with a previous and current link. I did a considerable amount of research into how best to achieve this - in the end the model I arrived at is similar to the Wordpress (and other systems) - store the changes as a new record and use this.
Considering all of the options available, space is really the last concern for authored content such as posts - media files take up way more space and these can't be stored as deltas anyway.
In any case the way that Git works is virtually identical in that it stores the entire content for every revision except that it will eventually pack down into deltas (or when you ask it to).
Going back to 1990 we were using SCCS or RCS and sometimes with only 30mb of disk space free we really needed the version control to be efficient to avoid running out of storage.
Using deltas to save space is not really worth all of the associated aggravation given the average amount of available storage on modern systems. You could argue it's wasteful of space, however I'd argue that it is much more efficient in the long run to store things uncompressed in their original form
it's faster
it's easier to search through old versions
it's quicker to view
it's easier to jump into the middle of a set of changes without having to process a lot of deltas.
it's a lot easier to implement because you don't have to write delta generation algorithms.
Also markup doesn't fare as well as plain text with deltas especially when editing with a wysiwyg editor.
Keep one table with the most recent version of the e.g. article.
When a new version is saved, move the current over in an archive table and put a version number on it while keeping the most recent version in the first table.
The archive table can have the property ROW_FORMAT=COMPRESSED (MySQL InnoDb example) to take up less space and it won't be a performance issue since it is rarely accessed. Yes, it is somewhat overhead not to only store changesets but if you do some math you can keep a huge amount of revisions in almost no space as your articles are highly compressable text anyway.
In example, the source code of this entire page is 11Kb compressed. That gives you almost 100 versions on 1Mb. In comparison, normal articles are quite a bit smaller and may on average give you 500-1000 articles/versions on 1Mb. You can probbably afford that.

Is a VCS appropriate for usage by a designer?

I know that a VCS is absolutely critical for a developer to increase productivity and protect the code, no doubts about it. But what about a designer, using say, Photoshop (though it's not specific to any tools, just to make my point clearer).
VCSs uses delta compression to store different versions of files. This works very well for code, but for images, that's a problem. Raster image files are binary formats, though vector image files are text (SVG comes to my mind) and pose to problem. The problem comes with .psd files (and any other image "source" file) - those can get pretty big and since I'm not familiar with the format, I'll consider them as binary files. How would a VCS work in this condition?
The repository could be pretty darned big if the VCS server isn't able to diff the files efficiently (or worse, not at all) and over time this can become a really big pain when someone needs to check out the repository (or clone it if using a DVCS).
Have any of you used a VCS for this purpose? How well does it work? I'm mostly interested in Mercurial, though this is a general situation that applies to any VCS.
Designers usually use specialized tools like AlienBrain, Adobe VersionCue or similar, which are essentically Version Control Systems that understand Images and other Media Assets, which allows stuff like diffing two images.
Designers IMHO should definitely use VCS systems, at least as a means for Versioning and backup - their stuff is just as important as Specs, Documentation, Code, Deploy Scripts and everything else that makes a project.
I do not know if there are bridges between "Asset Management Systems" like the mentioned ones and Developer VCS' systems though.
Version Control Systems are useful for ANYONE that is doing work that they might need an older version of at some later date. That said, I have set up all my creative friends with Subversion (in the past) and now I recommend Git. Even those that are doing Video editing with hundreds of gigs of video. They can archive off the projects when they get final payment. Drive space is CHEAP, cheaper than ever before, size isn't an issue in any modern VCS. Being able to revert back to a previous working state or experiment with something without losing data and manually managing multiple "temp" directories is invaluable if you bill by the hour.
Yes
Don't worry about the size, if you run out of space, just buy a larger hard drive.
Losing information will be far costlier.
In addition to a VCS (any will do, as you won't be needing delta storage), do regular backups.
When doing checkout you shouldn't be standing on the root of the system, but rather on a specific branch to your project, that way it won't be slower than any simple copy operation of that folder.
Definitely recommend using Version Control for any type of file you care about, or can't afford to loose. Disk space is cheap, and as has already been pointed out it'd be far worse to loose a bunch of important files than to spend a few extra bucks on a new HDD. I recommend Subversion since it has file locking, an important feature when working with binary files and version control to prevent ugly or impossible merge conflicts.
I believe so. Especially if you wish to track changes over time or need to rollback to previous versions. Centralized source control may be the way to go if you're worried about the size.

Keeping lots of localized content in sync

If you localize for a number of languages and have lots of content, how do you keep any changes in sync? For example, a bird app that list various birds (200 of them). If you have chosen to localize for five languages, that means you need 1000 localization files. Not only is it a lot of initial translating but very time consuming to keep up if any of the bird entries change.
Is there a better way to sync everything and do the initial translations?
This sounds more like a process management issue than a tools issue. At the heart of the problem is the fact that the strings files are language centric, but the translations you're managing are data centric. Tools can help maintain data integrity, but ultimately it'll be fixed with process.
The process you're looking for is either a mechanism to keep them in sync or a way to mark the other four as out-of-sync. The best process will be simple to implement, easy to understand how it works, obviously note out-of-date entries and lend itself to an automated check. Getting all four -- simple, easy, obvious, automated -- is not an easy task.
There are a few ways you and your team can handle this.
One is to not allow a commit that doesn't change all five strings files. A commit-hook script can enforce this in the repository. This falls apart when there is no change to the other translations, such as when fixing the spelling or grammar in one language.
Another is to file four follow-up bugs whenever one translation is changed and make sure they get done in a timely manner. This falls apart when two or more changes are made to one entry before the first set of four follow-up bugs are closed. Either the translators will get annoyed with extra bugs they have to triage or, worse yet, they'll address the bugs out of order and in the end set the translation to the first bug's version of the entry, not the most recent version.
A third is to only make changes to one language. After the code for the next release is finished, run a diff against the last release (e.g., svn diff -r <last release>) and use that output as a list of translations to complete on Translation Day before cutting the new release. Personally, I don't think this method makes the out-of-date translations sufficiently obvious. It's too easy to cut a release without updating the translations and simply not notice they were forgotten.
A fourth option that will be more obvious is to prefix the translation in the other four with "REDO:" whenever a change is made to one. Before cutting a release, search for and clean out the REDO entries. This method carries two risks: REDO labels may be forgotten on a commit, or a release could be cut with embarrassing REDO labels still in the strings files.
For all of these, the pre-commit peer review should check the chosen process was followed. "Many Eyes Verify," or so they say.
I'm sure there are other ways and there is no clear "right answer" here. It would be good to have a team discussion to determine the best method for the team.
Wouldn't you need only 5 files? What technologies are you using?
Like gmagana said, you should only need 5 files (one Localizeable.strings for each language).
See the Apple guide "International Programming Topics," specifically the section called "Strings Files."

DVCS and data loss?

After almost two years of using DVCS, it seems that one inherent "flaw" is accidental data loss: I have lost code which isn't pushed, and I know other people who have as well.
I can see a few reasons for this: off-site data duplication (ie, "commits have to go to a remote host") is not built in, the repository lives in the same directory as the code and the notion of "hack 'till you've got something to release" is prevalent... But that's beside the point.
I'm curious to know: have you experienced DVCS-related data loss? Or have you been using DVCS without trouble? And, related, apart from "remember to push often", is there anything which can be done to minimize the risk?
I have lost data from a DVCS, both because of removing the tree along with the repository (not remembering it had important information), and because of mistakes in using the DVCS command line (git, in the specific case): some operation that was meant to revert a change that I made actually deleted a number of already-committed revisions from the repository.
I've lost more data from clobbering uncommitted changes in a centralized VCS, and then deciding that I actually wanted them, than from anything I've done with a DVCS. Part of that is that I've been using CVS for almost a decade and git for under a year, so I've had a lot more opportunities to get into trouble with the centralized model, but differences in the properties of the workflow between the two models are also major contributing factors.
Interestingly, most of the reasons for this boil down to "BECAUSE it's easier to discard data, I'm more likely to keep it until I'm sure I don't want it". (The only difference between discarding data and losing it is that you meant to discard it.) The biggest contributing factor is probably a quirk of my workflow habits - my "working copy" when I'm using a DVCS is often several different copies spread out over multiple computers, so corruption or loss in a single repo or even catastrophic data loss on the computer I've been working on is less likely to destroy the only copy of the data. (Being able to do this is a big win of the distributed model over centralized ones - when every commit becomes a permanent part of the repository, the psychological barrier to copying tentative changes around is a lot higher.)
As far as minimizing the risks, it's possible to develop habits that minimize them, but you have to develop those habits. Two general principles there:
Data doesn't exist until there are
multiple copies of it in different
places. There are workflow habits
that will give you multiple copies
for free - f'rexample, if you work
in two different places, you'll have
a reason to push to a common location
at the end of every work session,
even if it's not ready to publish.
Don't try to do anything clever,
stupid, or beyond your comfort zone
with the only reference to a commit
you might want to keep. Create a
temporary tag that you can revert to,
or create a temporary branch to do
the operations on. (git's reflog lets
you recover old references after the
fact; I'd be unsurprised if other
DVCSs have similar functionality.
So manual tagging may not be
necessary, but it's often more
convenient anyways.)
There is an inherent tension between being distributed and making sure everything is "saved" (with the underlying assumption that saved means being backed up somewhere else).
IMO, this is only a real problem if you work on several computers at the same time on the same line of work (or more exactly several repositories: I often need to share changes between several VM on the same computer for example). In this case, a "centralized" workflow would be ideal: you would set up a temporary server, and on some given branches, use a centralized workflow. None of the current DVCS I know of (git/bzr/hg) support this well. That would be a good feature to have, though.