Version control of software refactoring - version-control

What is the best way of doing version control of large scale refactoring?
My typical style of programming (actually of writing documents as well) is getting something out as quickly as possible and then refactoring it. Typically, refactoring takes place at the same time as adding other functionality. In addition to standard refactoring of classes and functions, functions may move from one file to another, files get split and merged or just reordered.
For the time being, I am using version control as a lone user, so there is no issue of interaction with other developers at this stage. Still, version control gives me two aspects:
Backup and ability to revert to a good version "in case".
Looking at the history tells me how the project progressed and the flow of ideas.
I am using mercurial on windows using TortoiseHg which enables selections of hunks to commit. The reason I mention this is that I would like advice on the granularity of a commit in refactoring. Should I split refactoring from functionality added always in committing?
I have looked at the answers of Refactoring and Source Control: How To? but it doesn't answer my question. That question focuses on collaboration with a team. This one concentrates on having a history that is understandable in future (assuming I don't rewrite history as some VCS seem to allow).

I suppose that there is no one size fits all answer to you question :)
Personally I prefer to keep the finer sensible granularity in my commits: in your case I would split the action in two phases: each one independent:
refactoring (and following commit)
new functionalities (and commit).
The best thing to do is to add and commit each item on its own: break up the refactoring in localized changes, and commit them one by one, and add the functionalities one by one, committing them along the way.
There is a little more overhead, but in this way when you go back seeking for differences it is clear what was changed for refactoring and what for adding new functionalities. It's also easier to rollback only a particular problematic addition.

Should I split refactoring from functionality added always in committing?
I tend to check-in frequently; and each check-in is either refactoring, or new functionality. It's a cycle of:
Refactor existing code (without changing its functionality) to ready it to accept the new code
Add new code (which implements additional functionality).

I would recommend separating refactoring from adding functionality. Perhaps by alternating checkins. This is from my experiences after I discovered uncrustify and would reformat source files while also making code changes. It became very difficult to figure out a real change from just a reformat. Now uncrustify gets its own dedicated commits.

Having dealt with untangling of effects/bugs/side effects of VERY complicated refactoring combined with fairly extensive changes, I can very strongly advise to always try to separate the two as far as your versioning, as much as possible.
If there are any issues, you can VERY easily re-build the code from the tags/labels/versions pertaining to each stage and verify which of the two introduced the issue.
In addition, try to do refactoring in as small as possible logically complete chunks and commit those as separate checkpoints. Again, this simplifies investigations into what broke why/when.

Every answer thus far has advised you to separate refactoring from adding functionality - and I +1'ed them all. You should do this, independent of source control. Martin Fowler wrote a whole book around the concept that you can't refactor simultaneously with changing functionality. You want to know, for any change, whether the code should and does work the same before the change as after. And as #Amardeep points out, it's much harder to see what functional change you have made if it's hidden by formatting or refactoring changes, thus much harder to track down bugs that functional changes introduced. I don't mean by this to discourage you from refactoring, or to postpone it. Do it, by all means, frequently. But do it separately from functional changes. Micro-commits are the way to go.

Take baby steps. Make the smallest useful change, test it, submit, and repeat.
One kind of change at a time. Don't refactor and change behavior at
the same time.
Submit often. Small changes with clear, detailed descriptions are invaluable.
Make sure your automated tests are, reliable, and useful. If you can trust your tests, you can do the above easily and quickly.
Make sure your tests always pass.
Often I will start working on new functionality or a bug fix or whatever, to discover that if I refactor things just so, the new functionality will be much easier to add. Usually I will discard (or save elsewhere) my changes so far, refactor/test/submit, then go back to working the new functionality. Ideally I spend 90% of my time refactoring, and each new feature, bug fix, performance improvement, etc. is a simple, single-line change.

Related

Interface Builder (XIB) or Code when merging in a team environment?

Merging interface builder files with others (and even myself from a different computer) can be a real challenge. The XIB xml is certainly better than NIBs but even as xml, I've found cases where merging and getting a consistent and valid XIB was harder than just taking the other and manually redoing the changes made.
I'm wondering what other folks are doing who have multiple folks who can potentially collide on XIBs.
Was merging a consideration for going all code? Do you use XIBs just for layout and code the rest? Or, have you had any luck merging XIBs and over time you just get better at manually reading?
EDIT: My current approach is using it for strict layout (what it's really good at and painful to code) and setting all the options and data via code. I find code much easier to merge but laying out controls in code is tedious. Thoughts?
Was merging a consideration for going all code?
Yes, No, and "Portions Of". It depends on things like:
The people involved
the complexity of the UI
the quality of the implementation you need.
the expected lifetime of the implementation
But yes, it has been, and it often is when the case is just not trivial -- Otherwise, you just fight it by decomposing XIBs into smaller pieces. That can work pretty well (or not), depending on what you are faced with.
Do you use XIBs just for layout and code the rest?
Depends on a lot of things.
XIB-only has its restrictions, and is much like code duplication. I use it at times for prototyping, other times because that's what somebody else favored.
"A little of both" can require a lot of glue. At times, it can be pretty disorganized -- e.g. "where's that action really set?". Of course, this can also be used to achieve what some would consider a good balance of XIB and programmatic separation. The simpler the XIB is, the less often it will need to be adjusted, and less likely it will cause merge conflicts.
Code-only is my preference, but there are people who just prefer WYSIWYG, and people aren't very familiar writing UIs programmatically. As well, if quality, reusability, and maintainability not requirements (e.g. bang out a prototype), then code-only can be overkill.
Or, have you had any luck merging XIBs and over time you just get better at manually reading?
No real luck -- just by breaking them into smaller components. Unfortunately, the "Decompose Interface" option (from IB3) is not available in Xc4's editor.
I have found that IB is better for layout as you mention, but probably that's just me cause I was raised this way.Plus code is way more re-usable than layouts.
As far as I'm concerned during the runtime both act the same though I'm not 100% sure about that. Prototypes are less painful in IB than in code, I know that for sure and clients will not take any value on you prototyping in code.
What I do is don't bother trying to merge, just accepts the branch version completely or you version. It means you have to be a bit disciplined about who get to change the interface code, and on commits and updates, in practice we haven't found it a problem but I guess it depends on your environment.
Don't use this as an excuse to stop using interface builder, there is nothing worse than trying to dig through someone else's code to find a button so you can work out what happens when a use clicks it.
Without Interface Builder you are not respecting the MVC separation.

How to work collaboratively with Matlab?

For a project we have to write a Matlab simulation and would like to split the work over several persons. As there are some non-professional programmers involved and we are dealing with a short project we want to keep it simple and use Dropbox, so no version management system involved.
What are possibilities to do this? How do we best split the functions? How do you split the program into several files?
Use version control so that you can keep track of who broke what, and commit at regular intervals so that there is a point to version control.
Design the program such that different people can work on it at the same time. Split it into several files which you can independently test for correctness. Have a professional programmer be responsible for the backbone (main function, class definition). Require consistent interfaces and documentation, so it's easy to stick it all together.
Talk to one another frequently. It doesn't have to be large formal meetings in many cases, just turning around and saying "hey, can you look at this?" is often enough. You all need to know who works on what, and where they stand, so that you know who to talk to in case there are questions. It's just so much faster to solve an issue by talking to the person involved rather than by trying to understand their code.
I would use version control - it saves lots of problems in the long run.
Git is good in that there is no central repository - and so everyone owns their own version.
This, in my experience is liked by 'non-programmers' as they like to fiddle (and break) their version.
And git clone http://whatever, as a method of obtaining a distribution is probably as easy as it gets.
And you will need to know when changes were made. For example: you find a bug and are not sure if you need to rerun the previous simulations or not (when was the bug introduced? - does it affect such and such a simulation?). Without version control finding bugs is a major stress because you cannot be sure of the answers to these questions.

Reuse vs. maintainability and ease of testing

Everyone likes to talk about reusability. Where I work, whenever some new idea is being tossed around or tested out, the question of reusability always comes up. "We want to maximize our investment in this, let's make it reusable." "Reusability will bring higher quality with less work." And so on and so on.
What I've found is that when a reusable component or idea is introduced, everyone is immediately afraid of it and writes it off as a bad idea. Once applications become dependent on it, they say, it won't be maintainable, and any changes will result in the need to do regression testing on everything that uses it. People here point to one component in particular that has been around a long time and has a whole lot of dependents and grouse that it's become impossible to change becuase we don't know what the changes will break.
My responses to this complaint are:
It's good that change to a component
that has many dependents is slow,
because it forces the designers to
really think through the changes.
Time should be taken to get the
component right in the first place. Corrollary: If you're finding the need to change it all the time, it was never very reusable to begin with, was it?
Software development is hard and requires work. So does testing. You just gotta do it.
Unfortunately, what people hear in these responses are "slow," "time" and "effort."
I would love if there was a magic "make this reusable" switch I could flip on things I build so as to win brownie points from management, but things don't work that way. Making something reusable takes time and effort and you're still not guaranteed to get it right.
How do you deal with the request for "reusability" when delivering on it seems to bring nothing but complaints?
Reusability is only worthwhile if something will actually be reused. Make sure you have some practical reuse cases before you write something reusable.
Even if a reusable library is 10x harder to maintain than an ad-hoc version of itself, you're still saving on maintenance overall if the reusable library is used in place of ad-hoc versions in 10 different places.
Reusability is to make code reusable in term of similar behavior or "IS_A" relationship. If you just want to reuse code block by seeing them using again and again but they have no similar characteristic, you should better leave them alone to be loosely coupling. By that, we can have more flexibility to modify later.
One thing that we often do is to use versions and avoid the constant retesting. Just because there is a new version of common code doesn't mean everything has to use the new version right away. When something is getting updated for other reasons, update to the new version of the common code.

Writing my own file versioning program

There is what seems to be a plethora of version control systems. Therefore, to draw a bad conclusion, it must be easy to write one.
What are some issues that must be considered in order to write a simple file versioning system? (What are the minimum necessary functions?)
Is it a feasible task for one person?
A good place to learn about version control is Eric Sink's Weblog. His most recent article is Time and Space Tradeoffs in Version Control Storage, for one example.
Another good example is his series of articles Source Control HOWTO. Yes, it's all about how to use source control, but it has a lot of information about the decisions and tradeoffs developers have to make when designing the system. The best example of this is probably his article on Repositories, where he explains different methods of storing versions. I really learned a lot from this series.
How simple?
You could arguably write a version control system with a single-line shell script, upversion.sh:
cp $WORKING_COPY $REPO/$(date +"%s")
For large binary assets, that is basically all you need! It could be improved quite easily, say by making the version folders read-only, perhaps recording metadata with each version (you could have a text file at $REPO/$(date...).meta for example)
That sounds like a huge simplification, but it's not far of the asset-management-systems many film post-production facilities use (for example)
You really need to know what you wish to version, and why..
With large-binary assets (video, say), you need to focus on tools to visually compare versions. You also probably need to deal with dependancies ("I need image123.jpg and video321.avi to generate this image")
With code, you need to focus on things like making diff's between any two versions really easy. Also since edits to source-code are usually small (a few characters from a project with many thousands of lines), it would be horribly inefficient to copy the entire project for each version - so you only store the differences between each version (delta encoding).
To version a database, you probably want to store information on the schema, tracking new tables, or columns, or adjustments to existing ones (rather than calculating deltas of the database files, or making copies like the previous two systems)
There's no perfect way to version everything, you have to focus on doing one thing well.. Git is great for text, but not for binary files. Adobe Version Cue is great with binary files (images), but useless for text..
I suppose the things to consider can be summarised as..
What do you want to version?
Why can I not use (or extend/modify) an existing system?
How will I track differences between versions? (entire files? deltas?)
What other data do I need to attach to versions? (Author? Time-stamp? Dependancies?)
What tasks would a user commonly need to do (diff'ing? reverting specific files?)
Have a look in the question "core concepts" about (D)VCS.
In short, writing a VCS would involve making a decisions about each of these core concepts (Central vs. Distributed, linear vs. DAG, file centric vs. repository centric, ...)
Not a "quick" project, I believe ;)
If you're Linus Torvalds, you can write something like Git in a month.
But "a version control system" is such a vague and stretchable concept, that your question is really unanswerable.
I'd consider asking yourself what you want to achieve (learn about VCS, learn a language, ...) and then define some clear goal. It's good to have a project, but it's also good to have a reachable goal in a small amount of time. Small successes are good for your morale.
That IS really a bad conclusion. My personal opinion here is that the problem domain is so wide and generally hard that nobody has gotten it "right" yet, thus people try to solve it over and over again, from different angles and under different assumptions.That of course doesn't mean you shouldn't try. Just be warned that many smart people were there before you, so you should do your homework.
What could give you a good overview in a less technical manner is The Git Parable.
It is a nice abstraction on the principles of git, but it gives a very good understanding what a VCS should be able to perform. All things beyond this are rather "low-level" decisions.
A good delta algorithm, good compression and network efficiency.
A simple one is doable by one person for a learning opportunity. One issue you might consider is how to efficiently store plain text deltas. A very popular delta format is the one from RCS (used by many version control programs). You might want to study it to get ideas.
To write a proof of concept, you probably could pull it off, implementing or borrowing the tools Alan mentions.
IMHO, the most important aspect of a VCS is ease-of-use. This sounds like an odd statement, but when you think about it, hard drive space is one of the easiest IT commodities to scale horizontally, so bad compression or even real sloppy deltas are going to be tolerated. The main reason people demand improvement in versioning systems is to do common tasks more intuitively or to support more features that droves of people eventually demand but that weren't obvious before release. And since versioning tools tend to be monolithic and thoroughly integrated at a company, the cost to switch is high, and it may not be possible to support a new feature without breaking an existing repo.
The very minimal necessary prerequisite is an exhaustive and accurate test suite. Nobody (including you) will want to use your new system unless you can demonstrate that it works, reliably and completely error free.

Getting your head around other people's code

I'm occasionally unfortunate enough to have to make alterations to very old, poorly not documented and poorly not designed code.
It often takes a long time to make a simple change because there is not much structure to the existing code and I really have to read a lot of code before I have a feel for where things would be.
What I think would help a lot in cases like this is a tool that would allow one to visualise an overview of the code, and then maybe even drill down for more detail. I suspect such a tool would be very hard to get right, given that is trying to find structure where there is little or none.
I guess this is not really a question, but rather a musing. I should make it into a question - What do others do to assist in getting their head around other peoples code, the good and the bad?
Hmm, this is a hard one, so much to say so little time ...
1) If you can run the code it makes life soooo much easier, breakpoints (especially conditional) break points are you friend.
2) A purists' approach would be to write a few unit tests, for known functionality, then refactor to improve code and understanding, then re-test. If things break, then create more unit tests - repeat until bored/old/moved to new project
3) ReSharper is good at showing where things are being used, what's calling a method for instance, it's static but a good start, and it helps with refactoring.
4) Many .net events are coded as public, and events can be a pain to debug at the best of times. Recode them to be private and use a property with add/remove. You can then use break point to see what is listening on an event.
BTW - I'm playing in the .Net space, and would love a tool to help do this kind of stuff, like Joel does anyone out there know of a good dynamic code reviewing tool?
I have been asked to take ownership of some NASTY code in the past - both work and "play".
Most of the amateurs I took over code for had just sort of evolved the code to do what they needed over several iterations. It was always a giant incestuous mess of library A calling B, calling back into A, calling C, calling B, etc. A lot of the time they'd use threads and not a critical section was to be seen.
I found the best/only way to get a handle on the code was start at the OS entry point [main()] and build my own call stack diagram showing the call tree. You don't really need to build a full tree at the outset. Just trace through the section(s) you're working on at each stage and you'll get a good enough handle on things to be able to run with it.
To top it all off, use the biggest slice of dead tree you can find and a pen. Laying it all out in front of you so you don't have to jump back and forward on screens or pages makes life so much simpler.
EDIT: There's a lot of talk about coding standards... they will just make poor code look consistent with good code (and usually be harder to spot). Coding standards don't always make maintaining code easier.
I do this on a regular basis. And have developed some tools and tricks.
Try to get a general overview (object diagram or other).
Document your findings.
Test your assumptions (especially for vague code).
The problem with this is that on most companies you are appreciated by result. That's why some programmers write poor code fast and move on to a different project. So you are left with the garbage, and your boss compares your sluggish progress with the quick and dirtu guy. (Luckily my current employer is different).
I generally use UML sequence diagrams of various key ways that the component is used. I don't know of any tools that can generate them automatically, but many UML tools such as BoUML and EA Sparx can create classes/operations from source code which saves some typing.
The definitive text on this situation is Michael Feathers' Working Effectively with Legacy Code. As S. Lott says get some unit tests in to establish behaviour of the lagacy code. Once you have those in you can begin to refactor. There seems to be a sample chapter available on the Object Mentor website.
I strongly recommend BOUML. It's a free UML modelling tool, which:
is extremely fast (fastest UML tool ever created, check out benchmarks),
has rock solid C++ import support,
has great SVG export support, which is important, because viewing large graphs in vector format, which scales fast in e.g. Firefox, is very convenient (you can quickly switch between "birds eye" view and class detail view),
is full featured, intensively developed (look at development history, it's hard to believe that so fast progress is possible).
So: import your code into BOUML and view it there, or export to SVG and view it in Firefox.
See Unit Testing Legacy ASP.NET Webforms Applications for advice on getting a grip on legacy apps via unit testing.
There are many similar questions and answers. Here's the search https://stackoverflow.com/search?q=unit+test+legacy
The point is that getting your head around legacy is probably easiest if you are writing unit tests for that legacy.
I haven't had great luck with tools to automate the review of poorly documented/executed code, cause a confusing/badly designed program generally translates to a less than useful model. It's not exciting or immediately rewarding, but I've had the best results with picking a spot and following the program execution line by line, documenting and adding comments as I go, and refactoring where applicable.
a good IDE (EMACS or Eclipse) could help in many cases. Also on a UNIX-platform, there are some tools for crossreferencing (etags, ctags) or checking (lint) or gcc with many many warning options turned on.
First, before trying to comprehend a function/method, i would refactor it a bit to fit your coding conventions (spaces, braces, indentation) and remove most of the comments if they seem to be wrong.
Then I would refactor and comment the parts you understood, and try to find/grep those parts over the whole source tree and refactor them there also.
Over the time, you get a nicer code, you like to work with.
I personally do a lot of drawing of diagrams, and figuring out the bones of the structure.
The fad de jour (and possibly quite rightly) has got me writing unit tests to test my assertions, and build up a safety net for changes I make to the system.
Once I get to a point where I'm comfortable enought knowing what the system does, I'll take a stab at fixing bugs in the sanest way possible, and hope my safety nets neared completion.
That's just me, however. ;)
i have actuaally been using the refactoring features of ReSharper to help m get a handle on a bunch of projects that i inherited recently. So, to figure out another programmer's very poorly structured, undocumented code, i actually start by refactoring it.
Cleaning up the code, renaming methods, classes and namespaces properly, extracting methods are all structural changes that can shed light on what a piece of code is supposed to do. It might sound counterintuitive to refactor code that you don't "know" but trut me, ReSharper really allows you to do this. Take for example the issue of red herring dead code. You see a method in a class or perhaps a strangely named variable. You can start by trying to lookup usages or, ungh, do a text search, but ReSharper will actually detect dead code and color it gray. As soon as you open a file you see in gray and with scroll bar flags what would have in the past been confusing red herrings.
There are dozens of other tricks and probably a number of other tools that can do similar things but i am a ReSharper junky.
Cheers.
Get to know the software intimately from a user's point of view. A lot can be learnt about the underlying structure by studying and interacting with the user interface(s).
Printouts
Whiteboards
Lots of notepaper
Lots of Starbucks
Being able to scribble all over the poor thing is the most useful method for me. Usually I turn up a lot of "huh, that's funny..." while trying to make basic code structure diagrams that turns out to be more useful than the diagrams themselves in the end. Automated tools are probably more helpful than I give them credit for, but the value of finding those funny bits exceeds the value of rapidly generated diagrams for me.
For diagrams, I look for mostly where the data is going. Where does it come in, where does it end up, and what does it go through on the way. Generally what happens to the data seems to give a good impression of the overall layout, and some bones to come back to if I'm rewriting.
When I'm working on legacy code, I don't attempt to understand the entire system. That would result in complexity overload and subsequent brain explosion.
Rather, I take one single feature of the system and try to understand completely how it works, from end to end. I will generally debug into the code, starting from the point in the UI code where I can find the specific functionality (since this is usually the only thing I'll be able to find at first). Then I will perform some action in the GUI, and drill down in the code all the way down into the database and then back up. This usually results in a complete understanding of at least one feature of the system, and sometimes gives insight into other parts of the system as well.
Once I understand what functions are being called and what stored procedures, tables, and views are involved, I then do a search through the code to find out what other parts of the application rely on these same functions/procs. This is how I find out if a change I'm going to make will break anything else in the system.
It can also sometimes be useful to attempt to make diagrams of the database and/or code structure, but sometimes it's just so bad or so insanely complex that it's better to ignore the system as a whole and just focus on the part that you need to change.
My big problem is that I (currently) have very large systems to understand in a fairly short space of time (I pity contract developers on this point) and don't have a lot of experience doing this (having previously been fortunate enough to be the one designing from the ground up.)
One method I use is to try to understand the meaning of the naming of variables, methods, classes, etc. This is useful because it (hopefully increasingly) embeds a high-level view of a train of thought from an atomic level.
I say this because typically developers will name their elements (with what they believe are) meaningfully and providing insight into their intended function. This is flawed, admittedly, if the developer has a defective understanding of their program, the terminology or (often the case, imho) is trying to sound clever. How many developers have seen keywords or class names and only then looked up the term in the dictionary, for the first time?
It's all about the standards and coding rules your company is using.
if everyone codes in different style, then it's hard to maintain other programmer code and etc, if you decide what standard you'll use have some rules, everything will be fine :) Note: that you don't have to make a lot of rules, because people should have possibility to code in style they like, otherwise you can be very surprised.