Writing my own file versioning program - version-control

There is what seems to be a plethora of version control systems. Therefore, to draw a bad conclusion, it must be easy to write one.
What are some issues that must be considered in order to write a simple file versioning system? (What are the minimum necessary functions?)
Is it a feasible task for one person?

A good place to learn about version control is Eric Sink's Weblog. His most recent article is Time and Space Tradeoffs in Version Control Storage, for one example.
Another good example is his series of articles Source Control HOWTO. Yes, it's all about how to use source control, but it has a lot of information about the decisions and tradeoffs developers have to make when designing the system. The best example of this is probably his article on Repositories, where he explains different methods of storing versions. I really learned a lot from this series.

How simple?
You could arguably write a version control system with a single-line shell script, upversion.sh:
cp $WORKING_COPY $REPO/$(date +"%s")
For large binary assets, that is basically all you need! It could be improved quite easily, say by making the version folders read-only, perhaps recording metadata with each version (you could have a text file at $REPO/$(date...).meta for example)
That sounds like a huge simplification, but it's not far of the asset-management-systems many film post-production facilities use (for example)
You really need to know what you wish to version, and why..
With large-binary assets (video, say), you need to focus on tools to visually compare versions. You also probably need to deal with dependancies ("I need image123.jpg and video321.avi to generate this image")
With code, you need to focus on things like making diff's between any two versions really easy. Also since edits to source-code are usually small (a few characters from a project with many thousands of lines), it would be horribly inefficient to copy the entire project for each version - so you only store the differences between each version (delta encoding).
To version a database, you probably want to store information on the schema, tracking new tables, or columns, or adjustments to existing ones (rather than calculating deltas of the database files, or making copies like the previous two systems)
There's no perfect way to version everything, you have to focus on doing one thing well.. Git is great for text, but not for binary files. Adobe Version Cue is great with binary files (images), but useless for text..
I suppose the things to consider can be summarised as..
What do you want to version?
Why can I not use (or extend/modify) an existing system?
How will I track differences between versions? (entire files? deltas?)
What other data do I need to attach to versions? (Author? Time-stamp? Dependancies?)
What tasks would a user commonly need to do (diff'ing? reverting specific files?)

Have a look in the question "core concepts" about (D)VCS.
In short, writing a VCS would involve making a decisions about each of these core concepts (Central vs. Distributed, linear vs. DAG, file centric vs. repository centric, ...)
Not a "quick" project, I believe ;)

If you're Linus Torvalds, you can write something like Git in a month.
But "a version control system" is such a vague and stretchable concept, that your question is really unanswerable.
I'd consider asking yourself what you want to achieve (learn about VCS, learn a language, ...) and then define some clear goal. It's good to have a project, but it's also good to have a reachable goal in a small amount of time. Small successes are good for your morale.

That IS really a bad conclusion. My personal opinion here is that the problem domain is so wide and generally hard that nobody has gotten it "right" yet, thus people try to solve it over and over again, from different angles and under different assumptions.That of course doesn't mean you shouldn't try. Just be warned that many smart people were there before you, so you should do your homework.

What could give you a good overview in a less technical manner is The Git Parable.
It is a nice abstraction on the principles of git, but it gives a very good understanding what a VCS should be able to perform. All things beyond this are rather "low-level" decisions.

A good delta algorithm, good compression and network efficiency.

A simple one is doable by one person for a learning opportunity. One issue you might consider is how to efficiently store plain text deltas. A very popular delta format is the one from RCS (used by many version control programs). You might want to study it to get ideas.

To write a proof of concept, you probably could pull it off, implementing or borrowing the tools Alan mentions.
IMHO, the most important aspect of a VCS is ease-of-use. This sounds like an odd statement, but when you think about it, hard drive space is one of the easiest IT commodities to scale horizontally, so bad compression or even real sloppy deltas are going to be tolerated. The main reason people demand improvement in versioning systems is to do common tasks more intuitively or to support more features that droves of people eventually demand but that weren't obvious before release. And since versioning tools tend to be monolithic and thoroughly integrated at a company, the cost to switch is high, and it may not be possible to support a new feature without breaking an existing repo.

The very minimal necessary prerequisite is an exhaustive and accurate test suite. Nobody (including you) will want to use your new system unless you can demonstrate that it works, reliably and completely error free.

Related

Is it better to have 2 simple versions of a software or 1 complicated version?

This is a simple question. Is it better to have 2 versions of code or one? One would be an engineering version that is not V&V tested with lots of functionality the other being a stripped-down customer version that is highly regulated and highly tested.
I have a simple problem, my code base is ballooning as new engineering features are demanded. However, the core customer code is extremely simple. The engineering portions sometimes cause unexpected bugs in the customer code. Also, every engineering bug has to be debugged and treated the same as a customer one which gives longer lead times for releases.
I see it as good and bad to split. First, it would allow engineering functions to be quickly made and released with no need to V&V test the software. The second advantage would be fewer and higher quality releases of customer-facing software. However, it would mean 2(smaller) codebases to maintain.
What is the common solution to this problem? Is there limit to which you decide it is time to break into two versions? Or should I just learn even more in-depth organizational techniques? I already try my best to follow the best organizational tips for the software and follow OOP best practices.
As of now, I would say my code base is about 50% customer software and 50% engineering functionality. However, I just got a new (large) engineering/manufacturing project to add to the software.
Any experience would be appreciated.
TL;DR: Split it.
You can use either of the models that you mention. It's about how you see it grow in the future and how it is presently set up.
If you decide to go the one large code base route, I would suggest that you use the MVC Architecture as it effectively sorts one large code base into smaller more manageable parts. This allows for a layer of abstraction for the engineers where they can easily narrow down a issue to it's own team. This allows for one team to not have to worry about how the other implemented a feature; is: the Graphic team does not need to worry about how the data is stored on your servers.
If you decide to go with the multiple code bases, you add an additional layer of "difficulty". You now have to maintain "twice" as much code. You will also need to make sure that the two sources play nice with each other. The benefit of doing it this way would be that you completely isolate the engineering version from your production version.
In your case, since you mention that the split is currently 50/50 between the 2 versions, I would recommend that you split the customer and the engineering version. This isolates the engineering code that is not tested form the customer code which is properly tested. This does give you "more" code to control but this issue can easily be mitigated by having 2 teams, each responsible for their own version.

How to work collaboratively with Matlab?

For a project we have to write a Matlab simulation and would like to split the work over several persons. As there are some non-professional programmers involved and we are dealing with a short project we want to keep it simple and use Dropbox, so no version management system involved.
What are possibilities to do this? How do we best split the functions? How do you split the program into several files?
Use version control so that you can keep track of who broke what, and commit at regular intervals so that there is a point to version control.
Design the program such that different people can work on it at the same time. Split it into several files which you can independently test for correctness. Have a professional programmer be responsible for the backbone (main function, class definition). Require consistent interfaces and documentation, so it's easy to stick it all together.
Talk to one another frequently. It doesn't have to be large formal meetings in many cases, just turning around and saying "hey, can you look at this?" is often enough. You all need to know who works on what, and where they stand, so that you know who to talk to in case there are questions. It's just so much faster to solve an issue by talking to the person involved rather than by trying to understand their code.
I would use version control - it saves lots of problems in the long run.
Git is good in that there is no central repository - and so everyone owns their own version.
This, in my experience is liked by 'non-programmers' as they like to fiddle (and break) their version.
And git clone http://whatever, as a method of obtaining a distribution is probably as easy as it gets.
And you will need to know when changes were made. For example: you find a bug and are not sure if you need to rerun the previous simulations or not (when was the bug introduced? - does it affect such and such a simulation?). Without version control finding bugs is a major stress because you cannot be sure of the answers to these questions.

Best way to organize bioinformatics projects? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I come from a computer science. background, but I am now doing genomics.
My projects include a lot of bioinformatics typically involving: aligning sequences, comparing overlap, etc. between sequences and various genome-annotation-features, from different classes of biological samples, time-course data, microarray, high-throughput sequencing ("next-generation" sequencing, though it's the current generation actually) data, this kind of stuff.
The workflow with this kind of analyses is quite different from what I experienced during my computer science studies: no UML and thoughtfully designed objects shining with sublime elegance, no version management, no proper documentation (often no documentation at all), no software engineering at all.
Instead, what everyone does in this field is hacking out one Perl-script or AWK-one-liner after the other, usually for one-time usage.
I think the reason is that the input data and formats change so fast, the questions need to be answered so soon (deadlines!), that there seems to be no time for project organization.
One example to illustrate this: Let's say you want to write a raytracer. You would probably put a lot of effort into the software engineering first. Then program it, finally in some highly-optimized form. Because you would use the raytracer countless of times with different input data and would make changes to the source code over a duration of years to come. So good software engineering is paramount when coding a serious raytracer from scratch. But imagine you want to write a raytracer, where you already know that you will use it to raytrace one, single picture ever. And that picture is of a reflecting sphere over a checkered floor. In this case you would just hack it together somehow. Bioinformatics is like the latter case only.
You end up with whole directory trees with the same information in different formats until you have reached the one particular format necessary for the next step, and dozen of files with names like "tmp_SNP_cancer_34521_unique_IDs_not_Chimp.csv" where you don't have the slightest idea one day later why you created this file and what it exactly is.
For a while I was using MySQL which helped, but now the speed in which new data is generated and changes formats is such that it is not possible to do proper database design.
I am aware of one single publication which deals with these issues (Noble, W. S. (2009, July). A quick guide to organizing computational biology projects. PLoS Comput Biol 5 (7), e1000424+). The author sums the goal up quite nicely:
The core guiding principle is simple:
Someone unfamiliar with your project
should be able to look at your
computer files and understand in
detail what you did and why.
Well, that's what I want, too! But I am following the same practices as that author already, and I feel it is absolutely insufficient.
Documenting each and every command you issue in Bash, commenting it with why exactly you did it, etc., is just tedious and error-prone. The steps during the workflow are just too fine-grained. Even if you do it, it can be still an extremely tedious task to figure out what each file was for, and at which point a particular workflow was interrupted, and for what reason, and where you continued.
(I am not using the word "workflow" in the sense of Taverna; by workflow I just mean the steps, commands and programs you choose to execute to reach a particular goal).
How do you organize your bioinformatics projects?
I'm a software specialist embedded in a team of research scientists, though in the earth sciences, not the life sciences. A lot of what you write is familiar to me.
One thing to bear in mind is that much of what you have learned in your studies is about engineering software for continued use. As you have observed a lot of what research scientists do is about one-off use and the engineered approach is not suitable. If you want to implement some aspects of good software engineering you are going to have to pick your battles carefully.
Before you start fighting any battles, you are going to have to critically examine your own ideas to ensure that what you learned in school about general-purpose software engineering is valid for your current situation. Don't assume that it is.
In my case the first battle I picked was the implementation of source code control. It wasn't hard to find examples of all the things that go wrong when you don't have version control in place:
some users had dozens of directories each with different versions of the 'same' code, and only the haziest idea of what most of them did that was unique, or why they were there;
some users had lost useful modifications by overwriting them and not being able to remember what they had done;
it was easy to find situations where people were working on what should have been the same program but were in fact developing incompatibly in different directions;
etc etc etc
Once I had gathered the information -- and make sure you keep good notes about who said what and what it cost them -- it became relatively easy to paint a picture of a better world with source code control.
Next, well, next you have to choose your own next battle. But one of the seeds of doubt you have to sow in your scientist-colleagues minds is 'reproducibility'. Scientific experiments are not valid if they are not reproducible; if their experiments involve software (and they always do) then careful software engineering is essential for reproducibility. A lot of this is about data provenance, but that's a topic for another day.
Part of the issue here is the distinction between documentation for software vs documentation for publication.
For software development (and research plan) design, the important documentation is structural and intentional. Thus, modeling the data, reasons why you are doing something, etc. I strongly recommend using the skills you've learned in CS for documenting your research plan. Having a plan for what you want to do gives you a lot of freedom to multi-task while long analyses are running.
On the other hand, a lot of bioinformatics work is analysis. Here, you need to treat documentation like a lab notebook, and not necessarily a project plan. You want to be document what you did, maybe a brief comment why (e.g. when you are troubleshooting data), and what the outputs and results are.
What I do is fairly simple.
First, I start in a directory and create a git repo. Then, whenever I change some file, I commit it to the repo. As much as possible, I try to name data outputs in a way that I can drop then into my git ignore files.
Then, as much as possible, I work on a single terminal session for a project at a time, and when I hit a pause point (like when I've got a set of jobs sent up to the grid, I run 'history |cut -c 8-' and paste that into a lab notes file. I then edit the file to add comments for what I did, and remember, change the git add/commit lines to git checkout (I have a script that does this based on the commit messages). As long as I start it in the right directory, and my external data doesn't go away, this means that I can recreate the entire process later.
For any even slightly complex processing tasks, I write a script to do it, so that my notebook, as much as possible, looks clean. To an approximation, a helper script can be viewed as a subroutine in a larger project, and should be documented internally to at least that level.
Your question is about project management. Bad project management is not unique to bioinformatics. I find it hard to believe that the entire industry of bioinformatics is commited to bad software design.
About the presure... Again there are others in this world that have very challenging deadlines, and they are still using good software designs.
In many cases, following a good software design does not hold down the projects and may even speed its design and maintainance (at least on the long run).
Now to your real question... You can offer your manager to redesign small parts of the code that have no influence on the rest of the code as a proof of concept (POC), but it's really hard to stop a truck from keep on moving, so don't get upset if he feels "we worked this way for years - we know what we are doing, and we don't need a child to teach us how to do our work". Learn to work like the rest and when you will gain their trust, you could
do your thing once in a while (I hope you will have time and the devotion to do the right thing).
Good luck.

Lowest level of detail for functional specifications in order to be useful

Where I work, people don't like to write specs. (Boy, does anyone?) So they don't do it, unless forced by their bosses. If they are forced to write them, they make them as short as possible. (By the way, they also includes me.)
This results in specifications like
This software logs the time between event A and B to the event log
Name and path of parameter X are set in a configuration file in ini format.
The software is active without a user needing to log on to the computer (implementation as a Windows service)
This example is taken from a very small project, and it worked out pretty well, But I don't think that it will suffice for anything more complex. I did not specify OS/hardware requirements because this is in-house development and we have company or department standards covering those.
So my question is:
What do you consider the absolute minimum level of detail in a functional specification for any non-trivial software?
IMHO the important thing about Functional Specs (and all other formal methods/tools for software development and project planning (Yourdon, SSADM, PRINCE2, UML, etc) is that they encourage good practice by making you think along common lines.They don't guarantee success but they encourage success by formalising good practice
So the fact that FSs are created is a good thing, even if perhaps they could be better. Some planning and preparation is better than none at all - which is what a lot developers do.
What should ideally go into a FS? As much as is necessary and as little as possible. Just because some functional specs cover X, Y & Z doesn't mean yours should. If you become too prescriptive, you will add unnecessary bureaucracy to simpler projects; correspondingly, for complicated projects, a prescriptive approach might encourage the developer to stop short of the level of detail that they really ought to go to.
Joel on Software wrote a cracking article on specifications.
You can find it here
Specification Discussion

How do I plan an enterprise level web application?

I'm at a point in my freelance career where I've developed several web applications for small to medium sized businesses that support things such as project management, booking/reservations, and email management.
I like the work but find that eventually my applications get to a point where the overhear for maintenance is very high. I look back at code I wrote 6 months ago and find I have to spend a while just relearning how I originally coded it before I can make a fix or feature additions. I do try to practice using frameworks (I've used Zend Framework before, and am considering Django for my next project)
What techniques or strategies do you use to plan out an application that is capable of handling a lot of users without breaking and still keeping the code clean enough to maintain easily?
If anyone has any books or articles they could recommend, that would be greatly appreciated as well.
Although there are certainly good articles on that topic, none of them is a substitute of real-world experience.
Maintainability is nothing you can plan straight ahead, except on very small projects. It is something you need to take care of during the whole project. In fact, creating loads of classes and infrastructure code in advance can produce code which is even harder to understand than naive spaghetti code.
So my advise is to clean up your existing projects, by continuously refactoring them. Look at the parts which were a pain to change, and strive for simpler solutions that are easier to understand and to adjust. If the code is even too bad for that, consider rewriting it from scratch.
Don't start new projects and expect them to succeed, just because your read some more articles or used a new framework. Instead, identify the failures of your existing projects and fix their specific problems. Whenever you need to change your code, ask yourself how to restructure it to support similar changes in the future. This is what you need to do anyway, because there will be similar changes in the future.
By doing those refactorings you'll stumble across various specific questions you can ask and read articles about. That way you'll learn more than by just asking general questions and reading general articles about maintenance and frameworks.
Start cleaning up your code today. Don't defer it to your future projects.
(The same is true for documentation. Everyone's first docs were very bad. After several months they turn out to be too verbose and filled with unimportant stuff. So complement the documentation with solutions to the problems you really had, because chances are good that next year you'll be confronted with a similar problem. Those experiences will improve your writing style more than any "how to write good" style guide.)
I'd honestly recommend looking at Martin Fowlers Patterns of Enterprise Application Architecture. It discusses a lot of ways to make your application more organized and maintainable. In addition, I would recommend using unit testing to give you better comprehension of your code. Kent Beck's book on Test Driven Development is a great resource for learning how to address change to your code through unit tests.
To improve the maintainability you could:
If you are the sole developer then adopt a coding style and stick to it. That will give you confidence later when navigating through your own code about things you could have possibly done and the things that you absolutely wouldn't. Being confident where to look and what to look for and what not to look for will save you a lot of time.
Always take time to bring documentation up to date. Include the task into development plan; include that time into the plan as part any of change or new feature.
Keep documentation balanced: some high level diagrams, meaningful comments. Best comments tell that cannot be read from the code itself. Like business reasons or "whys" behind certain chunks of code.
Include into the plan the effort to keep code structure, folder names, namespaces, object, variable and routine names up to date and reflective of what they actually do. This will go a long way in improving maintainability. Always call a spade "spade". Avoid large chunks of code, structure it by means available within your language of choice, give chunks meaningful names.
Low coupling and high coherency. Make sure you up to date with techniques of achieving these: design by contract, dependency injection, aspects, design patterns etc.
From task management point of view you should estimate more time and charge higher rate for non-continuous pieces of work. Do not hesitate to make customer aware that you need extra time to do small non-continuous changes spread over time as opposed to bigger continuous projects and ongoing maintenance since the administration and analysis overhead is greater (you need to manage and analyse each change including impact on the existing system separately). One benefit your customer is going to get is greater life expectancy of the system. The other is accurate documentation that will preserve their option to seek someone else's help should they decide to do so. Both protect customer investment and are strong selling points.
Use source control if you don't do that already
Keep a detailed log of everything done for the customer plus any important communication (a simple computer or paper based CMS). Refresh your memory before each assignment.
Keep a log of issues left open, ideas, suggestions per customer; again refresh your memory before beginning an assignment.
Plan ahead how the post-implementation support is going to be conducted, discuss with the customer. Make your systems are easy to maintain. Plan for parameterisation, monitoring tools, in-build sanity checks. Sell post-implementation support to customer as part of the initial contract.
Expand by hiring, even if you need someone just to provide that post-implementation support, do the admin bits.
Recommended reading:
"Code Complete" by Steve Mcconnell
Anything on design patterns are included into the list of recommended reading.
The most important advice I can give having helped grow an old web application into an extremely high available, high demand web application is to encapsulate everything. - in particular
Use good MVC principles and frameworks to separate your view layer from your business logic and data model.
Use a robust persistance layer to not couple your business logic to your data model
Plan for statelessness and asynchronous behaviour.
Here is an excellent article on how eBay tackles these problems
http://www.infoq.com/articles/ebay-scalability-best-practices
Use a framework / MVC system. The more organised and centralized your code is the better.
Try using Memcache. PHP has a built in extension for it, it takes about ten minutes to set up and another twenty to put in your application. You can cache whatever you want to it - I cache all my database records in it - for every application. It does wanders.
I would recommend using a source control system such as Subversion if you aren't already.
You should consider maybe using SharePoint. It's an environment that is already designed to do all you have mentioned, and has many other features you maybe haven't thought about (but maybe you will need in the future :-) )
Here's some information from the official site.
There are 2 different SharePoint environments you can use: Windows Sharepoint Services (WSS) or Microsoft Office Sharepoint Server (MOSS). WSS is free and ships with Windows Server 2003, while MOSS isn't free, but has much more features and covers almost all you enterprise's needs.