I'm developing a script that performs a certain core task, and using versions of that script in two different environments where some settings and steps along the way need to be different. What I am looking for is whether there exists an elegant way to handle the small differences between the two versions of the script. I'm sure developers face similar problems when developing software to be deployed on multiple platforms, but I don't have a specific name to pin on it.
What I do now is to open up the second script and manually replace the lines that need to be different. This is cumbersome, time-consuming, and a bit of a headache whenever I inevitably forget to comment out a line or change a string.
Example
[...]
path_to_something = "this/is/different"
use_something(path_to_something)
[...]
do_thing_A() # Only in environment A.
[...]
do_thing_B() # Only in environment B.
[...]
The omitted [...] parts are identical in both versions, and when I make a change to them, I have to either copy and paste each changed line, or if the changes are significant, copy the whole thing, and manually change the A and B parts.
Some ideas for possible solutions that I've come up with:
Write a script that automates the steps I manually take when moving the code back and forth. This exactly replicates the necessary steps, and it's quick and easy to add or remove steps as necessary.
Is this a use case for gitattributes?
Factor all the code that is identical between versions into separate files, so that the files containing the heterogenous code don't need to change at all, and thus don't need to be version-controlled, per se.
Some other tool or best practice that I don't know about to handle this type of workflow.
Looking around, I've found a question with a similar premise of maintaining different versions of code that does the same thing:
Proper way to maintain a project that meets two versions of a platform?
Solutions offered to that question:
Get rid of all the differences, then there is no problem to solve. This may or may not be possible in my specific case, and certainly won't be possible in every case for everyone in the future. So maybe there is a more general solution.
Maintain two different branches of the code, even though they are nearly identical. This is similar to what I do now, but I end up having to do a lot of copying and pasting back and forth between branches. Is that just inherent to software development?
Perform platform detection and wrap the differences in conditionals. This adds a lot of ugly stuff in the code, but if I could successfully detect the environment and implement all the necessary differences conditionally, I would not have to make any changes to the code before sending it to the different environments.
How do developers move code back and forth between similar, but different, parallel branches of a project?
Language- and SCM-agnostic
Use one (two, three) common class(es), but different interfaces (interface per environment)
Move all env-specific from hardcode into external configuration, stored separately
SCM-agnostic
Separate tasks, i.e.:
Get clean common core
Get changes on top core for every and each environment
Move it into $ENVS+1 branches in any SCM (Core+...+EnvN)
Changed workflow become
Commit common changes to Core
Merge Core to each Env-branch
Test env-branches, fix env-specific changes if needed
Private and personal, preferred for me and my habits
Variation of branch-tree solution
Pure Mercurial-way, because I'm too lazy to maintain Env-specific branches
Mercurial, one branch + MQ-queue with a set of patches (one mq-patch per Env, may be "set of queues /queue per Env/, one patch per queue")
Common code stored in immutable changesets, any changes, which convert plain core into environment-specific products, stored in patches and applied on demand (and edited when|if it needed).
Small advantages over branch-way - single branch, no merges, more
clean history
Disadvantage - pure Mercurial, now trendy Git (git-boys
are crying)
Related
When configuring regmap, it is possible to include a list of power on defaults for the registers. As I understand it, the purpose is to pre-populate the cache in order to avoid an initial read after power on or after waking up. I'm confused by the fact that there is both a reg_defaults and a reg_defaults_raw field. Only one or the other is ever used. The vast majority of drivers use reg_defaults however there's a small handful that use reg_defaults_raw. I did look through the git history and found the commit that introduced reg_defaults and the later commit that introduced reg_defaults_raw. Unfortunately I wasn't able to divine the reason for that new field.
Does anyone know the difference between those fields?
We have a sort of unique situation where we have thousands (20k+) of individual small html files that are unrelated. We make edits to probably anywhere from 10s to 100s every day. We have been using Visual Sourcesafe which works well with that model but have been wanting to move to something a bit more modern for a while now. I just don't know, looking at what is available, what might work best, or if anything will.
Using something like Mercurial, would we want one repo, with one project for each file and all projects in the one repo? Or one repo with one project and all files in that one project? Or will this even work? Or do I know so little about all of this that my question doesn't even make sense (quite possible)?
do I know so little about all of this that my question doesn't even make sense
Yes, sorry...
"Project" is unknown entity for VCS (this is an object of another subject area), VCS deals with "repositories"|"files in repositories"
All (maybe most) modern VCS haven't strict limits on amount of files in repositories|amount of repositories, which single server (when it needed /nor always/)can support - except common sense: 20K repositories can be hard job for manage, 20K files in single repo may lead to speed's degradation in some edge-cases
Thus: you can|have to choice any model of storing your objects in repository (repositories), just
Rate and weigh all advantages, disadvantages and consequences of the using of each model (considering the extreme and intermediate options)
Offhand:
Repo per file means a lot of repositories on serer and working directories locally but zero of "collateral damage" in any workflow and errors
One giant repo means all limitation of "global context" for many action (global revision per repo, inability to branch|tag single file /except SVN/)
Hashed (by any rule) tree of dirs with files in each container inside will require to have+maintain "location map" and bring additional troubles in case of file-renames (just imagine: A/a.html A/a1.html before, B/b.html C/c.html after - moved and properly registered in VCS-of-choice - not a big headache, but "rakes in the forehead" is quite possible future)
In most examples I see, they tend to have one GitHub/cvs repository per bounded context, this does seem to be the best thing to do.
My question pertains specifically to user interfaces do they live in a separate repository which holds just ui's or is each interface included within the repository of the bc itself ?
What about interfaces which compose data from multiple bc's ?
Just to make it explicit here I am trying to gather how to physically organise code in a ddd project
Considering a tag would apply to the full Git repo, it is best to have two sets of files (like an UI and a BC) in two separate repos if:
you can make evolutions (and apply new tags) to one without touching the other
the number of files involved is important enough (if the UI is just one or two file, that might not be worth the trouble to create a dedicate repo for it)
As the OP Sudarshan summarizes below in the comments:
If a UI was dedicated to a BC, then it could live within the same repo as the BC itself or a separate one, depending on whether it will evolves on it own or not.
However for UI's that span across BC's it is better to spawn them in a repo of their own and use submodules to reference the right BC repo's
For one of my apps I was thinking about implementing branch/merging. I don't understand how to merge without conflicts in some scenarios. Lets take this for example.
Root writes some code. A, B and C pulls from him and adds features. C is done so A and B pull/merge from it. I believe it works by comparing their own against C using root as the base. Now A and B write more features and finish.
Now what happens if I pull from A then pull from B? Their base is root, they both pulled from C so the same lines are edited. How does it know if it is a conflict or not? What if I edit a line C wrote then pull from B? I guess that would be a conflict. Now my last question is What happens if A and B shuffle the location of function after pulling from C? I guess it now comes down to how good the dif recognition is but im unsure how one can pull from both A and B without conflicts
Sometimes you will get conflicts, if for example A and B edit the same lines (differently). You would then need to manually merge the changes by inspecting them (and maybe talking to A and B!), rather than relying on the DVCS to merge for you.
You might also get logical conflicts (A and B change different sections of the file, so there is no apparent conflict, and the DVCS can handle the merge, but those sections break one anothers' assumptions, so a bug is introduced). Version control can't fix this, only communication between developers, and unit testing.
How does it know if it is a conflict or not?
either because the DVCS will tell you: it will trigger a manual resolution of the merge if identical lines are edited
or because you:
know the code well enough to spot semantic conflicts
have an extended battery of unit test which will flush out semantic conflicts.
But as mentioned in the comments of "Still not drinking the DVCS Kool-Aid", committing and merging regularly (at a much higher pace than with a CVCS, Centralized Version Control system) is key to avoid logic conflicts, or make them as small as possible.
For more on semantic conflicts, see this Martin Fowler article.
we develop a data processing tool to extract some scientific results out of a given set of raw data. In data science it is very important that you can re-obtain your results and repeat the calculations, that led to a result set
Since the tool is evolving, we need a way to find out which revision/build of our tool generated a given result set and how to find the corresponding source from which the tool was build.
The tool is written in C++ and Python; gluing together the C++ parts using Boost::Python. We use CMake as a build system generating Make files for Linux. Currently the project is stored in a subversion repo, but some of us already use git resp. hg and we are planning to migrate the whole project to one of them in the very near future.
What are the best practices in a scenario like this to get a unique mapping between source code, binary and result set?
Ideas we are already discussing:
Somehow injecting the global revision number
Using a build number generator
Storing the whole sourcecode inside the executable itself
This is a problem I spend a fair amount of time working on. To what #VonC has already written let me add a few thoughts.
I think that the topic of software configuration management is well understood and often carefully practiced in commercial environments. However, this general approach is often lacking in scientific data processing environments many of which either remain in, or have grown out of, academia. However, if you are in such a working environment, there are readily available sources of information and advice and lots of tools to help. I won't expand on this further.
I don't think that your suggestion of including the whole source code in an executable is, even if feasible, necessary. Indeed, if you get SCM right then one of the essential tests that you have done so, and continue to do so, is your ability to rebuild 'old' executables on demand. You should also be able to determine which revision of sources were used in each executable and version. These ought to make including the source code in an executable unnecessary.
The topic of tying result sets in to computations is also, as you say, essential. Here are some of the components of the solution that we are building:
We are moving away from the traditional unstructured text file that is characteristic of the output of a lot of scientific programs towards structured files, in our case we're looking at HDF5 and XML, in which both the data of interest and the meta-data is stored. The meta-data includes the identification of the program (and version) which was used to produce the results, the identification of the input data sets, job parameters and a bunch of other stuff.
We looked at using a DBMS to store our results; we'd like to go this way but we don't have the resources to do it this year, probably not next either. But businesses use DBMSs for a variety of reasons, and one of the reasons is their ability to roll-back, to provide an audit trail, that sort of thing.
We're also looking closely at which result sets need to be stored. A nice approach would be only ever to store original data sets captured from our field sensors. Unfortunately some of our computations take 1000s of CPU-hours to produce so it is infeasible to reproduce them ab-initio on demand. However, we will be storing far fewer intermediate data sets in future than we have in the past.
We are also making it much harder (I'd like to think impossible but am not sure we are there yet) for users to edit result sets directly. Once someone does that all the provenance information in the world is wrong and useless.
Finally, if you want to read more about the topic, try Googling for 'scientific workflow' and 'data provenance' similar topics.
EDIT: It's not clear from what I wrote above, but we have modified our programs so that they contain their own identification (we use Subversion's keyword capabilities for this with an extension or two of our own) and write this into any output that they produce.
You need to consider git submodules of hg subrepos.
The best practice in this scenario os to have a parent repo which will reference:
the sources of the tool
the result set generated from that tool
ideally the c++ compiler (won't evolve every day)
ideally the python distribution (won't evolve every day)
Each of those are a component, that is an independent repository (Git or Mercurial).
One precise revision of each component will be reference by a parent repository.
The all process is representative of a component-based approach, and is key in using an SCM (here Software Configuration Management) at its fullest.