Hash of a textual file for integrity purposes

Hash of a textual file for integrity purposes - hash

I have a more general requirement to track changes in asset files that are committed into source code and deployed inside the binaries, but for now I am implementing it in unit testing context and facing a potential problem for the future. Before asking the TLDR question I will show a lot of contextual information.
Scenario
Some application assets are loaded from CSV files committed into Git repository via ClasspathResource[1] and they may sometime change. Change occurs across commits, but for a runtime application the change occurs across different versions of the application.
My test solution
I have implemented the following mechanism to alert me about changes in the resource:
#Before
public void setUp() throws Exception
{
assertEquals("Resource file has changed. Make sure the test reflects the changes in the file and update the checksum", MD5_OF_FILE,
DigestUtils.md5Hex(new ClassPathResource("META-INF/resources/assets.csv").getInputStream()));
}
Basically, I want my unit tests to fail until I explicitly code the checksum of the file. When I run md5sum assets.txt I hardcode the result into the code so tests know they are working with a fixed version of the file.
Problem
I ran the tests on my own Windows box and worked like a charm. Switching to Linux, I found that they failed. Immediately I realized that it may be due to line endings, which I totally forgot.
In the specific case, Git is configured to commit files LF but checkout (in Windows) CRLF. This configuration is reasonable for working with source code.
So I need to check if the asset file has changed in a smart way that allows a box to change/reinterpret the line endings. This is especially true for the runtime application which will store the file hash and will compare the actual assets file (which may have changed), performing corrective actions on differences ==> reloading the assets.
TL;DR
Given a textual file of which I can extract and store any hash (not just cryptographic, I used MD5), how can I tell that it has changed regardless of the environment the file is processed into, which may modify the line endings?
Note
I have requirement not to use a versioning system in the asset itself (e.g. first row has incremental version, since developers will fail to update correctly).
[1] Spring framework tool wrapping Class.getResourceAsStream

A solution could be normalizing the file to chosen line endings, i.e. always CRLF or always LF, then compute the cryptographic hash over that normalized content.
E.g. compute md5sum | dos2unix file and use a proper Stream in code that normalizes the file on the fly

Related

ClearCase, a makefile use case

I have an issue with the clearmake command in IBM ClearCase,
I use clearmake command to run my own makefile so i can build my program from the 'C' sourse code.
I want to put a command in make file, like shell cleartool -some-command to ignore all checkouts and all private files.
The disadvantage is that in config spec, i must include the command element * CHECKEDOUT.
But in my use case i want to working with files and the same time i could make a compile/build with the old files, so i could work faster and i shouldn't change views or edit configspecs.
But my contemplation is, if i can ignore the checked outed files with a command, without to lose it.
Could you give me a solution ?

I want to working with files and the same time i could make a compile/build with the old files,
It would be easier to use two different snapshot views loaded on the disk at two different places.
In one (where no checkout has ever been done), you can set all files writables (through Windows, not ClearCase): all the files becomes hijacked, but modifiable, host for compilation/testing purposes.
In the other view, you keep your checked out files and your work in progress (but do not run your clearmake).

Mercurial: Pre and Post merge operations (per file)

We are using Mercurial as an SCM to handle the source script files of a program. Each project we manage has ~5000 files with each file containing a section with some product-specific informations about the file itself (version list, date, time etc.). This section is - due to the way it is structured - in 80% of the merges, the only section that has conflicts. They are easily resolved, but when merging around 300 files, it gets tiresome.
The problem is: I have no control over the way this section is written and I cannot change the format of the section itself, as it would make the file unusable by the program.
My question: is there a way in mercurial (hooks?), that allows me to
pre-process the file with a script
let mercurial do the merge
if merged correctly: post-process the file with a script. otherwise: "resolve-conflicts" as usual.

You could probably get away with it by creating a custom merge tool:
https://www.mercurial-scm.org/wiki/MergeToolConfiguration
A simple script that invokes 'diff' after removing the ever changing sections might be enough.
It sounds like those sections are the sort of nonsense that the (disrecommended) KeywordsExtension are built to handle, but I gather you don't have a lot of flexibility around them.

Which source control uses a "s." prefix on its filenames?

I found what appears to be an old source repository for some source code that I need to resurrect. But I have no idea what source control tools were used to generate and manage this source repository. In the directory, all of the files have a "s." prefixed to the file name. Without knowing the format in these files, I cannot manually extract the source code with any degree of accuracy. And even if I did, manually extracting the source code would be very time consuming and error prone.
What source/version control system prefixes its source files with "s." when it stores the source file in its repository directory?
How can I effectively extract the latest source code from this repository directory?

The s. prefix is characteristic of SCCS, the Source Code Control System. The code for that is probably still proprietary, but GNU has the CSSC project which can manipulate SCCS files. It tracks changes per-file in revisions, known as 'deltas'.
SCCS is the official revision control system for POSIX; you can find the commands documented on the Open Group site (but the file format is not specified there, AFAICT):
admin
delta
get
prs
rmdel
sact
unget
val
what
The file format is not specified by POSIX. The manual page for get says:
The SCCS files shall be files of an unspecified format.
The original SCCS command set included some extras not recorded by POSIX:
cdc — change delta commentary (for changing the checkin comments for a delta)
comb — combine, effectively for merging deltas
help — no prefix; the wasn't any other help program at the time. Commands generate error codes such as cm3 and help interpreted them.
sccsdiff — difference between two deltas of a file
Most systems now have a single command, sccs, which takes the operation name and then options. Often, the files were placed into an ./SCCS/ subdirectory and extracted from that as required, and the sccs front-end would handle name expansion, adding s. or SCCS/s. to the start of the file names.
For extracting the latest version of the source code, use get.
get s.*
sccs get s.*
These will get the default version of each file, and the default default is the latest version of the file.
If you need to make changes, use:
get -e s.filename.c
...make changes...
delta -y'Why you made the changes' s.filename.c
get s.filename.c
Note that the files 'lose' the s. prefix for the working file names, rather like RCS (Revision Control System) files lose the ,v suffix for the working file names. If you've not come across that, accept that it was different when SCCS and RCS were created, back in the late 70s or early 80s.

SCCS uses an s. prefix. But it might not be the only one!
I never knew this knowledge would come in useful some day!

CVS keeps adding code at the end of the file I want to commit

I have trouble with 4 files in my CVS project. Each time I commit one of those files, CVS keeps adding the same line of code at the end of it. This line of code is a repeated line of the current file (but not the last line of it).
I've try several things : update, delete lines and commit, delete all lines and commit, adding lines and commit, adding header and commit. But I always get the same line of code added to the end of my file. I could delete all files and recreate those, but I would lost all my history data.
I find it awkward that CVS is modifying my file when I commit. Is it not counter productive as it may add errors in a compliant code?
I could add that my file is a .strings (text file, unicode). I'm working on a branch, but recently merge it in the trunk.
More Details:
I'm using TortoiseSVN on a virtual Windows machine, which has access to my Documents folder of Mac OS X via a Network Drive between those two.
It turns out that my colleague, which has the same project but on a real Windows folder, could commit without any problem.
And now that he done that, the problem is solve for me too.
But I have no idea what happen. My only clue would be a hidden character in Mac OS X that would breaks TortoiseSVN. Is it possible?

I haven't experienced this issue with CVS, but note that you mention that the file you are editing is Unicode text (you don't mention if this means UTF8 or UTF16, but either can cause issues).
Depending on how your CVS server was built, and how (and on what platform) it is being run, it is highly possible that the server is not Unicode-aware. This can cause a whole range of possible issues, including expanding RCS-style $ tags in places where the second (or later) byte of a Unicode character is equal to ASCII '$'.
The workaround for this is to mark Unicode source files as binary objects. From the command line, this can be done using
cvs add -kb file-name
when adding a new file, or
cvs admin -kb file-name
for an existing file (replace file-name with the name of your file).
In the latter case, I'd recommend removing the (local copy of the) file and running 'cvs update' to get it back after changing the type.
Note that doing this is unlikely to help with changes you're already seeing in the file, so make sure to check the file, and fix any existing problem after making this change.

Hash of an .exe file

I'm wondering whether I will ever get a different result when producing a checksum on an .exe file before and then while or after running that file. I'm more concerned with common practice (such as producing a SHA hash of popular app like firefox.exe) than with boundary cases, but both are interesting. Thanks.

The hash of a file should be constant for as long as the file is identical (i.e. contains only the same bytes, in the same order). It's very rare to find applications that rewrite their on-disk representation at runtime, so the hash should be constant. There are self-modifying programs, but they tend to operate on the in-memory loaded copy of their code, rather than the disk copy.
Edit: We should consider "Self-updating" applications, but these tend to launch a little helper program to download and update the core application. It's difficult (especially on Windows) to update an execution whilst it's running. UNIX systems tend to operate Copy on Write systems, so it's possible that a software update might change your executable under your feet - but again, this is a "corner case".

The hash will only change if the exe changes. That will only happen if the app modifies itself, which isn't going to happen on windows without the app restarting. Firefox might update itself (including a restart), but apart from such cases, the hash will remain the same.

The hash will change if the file changes.
EXE files rarely change on their own. firefox.exe would change if the user updates to a new version.
You can check the "date modified" attribute of an EXE file (like firefox.exe) after running it to see whether it has changed, but you'll probably find it hasn't.

If you mean the modification of the last access time, don't worry, it's stored at the filesystem level, not within the file so the hash will remain the same.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse