File Last Modified - hash

Is it safe to use File Last Modified (e.g. NTFS) when detecting if a file has changed? If not, does file backup applications always hash the whole file to check for changes? If so what hash algorithm is suited for this check?

It depends on the requirements of the application. Can it tolerate false positives? False negatives?
A File Last Modified date is not reliable. For example, FTP may change the modified date without changing the file, or a file could be downloaded twice, once over itself, changing the modified date without changing the file. On the other hand, there are a few utilities that will change a file but keep the same File Last Modified date.
If action absolutely must be taken on a file when it has been changed, the reliable way is to use a good hash or fingerprint. This does take time. One way to improve the odds without taking so much time would be to compare the modified date along with the file size, but again this is not foolproof.

I wouldn't trust last modified time so much since even opening a file and adding a single character would change it modification time. Hashing has the problem of collisions, so I would suggest reading about Rabin's Fingerprinting algorithm.

I think get used to setting up effective and routinely monitored hash check. Last modified I think is not as safe as many like to think. Stick with checking the hash and use a good software that does it regularly.
Trust me, once you get used to not picking easiest route and always do safest, you’ll develop great habits that will carry you forward to other security measures.

Related

Compare two versions of a document

How would you implement a version control system for a single file ?
The point of this system would be to highlight what changed between two versions of the same file (pretty much what git does).
Instead of storing the whole document, it's usually better to store the first version of the file and every "push" would just store every modification. However, how can we spot an insertion, a modification, a deletion or even a mix of both efficiently ?
There are version control systems which handle single files. One can use many modern version control systems, such as Git, and simply store a single file, but one tool which works on independent files is RCS.
Most version control systems adopt either a series-of-snapshots approach, like Git, or a changeset approach, like Arch and RCS. Notably, RCS uses reverse deltas; that is, only the latest version of a file is stored in full, and each older revision is stored as a change against its subsequent revision.
In either case, the way to detect changes is a diff algorithm. There's the standard Myers approach, plus modifications like the patience and histogram algorithms. They are all based around finding the longest common subsequence (possibly with some modifications) and then representing the non-common parts as insertions, removals, or, in some cases, modifications.
The idea of a "modification" in a diff is hard to quantify because whether we think of a single-line change as logically the insertion of one line and the removal of another or a modification to the line depends on whether the human reading the line thinks it constitutes a substantive change. Because gauging human opinion is difficult for software, some diff generation approaches, like the unified diff, always produce additions and removals, and others, like the context diff, always consider this a modification.

Will the depth of a file within a filesystem change the time taken to copy it?

I am trying to figure out if whether or not the depth of a file in a filesystem will change the amount of time it takes to execute a "cp" bash command with that file.
By depth I mean how many parent directories its contained in.
I tried running a few tests, but my results are pretty inconclusive, and when I try to logically answer, I can think of reasons of why it would be either way.
What is the purpose of this?
Provided nothing is cached, the deeper the directory tree the more data has to be read from storage to get to the file - you have to find the name of the second dir, then the third within the second and so on. On the other hand if the file is big, the time needed to do this can be negligible in comparison.
Also mere startup of a command like cp is not without its cost.
If you are interested in how file systems work read this free book: http://www.nobius.org/~dbg/practical-file-system-design.pdf
Performance is a complicated subject, especially so when hard media is involved. Without proper understanding of how this works and proper understanding of statistics, you can't perform a correct test.

Sharing a file on an overloaded machine

I have a computer that is running Windows XP that I am using to process a great deal of data, update monitors, and bank data. Generally it is pretty loaded with work.
One particular file that has real time data is useful to a number of users. We have two programs that need this file, one that displays the numerical data and one that plots the numerical data. Any user can run an instance of either program on their machine. These programs search for the real time data file which is updated every second. They are both written in Perl and I was asked not to change either program.
Because of the large load on the computer, I am currently running the program that does calculations and creates the real time data file on a separate computer. This program simply writes the real time file onto the overloaded computer. Because Windows doesn't have an atomic write, I created a method that writes to a different extension, deletes the old real time file, and then moves the new one to the correct name. Unfortunately, as the user load on the computer increases, the writes take longer (which isn't ideal but is live-able) but more annoyingly, the time between deleting the old real time file and moving the new file to the correct name increases a great deal, causing errors with the Perl programs. Both programs check to see if the file modify time has changed (neither check for file locks). If the file goes missing they get angry and output error messages.
I imagine a first course of action would be to move this whole process away from the overloaded computer. My other thought was to create a number of copies of the files on different machines and have different users read the file from different places (this would be a real hack though).
I am new to the world networking and file sharing but I know there is a better way to do this. Frankly this whole method is a little hacked but that's how it was when I came here.
Lastly, it's worth mentioning that this same process runs on a UNIX machine and has none of these problems. For this reason I feel the blame falls on a need for an atomic write. I have been searching the internet for any work around to this problem and have tried a number of different methods (eg my current extension switching method).
Can anyone point me in the right direction so I can solve this problem?
My code is written in Python.
os.rename() says:
os.rename(src, dst)
Rename the file or directory src to dst. If dst is a directory,
OSError will be raised. On Unix, if dst exists and is a file,
it will be replaced silently if the user has permission. The
operation may fail on some Unix flavors if src and dst are on
different filesystems. If successful, the renaming will be an
atomic operation (this is a POSIX requirement). On Windows, if
dst already exists, OSError will be raised even if it is a file;
there may be no way to implement an atomic rename when dst names
an existing file.
Given that on Windows you are forced to delete the old file before renaming the new one to it, and you are prohibited from modifying the reading scripts to tolerate the missing file for a configurable timeout (the correct solution) or do proper resource locking with the producer (another correct solution), your only workaround may be to play with the process scheduler to make the {delete, rename} operation appear atomic. Write a C program that does nothing but look for the new file, delete the old, and rename the new. Run that "pseudo-atomic rename" process at high priority and pray that it doesn't get task-switched between the delete and the rename.

Using combination of File Size and Hash value of only first 20KB of a file to detect duplicates?

A project I'm working on requires detection of duplicate files. Under normal circumstances I would simply compare the file bytes in blocks or hash value of the entire file contents. However, the system does not have access to the entire file - only the first 50KB or so. It also knows the total file size of the original file.
I was thinking of implementing the following: each time a file is added, I would look for possible duplicates using both the total file size and a hash calculation of (file-size)+(first-20KB-of-file). The hash algorithm itself is not the issue at this stage, but will likely be MurmurHash2.
Another option is to also store, say, bytes 9000 through 9020 and use that as a third condition when looking up a duplicate copy or alternatively to compare byte-to-byte when the aforementioned lookup method returns possible duplicates in a last attempt to discard false positives.
How naive is my proposed implementation? Is there a reliable way to predict the amount of false positives? What other considerations should I be aware of?
Edit: I forgot to mention that the files are generally going to be compressed archives (ZIP,RAR) and on occasion JPG images.
You can use file size, hashes and partial-contents to quickly detect when two files are different, but you can only determine if they are exactly the same by comparing their contents in full.
It's up to you to decide whether the failure rate of your partial-file-check will be low enough to be acceptable in your specific circumstances. Bearing in mind that even an "exceedingly unlikely" event will happen frequently if you have enough volume. But if you know the type of data that the files will contain, you can judge the chances of two near-identical files (idenitcal in the first 50kB) cropping up.
I would think that if a partial-file-match is acceptable to you, then a hash of those partial file contents is probably going to be pretty acceptable too.
If you have access to 50kB then I'd use all 50kB rather than only the first 20kB in my hash.
Picking an arbitrary 20 bytes probably won't help much (your file contents will either be very different in which case hash+size clashes will be unlikely, or they will be very similar in which case the chances of a randomly chosen 20 bytes being different will be quite low)
In your circumstances I would check the size, then a hash of the available daa (50kB), then if this suggests a file match, a brute-force comparison of the available data just to minimise the risks, if you don't expect to be adding so many duplicates that this would bog the system down.
It depends on the file types, but in most cases false positives will be pretty rare.
You probably won't have any in Office and graphical files. And executables are supposed have a checksum in the header.
I'd say that the most likely false positive you may encounter is in source code files. They change often and it may happen that a programmer replaces a few symbols something after the first 20K.
Other than that I'd say they are pretty unlikely.
Why don't use a hash of the first 50 KB, and then store the size on the side? That would give you the most security with what you have to work with (with that said, there could be totally different content in the files after the first 50 KB without you knowing, so it's not a really secure system).
I find it difficult. It's likely that you would catch most duplicates with this method, but the possibility of false positives is huge. What about two versions of a 5MB XML document whose last chapter is modified?

How can I write a program that can recover files in FAT32

How can I write a program that can recover files in FAT32?
This is pretty complex, but FAT32 is very good documented:
I wrote a tool for direct FAT32 access once using only those ressources:
http://en.wikipedia.org/wiki/File_Allocation_Table
http://support.microsoft.com/kb/154997/
http://www.microsoft.com/whdc/system/platform/firmware/fatgen.mspx
But I've never actually tried to recover files. If you will successfully recover a file depends on several factors:
The file must still "exist" physically on the hard disk
You must know where the file starts
You must know what you are looking for (Headers..)
It depends on what happened to the files you're trying to recover. The data may still be on the partition, or it could be overwritten by now. There are a lot of pre-written solutions. A simple google search should give you a plethora of software that can try to recover the data, but it's not 100% sure to get them back. If you really want to recover them yourself, you'll need to write something the read the raw partition and ignore missing file markers.
here is a program (written by Thomas Tempelman. This guy is great.) that might help you out. You can make a copy of the partition, ignoring corrupt bits, then operate on the copy so you don't mess anything up, and you may also be able to recover the data directly with it.
I think you are referring to data carving, that is, reading the physical device and reconstructing previously unlinked files based on some knowledge (e.g. when you find two letters, PK, it's highly probable than a zip archive is following, same for JFIF for JPEG).
In this case, I suggest you to study the source code of PhotoRec a great (in my opinion, the best) Open Source tool for data carving.