What can go wrong if I use MD5 instead of SHA256 for the sake of file names? - hash

I'm developing a cache system. This system creates cache files for each API call.
Each API has a URL of course, and I'm using that URL to create a hash and use that hash as the name of the file.
The reason I can't use the URL directly as the name of the file is because many URLs contain Unicode characters and when they are URL encoded, they become very long and OS won't let me save them.
Now, I just tested some hashing algorithms online and it seems that MD5 generates shorter digests for me than SHA256.
So I want to use MD5. But do I lose anything? I mean, does it have any shortcomings that I might not see now?
Is there a greater chance for it to generate duplicate digests for different URLs?

Related

System.IO - Does BinaryReader/Writer read/write exactly what a file contains? (abstract concept)

I'm relatively new to C# and am attempting to adapt a text encryption algorithm I designed in wxMaxima into a Binary encryption program in C# using Visual Studio forms. Because I am new to reading/writing binary files, I am lacking in knowledge regarding what happens when I try to read or write to a filestream.
For example, instead of encrypting a text file as I've done in the past, say I want to encrypt an executable or any other form of binary file.
Here are a few questions I don't understand:
When I open a file stream and use binaryreader will it read in an absolute duplicate of absolutely everything in the file? I want to be able to, for example, read in an entire file, delete the original file, then create a new file with the old name and write the entire binary stream back. Will this reproduce the original file exactly or will there be some sort of corruption that must otherwise be accounted for?
Because it's an encryption program, I was hoping to add in a feature that would low-level "format" the original file before deleting it so it would be theoretically inaccessible by combing the physical data of a harddisk. If I use binarywriter to overwrite parts of the original file with gibberish will it be put on the same spot on the harddisk or will the file become fragmented and actually just redirect via the FAT to some other portion of the harddisk? Obviously there's no point in overwriting the original file with gibberish if it's not over-writing the original cluster on the harddisk.
For your first question: A BinaryReader is not what you want. The name is a bit misleading: it "Reads primitive data types as binary values in a specific encoding." You probably want a FileStream.
Regarding the second question: That will not be easy: please see the "How SDelete Works" section of SDelete for an explanation. Brief extract in case that link breaks in the future:
"Securely deleting a file that has no special attributes is relatively straight-forward: the secure delete program simply overwrites the file with the secure delete pattern. What is more tricky is securely deleting Windows NT/2K compressed, encrypted and sparse files, and securely cleansing disk free spaces.
Compressed, encrypted and sparse are managed by NTFS in 16-cluster blocks. If a program writes to an existing portion of such a file NTFS allocates new space on the disk to store the new data and after the new data has been written, deallocates the clusters previously occupied by the file."

How to read a file from the disk if less than X days old, if older, refetch the html file

I wish to read an html file off of the internet and cache it. Then when I go back, because I'm debugging, I don't want to hammer the servers with the numerous requests I'll need. I don't want to get my IP banned for slamming the server over and over again just because I'm debugging. So my code needs to look something like:
if ((file > days_old) || !(file exists))
fetch html file from internet
save file to disk
else
read it from the disk
Because there will be multiple files, I'll need to include a variable name in the file name so the file is unique and I can easily look it up again.
I just learned Perl this semester and we only learned the basics & a bit of regex, once I get this I should be mostly fine.
Thanks!
Use an existing module:
Cache::Cache
HTTP::Cache::Transparent
If you really want to implement your own, you'll want to look at the If-Modified-Since and ETag HTTP headers to determine when to re-fetch a file, rather than an arbitrary days_old number you suck out of your thumb. You will also have to generate a unique filename, preferably with a hash function, while retaining the original URL to cater for hash collisions.

Why do some sites have a md5 string on each file?

On some sites, in their download section each file has a md5. md5 of what? i cant understand the purpose
on phpBB.com for example:
Download phpBB 3.0.6 (zip)
Size: 2.30 MiB
MD5: 63e2bde5bd03d8ed504fe181a70ec97a
It is the signature of the file's hash. The idea is that you can run MD5 against the downloaded file, then compare it against that value to make sure you did not end up with a corrupted download.
This is a checksum, for verifying that the file as-downloaded is intact, without transmission errors. If the checksum listing is on a different server than the download, it also may give a little peace of mind that the download server hasn't been hacked (with the presumption that two servers are harder to hack than one).
It's a hash of the file. Used to ensure file integrity once you download said file. You'd use an md5 checksum tool to verify the file state.
Sites will post checksums so that you can make sure the file downloaded is the same as the file they're offering. This lets you ensure that file has not been corrupted or tampered with.
On most unix operating systems you can run md5 or md5sum on a file to get the hash for it. If the hash you get matches the hash from the website, you can be reasonably certain that the file is intact. A quick Google search will get you md5sum utilities for Windows.
You might also see an SHA-1 hash sometimes. It's the same concept, but a different and more secure algorithm.
This is an md5 hash of the entire binary contents of the file. The point is that if two files have different md5 hashes, they are different. This helps you determine whether a local file on your computer is the same as the file on the website, without having to download it again. For instance:
You downloaded your local copy somewhere else and think there might be a virus inside.
Your connection is lossy and you fear the file might be corrupted by the download.
You have changed the local file name and want to know which version you have.

How can I limit file types in CGI file uploads in Perl?

I am using CGI to allow the user to upload some files. I just want the just to be able to upload .txt or .csv files. If the user uploads file with any other format then I want to be able to put out an error message.
I saw that this can be done by javascript: http://www.codestore.net/store.nsf/unid/DOMM-4Q8H9E
But is there a better way to achieve this? Is there is some functionality in Perl that allows this?
The disclaimer on the site to you link to is important:
Note: This is not entirely foolproof as people can easily change the extension of a file before uploading it, or do some other trickery, as in the case of the "LoveBug" virus.
If you really want to do this right, let the user upload the file, and
then use something like File::MimeInfo::Magic (or file(1), the
UNIX utility) to guess the actual file type. If you don't like the
file type, delete the file and give the user an error message.
I just want the just to be able to upload .txt or .csv files.
Sounds easy, doesn't it? It's not. And then some.
The simple approach is just to test that the file ends in ‘.txt’ or ‘.csv’ before storing it on the filesystem. This should be part of a much more in-depth validation of what the filename is allowed to contain before you let a user-submitted filename anywhere near the filesystem.
Because the rules about what can go in a filename are complex on some platforms (especially Windows) it's usually best to create your own filename independently with a known-good name and extension.
In any case there is no guarantee that the browser will send you a file with a usable name at all, and even if it does there is no guarantee that name will have ‘.txt’ or ‘.csv’ at the end, even if it is a text or CSV file. (Some platforms simply do not use extensions for file typing.)
Whilst you can try to sniff the contents of the file to see what type it might be, this is highly unreliable. For example:
<html>,<body>,</body>,</html>
could be plain text, CSV, HTML, XML, or a variety of other formats. Better to give the user an explicit control to say what file type they're uploading (or use one file upload field per type).
Now here's where it gets really nasty. Say you've accepted the upload and stored it as /data/mygoodfilename.txt, and the web server is correctly serving it as the Content-Type ‘text/plain’. What do you think the browser interprets it as? Plain text? You should be so lucky.
The problem is that browsers (primarily IE) don't trust your Content-Type header, and instead sniff the contents of the file to see if it looks like something else. Serve the above snippet as plain text, and IE will happily treat it as HTML. This can be a huge problem, because HTML can include client-side scripts that will take over the user's access to the site (a cross-site-scripting attack).
At this point you might be tempted to sniff the file on the server-side, for example using the ‘file’ command, to check it doesn't contain ‘<html>’. But this is doomed to failure. The ‘file’ command does not sniff for all the same HTML tags as IE does, and other browsers sniff differently anyway. It's quite easy to prepare a file that ‘file’ will claim is not HTML, but that IE will nevertheless treat as if it is (with security-disaster implications).
Content-sniffing approaches such as ‘file’ will give you only a false sense of security. This is a convenience tool for loose guessing of filetypes and not an effective security measure.
At this point your last desperate possibilities are things like:
serving all user-uploaded files from a separate hostname, so that a script injection attack can't purloin the credentials of your main site;
serving all user-uploaded files through a CGI wrapper, adding the header ‘Content-Disposition: attachment’ so that browsers won't attempt to display them directly;
only accepting uploads from trusted users.
On unix the easiest way is to do an JRockway suggested. If not on unix then your options are limited. You can examine the file extension and you can examine the contents to verify. I'm assuming for you specific case that you only want "* seperated value" text files. So one of the Text::CSV::* modules may be useful in verifying the file is the type you asked for.
Security for this operation is a whole other ball of wax.
try this:
$file_name = "file.txt";
$file_cmd = "file \"$file_name"\";
$file_type = `$file_cmd`;
return 0 unless($file_type =~ /(ASCII|text)/i)

why are downloads sometimes tagged md5, sha1 and other hash indicators?

I've seen this all over the place:
Download here! SHA1 =
8e1ed2ce9e7e473d38a9dc7824a384a9ac34d7d0
What does it mean? How does a hash come into play as far as downloads and... What use can I make of it? Is this a legacy item where you used to have to verify some checksum after you downloaded the whole file?
It's a security measure. It allows you to verify that the file you just downloaded is the one that the author posted to the site. Note that using hashes from the same website you're getting the files from is not especially secure. Often a good place to get them from is a mailing list announcement where a PGP-signed email contains the link to the file and the hash.
Since this answer has been ranked so highly compared to the others for some reason, I'm editing it to add the other major reason mentioned first by the other authors below, which is to verify the integrity of the file after transferring it over the network.
So:
Security - verify that the file that you downloaded was the one the author originally published
Integrity - verify that the file wasn't damaged during transmission over the network.
When downloading larger files, it's often useful to perform a checksum to ensure your download was successful and not mangled along transport. There's tons of freeware apps that can be used to gen the checksum for you to validate your download. This to me is an interesting mainstreaming of procedures that popular mp3 and warez sites used to use back in the day when distributing files.
SHA1 and MD5 hashes are used to verify the integrity of files you've downloaded. They aren't necessarily a legacy technology, and can be used by tools like those in the openssl to verify whether or not your a file has been corrupted/changed from its original.
It's to ensure that you downloaded the file correctly. If you hash the downloaded the file and it matches the hash on the page, all is well.
A cryptographic hash (such as SH1 or MD5) allows you to verify that file you have has been downloaded correctly and has not been tampered with.
To go along with what everyone here is saying I use HashTab when I need to generate/compare MD5 and SHA1 hashes on Windows. It adds a new tab to the file properties window and will calculate the hashes.
With a has (MD5, SHA-1) one input matches only with one output, and then if you down load the file and calculate the hash again should obtain the same output.
If the output is different the file is corrupt.
If (hash(file) == “Hash in page”)
validFile = true;
else
validFile = false;