I´m searching a way to recognise files depending on their content. So I need a method which is independent of the files location, name, attributes and so on.
Normally I would use a hash function like SHA-1 or MD5. The problem is the size of the files I want to identify. The files are usually between 5 - 15 GB.
My approach with SHA-1 hashes is not a good solution. Hashing such big files takes several minutes... I need something much faster which makes it possible to identify a file scanned previously in some seconds.
Is there another way than hashing files for such a demand?
My current Java code is comparable to openssl sha1 <path> on my Mac:
MessageDigest md = MessageDigest.getInstance("SHA1");
FileInputStream fis = new FileInputStream(f.getPath());
byte[] dataBytes = new byte[1024];
int nread = 0;
while ((nread = fis.read(dataBytes)) != -1) {
md.update(dataBytes, 0, nread);
};
fis.close();
byte[] mdbytes = md.digest();
//convert the byte to hex format
StringBuffer sb = new StringBuffer("");
for (int i = 0; i < mdbytes.length; i++) {
sb.append(Integer.toString((mdbytes[i] & 0xff) + 0x100, 16).substring(1));
}
return sb.toString();
But actually I´m looking for something other than such hashing algorithms. Do you have a idea? :-)
BR
m4xy
Depending on what kind of files you are dealing with, it might suffice to use only parts of the file for the hash. E.g. if this is compressed image data, chances are very very high, that you will get unique different hashes for your files if you only hash the first few kilobytes (and maybe the last few kilobytes).
This might not work for uncompressed database dumps that always start identical.
As a first early-out pass, you can simply compare file sizes.
Once you've hashed a file, you can store the hash with the file's ctime. As long as the ctime hasn't changed, there's no need to rehash. (You could use mtime instead, but you'd need to rely on the programs which modify the files not manually setting mtime to what it was.)
Related
I'm using this code to listen to a port:
int start(){
ushort port = 61888;
listener = new TcpSocket();
assert(listener.isAlive);
listener.blocking = false;
listener.bind(new InternetAddress(port));
listener.listen(10);
writefln("Listening on port %d.", port);
enum MAX_CONNECTIONS = 60;
auto socketSet = new SocketSet(MAX_CONNECTIONS + 1);
Socket[] reads;
while (true)
{
socketSet.add( listener);
foreach (sock; reads)
socketSet.add(sock);
Socket.select(socketSet, null, null);
}
return 0;
}
As far as I know, sockets interact with the bytes as they are. I want to find a way how to convert these bytes (which are essentially SQL requests) to strings. How can I do so, providing that input is in UTF-8, which is an encoding using variable size?
You seem to have a few questions here.
How do I get chars from bytes?
cast them with cast(char[]) st. This aliases the bytes, giving you a slice of the exact same data, and doesn't require new allocation. You are not yet assuming that the bytes are valid UTF-8, but autodecoding or other parts of your program might complain if they aren't. You can run it by std.utf.validate if you want.
do basically the same thing with std.string.assumeUTF(st), which at least also asserts on invalid UTF in debug builds only.
How do I get a string from char[]?
You can unsafely alias the char[] with std.exception.assumeUnique(st), or you can allocate an immutable copy with st.idup or std.utf.toUTF8(st).
What if my fixed buffer of bytes contains invalid UTF-8 -- because it got cut off?
If that's a risk you can use low level std.utf tools (decodeFront and catching UTFException is one way) to peel off the valid UTF-8 and then check if you have remaining bytes, or to check that the end of the input is valid UTF-8.
How do I know if I've gotten a complete SQL statement with my fixed buffer socket I/O?
Instead of just passing the raw SQL statement over the line, you can define a network protocol that includes information like statement size, or that has 'end of statement' markers that you can read for.
I've a cheat sheet for string type conversions, which links to a file of more elaborate unittests.
Try this:
import std.exception: assumeUnique;
string s = assumeUnique (cast(char[])ubyteArray);
This question already has answers here:
What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB?
(22 answers)
Closed 4 years ago.
In my app, I download videos from the Amazon S3 cloud to the sandbox. In order to make sure that the downloaded files are not corrupt, I compare the eTag of the object (delivered by Amazon) with the MD5 hash of the downloaded object which resides in the local file system. For small videos (< 5MB) my algorithm works fine - eTag and MD5 hash are identical.
For bigger files, both parameters no longer match - as far as I know, Amazon generates the eTag differently for files > 5MB - the eTag also has a trailing hyphen with a number behind (maybe it's the number of chunks?):
8c18c4ed68bc9db377cb2d3225c0ee31-4
In the Internet, I could find no solution or code snippet calculating the correct MD5 hash for bigger files.
Calculating the MD5 hash, I tried both
localData.md5().toHexString() // CryptoSwift
both
var md5: String? {
let hash = localData.withUnsafeBytes { (bytes: UnsafePointer<Data>) -> [UInt8] in
var hash: [UInt8] = [UInt8](repeating: 0, count: Int(CC_MD5_DIGEST_LENGTH))
CC_MD5(bytes, CC_LONG(localData.count), &hash)
return hash
}
return hash.map { String(format: "%02x", $0) }.joined()
}
Has anyone an idea how to resolve this?
Maybe I should focus on another approach - for example checking if the downloaded video can be opened?
I think a more viable strategy would be to store a pre-calculated hash in your structured response (you most likely have a JSON, XML, <insert your favourite wire format here> that references the S3 URL, don't you?).
{
"url": "https://.../myfile.mpeg",
"sha256": "9e7bf344f14a1fd2f98abbd736fa3c777ef6088e9b964858bbb524e88322a938"
}
Relying on S3's ETag generation algorithm will break anytime when they decide to change the implementation. Plus, CDNs usually handle ETags poorly, and ETags tend to differ from mirror to mirror (worked in a company that rolled a private CDN where that was the case). So if you decide to move away from S3, your logic may break as well.
Problem: limit binary files download rate.
def test = {
Logger.info("Call test action")
val file = new File("/home/vidok/1.jpg")
val fileIn = new FileInputStream(file)
response.setHeader("Content-type", "application/force-download")
response.setHeader("Content-Disposition", "attachment; filename=\"1.jpg\"")
response.setHeader("Content-Length", file.lenght + "")
val bufferSize = 1024 * 1024
val bb = new Array[Byte](bufferSize)
val bis = new java.io.BufferedInputStream(is)
var bytesRead = bis.read(bb, 0, bufferSize)
while (bytesRead > 0) {
bytesRead = bis.read(bb, 0, bufferSize)
//sleep(1000)?
response.writeChunk(bytesRead)
}
}
But its working only for the text files. How to work with binary files?
You've got the basic idea right: each time you've read a certain number of bytes (which are stored in your buffer) you need to:
evaluate how fast you've been reading (= X B/ms)
calculate the difference between X and how fast you should have been reading (= Y ms)
use sleep(Y) on the downloading thread if needed to slow the download rate down
There's already a great question about this right here that should have everything you need. I think especially the ThrottledInputStream solution (which is not the accepted answer) is rather elegant.
A couple of points to keep in mind:
Downloading using 1 thread for everything is the simplest way, however it's also the least efficient way if you want to keep serving requests.
Usually, you'll want to at least offload the actual downloading of a file to its own separate thread.
To speed things up: consider downloading files in chunks (using HTTP Content-Range) and Java NIO. However, keep in mind that this will make thing a lot more complex.
I wouldn't implement something which any good webserver should be able to for me. In enterprise systems this kind of thing is normally handled by a web entry server or firewall. But if you have to do this, then the answer by tmbrggmn looks good to me. NIO is a good tip.
I'm looking for some sample code to show me how to add metadata to the wav files we create.
Anyone?
One option is to add your own chunk with a unique id. Most WAV players will ignore it.
Another idea would to be use a labl chunk, associated with a que set at the beginning or end of the file. You'd also need a que chunk. See here for a reference
How to write the data is simple
Write "RIFF".
save the file position.
Write 4 bytes of 0's
Write all the existing chunks. Keep count of bytes written.
Add your chunk. Be sure to get the chunksize right. Keep
count of bytes written.
rewind to the saved position. Write the new size (as a 32-bit
number).
Close the file.
It's slightly more complicated if you are adding things to an existing list chunk, but the same principle applies.
Maybe the nist file format will give you what you want:
NIST
Here is a lib that could help, but im afraid it looks old. NIST Lib
Cant find more useful information right now how exactly to use it, and im afraid the information papers from my company must stay there. :L/
Try code below
private void WaveTag()
{
string fileName = "in.wav";
WaveReadWriter wrw = new WaveReadWriter(File.Open(fileName, FileMode.Open, FileAccess.ReadWrite));
//removes INFO tags from audio stream
wrw.WriteInfoTag(null);
//writes INFO tags into audio stream
Dictionary<WaveInfo, string> tag = new Dictionary<WaveInfo, string>();
tag[WaveInfo.Comments] = "Comments...";
wrw.WriteInfoTag(tag);
wrw.Close();
//reads INFO tags from audio stream
WaveReader wr = new WaveReader(File.OpenRead(fileName));
Dictionary<WaveInfo, string> dir = wr.ReadInfoTag();
wr.Close();
if (dir.Count > 0)
{
foreach (string val in dir.Values)
{
Console.WriteLine(val);
}
}
}
from http://alvas.net/alvas.audio,articles.aspx#id3-tags-for-wave-files
If you examine the wave file spec you'll see that there does not seem to be room for annotations of any kind. An option would be to wrap the wave file with your own format that includes custom information but you would in effect be creating a whole new format that would not be readable by users who do not have your app. But you might be ok with that.
I have some code that downloads a plist from a web server and stores it in the documents directory of the phone. My concern is that if the file becomes corrupt then it will effect the stability and user experience of the app.
I am coding defensively in the data read parts of the app, but wondered what advice is out there for checking the integrity of the file in the first place before the old one is over written. I am thinking of implementing some sort of computed value which is also stored in as a key in the plist for example.
Any thoughts about making this as robust as possible would be greatly appreciated.
Best regards
Dave
Have a look at CommonCrypto/CommonDigest.h.
The CC_MD5(const void *data, CC_LONG len, unsigned char *md); function computes an MD5 hash.
#implementation NSData (MD5)
-(NSString*)md5
{
unsigned char digest[CC_MD5_DIGEST_LENGTH];
CC_MD5([self bytes], [self length], digest);
NSString* s = [NSString stringWithFormat: #"%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x",
digest[0], digest[1],
digest[2], digest[3],
digest[4], digest[5],
digest[6], digest[7],
digest[8], digest[9],
digest[10], digest[11],
digest[12], digest[13],
digest[14], digest[15]];
return s;
}
#end
As part of the deployment of the files on the server, you can use OpenSSL to compute the hashs. The openssl md5 filename command computes an MD5 hash for a file. This can be integrated in a script.
Then after your application has downloaded a file, it computes the hash of what's been downloaded and compares it to the hash stored on the server.
Obviously, if you want to ensure the integrity of a plist file, this plist cannot contain its own hash.