Getting different checksum values - numbers

I have uploaded a PDF document to database as a BLOB with it's checksum number. When I am getting that BLOB back to PDF file and trying to calculate the checksum number then its giving a different checksum number then the original one. I would like to maintain the document integrity by using checksum number but due to this issue I am not able to maintain that.
Here for all the above mentioned operation I am doing in java.
Any help will be highly appreciable.
Thanks,
Kuldeep

Related

How to zip/unzip data bytea on Postgresql

As we have known, Oracle provided 2 functions UTL_COMPRESS.LZ_COMPRESS/LZ_UNCOMPRESS to compress the raw input. But when moving to aws Postgresql, there are no similar functions like that.
My question is how can we do compress and uncompress the raw data on Postgresql? Willing to listen to any solutions from you guys. Thank you in advance.
PostgreSQL does that automatically for you through a feature called TOAST. Values in a table row exceeding a size of 2000 bytes will get compressed and, if the row still exceeds 2000 bytes after that, sliced up and stored out of line. All that happens transparently.

Understanding mongodb md5 key in GridFS files collection

Please explain the usage of md5 key in GridFS files collection.
In mongodb GridFS reference it says: "An MD5 hash returned from the filemd5 API. This value has the String type.", What is need for this hash ?
I tryied to understand that too some weeks ago, and i still have some doubt, but i report to you what i had read here
A kind of safe mode is built into the GridFS specification. When you
save a file, and MD5 hash is created on the server. If you save the
file in safe mode, an MD5 will be created on the client for comparison
with the server version. If the two hashes don't match, an exception
will be raised.
I guess it's a kind of check to see if the file is being update correctly[?]
edit: found that short sentence too, in official mongodb site, give a look
http://docs.mongodb.org/manual/reference/command/filemd5/
I use the md5 field to make sure I update the file only if it was changed without having to fetch the whole file from the DB and compare it.
I do db.col_name.find({id: myid}, {md5: 1}) so I fetch only the md5 field, I calculate the md5 of the new file, and update only if needed.
Fetching the whole file and making full data comparison could be very slow and traffic expensive.

Performance in MongoDB and GridFS

I am developing a plugin that using mongodb. The plugin has to store some .dcm files (DICOM files) in the database as binary files. After that, the plugin has to store the metadata of the file and be able to make some query on only these metadata.
Naturally, I chose GridFs to answer at my problem. Because I can use the same file to store the binary data in the chunks collection and the metadata in the metadata field in the files collection (and bypass the sized limit of MongoDB).
But another problem comes to me. This solution would be great but I am storing at the same time the binary data and the metadata. Let me explain : first I store the binary file and after that I retrieve the file and read metadata from it and store the metadata in the same file. It is an obligation for me for some externals reasons. So I lost a lot of time to retrieve the file and restore it again. For update the metadata from a file that it is already stored, I am using this code :
GridFSDBFile file = saveFs.findOne(uri.getFileName());
if (file == null) {
return false;
} else {
file.setMetaData(new BasicDBObject());
file.save();
return true;
}
The main problem it that I have to find the file before to modify it and then store it AGAIN !
So my first question is : Is there a best way to retrieve file from the database instead of findOne(String fileName) ? Is the method findOne(ObjectID id) is faster ? (I don't think so because I think that fileName is already indexed by default, is not it ?)
I have tried another way to do it. To bypass this problem, I decided to store 2 different files, ones for binary data and ones for metadata. In this case, I don't loose time to retrieve the file in the database. But I have got 2 times more files... But I almost sure that it exist a better way to do it !
So my second question : Do you think that I would have to used 2 different collections ? One which used GridFs to store the binary data and the other one that used classic mongo storage (or GridFS) to only store the metada ?
Thank you a lot for reading me and for your answer :).
For your first question, both _id and filename fields are indexed by default. While _id field is unique, filename is not. So if you have files with same filenames, getting a file with the filename will be relatively slower than getting it by the _id field.
For your second question, you can always have metadata for any GirdFS file you inserted. That means you don't have to have more than GridFS. Use GridFS to insert data, but just before inserting it, assign your metadata to the file you want to insert. That way you can query files using the metadata. If the metadata you want to have is fixed for all documents, then you can have those fields indexed too, and queryable of course.

Can i store sqlite db as zip file in iphone application

My sqlite file has a size of 7MB. I want to reduce its size. How i can do that ? When am simply compressing it will come around only 1.2 MB. Can i compress my mydb.sqlite to a zip file ? If it is not possible, any other way to reduce size of my sqlite file ?
It is possible to compress before hand, but is very redundant. You will compress your binary before distribution, Apple distributes your app through the store compressed and the compression of a compressed file is fruitless. Thus, any work you do to compress beforehand should not have much of an effect on the resulted size of your application
without details of what you are storing in the DB it's hard to give specific advice. The usual generics on DB Design will apply. Normalise your database.. for example
reduce/remove repeating data. If you have text/data that is repeated then store it once, and use key to reference it
If you are storing large chunks of data then you might be able to zip and unzip these in and out of the database in your app code rather than try to zip the DB

How do I extract Word documents from data recovered from USB device?

I have been able to copy the raw data from an otherwise inaccessible USB drive into a monolithic file of about 250MB. Somewhere in that blob of bytes are about 40 Word documents.
Where do I find documentation about the internal structure of Word documents such that I can parse the byte-stream, recognise where a Word doc starts and finishes and extract a copy?
Are there any libraries in any programming language specific to this task?
Can anyone suggest an already existing software solution to this issue?
Two approaches:
You can mount files as volumes in linux. Provided your binary blob isn't too corrupted, you'll probably be able to break down the filesystem to find out where you files are located. Is (was) it a FAT partition or NTFS?
If that doesn't work, I'd look for this string of bytes:
D0 CF 11 E0 A1 B1 1A E1
These are the "magic bytes" of office document file signatures. They might occur randomly in other data, but it's a start. You're going to run into MAJOR issues if the files are fragmented.
Also, try to recreate pieces of the document(s) in Word as is, save it to a file and extract chunks to search for in the blob (using grep binary or whatever). Provided you have info from all parts of the file you should be able to decode WHERE in the blob they are. Piecing it back into a working DOC binary seems far fetched, but recovering the rest of the text shouldn't be impossible.
The Apache POI project has a library for reading and writing all kinds of MS Office docs. If the files are in the new XML base OOXML format, you'll be looking for the start of a zip file as the XML is compressed.