Understanding mongodb md5 key in GridFS files collection - mongodb

Please explain the usage of md5 key in GridFS files collection.
In mongodb GridFS reference it says: "An MD5 hash returned from the filemd5 API. This value has the String type.", What is need for this hash ?

I tryied to understand that too some weeks ago, and i still have some doubt, but i report to you what i had read here
A kind of safe mode is built into the GridFS specification. When you
save a file, and MD5 hash is created on the server. If you save the
file in safe mode, an MD5 will be created on the client for comparison
with the server version. If the two hashes don't match, an exception
will be raised.
I guess it's a kind of check to see if the file is being update correctly[?]
edit: found that short sentence too, in official mongodb site, give a look
http://docs.mongodb.org/manual/reference/command/filemd5/

I use the md5 field to make sure I update the file only if it was changed without having to fetch the whole file from the DB and compare it.
I do db.col_name.find({id: myid}, {md5: 1}) so I fetch only the md5 field, I calculate the md5 of the new file, and update only if needed.
Fetching the whole file and making full data comparison could be very slow and traffic expensive.

Related

postgreSQL how to human read WAL files

due to a bad foreign key, we had some non written inserts in to database, but i have the wal files generated when this actions, just a few, so im wondering if is possible to decrypt them and read the original query values. i cant replay anithing because database owner just can give me those wal files.
thanks in advance.
I have seen the pg_waldump but no example of how to use and wanted to know if there is something just to make this bin readable
I would have something at output like "insert blabla into blabla"
finally with the plugin for vscode hex editor, i have been able to read SOME values inside each wal file. The problem is data is not so ordered, like i have a huge list of emails, then a huge list of messages, so i cant be sure to which email correspond wich message, and also there were boolean values that i cant find, despite of message. but yes, the data is here. in bad way but bit readable
No, that is not possible, since WAL contains low-level binary information that cannot readily be translated into SQL statements. Moreover, it is unclear what "non-written inserts" are – if they were not written, they were probably not logged in the WAL either.
Without a base backup, WAL files are pretty useless.

Reason behind MD5 Hashing?

I have sometimes seen and have been recommended to store Strings and associative array keys as MD5 hash values. Now I have learnt about hashing from MIT - OCW 6.046j and it seems more like a scheme to store data in an efficient format for fast searching and to prevent people from getting back the original.
But don't languages supporting associative arrays / dictionaries do this internally? What additional advantage is the MD5 hash giving?
Most languages may support this internally, for example see Java's hashcode(), which is used when storing keys in a HashMap:
Returns a hash code value for the object. This method is supported for the benefit of hash tables such as those provided by HashMap.
But there are scenarios where you want to do it yourself.
Scenario 1 - using as key in a database:
Let's suppose you have a big no-sql-ish database of letters and metadata of these letters. You want to be able to find a letter's metadata quickly without searching. What would your index be?
One option is using a running index that's unrelated to the letter's content, but then you have to search the database before being able to find a document's metadata. Another option is to create a signature for the document composed of it's prefix (it's just an example out of many), but some documents may share this property ("Dear John,").
So how about taking into account the entire document? That's where you can use md5 as the row-key for your documents.
In this scenario you're relying on having no collisions, and the argument in favour of this assumption usually mention your chances of running into a demented gorilla being (usually) greater. The Secure Hash Algorithm family produce even less collisions.
I mention this since databases normally do not do this out of the box (frameworks may...).
Scenario 2 - One-way hash for password storage:
note: This may no longer apply for md5, but it does for the SHA-family variants.
In this scenario, you want to store passwords on your database, but storing plain-text passwords may have drawbacks if the database is compromised (user often share passwords across sites - may lead to accounts on other sites compromised as well). The usage of hashing here is storing the hashed password and when a user attempts to log-in you only compare the hash and not the password itself. This way you don't need the password stored locally and it is a lot harder to crack it.

Performance in MongoDB and GridFS

I am developing a plugin that using mongodb. The plugin has to store some .dcm files (DICOM files) in the database as binary files. After that, the plugin has to store the metadata of the file and be able to make some query on only these metadata.
Naturally, I chose GridFs to answer at my problem. Because I can use the same file to store the binary data in the chunks collection and the metadata in the metadata field in the files collection (and bypass the sized limit of MongoDB).
But another problem comes to me. This solution would be great but I am storing at the same time the binary data and the metadata. Let me explain : first I store the binary file and after that I retrieve the file and read metadata from it and store the metadata in the same file. It is an obligation for me for some externals reasons. So I lost a lot of time to retrieve the file and restore it again. For update the metadata from a file that it is already stored, I am using this code :
GridFSDBFile file = saveFs.findOne(uri.getFileName());
if (file == null) {
return false;
} else {
file.setMetaData(new BasicDBObject());
file.save();
return true;
}
The main problem it that I have to find the file before to modify it and then store it AGAIN !
So my first question is : Is there a best way to retrieve file from the database instead of findOne(String fileName) ? Is the method findOne(ObjectID id) is faster ? (I don't think so because I think that fileName is already indexed by default, is not it ?)
I have tried another way to do it. To bypass this problem, I decided to store 2 different files, ones for binary data and ones for metadata. In this case, I don't loose time to retrieve the file in the database. But I have got 2 times more files... But I almost sure that it exist a better way to do it !
So my second question : Do you think that I would have to used 2 different collections ? One which used GridFs to store the binary data and the other one that used classic mongo storage (or GridFS) to only store the metada ?
Thank you a lot for reading me and for your answer :).
For your first question, both _id and filename fields are indexed by default. While _id field is unique, filename is not. So if you have files with same filenames, getting a file with the filename will be relatively slower than getting it by the _id field.
For your second question, you can always have metadata for any GirdFS file you inserted. That means you don't have to have more than GridFS. Use GridFS to insert data, but just before inserting it, assign your metadata to the file you want to insert. That way you can query files using the metadata. If the metadata you want to have is fixed for all documents, then you can have those fields indexed too, and queryable of course.

How to use MongoDB or other document database to keep video files, with options of adding to existing binary files and parallel read/write

I'm working on a video server, and I want to use a database to keep video files.
Since I only need to store simple video files with metadata I tried to use MongoDB in Java, via its GridFS mechanism to store the video files and their metadata.
However, there are two major features I need, and that I couldn't manage using MongoDB:
I want to be able to add to a previously saved video, since saving a video might be performed in chunks. I don't want to delete the binary I have so far, just append bytes at the end of an item.
I want to be able to read from a video item while it is being written. "Thread A" will update the video item, adding more and more bytes, while "Thread B" will read from the item, receiving all the bytes written by "Thread A" as soon as they are written/flushed.
I tried writing the straightforward code to do that, but it failed. It seems MongoDB doesn't allow multi-threaded access to the binary (even if one thread is doing all the writing), nor could I find a way to add to a binary file - the Java GridFS API only gives an InputStream from an already existing GridFSDBFile, I cannot get an OutputStream to write to it.
Is this possible via MongoDB, and if so how?
If not, do you know of any other DB that might allow this (preferably nothing too complex such as a full relational DB)?
Would I be better off using MongoDB to keep only the metadata of the video files, and manually handle reading and writing the binary data from the filesystem, so I can implement the above requirements on my own?
Thanks,
Al
I've used mongo gridfs for storing media files for a messaging system we built using Mongo so I can share what we ran into.
So before I get into this for your use case scenario I would recommend not using GridFS and actually using something like Amazon S3 (with excellent rest apis for multipart uploads) and store the metadata in Mongo. This is the approach we settled on in our project after first implementing with GridFS. It's not that GridFS isn't great it's just not that well suited for chunking/appending and rewriting small portions of files. For more info here's a quick rundown on what GridFS is good for and not good for:
http://www.mongodb.org/display/DOCS/When+to+use+GridFS
Now if you are bent on using GridFS you need to understand how the driver and read/write concurrency works.
In mongo (2.2) you have one writer thread per schema/db. So this means when you are writing you are essentially locked from having another thread perform an operation. In real life usage this is super fast because the lock yields when a chunk is written (256k) so your reader thread can get some info back. Please look at this concurrency video/presentation for more details:
http://www.10gen.com/presentations/concurrency-internals-mongodb-2-2
So if you look at my two links essentially we can say quetion 2 is answered. You should also understand a little bit about how Mongo writes large data sets and how page faults provide a way for reader threads to get information.
Now let's tackle your first question. The Mongo driver does not provide a way to append data to GridFS. It is meant to be a fire/forget atomic type operation. However if you understand how the data is stored in chunks and how the checksum is calculated then you can do it manually by using the fs.files and fs.chunks methods as this poster talks about here:
Append data to existing gridfs file
So going through those you can see that it is possible to do what you want but my general recommendation is to use a service (such as Amazon S3) that is designed for this type of interaction instead of trying to do extra work to make Mongo fit your needs. Of course you can go to the filesystem directly as well which would be the poor man's choice but you lose redundancy, sharding, replication etc etc that you get with GridFS or S3.
Hope that helps.
-Prasith

How to search the value when value is stored as encrypted

in my database i store the student information in encrypted form.
now i want to perform the search to list all student which name is start with "something" or contains "something"
anybody have idea that how can perform this type of query?
Please suggest
Any decent encryption algorithm has as one of its core features the fact that it's impossible to deduce anything about the plaintext just by looking at the encrypted text. If you were able to tell, just by looking at the encrypted text, that the plaintext contained the string william, any attackers would be able to get that information just as easily, and you may as well not be encrypting at all.
The only way to perform this kind of operation on the data is to have access to the decrypted data. Using the model you've described - where the database only ever sees the encrypted data - it's not possible for the database to do this work, as the database has no access to the data it needs.
You need to have the data you're wanting to search on decrypted. The only complete way to do this is to have the application pull all the data out of the database, decrypt it, then do the filtering/sorting/whatever in your application. Obviously this is not going to scale well - but that's surely something you took into consideration when you decided to encrypt the data before putting it in the database.
Another option would be to store fragments of the data unencrypted. For example, if you have a first_name field and you want to be able to retrieve all records where first_name begins with a, have a first_name_first_letter field. Obviously this isn't going to scale well either - if you want to search for all records where first_name contains ill, you're going to have to store the complete first_name unencrypted.
There's a more serious problem with this solution though: by storing unencrypted data, you're leaking information about the encrypted data. The more unencrypted data you store, the more you leak. The more you leak, the more clues you're leaving for an attacker to defeat your encryption - plus, if you've stored the bit they were interested in unencrypted, they've already won.
Another question points to SQLCipher - it's an implemention of sqlite that does the encryption in the database. It seems to be targeted towards your use case - it's even already used on a couple of iPhone apps.
However, it has the database doing the encryption, not the application. This lets the database also handle the decryption, and hence the database is able to inspect the contents of the fields and do the searching you're looking for.
If you're still insisting on not doing the encryption in the database, this won't work for you.
If all you want is the equivalent of "starts with" and "contains", you might be able to do something with a bit field and the bitwise logical operators.
Not sure on the syntax you'd use, exactly (I'm a bit rusty on SQL) but the idea would be to create an additional field for each record which has a bit set for each letter that occurs in the name, then do something like:
SELECT * from someTable where (searchValue & bitField)>0
You then need to iterate over those records, decrypt them, and determine whether they actually meet the criteria you really wanted to search on (since you'd get a superset of the desired records back from the search).
You'd obviously be leaking some information about the contents of the field by doing this, but you can probably reduce that by encrypting the bitfields as well, or by turning on a few extra bits in each bitfield, so you can't tell "bob" from "bobby", for example.
I'm curious about what sort of security goal you're trying to meet with this encryption, though. If you describe the model a bit more, you might get better answers.