File versioning with GridFS

File versioning with GridFS - mongodb

I'm trying to store versioned content in mongo DB with GridFS. Therefore I add a version field to the metadata of the file I'm storing. This all works well. Now I want to get the latest version without knowing the version. Here: Find the latest version of a document stored in MongoDB - GridFs someone mentions that findOne always returns the youngest (latest) file if matching the query. What is, what I want. But when I try this, I always get the first (oldest) file from findOne(). I'm using spring-data-mongodbversion 1.5.0.RELEASE
Here my current code:
public void storeFileToGridFs(ContentReference contentReference, InputStream content) {
Integer nextVersion = findLatestVersion(contentReference) + 1;
DBObject metadata = new BasicDBObject();
metadata.put("version", nextVersion);
metadata.put("definitionId", contentReference.getContentDefinitionId());
gridOperations.store(content, contentReference.getContentId().getValue(), metadata);
}
and to find the latest version:
private Integer findLatestVersion(ContentReference contentReference) {
Query query = new Query(GridFsCriteria.whereFilename().is(contentReference.getContentId().getValue()));
GridFSDBFile latestVersionRecord = gridOperations.findOne(query);
if (latestVersionRecord != null) {
Integer version = (Integer) latestVersionRecord.getMetaData().get("version");
return version;
} else return 0;
}
But, as already mentioned, the findLatestVersion() always returns 1 (except the first time, when it returns 0...
If I have this running, is there a way to only retrieve the metadata of the document? In findLatestVersion()it's not necessary to load the file itself.

findOne returns exactly one result, more specifically the first one in the collection matching the query.
I am not too sure whether the latest version is returned when using findOne. Please try find instead.
A more manual approach would be filtering a result set from querying for the file name for the highest value of version.
In general, the version field only shows how often a document was changed. It is used for something which is called optimistic locking, which works by checking the current version of a document against the one the changed document has. If the version in the database is higher than the one in the document to be saved, another process has made changes to the document and an exception is raised.
For storing versioned documents, git (via egit for example) might be a solution.
EDIT: After a quick research, here is how it works. File versioning should be done using the automatically set upload date from the metadata. Query for it, sort descending and use the first result. You do not need to set the version manually any more.

I know it's been a while since this question has been asked and I don't know whether the code has been the same back then, but I think this information may help future readers:
Looking at the source code shows that findOne completely ignores the sorting part defined in the query, while find actually makes use of it.
So you need to make a normal query with find and then select the first object found (refer to Markus W Mahlberg's answer for more information).

Try adding sorting to the query, like this:
GridFSDBFile latestVersionRecord = template.findOne(
new Query(GridFsCriteria.whereFilename().is(filename))
.with(new Sort(Sort.Direction.DESC, "version")));
once you have the GridFSDBFile, you can easily retrieve metadata without loading whole file with the method:
DBObject metadata = latestVersionRecord.getMetaData();
Hope it helps!

Related

MongoDB: verifying document was not updated

We would like to apply some auditing in our current project. For that we created a scenario but I don't see how to make point 1 and 2 atomic.
Scenario
Every document has to have a timestamp that will server
as a version. When saving a document we will:
Verify document was not changed - first compare the timestamps of the latest document docLatest
with the document we would like to store docUpdated. The timestamps must be equal.
If not, save request is refused.
If ok, go to next point.
Update the document
Create diff with the previous doc - The latest document must be our last
document. We will create a diff and store it.

I stumble upon this idea once. My idea will utilize long_polling technique. I am not going to tell you how to architect your data, but you can convert date to numeric value, and compare by it.
for 1 and 2, you can convert Date-format to number, the schema will look something like below:
var document= {
updatedAt: { type: Number, default: Date.parse(new Date()) }
}
then for every document submitted by client, just check, if the
if(latestDocument.updatedAt - prevDocument.updatedAt > 0) {
//if latest's timestamp is bigger than prev, then store it in mongodb
} else {
//if latest document is the same or even older, just ignore this document
}
for number 3. I found that, if the document has changed, do you even need to diff it? I decide to follow react/flux's method, if the document has changed just replaced the whole document.

Queries in Mongodb Gridfs - Metadata? Further Possibilities?

I started using MongoDB with Gridfs some time ago as I need to store documents that are bigger than 16mb. Saving and loading documents worked fine so far.
Next, I wondered how to specify queries for my files stored in GridFS. Lets assume I have instances of my class which looks as follows:
public class Test {
private String id;
private int test;
}
If I want to query for a file which has a certain value for "test", how can I do that in GridFS? I know that I can save my file and store additional metadata.
Hence, I could store the test value in the metadata of the file.
To retrieve the value test of a file, I would do something like this:
String id = "...";
GridFS fs = new GridFS(mongoDB, "TESTFS");
BasicDBObject dbobj = new BasicDBObject();
dbobj.put("filename", id);
GridFSDBFile fsFile = fs.findOne(dbobj);
BasicDBObject metadata = (BasicDBObject) fsFile.get("metadata");
System.out.println("Test: " + metadata.get("test"));
Using metadata, is there an easier way to extract the "test" value of a certain file (without loading the complete file, deserialize the JSON string, etc.)
The disadvantage of this approach is that I have to store the metadata explicitly. If I want to query for other information, I need to introduce this into the metadata for all my data. Is this right?
Is there an alternative to storing such information in the metadata? Or how can I query for specific information in GridFS?
This is obviously a very simple example. The same questions arise when trying to perform more complex queries.

As #wdberkeley noted - you can query db['fs'].files explicitly without fetching the whole file each time (indeed, not the best idea). You can do it using find on the above collections (query as you would form it in MongoDB console - please translate to your language):
db['fs'].files.find({'metadata.test': '#### - the value you want to query for - $$$'})
The cursor that is returned in the MongoDB console and via most drivers is going to allow you to read metadata without ever retrieving the contents of the file.

MongoDB: doing an atomic create and return operation

I need to create a document in mongodb and then immediately want it available in my application. The normal way to do this would be (in Python code):
doc_id = collection.insert({'name':'mike', 'email':'mike#gmail.com'})
doc = collection.find_one({'_id':doc_id})
There's two problems with this:
two requests to the server
not atomic
So, I tried using the find_and_modify operation to effectively do a "create and return" with the help of upserts like this:
doc = collection.find_and_modify(
# so that no doc can be found
query= { '__no_field__':'__no_value__'},
# If the <update> argument contains only field and value pairs,
# and no $set or $unset, the method REPLACES the existing document
# with the document in the <update> argument,
# except for the _id field
document= {'name':'mike', 'email':'mike#gmail.com'},
# since the document does not exist, this will create it
upsert= True,
#this will return the updated (in our case, newly created) document
new= True
)
This indeed works as expected. My question is: whether this is the right way to accomplish a "create and return" or is there any gotcha that I am missing?

What exactly are you missing from a plain old regular insert call?
If it is not knowing what the _id will be, you could just create the _id yourself first and insert the document. Then you know exactly how it will look like. None of the other fields will be different from what you sent to the database.
If you are worried about guarantees that the insert will have succeeded you can check the return code, and also set a write concern that provides enough assurances (such as that it has been flushed to disk or replicated to enough nodes).

Update document from inside another document - mongoengine

I have the following code in a collection:
class Author(Agent):
def foo(self):
self.find_another_document_and_update_it(ids)
self.processed = True
self.save()
def find_another_document_and_update_it(self, ids):
for id in ids:
documentA = Authors.objects(id=id)
documentA.update(inc__mentions=1)
Inside find_another_document_and_update_it() I query the database and retrieve a document A. and then I increment a counter in A. Then in foo(), after calling find_another_document_and_update_it(), I also save the current document lets say B. The problem is that although I can see that the counter in A is actually increased when self.save() is called, document A is reset to its old value. I guess the problem is to do with a concurrency issue and how MongoDB deals with it. I appreciate your help.

In MongoEngine 0.5 save only updates fields that have changed - prior it saved the whole document, which would have meant the previous update in find_another_document_and_update_it would have been overwritten. In general and as with all things python, its better to be explicit - so you might want to use update to update a document.
You should be able to update all mentions with a single update:
Authors.objects(id__in=ids).update(inc__mentions=1)
Regardless, the best way to update would be to call the global updates after self.save(). That way the mentions are only incremented after you've processed and saved any changes.

MongoDB - most efficient way of getting the last version of a document

I'm using MongoDB to hold a collection of documents.
Each document has an _id (version) which is an ObjectId. Each document has a documentId which is shared across the different versions. This too is an OjectId assigned when the first document was created.
What's the most efficient way of finding the most up-to-date version of a document given the documentId?
I.e. I want to get the record where _id = max(_id) and documentId = x
Do I need to use MapReduce?
Thanks in advance,
Sam

Add index containing both fields (documentId, _id) and don't use max (what for)? Use query with documentId = x, order DESC by _id and limit(1) results to get the latest. Remember about proper sorting order of index (DESC also)
Something like that
db.collection.find({documentId : "x"}).sort({_id : -1}).limit(1)
Other approach (more denormalized) would be to use other collecion with documents like:
{
documentId : "x",
latestVersionId : ...
}
Use of atomic operations would allow to safely update this collection. Adding proper index would make queries fast as lightning.
There is one thing to take into account - i'm not sure whether ObjectID can always be safely used to order by for latest version. Using timestamp may be more certain approach.

I was typing the same as Daimon's first answer, using sort and limit. This is probably not recommended, especially with some drivers (which use random numbers instead of increments for the least significant portion), because of the way the _id is generated. It has second [as opposed to something smaller, like millisecond] resolution as the most significant portion, but the last number can be a random number. So if you had a user save twice in a second (probably not likely, but worth noting), you might end up with a slightly out of order latest document.
See http://www.mongodb.org/display/DOCS/Object+IDs#ObjectIDs-BSONObjectIDSpecification for more details on the structure of the ObjectID.
I would recommend adding an explicit versionNumber field to your documents, so you can query in a similar fashion using that field, like so:
db.coll.find({documentId: <id>}).sort({versionNum: -1}).limit(1);
edit to answer question in comments
You can store a regular DateTime directly in MongoDB, but it will only store the milliseconds precision in a "DateTime" format in MongoDB. If that's good enough, it's simpler to do.
BsonDocument doc = new BsonDocument("dt", DateTime.UtcNow);
coll.Insert (doc);
doc = coll.FindOne();
// see it doesn't have precision...
Console.WriteLine(doc.GetValue("dt").AsUniversalTime.Ticks);
If you want .NET DateTime (ticks)/Timestamp precision, you can do a bunch of casts to get it to work, like:
BsonDocument doc = new BsonDocument("dt", new BsonTimestamp(DateTime.UtcNow.Ticks));
coll.Insert (doc);
doc = coll.FindOne();
// see it does have precision
Console.WriteLine(new DateTime(doc.GetValue("dt").AsBsonTimestamp.Value).Ticks);
update again!
Looks like the real use for BsonTimestamp is to generate unique timestamps within a second resolution. So, you're not really supposed to abuse them as I have in the last few lines of code, and it actually will probably screw up the ordering of results. If you need to store the DateTime at a Tick (100 nanosecond) resolution, you probably should just store the 64-bit int "ticks", which will be sortable in mongodb, and then wrap it in a DateTime after you pull it out of the database again, like so:
BsonDocument doc = new BsonDocument("dt", DateTime.UtcNow.Ticks);
coll.Insert (doc);
doc = coll.FindOne();
DateTime dt = new DateTime(doc.GetValue("dt").AsInt64);
// see it does have precision
Console.WriteLine(dt.Ticks);

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

File versioning with GridFS - mongodb

Related

MongoDB: verifying document was not updated

Queries in Mongodb Gridfs - Metadata? Further Possibilities?

MongoDB: doing an atomic create and return operation

Update document from inside another document - mongoengine

MongoDB - most efficient way of getting the last version of a document

Categories

Resources