Is it possible to process objects in a Google Cloud Storage bucket in FIFO order? - google-cloud-storage

In my web app, I need to pull objects from gcs one by one and process them.
So the question is,
"How do I send a request to gcs to get the next unprocessed object?"
What I’d like to do is to simply rely on the sort order provided by gcs and then just process the objects in this sorted list one by one.
That way, I only need to keep track of the last processed item in my app.
I’d like to rely on the sort order provided by the timeCreated timestamp on each individual object in the bucket.
When I query my bucket via the JSON API, I notice that the objects are returned sorted by timeCreated from oldest to newest.
For example, this query ...
returns this list ...
{
"items": [
{
"name": "cars_train/00001.jpg",
"timeCreated": "2016-03-23T19:19:47.506Z"
},
{
"name": "cars_train/00002.jpg",
"timeCreated": "2016-03-23T19:19:49.320Z"
},
{
"name": "cars_train/00003.jpg",
"timeCreated": "2016-03-23T19:19:50.228Z"
},
{
"name": "cars_train/00004.jpg",
"timeCreated": "2016-03-23T19:19:51.377Z"
},
{
"name": "cars_train/00005.jpg",
"timeCreated": "2016-03-23T19:19:51.778Z"
},
{
"name": "cars_train/00006.jpg",
"timeCreated": "2016-03-23T19:19:52.817Z"
},
{
"name": "cars_train/00007.jpg",
"timeCreated": "2016-03-23T19:19:53.868Z"
},
{
"name": "cars_train/00008.jpg",
"timeCreated": "2016-03-23T19:19:54.925Z"
},
{
"name": "cars_train/00009.jpg",
"timeCreated": "2016-03-23T19:19:58.426Z"
},
{
"name": "cars_train/00010.jpg",
"timeCreated": "2016-03-23T19:19:59.323Z"
}
]
}
This sort order by timeCreated is exactly what I need, though I’m not certain if I can rely on this always being true?
So, I could code my app to process this list by simply searching for the first timeCreated value greater than the last object that processed.
The problem is this list can be very large and searching through a huge list every single time the user presses the NEXT button is too computationally expensive.
I would like to be able to specify in my query to gcs to filter the list so that I return only the single item that I need.
The API does allow me to set the maxResults returned to a value of 1.
However, I do not see an option that would allow me to return only objects whose timeCreated value is greater than the value I specified.
I think what I am trying to do is probably fairly common, so I’m guessing that a solution may exist for this problem.
One work around for this problem is to physically move an object that has been processed to another bucket.
That way the first item in the list would always be the newest one and I could simply send the request with maxCount=1.
But this adds complexity because it forces me have have 2 separate buckets for every project instead of 1.
Is there a way to filter this list of objects to only include ones whose timeCreated date is above a specified value?
In MySQL, it might be something like ...
SELECT name
FROM bucket
WHERE timeCreated > X
ORDER BY timeCreated
LIMIT 1

You can configure object change notifications on the bucket, and get a notification each time a new object arrives. That would allow you to process new objects without scanning a long listing each time. It also avoids the problem that listing a bucket is only eventually consistent (so, recently uploaded objects may not show up immediately when you list objects; I don't know if that's a problem for your app).
Details about object change notification are documented at https://cloud.google.com/storage/docs/object-change-notification.

Object listing in GCS is not sorted by timeCreated. Object listing results are always in alphabetical order. In your example, those two things merely happen to catch up.
If you want to get a list of objects in the order they were uploaded, you must ensure that each object has a name alphabetically later than the name of any object uploaded before it. Even then, however, you must take care, as object listing is eventually consistent, which means that objects you upload may not immediately show up in a listing.
If some ordering of objects is critically important, it would be a good idea to maintain a separate index of the objects and their timestamps in a separate data structure, perhaps populated via object change notifications as Mike suggested.

Related

How to search values in real time on a badly designed database?

I have a collection named Company which has the following structure:
{
"_id" : ObjectId("57336ea1a7454c0100d889e4"),
"currentMonth" : 62,
"variables1": { ... },
...
"variables61": { ... },
"variables62" : {
"name" : "Test",
"email": "email#test.com",
...
},
"country" : "US",
}
My need is to be able to search for companies by name with up-to-date data. I don't have permission to change this data structure because many applications still use it. For the moment I haven't found a way to index these variables with this data structure, which makes the search slow.
Today each of these documents can be several megabytes in size and there are over 20,000 of them in this collection.
The system I want to implement uses a search engine to index the names of companies, but for that it needs to be able to detect changes in the collection.
MongoDB's change stream seems like a viable option but I'm not sure how to make it scalable and efficient.
Do you have any suggestions that would help me solve this problem? Any suggestion on the steps needed to set up the above system?
Usually with MongoDB you can add new fields to documents and existing applications would simply ignore the extra fields (though they naturally would not be populated by old code). Therefore:
Create a task that is regularly executed which goes through all documents in your collection, figures out the name for each document from its fields, then writes the name into a top-level field.
Add an index on that field.
In your search code, look up by the values of that field.
Compare the calculated name to the source-of-truth name. If different, discard the document.
If names don't change once set, step 1 only needs to go through documents that are missing the top-level name and step 4 is not needed.
Using the change detection pattern with monstache, I was able to synchronise in real time MongoDB with ElasticSearch, performing a Filter based on the current month and then Map the result of the variables to be indexed 🎊

MongoDB - how to reference children/nested document _id's within parent on insert

I am very new to MongoDb but the project I was just brought in on uses it to store message threads like this:
{
"_id": ObjectId("messageThreadId"),
"messages": [
{
"_id": ObjectId("messageId"),
"body": "Lorem ipsum..."
}, etc...]
"users": [
{
"_id": ObjectId("userId"),
"unreadMessages": ['messageId', 'messageId', etc...]
}
]
}
I need to use pymongo to insert brand new messageThreads which should (initially) contain a single message. However, I am not clear on how to construct the users.unreadMessages lists of messageIds (which should contain just the newly-created initial message). Is there a way of referencing the initial message's _id before/as it's created, from within the same document? Also worth noting that unreadMessages is a list of strings, not ObjectId()s.
Do I need to create the messageThread with the unreadMessages list empty, then go back and retrieve the initial message's _id that was just created, then update every unreadMessages in the list of users? It feels wrong to require multiple transactions for an insert, but this whole schema feels wrong to me.
As DaveStSomeWhere said, I ended up pre-generating the ObjectId and then using it in the document before insertion. This is what PyMongo does when it goes to insert a document anyways: the relevant code in pymongo.collection.insert_one(). Thanks Dave.

Inserting multiple key value pair data under single _id in cloudant db at various timings?

My requirement is to get json pair from mqtt subscriber at different timings under single_id in cloudant, but I'm facing error while trying to insert new json pair in existing _id, it simply replace old one. I need at least 10 json pair under one _id. Injecting at different timings.
First, you should make sure about your architectural decision to update a particular document multiple times. In general, this is discouraged, though it depends on your application. Instead, you could consider a way to insert each new piece of information as a separate document and then use a map-reduce view to reflect the state of your application.
For example (I'm going to assume that you have multiple "devices", each with some kind of unique identifier, that need to add data to a cloudant DB)
PUT
{
"info_a":"data a",
"device_id":123
}
{
"info_b":"data b",
"device_id":123
}
{
"info_a":"message a"
"device_id":1234
}
Then you'll need a map function like
_design/device/_view/state
{
function (doc) {
emit(doc.device_id, 1);
}
Then you can GET the results of that view to see all of the "info_X" data that is associated with the particular device.
GET account.cloudant.com/databasename/_design/device/_view/state
{"total_rows":3,"offset":0,"rows":[
{"id":"28324b34907981ba972937f53113ac3f","key":123,"value":1},
{"id":"d50553d206d722b960fb176f11841974","key":123,"value":1},
{"id":"eaa710a5fa1ff4ba6156c997ddf6099b","key":1234,"value":1}
]}
Then you can use the query parameters to control the output, for example
GET account.cloudant.com/databasename/_design/device/_view/state?key=123&include_docs=true
{"total_rows":3,"offset":0,"rows":[
{"id":"28324b34907981ba972937f53113ac3f","key":123,"value":1,"doc":
{"_id":"28324b34907981ba972937f53113ac3f",
"_rev":"1-bac5dd92a502cb984ea4db65eb41feec",
"info_b":"data b",
"device_id":123}
},
{"id":"d50553d206d722b960fb176f11841974","key":123,"value":1,"doc":
{"_id":"d50553d206d722b960fb176f11841974",
"_rev":"1-a2a6fea8704dfc0a0d26c3a7500ccc10",
"info_a":"data a",
"device_id":123}}
]}
And now you have the complete state for device_id:123.
Timing
Another issue is the rate at which you're updating your documents.
Bottom line recommendation is that if you are only updating the document once per ~minute or less frequently, then it could be reasonable for your application to update a single document. That is, you'd add new key-value pairs to the same document with the same _id value. In order to do that, however, you'll need to GET the full doc, add the new key-value pair, and then PUT that document back to the database. You must make sure that your are providing the most recent _rev of that document and you should also check for conflicts that could occur if the document is being updated by multiple devices.
If you are acquiring new data for a particular device at a high rate, you'll likely run into conflicts very frequently -- because cloudant is a distributed document store. In this case, you should follow something like the example I gave above.
Example flow for the second approach outlined by #gadamcox for use cases where document updates are not required very frequently:
[...] you'd add new key-value pairs to the same document with the same _id value. In order to do that, however, you'll need to GET the full doc, add the new key-value pair, and then PUT that document back to the database.
Your application first fetches the existing document by id: (https://docs.cloudant.com/document.html#read)
GET /$DATABASE/100
{
"_id": "100",
"_rev": "1-2902191555...",
"No": ["1"]
}
Then your application updates the document in memory
{
"_id": "100",
"_rev": "1-2902191555...",
"No": ["1","2"]
}
and saves it in the database by specifying the _id and _rev (https://docs.cloudant.com/document.html#update)
PUT /$DATABASE/100
{
"_id": "100",
"_rev": "1-2902191555...",
"No":["1","2"]
}

MongoDB selection with existing value

I'm using PyMongo to fetch data from MongoDB. All documents in the collection look like the structure below:
{
"_id" : ObjectId("50755d055a953d6e7b1699b6"),
"actor":
{
"languages": ["nl"]
},
"language":
{
"value": "nl"
}
}
I'm trying to fetch all the conversations where the property language.value is inside the property actor.languages.
At the moment I know how to look for all conversations with a constant value inside actor.languages (eg. all conversations with en inside actor.languages).
But I'm stuck on how to do the same comparison with a variable value (language.value) inside the current document.
Any help is welcome, thanks in advance!
db.testcoll.find({$where:"this.actor.languages.indexOf(this.language.value) >= 0"})
You could use a $where provided your query set is small, but any real size and you could start seeing problems, especially since this query seems like one that needs to be run in realtime on a page and the JS engine is single threaded among other problems.
I would actually consider a better way in this case is through the client side, it is quite straight forward, pull out records based on one of the values, iterate and test their conditional double value (i.e. pull out based on language.value being nl and test actor.languages value for that previous value).
I would imagine you might be able to do this with the aggregation framework however, at the min you cannot use computed fields within $match. I would imagine it would look like this:
{$project:
{languages_value: "$language.value", languages: "$actor.languages"}
}, {$match: {"$languages": {$in:"$languages_values"}}
If you could. But there might be a way.

Logging file access with MongoDB

I am designing my first MongoDB (and first NoSQL) database and would like to store information about files in a collection. As part of each file document, I would like to store a log of file accesses (both reads and writes).
I was considering creating an array of log messages as part of the document:
{
"filename": "some_file_name",
"logs" : [
{ "timestamp": "2012-08-27 11:40:45", "user": "joe", "access": "read" },
{ "timestamp": "2012-08-27 11:41:01", "user": "mary", "access": "write" },
{ "timestamp": "2012-08-27 11:43:23", "user": "joe", "access": "read" }
]
}
Each log message will contain a timestamp, the type of access, and the username of the person accessing the file. I figured that this would allow very quick access to the logs for a particular file, probably the most common operation that will be performed with the logs.
I know that MongoDB has a 16Mbyte document size limit. I imagine that files that are accessed very frequently could push against this limit.
Is there a better way to design the NoSQL schema for this type of logging?
Lets first try to calculate avg size of the one log record:
timestamp word = 18, timestamp value = 8, user word = 8, user value=20 (10 chars it is max(or avg for sure) I guess), access word = 12, access value 10. So total is 76 bytes. So you can have ~220000 of log records.
And half of physical space will be used by field names. In case if you will name timestamp = t, user = u, access=a -- you will be able to store ~440000 of log items.
So, i think it is enough for the most systems. In my projects I always trying to embed rather than create separate collection, because it a way to achieve good performance with mongodb.
In the future you can move your logs records into separate collection. Also for performance you can have like a 30 last log records (simple denormalize them) in file document, for fast retrieving in addition to logs collection.
Also if you will go with one collection, make sure that you not loading logs when you no need them (you can include/exclude fields in mongodb). Also use $slice to do paging.
And one last thing: Enjoy mongo!
If you think document limit will become an issue there are few alternatives.
The obvious one is to simple create a new document for each log.
So you will have a collecton "logs". With this schema.
{
"filename": "some_file_name",
"timestamp": "2012-08-27 11:40:45",
"user": "joe",
"access": "read"
}
A query to find which files "joe" read will be something like the
db.logs.find({user: "joe", access: "read"})