Accessing files in Mongodb - mongodb

I am using sacred package in python, this allows to keep track of computational experiments i'm running. sacred allows to add observer (mongodb) which stores all sorts of information regarding the experiment (configuration, source files etc).
sacred allows to add artifacts to the db bt using sacred.Experiment.add_artifact(PATH_TO_FILE).
This command essentially adds the file to the DB.
I'm using MongoDB compass, I can access the experiment information and see that an artifact has been added. it contains two fields:
'name' and 'file_id' which contains an ObjectId. (see image)
I am attempting to access the stored file itself. i have noticed that under my db there is an additional sub-db called fs.files in it i can filter to find my ObjectId but it does not seem to allow me to access to content of the file itself.

Code example for GridFS (import gridfs, pymongo)
If you already have the ObjectId:
artifact = gridfs.GridFS(pymongo.MongoClient().sacred)).get(objectid)
To find the ObjectId for an artifact named filename with result as one entry of db.runs.find:
objectid = next(a['file_id'] for a in result['artifacts'] if a['name'] == filename)

MongoDB file storage is handled by "GridFS" which basically splits up files in chunks and stores them in a collection (fs.files).
Tutorial to access: http://api.mongodb.com/python/current/examples/gridfs.html

I wrote a small library called incense to access data from MongoDB stored via sacred. It is available on GitHub at https://github.com/JarnoRFB/incense and via pip. With it you can load experiments as Python objects. The artifacts will be available as objects that you can again save on disk or display in a Jupyter notebook
from incense import ExperimentLoader
loader = ExperimentLoader(db_name="my_db")
exp = loader.find_by_id(1)
print(exp.artifacts)
exp.artifacts["my_artifact"].save() # Save artifact on disk.
exp.artifacts["my_artifact"].render() # Display artifact in notebook.

Related

How to access the latest uploaded object in google cloud storage bucket using python in tensorflow model

I am woking on tensorflow model where I want to make use of the latest ulpoad object, in order get output from that uploaded object. Is there way to access latest object uploaded to Google cloud storage bucket using python.
The below is what I use for grabbing the latest updated object.
Instantiate your client
from google.cloud import storage
# first establish your client
storage_client = storage.Client()
Define bucket_name and any additional paths via prefix
# get your blobs
bucket_name = 'your-glorious-bucket-name'
prefix = 'special-directory/within/your/bucket' # optional
Iterate the blobs returned by the client
Storing these as tuple records is quick and efficient.
blobs = [(blob, blob.updated) for blob in storage_client.list_blobs(
bucket_name,
prefix = prefix,
)]
Sort the list on the second tuple value
# sort and grab the latest value, based on the updated key
latest = sorted(blobs, key=lambda tup: tup[1])[-1][0]
string_data = latest.download_as_string()
Metadata key docs and Google Cloud Storage Python client docs.
One-liner
# assumes storage_client as above
# latest is a string formatted response of the blob's data
latest = sorted([(blob, blob.updated) for blob in storage_client.list_blobs(bucket_name, prefix=prefix)], key=lambda tup: tup[1])[-1][0].download_as_string()
There is no a direct way to get the latest uploaded object from Google Cloud Storage. However, there is a workaround using the object's metadata.
Every object that it is uploaded to the Google Cloud Storage has different metadata. For more information you can visit Cloud Storage > Object Metadata documentation. One of the metadatas is "Last updated". This value is a timestamp of the last time the object was updated. Which can happen only in 3 occasions:
A) The object was uploaded for the first time.
B) The object was uploaded and replaced because it already existed.
C) The object's metadata changed.
If you are not updating the metadata of the object, then you can use this work around:
Set a variable with very old date_time object (1900-01-01 00:00:00.000000). There is no chance of an object to have this update metadata.
Set a variable to store the latest's blob's name and set it to "NONE"
List all the blobs in the bucket Google Cloud Storage Documentation
For each blob name load the updated metadata and convert it to date_time object
If the blob's update metadata is greater than the one you have already, then update it and save the current name.
This process will continue until you search all the blobs and only the latest one will be saved in the variables.
I have did a little bit of coding my self and this is my GitHub code example that worked for me. Take the logic and modify it based on your needs. I would also suggest to test it locally and then use it in your code.
BUT, in case you update the blob's metadata manually then this is another workaround:
If you update the blob's any metadata, see this documentation Viewing and Editing Object Metadata, then the "Last update" timestamp of that blob will also get updated so running the above method will NOT give you the last uploaded object but the last modified which are different. Therefore you can add a custom metadata to your object every time you upload and that custom metadata will be the timestamp at the time you upload the object. So no matter what happen to the metadata later, the custom metadata will always keep the time that the object was uploaded. Then use the same method as above but instead of getting blob.update get the blob.metadata and then use that date with the same logic as above.
Additional notes:
To use custom metadata you need to use the prefix x-goog-meta- as it is stated in Editing object metadata section in Viewing and Editing Object Metadata documentation.
So the [CUSTOM_METADATA_KEY] should be something like x-goog-meta-uploaded and [CUSTOM_METADATA_VALUE] should be [CURRENT_TIMESTAMP_DURING_UPLOAD]

Different S3 behavior using different endpoints?

I'm currently writing code to use Amazon's S3 REST API and I notice different behavior where the only difference seems to be the Amazon endpoint URI that I use, e.g., https://s3.amazonaws.com vs. https://s3-us-west-2.amazonaws.com.
Examples of different behavior for the the GET Bucket (List Objects) call:
Using one endpoint, it includes the "folder" in the results, e.g.:
/path/subfolder/
/path/subfolder/file1.txt
/path/subfolder/file2.txt
and, using the other endpoint, it does not include the "folder" in the results:
/path/subfolder/file1.txt
/path/subfolder/file2.txt
Using one endpoint, it represents "folders" using a trailing / as shown above and, using the other endpoint, it uses a trailing _$folder$:
/path/subfolder_$folder$
/path/subfolder/file1.txt
/path/subfolder/file2.txt
Why the differences? How can I make it return results in a consistent manner regardless of endpoint?
Note that I get these same odd results even if I use Amazon's own command-line AWS S3 client, so it's not my code.
And the contents of the buckets should be irrelevant anyway.
Your assertion notwithstanding, your issue is exactly about the content of the buckets, and not something S3 is doing -- the S3 API has no concept of folders. None. The S3 console can display folders, but this is for convenience -- the folders are not really there -- or if there are folder-like entities, they're irrelevant and not needed.
In Amazon S3, buckets and objects are the primary resources, where objects are stored in buckets. Amazon S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using key name prefixes for objects.
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
So why are you seeing this?
Either you've been using EMR/Hadoop, or some other code written by someone who took a bad example and ran with it... or is doing something differently than it should have been done for quite some time.
Amazon EMR is a web service that uses a managed Hadoop framework to process, distribute, and interact with data in AWS data stores, including Amazon S3. Because S3 uses a key-value pair storage system, the Hadoop file system implements directory support in S3 by creating empty files with the <directoryname>_$folder$ suffix.
https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/
This may have been something the S3 console did many years ago, and apparently (since you don't report seeing them in the console) it still supports displaying such objects as folders in the console... but the S3 console no longer creates them this way, if it ever did.
I've mirrored the bucket "folder" layout exactly
If you create a folder in the console, an empty object with the key "foldername/" is created. This in turn is used to display a folder that you can navigate into, and upload objects with keys beginning with that folder name as a prefix.
The Amazon S3 console treats all objects that have a forward slash "/" character as the last (trailing) character in the key name as a folder
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
If you just create objects using the API, then "my/object.txt" appears in the console as "object.txt" inside folder "my" even though there is no "my/" object created... so if the objects are created with the API, you'd see neither style of "folder" in the object listing.
That is probably a bug in the API endpoint which includes the "folder" - S3 internally doesn't actually have a folder structure, but instead is just a set of keys associated with files, where keys (for convenience) can contain slash-separated paths which then show up as "folders" in the web interface. There is the option in the API to specify a prefix, which I believe can be any part of the key up to and including part of the filename.
EMR's s3 client is not the apache one, so I can't speak accurately about it.
In ASF hadoop releases (and HDP, CDH)
The older s3n:// client uses $folder$ as its folder delimiter.
The newer s3a:// client uses / as its folder marker, but will handle $folder$ if there. At least it used to; I can't see where in the code it does now.
The S3A clients strip out all folder markers when you list things; S3A uses them to simulate empty dirs and deletes all parent markers when you create child file/dir entries.
Whatever you have which processes GET should just ignore entries with "/" or $folder at the end.
As to why they are different, the local EMRFS is a different codepath, using dynamo for implementing consistency. At a guess, it doesn't need to mock empty dirs, as the DDB tables will host all directory entries.

How do I do bulk file storage with IBM Object Storage?

I'm using IBM Object Storage to store huge amounts of very small files,
say more than 1500 small files in one hour. (Total size of the 1500 files is about 5 MB)
I'm using the object store api to post the files, one file at a time.
The problem is that for storing 1500 small files it takes about 15 minutes in total. This is with setting up and closing the connection with the object store.
Is there a way to do a sort of bulk post, to send more than one file in one post?
Regards,
Look at the archive-auto-extract feature available within Openstack Swift (Bluemix Object Storage). I assume that you are familiar with obtaining the X-Auth-Token and Storage_URL from Bluemix object storage. If not, my post about large file manifests explains the process. From the doc, the constraints include:
You must use the tar utility to create the tar archive file.
You can upload regular files but you cannot upload other items (for example, empty directories or symbolic links).
You must UTF-8-encode the member names.
Basic steps would be:
Confirm that IBM Bluemix supports this feature by viewing info details for the service # https://dal.objectstorage.open.softlayer.com/info . You'll see a JSON section within the response similar to:
"bulk_upload": {
"max_failed_extractions": 1000,
"max_containers_per_extraction": 10000
}
Create a tar archive of your desired file set. tar gzip is most common.
Upload this tar archive to object storage with a special parameter that tells swift to auto-extract the contents into the container for you.PUT /v1/AUTH_myaccount/my_backups/?extract-archive=tar.gz
From the docs: To upload an archive file, make a PUT request. Add the extract-archive=format query parameter to indicate that you are uploading a tar archive file instead of normal content. Include within the request body the contents of the local file backup.tar.gz.
Something like:
AUTH_myaccount/my_backups/etc/config1.conf
AUTH_myaccount/my_backups/etc/cool.jpg
AUTH_myaccount/my_backups/home/john/bluemix.conf
...
Inspect the results. Any top-level directory in the archive should create a new container in your Swift object-storage account.
Voila! Bulk upload. Hope this helps.

Import "normal" MongoDB collections into DerbyJS 0.6

Same situation like this question, but with current DerbyJS (version 0.6):
Using imported docs from MongoDB in DerbyJS
I have a MongoDB collection with data that was not saved through my
Derby app. I want to query against that and pull it into my Derby app.
Is this still possible?
The accepted answer there links to a dead link. The newest working link would be this: https://github.com/derbyjs/racer/blob/0.3/lib/descriptor/query/README.md
Which refers to the 0.3 branch for Racer (current master version is 0.6).
What I tried
Searching the internets
The naïve way:
var query = model.query('projects-legacy', { public: true });
model.fetch(query, function() {
query.ref('_page.projects');
})
(doesn't work)
A utility was written for this purpose: https://github.com/share/igor
You may need to modify it to only run against a single collection instead of the whole database, but it essentially goes through every document in the database and modifies it with the necessary livedb metadata and creates a default operation for it as well.
In livedb every collection has a corresponding operations collection, for example profiles will have a profiles_ops collection which holds all the operations for the profiles.
You will have to convert the collection to use it with Racer/livedb because of the metadata on the document itself.
An alternative if you dont want to convert is to use traditional AJAX/REST to get the data from your mongo database and then just put it in your local model. This will not be real-time or synced to the server but it will allow you to drive your templates from data that you dont want to convert for some reason.

Comparing geospatial paths with MongoDB

I'm working on a mobile app that tracks a user's location at regular intervals to allow him to plot the path of a journey on a map. We'd like to add an optional feature that will tell him what other users of the app have made similar journeys in the timeframe he's looking at, be it today's commute or the last month of travel. We're referring to this as "path-matching".
The data is currently logged into files within the app's private storage directories on iOS and Android in a binary format that is easily and quickly scanned through to read locations. Each file contains the locations for one day, and generally runs to about 80KB.
To be able to implement the path matching feature we'll obviously need to start uploading these location logs to our server (with the users permission of course), on which we're running PHP. Someone suggested MongoDB for its geospatial prowess - but I've a few questions that maybe folks could help me with:
It seems like we could change our location-logging to use BSON instead. The first field would be a device or user IDs, followed by a list of locations for a particular day. The file could then be uploaded to our server and pushed into the MongoDB store. The online documentation however only seems to refer to importing BSON files created by mongodump. Is the format stable enough that any app could write BSON files readable directly by MongoDB?
Is MongoDB able to run geospatial queries on documents containing multiple locations, or on locations forming a path across multiple documents? Or does this strike you as something that would require excessive logic outside the database, on the PHP side?
The format is totally stable, but there isn't much tooling to do what you describe. Generally, you'd upload it to the backend and it would end up in, say $_POST['locations'] or something that would be an array of associative arrays. Sanitize it and just save it to the database, something like:
$locs = sanitize($_POST['locations']);
$doc = array('path' => array('type' => 'LineString', 'coordinates' => $locs), 'user' => $userId);
$collection->insert($doc);
In the above example, I'm using some of the latest geo stuff (http://docs.mongodb.org/manual/release-notes/2.4/#new-geospatial-indexes-with-geojson-and-improved-spherical-geometry), you'll need a nightly build to get this but it should be in the stable build in about a month. If you need it before then, you can use the older geo API: http://docs.mongodb.org/manual/core/geospatial-indexes/.
MongoDB doesn't read BSON files, but you could use mongorestore to manually load them. I would highly recommend letting the driver do the low-level stuff for you, though!
You can have a document containing a line (in the new geo stuff) and an array of points (in the old geo stuff). I'm not sure what you mean by "a path across multiple documents."
Edited to add: based on your comment, you might want to try {path : {$near : {$geometry : userPath}}} to find "nearby" paths. You could also try making a polygon around the user's path and querying for docs $within the polygon.