dsx writing to blue-mix object storage - ibm-cloud

Will bluemix object storage ever have folder capability inside a container like amazon s3. I am not sure about other folks but pretty soon writing from DSX, it gets such a mess in a container. Its like a computer with no capability of creating folders under C:\ drive . Its a complete mess.
Since its DSX's primary storage, is the DSX pushing for this capability.Bluemix object storage no folder capability
Here's the s3 container and how beautifully you can organize everything S3 conatiner

i believe what you are looking for is something like subcontainers and to organize your files.
I think Object-storage service is based Openstack Object Storage and according to Openstack doc it is not possible to create nested directories.
https://docs.openstack.org/user-guide/cli-swift-pseudo-hierarchical-folders-directories.html
You can use the path in the filename to simulate subdirectories by seperating with / when writing/reading file you can use something like this 'swift://containername.' + name + '/foldername/fillename.csv'
So anything you write with /foldername/filename.csv will be organized under foldername.
Thanks,
Charles.

Related

Reading Json file from Azure datalake as a file using Json.load in Azure databricks /Synapse notebooks

I am trying to parse Json data with multi nested level. I am using the approach is giving filename and using open(File-name) to load the data. when I am providing datalake path, it is throwing error that file path not found. I am able to read data in dataframes but How can I read file from data lake without converting to dataframes and reading it as a file and open it?
Current code approach on local machine which is working:
f = open(File_Name.Json)
data = json.load(f)
Failing scenario when provding datalake path:
f = open(Datalake path/File_Name.Json)
data = json.load(f)
You need to mount the data lake folder to a location in dbfs (in Databricks), although mounting is a security risk. Anyone with access to Databricks resource will have access to all mounted locations.
Documentation on mounting to dbfs: https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs
The open function works only with local files, not understanding (out of box) the cloud file paths. You can of course try to mount the cloud storage, but as it was mentioned by #ARCrow, it would be a security risk (until you create so-called passthrough mount that will control access on the cloud storage level).
But if you're able to read file into dataframe, then it means that cluster has all necessary settings for accessing the cloud storage - in this case you can just use dbutils.fs.cp command to copy file from the cloud storage to local disk, and then open it with open function. Something like this:
dbutils.fs.cp("Datalake path/File_Name.Json", "file:///tmp/File_Name.Json")
with open("/tmp/File_Name.Json", "r") as f:
data = json.load(f)

How to download multiple objects from IBM Cloud Object Storage?

I am trying to use IBM Cloud Object Storage to store images uploaded to my site by users. I have this functionality working just fine.
However, based on the documentation here (link) it appears as though only one object can be downloaded from a bucket at a time.
Is there any way a list of objects could all be downloaded from the bucket? Is there a different approach to requesting multiple objects from a COS bucket?
Via the REST API, no, you can only download a single object at a time. But most tools (like the AWS CLI, or Minio Client) allow downloading all objects that share a prefix (eg foo/bar and foo/bas). The IBM forks of the S3 libraries also are now integrated with Aspera, and can transfer large directories all at once. What are you trying to do?
According to S3 spec (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html), you can only download one object at a time.
There are various tools which may help to download multiple objects at a time from COS. I used AWS CLI tool to download and upload the objects from/to COS.
So install aws-cli tool and configure it by supplying access_key_id and secret_access_key here.
Recursively copying S3 objects to a local directory: When passed with the parameter --recursive, the following cp command recursively copies all objects under a specified prefix and bucket to a specified directory.
C:\Users\Shashank>aws s3 cp s3://yourBucketName . --recursive
for example:
C:\Users\Shashank>aws --endpoint-url http://s3.us-east.cloud-object-storage.appdomain.cloud s3 cp s3://yourBucketName D:\s3\ --recursive
In my case having endpoint based on us-east region and I am copying objects into D:\s3 directory.
Recursively copying local files to S3: When passed with the parameter --recursive, the following cp command recursively copies all files under a specified directory to a specified bucket.
C:\Users\Shashank>aws s3 cp myDir s3://yourBucketName/ --recursive
for example:
C:\Users\Shashank>aws --endpoint-url http://s3.us-east.cloud-object-storage.appdomain.cloud s3 cp D:\s3 s3://yourBucketName/ --recursive
I am copying objects from D:\s3 directory to COS.
For more reference, you can see the link here.
I hope it works for you.

How to upload/download file from GCS to/from ftp server with Airflow FTPHook

I am currently trying to use the FTPHook in Airflow in order to upload and download file to/from a remote ftp. But I'm not sure if I can use the gs:// path as part of the source/destination path.
I currently don't want to use local folder within the AF pod since the file size might get big, so I would rather use gcs path directly or gcs file stream.
conn = FTPHook(ftp_conn_id='ftp_default')
conn.store_file('in', 'gs://bucket_name/file_name.txt')
link to the FTPHook code:
here
Thanks for any help!
I found a simple streaming solution to upload/download from gcs to ftp server and vice versa using pysftp which I'll like to share with you.
First, I found this solution, which was working great, but the only issue with that solution was that it didn't support upload file from gcs to FTP. So I was looking for something else.
So than I was looking into different approach, so I've found this google document which basically allow you to stream to/from blob file which was exactly what I was looking for.
params = BaseHook.get_connection(self.ftp_conn_id)
cnopts = pysftp.CnOpts()
cnopts.hostkeys = None
ftp = pysftp.Connection(host=params.host, username=params.login, password=params.password,
port=params.port,
cnopts=cnopts)
#This will download file from FTP server to GCS location
with ftp.open(self.ftp_folder + '/' + file_to_load, 'r+') as remote_file:
blob = bucket.blob(self.gcs_prefix + file_to_load)
blob.upload_from_file(remote_file)
#This will upload file from GCS to FTP server
with sftp.open(self.ftp_folder + '/' +file_name,'w+') as remote_file:
blob = bucket.blob(fileObject['name'])
blob.download_to_file(remote_file)
GCS does not implement FTP support, so this won't work.
It looks like FTP hook only knows how to deal with a local file path or buffer, not one of the GCS APIs.
You might be able to find (or write) some code that reads from FTP and writes to GCS.

Mount Bucket on Google Storage

I want to mount a Google bucket to a local server. However, when I run the line, the directory I point it to is empty. Any ideas?
gcsfuse mssng_vcf_files ./mountbucket/
It reports:
File system has been successfully mounted.
but the directory mountbucket/ is empty.
gcsfuse will not show any directory defined by a file with a slash in its name. So if your bucket contains /files/index.txt it will not show until you create a object named files. I am assuming here your bucket contains directories then files, and if that is the case this may be your problem.
gcsfuse supports a flag called --implicit-dirs that changes the behaviour. When this flag is enabled, name lookup requests from the kernel use the GCS API's Objects.list operation to search for objects that would implicitly define the existence of a directory with the name in question. So, in the example above, there would appear to be a directory named "files".
There are some drawbacks which are defined here -
https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/semantics.md#implicit-directories
So you have 2 options
Create the directories in your bucket which will make your files appear
Look at --implicit-dirs flag to get them to always appear.
Hope this helps.

Move Cloud Storage file to different bucket with Java API

How can I move a file from one bucket to another with the Cloud Storage Java API? I can find examples of file creation but not copying or deletion - and I imagine I'd have to copy the file and delete it in order to execute a move from one bucket to another.
You're correct. Do the copy and then delete the original after. There are some examples on GitHub. Here's the gist of it:
CopyWriter copyWriter = originalBlob.copyTo(BlobId.of(bucketName, blobName));
Blob copiedBlob = copyWriter.getResult();