How to upload/download file from GCS to/from ftp server with Airflow FTPHook - google-cloud-storage

I am currently trying to use the FTPHook in Airflow in order to upload and download file to/from a remote ftp. But I'm not sure if I can use the gs:// path as part of the source/destination path.
I currently don't want to use local folder within the AF pod since the file size might get big, so I would rather use gcs path directly or gcs file stream.
conn = FTPHook(ftp_conn_id='ftp_default')
conn.store_file('in', 'gs://bucket_name/file_name.txt')
link to the FTPHook code:
here
Thanks for any help!

I found a simple streaming solution to upload/download from gcs to ftp server and vice versa using pysftp which I'll like to share with you.
First, I found this solution, which was working great, but the only issue with that solution was that it didn't support upload file from gcs to FTP. So I was looking for something else.
So than I was looking into different approach, so I've found this google document which basically allow you to stream to/from blob file which was exactly what I was looking for.
params = BaseHook.get_connection(self.ftp_conn_id)
cnopts = pysftp.CnOpts()
cnopts.hostkeys = None
ftp = pysftp.Connection(host=params.host, username=params.login, password=params.password,
port=params.port,
cnopts=cnopts)
#This will download file from FTP server to GCS location
with ftp.open(self.ftp_folder + '/' + file_to_load, 'r+') as remote_file:
blob = bucket.blob(self.gcs_prefix + file_to_load)
blob.upload_from_file(remote_file)
#This will upload file from GCS to FTP server
with sftp.open(self.ftp_folder + '/' +file_name,'w+') as remote_file:
blob = bucket.blob(fileObject['name'])
blob.download_to_file(remote_file)

GCS does not implement FTP support, so this won't work.
It looks like FTP hook only knows how to deal with a local file path or buffer, not one of the GCS APIs.
You might be able to find (or write) some code that reads from FTP and writes to GCS.

Related

How to make Snakemake recognize Globus remote files using Globus CLI?

I am working in a high performance computing grid environment, where large-scale data transfers are done via Globus. I would like to use Snakemake to pull data from a Globus path, process the data, and then push the processed data to a different Globus path. Globus has a command-line interface.
Pulling the data is no problem, for I'd just create a rule that would run globus transfer to create the requisite local file. But for pushing the data back to Globus, I think I'll need a rule that can "see" that the file is missing at the remote location, and then work backwards to determine what needs to happen to create the file.
I could create local "proxy" files that represent the remote files. For example I could make a rule for creating 'processed_data_1234.tar.gz' output files in a directory. These files would just be created using touch (thus empty), and the same rule will run globus transfer to push the files remotely. But then there's the overhead of making sure that the proxy files don't get out of sync with the real Globus-hosted files.
Is there a more elegant way to do this akin to the Remote File capability? Is it difficult to add a Globus CLI support for Snakemake? Thanks in advance for any advice!
Would it help to create a utility function that would generate a list of all desired files and compare it against the list of files available on globus? Something like this (pseudocode):
def return_needed_files():
list_needed_files = [] # either hard-coded or specified with some logic
list_available = [] # as appropriate, e.g. using globus ls
return [i for i in list_needed_files if i not in list_available]
# include all the needed files in the all rule
rule all:
input: return_needed_files

Reading Json file from Azure datalake as a file using Json.load in Azure databricks /Synapse notebooks

I am trying to parse Json data with multi nested level. I am using the approach is giving filename and using open(File-name) to load the data. when I am providing datalake path, it is throwing error that file path not found. I am able to read data in dataframes but How can I read file from data lake without converting to dataframes and reading it as a file and open it?
Current code approach on local machine which is working:
f = open(File_Name.Json)
data = json.load(f)
Failing scenario when provding datalake path:
f = open(Datalake path/File_Name.Json)
data = json.load(f)
You need to mount the data lake folder to a location in dbfs (in Databricks), although mounting is a security risk. Anyone with access to Databricks resource will have access to all mounted locations.
Documentation on mounting to dbfs: https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs
The open function works only with local files, not understanding (out of box) the cloud file paths. You can of course try to mount the cloud storage, but as it was mentioned by #ARCrow, it would be a security risk (until you create so-called passthrough mount that will control access on the cloud storage level).
But if you're able to read file into dataframe, then it means that cluster has all necessary settings for accessing the cloud storage - in this case you can just use dbutils.fs.cp command to copy file from the cloud storage to local disk, and then open it with open function. Something like this:
dbutils.fs.cp("Datalake path/File_Name.Json", "file:///tmp/File_Name.Json")
with open("/tmp/File_Name.Json", "r") as f:
data = json.load(f)

Aspera Node API /files/{id}/files endpoint not returning up to date data

I am working on a webapp for transferring files with Aspera. We are using AoC for the transfer server and an S3 bucket for storage.
When I upload a file to my s3 bucket using aspera connect everything appears to be successful, I see it in the bucket, and I see the new file in the directory when I run /files/browse on the parent folder.
I am refactoring my code to use the /files/{id}/files endpoint to list the directory because the documentation says it is faster compared to the /files/browse endpoint. After the upload is complete, when I run the /files/{id}/files GET request, the new file does not show up in the returned data right away. It only becomes available after a few minutes.
Is there some caching mechanism in place? I can't find anything about this in the documentation. When I make a transfer in the AoC dashboard everything updates right away.
Thanks,
Tim
Yes, the file-id base system uses an in-memory cache (redis).
This cache is updated when a new file is uploaded using Aspera. But for files movement directly on the storage, there is a daemon that will periodically scan and find new files.
If you want to bypass the cache, and have the API read the storage, you can add this header in the request:
X-Aspera-Cache-Control: no-cache
Another possibility is to trigger a scan by reading:
/files/{id}
for the folder id

Move Cloud Storage file to different bucket with Java API

How can I move a file from one bucket to another with the Cloud Storage Java API? I can find examples of file creation but not copying or deletion - and I imagine I'd have to copy the file and delete it in order to execute a move from one bucket to another.
You're correct. Do the copy and then delete the original after. There are some examples on GitHub. Here's the gist of it:
CopyWriter copyWriter = originalBlob.copyTo(BlobId.of(bucketName, blobName));
Blob copiedBlob = copyWriter.getResult();

dsx writing to blue-mix object storage

Will bluemix object storage ever have folder capability inside a container like amazon s3. I am not sure about other folks but pretty soon writing from DSX, it gets such a mess in a container. Its like a computer with no capability of creating folders under C:\ drive . Its a complete mess.
Since its DSX's primary storage, is the DSX pushing for this capability.Bluemix object storage no folder capability
Here's the s3 container and how beautifully you can organize everything S3 conatiner
i believe what you are looking for is something like subcontainers and to organize your files.
I think Object-storage service is based Openstack Object Storage and according to Openstack doc it is not possible to create nested directories.
https://docs.openstack.org/user-guide/cli-swift-pseudo-hierarchical-folders-directories.html
You can use the path in the filename to simulate subdirectories by seperating with / when writing/reading file you can use something like this 'swift://containername.' + name + '/foldername/fillename.csv'
So anything you write with /foldername/filename.csv will be organized under foldername.
Thanks,
Charles.