Unable to read a GZIP file from AWS s3 bucket in pentaho PDI - pentaho-spoon

My requirement is to read a gzip file from s3 bucket in pentaho. I am able to do it with virtual file system in this way
s3://aCcEsSkEy:SecrEttAccceESSKeeey#s3/your-s3-bucket/your_file.gzip, but the issue is my secret has a (slash in between) so I am unable to read the file from that particular bucket and the path.
Could anyone help me in this regard?

Related

How to process a large file from a s3 bucket using spring batch

Hellow, I am trying execute the example posted in comment of the follow post.
I`m accessing the bucket and reading a list of file, but when I go execute the reader I receive the follow error message: "Caused by: java.lang.IllegalStateException: Input resource must exist (reader is in 'strict' mode): ServletContext resource [/s3://bkt-csv-files/files/23-12-2022/arquivo_jan_2022_pt_00]". How can I resolved this error, or, there is another way to read the files on s3 using spring-batch?
I did not try the example you shared, but I would do it differently.
The FlatFileItemReader works with any implementation of the Resource interface. So if you manage to get an accessible resource in S3, you can use it with your item reader.
For example, you can use a URLResource that points to your file in S3 and set it on the item reader.
This might help as well: Spring Batch - Read files from Aws S3

Unzipping a zip file through ADF is giving a special character

We are extracting a zip file in SFTP and we are trying to unzip it through ADF. While unzipping it is giving a special character in the file as below
Actual data
|"QLD Mackay"|
After unzipping through ADF
|"QLD |"56ay"|
But when we manually try to unzip, we are not getting this issue.
Can someone help with this issue, please?
Make sure your data does not have any unknown characters in it. I have repro'd with sample data and was able to unzip the file without any issues.
Example:
I am using a binary dataset for source and sink to unzip the zip file using azure data factory copy data activity.
In the source dataset, select compression type as ZipDeflate.
Connect sink to sink dataset with compression type none. In sink settings, select copy behavior as Preserve hierarchy to preserve the source filename.

Dataset format for copying csv files from a sftp server to blob storage

Hi I want to copy csv files from a sftp server to blob storage using copy activity of adf without processing the content. Is there a difference using a binary dataset instead of a csv dataset for the source and sink?
If I understand the ask correctly, you want to copy csv files from SFTP to blob storage as-is. If that is the case, you can use binary dataset on both source and sink of your copy activity. When using Binary dataset, the service does not parse file content but treat it as-is. Where as if you use CSV file format, the service will parse file content and you will have to configuration file specs in the connection settings of your dataset.
Please note that when using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.

Reading Json file from Azure datalake as a file using Json.load in Azure databricks /Synapse notebooks

I am trying to parse Json data with multi nested level. I am using the approach is giving filename and using open(File-name) to load the data. when I am providing datalake path, it is throwing error that file path not found. I am able to read data in dataframes but How can I read file from data lake without converting to dataframes and reading it as a file and open it?
Current code approach on local machine which is working:
f = open(File_Name.Json)
data = json.load(f)
Failing scenario when provding datalake path:
f = open(Datalake path/File_Name.Json)
data = json.load(f)
You need to mount the data lake folder to a location in dbfs (in Databricks), although mounting is a security risk. Anyone with access to Databricks resource will have access to all mounted locations.
Documentation on mounting to dbfs: https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs
The open function works only with local files, not understanding (out of box) the cloud file paths. You can of course try to mount the cloud storage, but as it was mentioned by #ARCrow, it would be a security risk (until you create so-called passthrough mount that will control access on the cloud storage level).
But if you're able to read file into dataframe, then it means that cluster has all necessary settings for accessing the cloud storage - in this case you can just use dbutils.fs.cp command to copy file from the cloud storage to local disk, and then open it with open function. Something like this:
dbutils.fs.cp("Datalake path/File_Name.Json", "file:///tmp/File_Name.Json")
with open("/tmp/File_Name.Json", "r") as f:
data = json.load(f)

How to upload/download file from GCS to/from ftp server with Airflow FTPHook

I am currently trying to use the FTPHook in Airflow in order to upload and download file to/from a remote ftp. But I'm not sure if I can use the gs:// path as part of the source/destination path.
I currently don't want to use local folder within the AF pod since the file size might get big, so I would rather use gcs path directly or gcs file stream.
conn = FTPHook(ftp_conn_id='ftp_default')
conn.store_file('in', 'gs://bucket_name/file_name.txt')
link to the FTPHook code:
here
Thanks for any help!
I found a simple streaming solution to upload/download from gcs to ftp server and vice versa using pysftp which I'll like to share with you.
First, I found this solution, which was working great, but the only issue with that solution was that it didn't support upload file from gcs to FTP. So I was looking for something else.
So than I was looking into different approach, so I've found this google document which basically allow you to stream to/from blob file which was exactly what I was looking for.
params = BaseHook.get_connection(self.ftp_conn_id)
cnopts = pysftp.CnOpts()
cnopts.hostkeys = None
ftp = pysftp.Connection(host=params.host, username=params.login, password=params.password,
port=params.port,
cnopts=cnopts)
#This will download file from FTP server to GCS location
with ftp.open(self.ftp_folder + '/' + file_to_load, 'r+') as remote_file:
blob = bucket.blob(self.gcs_prefix + file_to_load)
blob.upload_from_file(remote_file)
#This will upload file from GCS to FTP server
with sftp.open(self.ftp_folder + '/' +file_name,'w+') as remote_file:
blob = bucket.blob(fileObject['name'])
blob.download_to_file(remote_file)
GCS does not implement FTP support, so this won't work.
It looks like FTP hook only knows how to deal with a local file path or buffer, not one of the GCS APIs.
You might be able to find (or write) some code that reads from FTP and writes to GCS.