How to send data from S3 Bucket to FTP Server with help of Apache nifi? - postgresql

I have file name and file path in SQL tables, files can be multiple for each row and this files are stored in S3 Bucket, now i want to send all those files via FTP whose name in sql rows? How we can achieve it via Apache nifi ?
I tried with get file and lists3 but not able to come to conclusion

You need below components to develop your data ingestion pipeline:
Processors
QueryDatabaseTableRecord: to query and get result data from database
SplitText: split query result line by line
ExtractText: get content of FlowFile into attribute
UpdateAttibute: derive S3 object key, bucket, FTP filename etc attributes
FetchS3Object: retrieves the contents of an S3 Object and writes it to the content of a FlowFile
PutFTP: sends FlowFiles to an FTP Server
Controller Services:
DBCPConnectionPool: configure database instance connection, required by QueryDatabaseTableRecord
CSVRecordSetWriter: required by QueryDatabaseTableRecord to parse result dataset
AWSCredentialsProviderControllerService: configure AWS credentials, required for FetchS3Object
Flow Design:
Derive S3 and FTP attributes in one go at UpdateAttibute
QueryDatabaseTableRecord -> SplitText -> ExtractText -> UpdateAttibute -> FetchS3Object -> PutFTP
All the processors are self-explanatory, so refer to official documentation for properties configuration.

Related

Azure data factory rest api x-ms-file-rename-source issues

I have a pipeline in Azure Data Factory that is using a web task to rename a file on a file share on one of our azure storage accounts using rest api.
The process almost works and creates a copy of the file with the new name, but the new file is empty. I’ve tried this with both xlsx and a standard txt file. These are the headers I’m using:
x-ms-date: <generating in ADF>
x-ms-version: 2021-08-06
x-ms-rename-source: <path to original file>
x-ms-type: file
x-ms-content-length: <?>
I put <?> for content length because I think this is the issue and I’m not sure what value I should use here. I tried not including the x-ms-content-length to preserve the file attributes but I get an error that the header is required. Any thoughts on why the file is empty/being resized?

Dataset format for copying csv files from a sftp server to blob storage

Hi I want to copy csv files from a sftp server to blob storage using copy activity of adf without processing the content. Is there a difference using a binary dataset instead of a csv dataset for the source and sink?
If I understand the ask correctly, you want to copy csv files from SFTP to blob storage as-is. If that is the case, you can use binary dataset on both source and sink of your copy activity. When using Binary dataset, the service does not parse file content but treat it as-is. Where as if you use CSV file format, the service will parse file content and you will have to configuration file specs in the connection settings of your dataset.
Please note that when using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.

Extra Blob Created after Sink in Data Flow

I'm importing from Snowflake to Azure blob using data flow activity in Azure Data Factory.
I noticed that whenever I created a blob thru sink (placed inside provider/Inbound/ folder), I get an extra empty blob file outside Inbound.
Does this happen for all data flow sink to blob?
I had created a data flow and loaded data to blob from snowflake and I don't see any additional blob file generated outside my sink folder.
Make sure the sink connection is pointed to the correct folder and also double-check if any other process is running other than this dataflow which is causing to create an extra file outside the sink folder.
Snowflake source:
Sink:
Output file path to generate the out file:
Sink setting to add a date as the filename:
Output folder:
Output file generated after executing the data flow.

How to use file name prefix in Data Factory when importing data into Azure data lake from SAP BW Open Hub?

I have a source of SAP BW Open Hub in data factory and a sink of Azure data lake gen2 and am using a copy activity to move the data.
I am attempting to transfer the data to the lake and split into numerous files, with 200000 rows per file. I would also like to be able to prefix all of the filenames e.g. 'cust_', so the files would be something along the lines of cust_1, cust_2, cust_3 etc.
This method only seems to be an issue when using SAP BW Open Hub as a source (it works fine when using SQL Server as a source. Please see the warning message below. After checking with out internal SAP BW team, they assure me that the data is in a tabular format, and no explicit partition is enabled, so there shouldn't be an issue.
When executing the copy activity, the files are transferred to the lake but the file name prefix setting is ignored, and the filenames instead are set automatically, as below (the name seems to be automatically made up of the SAP BW Open Hub table and the request ID):
Here is the source config:
All other properties on the other tabs are set to default and have been unchanged.
QUESTION: without using a data flow, is there any way to split the files when pulling from SAP BW Open Hub and also be able to dictate the filenames in the lake?
I tried to reproduce the issue and it works fine with a work around. Instead of splitting the data while copying from SAP BW to Azure data lake storage, you can just simply copy the entire exact data (without partition) into the Azure SQL Database. Please follow copy data from SAP Business warehouse by using azure data factory (make sure to use Azure SQL Database as sink).
Now the data is in you Azure SQL Database, you can now simply use the copy activity to copy the data to Azure data lake storage.
In source configuration, keep “Partition option” as None.
Source Config:
Sink config:
Output:

when using the spring cloud data flow sftp source starter app file_name header is not found

spring cloud dataflow sftp source starter app states that file name should be in the headers (mode=contents). However, when I connect this source to a log sink, I see a few headers (like Content-Type) but not the file_name header. I want to use this header to upload the file to S3 with the same name.
spring server: Spring Cloud Data Flow Local Server (v1.2.3.RELEASE)
my apps are all imported from here
stream definition:
stream create --definition "sftp --remote-dir=/incoming --username=myuser --password=mypwd --host=myftp.company.io --mode=contents --filename-pattern=preloaded_file_2017_ --allow-unknown-keys=true | log" --name test_sftp_log
configuring the log application to --expression=#root --level=debug doesn't make any difference. Also, writing my own sink that tries to access the file_name header I get an error message that such header does not exist
logs snippets from the source and sink are in this gist
Please follow this link bellow, You need to code your own Source and populate such a header manually downstream already after FileReadingMessageSource. And only after that send the message with content and appropriate header to the target destination.
https://github.com/spring-cloud-stream-app-starters/file/issues/9