when using the spring cloud data flow sftp source starter app file_name header is not found - spring-cloud

spring cloud dataflow sftp source starter app states that file name should be in the headers (mode=contents). However, when I connect this source to a log sink, I see a few headers (like Content-Type) but not the file_name header. I want to use this header to upload the file to S3 with the same name.
spring server: Spring Cloud Data Flow Local Server (v1.2.3.RELEASE)
my apps are all imported from here
stream definition:
stream create --definition "sftp --remote-dir=/incoming --username=myuser --password=mypwd --host=myftp.company.io --mode=contents --filename-pattern=preloaded_file_2017_ --allow-unknown-keys=true | log" --name test_sftp_log
configuring the log application to --expression=#root --level=debug doesn't make any difference. Also, writing my own sink that tries to access the file_name header I get an error message that such header does not exist
logs snippets from the source and sink are in this gist

Please follow this link bellow, You need to code your own Source and populate such a header manually downstream already after FileReadingMessageSource. And only after that send the message with content and appropriate header to the target destination.
https://github.com/spring-cloud-stream-app-starters/file/issues/9

Related

How to send data from S3 Bucket to FTP Server with help of Apache nifi?

I have file name and file path in SQL tables, files can be multiple for each row and this files are stored in S3 Bucket, now i want to send all those files via FTP whose name in sql rows? How we can achieve it via Apache nifi ?
I tried with get file and lists3 but not able to come to conclusion
You need below components to develop your data ingestion pipeline:
Processors
QueryDatabaseTableRecord: to query and get result data from database
SplitText: split query result line by line
ExtractText: get content of FlowFile into attribute
UpdateAttibute: derive S3 object key, bucket, FTP filename etc attributes
FetchS3Object: retrieves the contents of an S3 Object and writes it to the content of a FlowFile
PutFTP: sends FlowFiles to an FTP Server
Controller Services:
DBCPConnectionPool: configure database instance connection, required by QueryDatabaseTableRecord
CSVRecordSetWriter: required by QueryDatabaseTableRecord to parse result dataset
AWSCredentialsProviderControllerService: configure AWS credentials, required for FetchS3Object
Flow Design:
Derive S3 and FTP attributes in one go at UpdateAttibute
QueryDatabaseTableRecord -> SplitText -> ExtractText -> UpdateAttibute -> FetchS3Object -> PutFTP
All the processors are self-explanatory, so refer to official documentation for properties configuration.

How to process a large file from a s3 bucket using spring batch

Hellow, I am trying execute the example posted in comment of the follow post.
I`m accessing the bucket and reading a list of file, but when I go execute the reader I receive the follow error message: "Caused by: java.lang.IllegalStateException: Input resource must exist (reader is in 'strict' mode): ServletContext resource [/s3://bkt-csv-files/files/23-12-2022/arquivo_jan_2022_pt_00]". How can I resolved this error, or, there is another way to read the files on s3 using spring-batch?
I did not try the example you shared, but I would do it differently.
The FlatFileItemReader works with any implementation of the Resource interface. So if you manage to get an accessible resource in S3, you can use it with your item reader.
For example, you can use a URLResource that points to your file in S3 and set it on the item reader.
This might help as well: Spring Batch - Read files from Aws S3

Aspera Node API /files/{id}/files endpoint not returning up to date data

I am working on a webapp for transferring files with Aspera. We are using AoC for the transfer server and an S3 bucket for storage.
When I upload a file to my s3 bucket using aspera connect everything appears to be successful, I see it in the bucket, and I see the new file in the directory when I run /files/browse on the parent folder.
I am refactoring my code to use the /files/{id}/files endpoint to list the directory because the documentation says it is faster compared to the /files/browse endpoint. After the upload is complete, when I run the /files/{id}/files GET request, the new file does not show up in the returned data right away. It only becomes available after a few minutes.
Is there some caching mechanism in place? I can't find anything about this in the documentation. When I make a transfer in the AoC dashboard everything updates right away.
Thanks,
Tim
Yes, the file-id base system uses an in-memory cache (redis).
This cache is updated when a new file is uploaded using Aspera. But for files movement directly on the storage, there is a daemon that will periodically scan and find new files.
If you want to bypass the cache, and have the API read the storage, you can add this header in the request:
X-Aspera-Cache-Control: no-cache
Another possibility is to trigger a scan by reading:
/files/{id}
for the folder id

Why does Kafka Connect Sftp source directory need to be writeable?

using the connector: https://docs.confluent.io/current/connect/kafka-connect-sftp/source-connector/index.html
When I config the connector and check the status I get the bellow exception...
org.apache.kafka.connect.errors.ConnectException: Directory for 'input.path' '/FOO' it not writable.\n\tat io.confluent.connect.sftp.source.SftpDirectoryPermission.directoryWritable
This makes no sense from a source stand point especially if you are connecting to a 3rd party source you do NOT control.
You need write permissions because the Connector will move the files that it read to the configurable finished.path. This movement into finished.path is explained in the link you have provided:
Once a file has been read, it is placed into the configured finished.path directory.
The documentation on the configuration input.path states that you need write access to it:
input.path - The directory where Kafka Connect reads files that are processed. This directory must exist and be writable by the user running Connect.

spring cloud stream app starter File Source to Spring Batch Cloud Task

I have a spring batch boot app which takes a flat file as input . I converted the app into cloud task and deployed in spring local data flow server. Next , I created a stream starting with File Source -> tasklaunchrequest-transform -> task-launcher-local which starts my batch cloud task app .
It looks like that the File does not come into the batch app . I do not see anything in the logs to indicate that.
I checked the docs at https://github.com/spring-cloud-stream-app-starters/tasklaunchrequest-transform/tree/master/spring-cloud-starter-stream-processor-tasklaunchrequest-transform
It says
Any input type. (payload and header are discarded)
My question is how do I pass the file as payload from File Source to the Batch app which seems to be a very basic feature.
any help is very much appreciated.
You'll need to write your own transformer that takes the data from the source and packages it up so your task can consume it.