How to process a large file from a s3 bucket using spring batch

How to process a large file from a s3 bucket using spring batch - spring-batch

Hellow, I am trying execute the example posted in comment of the follow post.
I`m accessing the bucket and reading a list of file, but when I go execute the reader I receive the follow error message: "Caused by: java.lang.IllegalStateException: Input resource must exist (reader is in 'strict' mode): ServletContext resource [/s3://bkt-csv-files/files/23-12-2022/arquivo_jan_2022_pt_00]". How can I resolved this error, or, there is another way to read the files on s3 using spring-batch?

I did not try the example you shared, but I would do it differently.
The FlatFileItemReader works with any implementation of the Resource interface. So if you manage to get an accessible resource in S3, you can use it with your item reader.
For example, you can use a URLResource that points to your file in S3 and set it on the item reader.
This might help as well: Spring Batch - Read files from Aws S3

Related

Aspera Node API /files/{id}/files endpoint not returning up to date data

I am working on a webapp for transferring files with Aspera. We are using AoC for the transfer server and an S3 bucket for storage.
When I upload a file to my s3 bucket using aspera connect everything appears to be successful, I see it in the bucket, and I see the new file in the directory when I run /files/browse on the parent folder.
I am refactoring my code to use the /files/{id}/files endpoint to list the directory because the documentation says it is faster compared to the /files/browse endpoint. After the upload is complete, when I run the /files/{id}/files GET request, the new file does not show up in the returned data right away. It only becomes available after a few minutes.
Is there some caching mechanism in place? I can't find anything about this in the documentation. When I make a transfer in the AoC dashboard everything updates right away.
Thanks,
Tim

Yes, the file-id base system uses an in-memory cache (redis).
This cache is updated when a new file is uploaded using Aspera. But for files movement directly on the storage, there is a daemon that will periodically scan and find new files.
If you want to bypass the cache, and have the API read the storage, you can add this header in the request:
X-Aspera-Cache-Control: no-cache
Another possibility is to trigger a scan by reading:
/files/{id}
for the folder id

Azure Data Factory copy data - how do I save an http downloaded zip to blob store?

I have a simple copy data activity, with an HTTP connector source, and Azure Blob Storage as the sink. The file is a zip file so I am using a binary dataset for source and a binary for sink.
The data is properly fetched (I believe - looking at bytes transferred). However, I cannot save it to the Blob Store. In this scenario, you do not get to set the filename, only the path (container/directory). The filename used is the name of the file that I fetched.
However, the filename used in the sink step is prefixed with a backslash. It does not exist in the source, and I can find no way to remove it, and with a filename like that, I get a failure:
Failure happened on 'Sink' side. ErrorCode=UserErrorFailedFileOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Upload file failed at path extract/coEDW\XXXX_Data_etc.zip.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.WindowsAzure.Storage.StorageException,Message=The remote server returned an error: (404) Not Found.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified resource does not exist. RequestId:bfe4e2f6-501e-002e-6a21-eaf10e000000 Time:2021-01-14T02:59:24.3300081Z,,''Type=System.Net.WebException,Message=The remote server returned an error: (404) Not Found.,Source=Microsoft.WindowsAzure.Storage,'
(filename masked by me)
I am sure the fix is simple, but I cannot figure this out. Can anyone help?
Thanks.

You will have to add a dynamic file name for your Blob sink.
You can use the below example to see how to dynamically add a file name using variables:
In this example, the file name is having date and time fields to mark each file with their date and time.
Let me know if that works.

Creating and using a custom kafka connect configuration provider

I have installed and tested kafka connect in distributed mode, it works now and it connects to the configured sink and reads from the configured source.
That being the case, I moved to enhance my installation. The one area I think needs immediate attention is the fact that to create a connector, the only available mean is through REST calls, this means I need to send my information through the wire, unprotected.
In order to secure this, kafka introduced the new ConfigProvider seen here.
This is helpful as it allows to set properties in the server and then reference them in the rest call, like so:
{
.
.
"property":"${file:/path/to/file:nameOfThePropertyInFile}"
.
.
}
This works really well, just by adding the property file on the server and adding the following config on the distributed.properties file:
config.providers=file # multiple comma-separated provider types can be specified here
config.providers.file.class=org.apache.kafka.common.config.provider.FileConfigProvider
While this solution works, it really does not help to easy my concerns regarding security, as the information now passed from being sent over the wire, to now be seating on a repository, with text on plain sight for everyone to see.
The kafka team foresaw this issue and allowed clients to produce their own configuration providers implementing the interface ConfigProvider.
I have created my own implementation and packaged in a jar, givin it the sugested final name:
META-INF/services/org.apache.kafka.common.config.ConfigProvider
and added the following entry in the distributed file:
config.providers=cust
config.providers.cust.class=com.somename.configproviders.CustConfigProvider
However I am getting an error from connect, stating that a class implementing ConfigProvider, with the name:
com.somename.configproviders.CustConfigProvider
could not be found.
I am at a loss now, because the documentation on their site is not explicit about how to configure custom config providers very well.
Has someone worked on a similar issue and could provide some insight into this? Any help would be appreciated.

I just went through these to setup a custom ConfigProvider recently. The official doc is ambiguous and confusing.
I have created my own implementation and packaged in a jar, givin it the sugested final name:
META-INF/services/org.apache.kafka.common.config.ConfigProvider
You could name the final name of jar whatever you like, but needs to pack to jar format which has .jar suffix.
Here is the complete step by step. Suppose your custom ConfigProvider fully-qualified name is com.my.CustomConfigProvider.MyClass.
1. create a file under directory: META-INF/services/org.apache.kafka.common.config.ConfigProvider. File content is full qualified class name:
com.my.CustomConfigProvider.MyClass
Include your source code, and above META-INF folder to generate a Jar package. If you are using Maven, file structure looks like this
put your final Jar file, say custom-config-provider-1.0.jar, under the Kafka worker plugin folder. Default is /usr/share/java. PLUGIN_PATH in Kafka worker config file.
Upload all the dependency jars to PLUGIN_PATH as well. Use the META-INFO/MANIFEST.MF file inside your Jar file to configure the 'ClassPath' of dependent jars that your code will use.
In kafka worker config file, create two additional properties:
CONNECT_CONFIG_PROVIDERS: 'mycustom', // Alias name of your ConfigProvider
CONNECT_CONFIG_PROVIDERS_MYCUSTOM_CLASS:'com.my.CustomConfigProvider.MyClass',
Restart workers
Update your connector config file by curling POST to Kafka Restful API. In Connector config file, you could reference the value inside ConfigData returned from ConfigProvider:get(path, keys) by using the syntax like:
database.password=${mycustom:/path/pass/to/get/method:password}
ConfigData is a HashMap which contains {password: 123}
If you still seeing ClassNotFound exception, probably your ClassPath is not setup correctly.
Note:
• If you are using AWS ECS/EC2, you need to set the worker config file by setting the environment variable.
• worker config and connector config file are different.

when using the spring cloud data flow sftp source starter app file_name header is not found

spring cloud dataflow sftp source starter app states that file name should be in the headers (mode=contents). However, when I connect this source to a log sink, I see a few headers (like Content-Type) but not the file_name header. I want to use this header to upload the file to S3 with the same name.
spring server: Spring Cloud Data Flow Local Server (v1.2.3.RELEASE)
my apps are all imported from here
stream definition:
stream create --definition "sftp --remote-dir=/incoming --username=myuser --password=mypwd --host=myftp.company.io --mode=contents --filename-pattern=preloaded_file_2017_ --allow-unknown-keys=true | log" --name test_sftp_log
configuring the log application to --expression=#root --level=debug doesn't make any difference. Also, writing my own sink that tries to access the file_name header I get an error message that such header does not exist
logs snippets from the source and sink are in this gist

Please follow this link bellow, You need to code your own Source and populate such a header manually downstream already after FileReadingMessageSource. And only after that send the message with content and appropriate header to the target destination.
https://github.com/spring-cloud-stream-app-starters/file/issues/9

IBM Integration Bus: The PIF data could not be found for the specified application

I'm using IBM Integration Bus v10 (previously called IBM Message Broker) to expose COBOL routines as SOAP Web Services.
COBOL routines are integrated into IIB through MQ queues.
We have imported some COBOL copybooks as DFDL schemas in IIB, and the mapping between SOAP messages and DFDL messages is working fine.
However, when the message reaches a node where a serialization of the message tree has to take place (for example, a FileOutput or a MQ request), it fails with the following error:
"The PIF data could not be found for the specified application"
This is the last part of the stack trace of the exception:
RecoverableException
File:CHARACTER:F:\build\slot1\S000_P\src\DataFlowEngine\TemplateNodes\ImbOutputTemplateNode.cpp
Line:INTEGER:303
Function:CHARACTER:ImbOutputTemplateNode::processMessageAssemblyToFailure
Type:CHARACTER:ComIbmFileOutputNode
Name:CHARACTER:MyCustomFlow#FCMComposite_1_5
Label:CHARACTER:MyCustomFlow.File Output
Catalog:CHARACTER:BIPmsgs
Severity:INTEGER:3
Number:INTEGER:2230
Text:CHARACTER:Caught exception and rethrowing
Insert
Type:INTEGER:14
Text:CHARACTER:Kcilmw20Flow.File Output
ParserException
File:CHARACTER:F:\build\slot1\S000_P\src\MTI\MTIforBroker\DfdlParser\ImbDFDLWriter.cpp
Line:INTEGER:315
Function:CHARACTER:ImbDFDLWriter::getDFDLSerializer
Type:CHARACTER:ComIbmSOAPInputNode
Name:CHARACTER:MyCustomFlow#FCMComposite_1_7
Label:CHARACTER:MyCustomFlow.SOAP Input
Catalog:CHARACTER:BIPmsgs
Severity:INTEGER:3
Number:INTEGER:5828
Text:CHARACTER:The PIF data could not be found for the specified application
Insert
Type:INTEGER:5
Text:CHARACTER:MyCustomProject
It seems like something is missing in my deployable BAR file. It's important to say that my application has the message flow and it depends on a shared library that has all the .xsd files (DFDLs).
I suppose that the schemas are OK, as I've generated them using the Toolkit wizard, and the message parsing works well. The problem is only with serialization.
Does anybody know what may be missing here?

OutputRoot.Properties.MessageType must contain the name of the message in the DFDL schema. Additionally when the DFDL schema is in a shared library, OutputRoot.Properties.MessageSet must contain the name of the library.

Sounds as if OutputRoot.Properties is not pointing at the shared library. I cannot remember which subfield does that job - it is either OutputRoot.Properties.MessageType or OutputRoot.Properties.MessageSet.
You can easily check - just check the contents of InputRoot.Properties after an input node that has used the same shared libary.

Faced a similar problem. In my case, a message flow with an HttpRequest node using a DFDL domain parser / format to parse an HTTP response from the remote system threw this error (PIF data could not be found for the specified application). "Re-selecting" the same parser domain & message type on the node followed by build / redeploy solved the problem. Seemed to be a project reference related issue within the IIB toolkit.

you need to create static libraries and refer to application.
in compute node ur coding is based on dfdl body

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse