How to get data transfer completion status in nifi for SFTP transfer - rest

I have created a flow in nifi to transfer data from one linux machine to another linux machine.
Flow is like this:
GetSFTP-->UpdateAttribute-->PutSFTP
Everything I am managing through nifi APIs, i.e. creating, updating attributes and starting of flow using nifi APIs.
How can I get the completion status of data transfer, so that I can stop the flow.
Thanks.

The concept of being "complete" is something that NiFi can't really know here. How would NiFi know that another file isn't going to be added to the directory where GetSFTP is watching?
From NiFi's perspective the dataflow is running until someone says to stop it. It is not really meant to be a job system where you submit a job that starts and completes, it is a running dataflow that handles an infinite stream of data.

Related

How To Design a Distributed Logging System in Kubernetes?

I'm designing a distributed application, comprised of several Spring microservices that will be deployed with Kubernetes. It is a batch processing app, and a typical request could take several minutes of processing, with the processing getting distributed across the services, using Kafka as a message broker.
A requirement of the project is that each request will generate a log file, which will need to be stored on the application file store for retrieval. The current design is, all the processing services write log messages (with the associated unique request ID) to Kafka, and there is a dedicated logging microservice that reads these messages down, does some formatting and should persist them to the log file associated with the given request ID.
I'm very unfamiliar with how files should be stored in web applications. Should I be storing these log files to the local file system? If so, wouldn't that mean this "logging service" couldn't be scaled? For example, if I scaled the log service to 2 instances, then each instance would only have access to half of the log files in theory. And if a user makes a request to retrieve a log file, there is no guarantee that the requested log file will be at whatever log service instance the Kubernetes load balancer routed them too.
What is the currently accepted "best practice" for having a file system in a distributed application? Or should I just accept that the logging service can never be scaled up?
A possible solution I can think of would just store the text log files in our MySQL database as TEXT rows, making the logging service effectively stateless. If someone could point out any potential issues with this that would be much appreciated?
deployed with Kubernetes
each request will generate a log file, which will need to be stored on the application file store
Don't do this. Use a Fluentd / Filebeat / promtail / Splunk forwarder side car that gathers stdout from the container processes.
Or have your services write to a kafka logs topic rather than create files.
With either option, use a collector like Elasticsearch, Grafana Loki, or Splunk
https://kubernetes.io/docs/concepts/cluster-administration/logging/#sidecar-container-with-a-logging-agent
wouldn't that mean this "logging service" couldn't be scaled?
No, each of these services are designed to be scaled
possible solution I can think of would just store the text log files in our MySQL database as TEXT rows,
Sure, but Elasticsearch or Solr are purpose-built for gathering and searching plaintext, not MySQL.
Don't treat logs as something application specific. In other words, your solution shouldn't be unique to Spring

Write Logfiles to Slow Disk or sending Tomcat Access Logs to ElasticSearch?

My service (tomcat/java) is running on a kubernetes cluster (AKS).
I would like to write the log files (tomcat access logs, application logs with logback) to an AzureFile volume.
I do not want to write the access logs to the stdout, because I do not want to mix the access logs with the application logs.
Question
I expect that all logging is done asynchronously, so that writing to the slow AzureFile volume should not affect the performance.
Is this correct?
Update
In the end I want to collect the logfiles so that I can send all logs to ElasticSearch.
Especially I need a way to collect the access logs.
If you want to send your access logs to Elastic Search, you just need to extend the AbstractAccessLogValve and implement the log method.
The AbstractAccessLogValve already contains the logic to format the messages, so you need just to add the logic to send the formatted message.
Yes, you are right but still here depends on how you are writing the logs. If asynchronously you are writing long process will take and your files system is slow. If it's NFS there is also the chance of network latency etc.
i have seen performance issues if attaching NFS & Bucket volume direct to multiple PODs.
If your writing is slow asyn thread might take time to complete job and take higher resources also however it still depends on code and way of written code.
Ideally, people use to store in Elasticsearch for fast retrieval easy management.
People use different stacks based on requirement but mostly all of them backed by elasticsearch for example Graylog, ELK.
For sending or writing logs to these stack people use the UDP I personally prefer GELF UDP and throws a logs at Graylog and forget.

Get source/upstream connection's processor name in nifi

I want to monitor the flowfiles in Nifi from business perspective.
So I have added executescript processor using python script which creates the message and pushes the same in elasticsearch after each processor.
I want the parent processor name or id of this executescript processor so that I will keep appending in the flowfile which will allow to know through which stages/processors this flow file is passed through and I can monitor it in ELK.
I think the best way to monitor FlowFiles is to use the Provenance logs. You can also export those logs to ELK using another NiFi instance and S2S.
Anyway, if you want to get the processor's name of a connection's source/destination using the REST API you can get it when you browse a process group's connections.
Example:
/nifi-api/process-groups/{processGroupId}/connections/
You'll get an array of connections. In the connection object, you will get the name of the source in the path component/source/name. The same goes for destination.
EDIT:
To use provenance logs you need to do as follows:
Send provenance logs to another NiFi instance(it is restricted to NiFi since it uses S2S).
Parse the logs in this NiFi instance
Send the logs to ElasticSearch using the PutElasticSearch5 processor.
It works the best and will help you monitor the FlowFiles the best :)

Datastage AUtomation

I am currently working in a project where we are using CDC Transaction stage which will automatically capture the change data from source.This data is used to create xml which will inturn pass to MQ where MDM will pickup this from MQ.
Now I got a requirement for automating the deployment process in datastage.Currently I am looking for an architecture how to achieve this. So when we did automation of deployment process, there should be some mechanism where I need to stop the CDC transaction process, stop the MQ process and deploy the jobs.Once the deployment is successfull I need to restart the CDC transaction process and MQ process from the point where it stopped. Also I need a rollback mechanis in case if the deployment went wrong. Please let me know all your thought process on this, so that I can create a good solution

Apache Ignite Failover functionality

I have set apache ignite on a Cluster of nodes and sent some job to some server node to run. When connection to that server node was lost I need to somehow store the result of that node locally (either via binary file or via some other way). Then when the connection with that node is established again push back the stored results to some Database server.
I'm working under .Net platform.
I can use
EventType.EVT_CLIENT_NODE_DISCONNECTED
EventType.EVT_CLIENT_NODE_RECONNECTED
these events and inside of their functions to implement the 'storing locally' and 'pushing to the DB server' functionality but I wanted to find some ready solution.
Is there any ready tool with the functionality I mentioned to just take and use it?
You can take a look at Checkpointing. I'm not sure this is exactly the same as you described (mainly because it will save the intermidiate state on server side), but I think it can be quite helpful.