FlowFiles stuck in the queue in NiFi Cluster - queue

I am currently running NiFi 1.9.2 in a clustered environment with 3 nodes. Recently what I have noticed is that the flow seems to get stuck. The queue shows that there are items in the queue, but nothing is going to the downstream processor. When I list the items in the queue, I get "The queue has no FlowFiles".
The queue in this case is set to load balance with round robin. If I stop the downstream processor, and change the configuration on the queue to not to load balance, and then switch it back to round robin again, the queue items distribute to the other two nodes, and I can see the flow files when I list the items in the queue. However, it only shows items as being in two of the nodes. When I restart the downstream processor, the 2/3 of the items get processed leaving the 1/3 that would be on the node whose queue items I cannot see. This behavior seems to persist even after restarting the cluster service.
If I change the queue to not to load balance, then everything seems to get put on a good node, and the queue get emptied. So it looks like there might be something not correct on my first node.
Any suggestions on what to try?
Thanks,
-tj

You should check disk usages. If usage rate of the disk nifi located is equal or higher than the "nifi.content.repository.archive.max.usage.percentage" setting in nifi.properties file, you may see this NiFi's strange behavior. If you have this kind of situation, you can try to delete old log files of NiFi

Related

How to handle out of order Zookeeper notifications?

I have multiple processes operating on an in-memory queue. That queue is a manifestation of sequential znodes created/deleted at Zookeeper.
When a znode is added, an equivalent item is added to the queue at all the involved processes. And also when a znode is removed, the equivalent item is removed from the queue at every involved process.
The addition and removal signals are expected to be balanced because every added item should eventually be removed.
I faced a situation when a znode was added and removed very quickly and the removal notification was received at one of the processes before the addition notificaiton. So an attempt to remove that item occurred but failed because it wasn't actually there, and then the addition signal was received which added the item but then it was never removed.
A simple solution would be to assert the existence of the equivalent znode after adding the item to the queue and that's good enough for me now but it doesn't seem as efficient as it can get.
My question is if there is a way to handle this scenario in a more efficient or "zookeeper way"?
You're trying to use ZooKeeper as a message queue which is not designed for. There's no ordering neither delivery guarantee in ZooKeeper for watcher notifications.
Instead you should use some messaging system like Kafka or RabbitMQ for this use case.

initializing a new service when the number of jobs ina GCP pub/sub is too many

I am working on a project on GCP and I need to create a system that works like a load balancer, but the load is the number of items in a pub/sub queue.
Here is more detail:
I have a message queue which is based on pub/sub.
A lot of messages are posted to this ques and I have one service which consume them.
It would takes some hours to process each items in the queue.
I want to start a new service ( a docker image) when the number of items in the queue became very big (say I want to start anew service when the number of items in the queue became more than 10 items and start another one when the number of items in the queue became more than 20 and so on) and shutdown the services when the number of items in the queue reduces (so for example if the number of items in the queue is come to under 20, shut down all services and only 2 services became live)
Now my questions are:
How can I do this?
Is kubernties a good solution? if yes, where can I find more information about it.
Can I do it by pub/sub? If yes, where can I find information?
I have a similar requirement (that I am struggling with). I would suggest you look at an horizontal pod autoscaler based on an external stackdriver monitoring metric to see if it will meet your needs. This process is discussed here:
https://cloudplatform.googleblog.com/2018/05/Beyond-CPU-horizontal-pod-autoscaling-comes-to-Google-Kubernetes-Engine.html
and here:
https://cloud.google.com/kubernetes-engine/docs/tutorials/external-metrics-autoscaling

Why doesn't my Azure Function scale up?

For a test, I created a new function app. I added two functions, one was an http trigger that when invoked, pushed 500 messages to a queue. The other, a queue trigger to read the messages. The queue trigger function code, was setup to read a message and randomly sleep from 1 to 30 seconds. This was intended to simulate longer running tasks.
I invoked the http trigger to create the messages, then watched the que fill up (messages were processed by the other trigger). I also wired up app insights to this function app, but I did not see is scale beyond 1 server.
Do Azure functions scale up soley on the # of messages in the que?
Also, I implemented these functions in Powershell.
If you're running in the Azure Functions consumption plan, we monitor both the length and the throughput of your queue to determine whether additional VM resources are needed.
Note that a single function app instance can process multiple queue messages concurrently without needing to scale across multiple VMs. So if all 500 messages can be consumed relatively quickly (again, in the consumption plan), then it's possible that you won't scale at all.
The exact algorithm for scaling isn't published (it's subject to lots of tweaking), but generally speaking you can expect the system to automatically scale you out if messages are getting added to the queue faster than your functions can process them. Your app will also scale out if the latency of the first message in the queue is continuously increasing (meaning, messages are sitting idle and not getting processed). The time between VMs getting added is usually in the tens of seconds.
There are some thresholds based on queue count as well. For example, the system tries to ensure that there is at least 1 VM for every 1K queue messages, but usually the scale decisions are based on message throughput as I described earlier.
I think #Chris Gillum put it well, it's hard for us to push the limits of the server to the point that things will start to scale.
Some other options available are:
Use durable functions and scale with Threading:
https://learn.microsoft.com/en-us/azure/azure-functions/durable-functions-cloud-backup
Another method could be to use Event Hubs which are designed for massive scale. Instead of queues, have Function #1 trigger an Event, and your Function #2 subscribed to that Event Hub trigger. Adding Streaming Analytics, could also be an option to more fully expand on capabilities if needed.

Using Celery with multiple workers in different pods

What I'm trying to do is using Celery with Kubernetes. I'm using Redis as the message broker in a different pod and I have multiple pods for each queue of Celery.
Imagine if I have 3 queues, I would have 3 different pods (i.e workers) that can accept and handle the requests.
Everything is working fine so far but my question is, what would happen if I clone the pod of one of queues to have two pods for one single queue?
I think client (i.e Django) creates a new message using Redis to send to the worker and start the job but it's not clear to me what would happen because I have two pods listening to the same queue? Does the first pod accept the request and start the job and prevents the other pod to accept the request?
(I tried to search a bit on the documentation of Celery to see if I can find any clues but I couldn't. That's why I'm asking this question)
I guess you are using basic task type, which employs 'direct' queue type, not 'fanout' or 'topic' queue, the latter two have much difference, which will not be discussed here.
While using Redis as broker transport, celery/kombu use a Redis list object as a storage of queue (source), use command LPUSH to publish message, BRPOP to consume the message.
In short, BRPOP(doc) blocks the connection when there are no elements to pop from the given lists, if the list is not empty, an element is popped from the tail of the given list. It is guaranteed that this operation is atomic, no two connection could get the same element.
Celery leverage this feature to guarantees at-least-once message delivery. use of acknowledgment doesn't affect this guarantee.
In your case, there are multiple celery workers across multiple pods, but all of them connected to one same Redis server, all of them blocked for the same key, try to pop an element from the same list object. when new message arrived, there will be one and only one worker could get that message.
A task message is not removed from the queue until that message has been acknowledged by a worker. A worker can reserve many messages in advance and even if the worker is killed – by power failure or some other reason – the message will be redelivered to another worker.
More: http://docs.celeryproject.org/en/latest/userguide/tasks.html
The two workers (pods) will receive tasks and complete them independently. It's like have a single pod, but processing task at twice the speed.

Communication protocol

I'm developing distributed system that consists of master and worker servers. There should be 2 kind of messages:
Heartbeat
Master gets state of worker and respond immediately with appropriate command. For instance:
Message from Worker to Master: "Hey there! I have data a,b,c"
Response from Master to Worker: "All ok, But throw away c - we dont need this anymore"
The participants exchange this messages with interval T.
Direct master command
Lets say client asks master to kill job #123. Here is conversation:
Message from Master to Worker: "Alarm! We need to kill job #123"
Message from Worker to Master: "No problem! Done."
Obvious that we can't predict when this message appear.
Simplest solution is that master is initiator of all communications for both messages (in case of heartbeat we will include another one from master to start exchange). But lets assume that it is expensive to do all heartbeat housekeeping on master side for N workers. And we don't want to waste our resources to keep several tcp connections to worker servers so we have just one.
Is there any solution for this constraints?
First off, you have to do some bookkeeping somewhere. Otherwise, who's going to realize that a worker has died? The natural place to put that data is on the master, if you're building a master/worker system. Otherwise, the workers could be asked to keep track of each other in a long circle, or a randomized graph. If a worker notices that their accountabilibuddy is not responding anymore, it can alert the master.
Same thing applies to the list of jobs currently running; who keeps track of that? It also scales O(n), so presumably the master doesn't have space for that either. Sharding that data out among the workers (e.g. by keeping track of what things their accountabilibuddy is supposed to be doing) only works so far; if a and b crashes, and a is the only one looking after b, you just lost the list of jobs running on b (and possibly the alert that was supposed to notify you that b crashed).
I'd recommend a distributed consensus algorithm for this kind of task. For production, use something someone else has already written; they probably know what they're doing. If it's for learning purposes, which I presume, have a look at the raft consensus algorithm. It's not too hard to understand, but still highlights a lot of the complexity in distributed systems. The simulator is gold for proper understanding.
A master/worker system will never properly work with less than O(n) resources for n workers in the face of crashing workers. By definition, the master needs to control the workers, which is an O(n) job, even if some workers manage other workers. Also, what happens if the master crashes?
Like Filip Haglund said read the raft paper you should also implement it yourself. However in a nutshell what you need to extract from it would be this. In regaurds to membership management.
You need to keep membership lists and the masters Identity on all nodes.
Raft does it's heartbeat sending on master's end it is not very expensive network wise you don't need to keep them open. Every 200 ms to a second you need to send the heartbeat if they don't reply back the Master tells the slaves remove member x from list.
However what what to do if the master dies well basically you need to preset candidate nodes. If you haven't received a heart beat within the timeout the candidate requests votes from the rest of the cluster. If you get the slightest majority you become the new leader.
If you want to join a existing cluster basically same as above if not leader respond not leader with leaders address.