Remote NiFi input ports are not exposed - dataflow

I'm learning Apache NiFi. I'm working on a simple site-to-site data flow. On one side, I have a single node NiFi and on the other side, I have two node NiFi cluster. The issue that I'm facing is, (on the single node instance)when I connect a GetFile processor with a Remote Process Group(Two node NiFi cluster), the connection details asks me to select the Input Port name of the Remote Cluster. However, in the drop down list, my remote cluster's input port name is not displayed.
I have given the correct URL of the remote NiFi cluster. the single node instance is supposed to talk with the remote cluster to get the port details and port names, right? Then why is it now showing my input port?

In a secure setup, there are two policies that need to be created. One is a global policy that allows the Remote Process Group to ask the other NiFi for information about the nodes/node, this is called "retrieve site-to-site details", the other is a policy on each port that allows data to be sent to it, this is called "receive data via site-to-site".
This blog post explains how to configure secure site-to-site in more detail:
http://bryanbende.com/development/2016/08/30/apache-nifi-1.0.0-secure-site-to-site

Related

Artemis k8s cluster client connection handling

I am using the Artemis Cloud operator for deploying ActiveMQ Artemis in k8s cluster. Number of replicas configured is 3. Queues are expected to be created by client applications. Operator creates a headless service and service for each pod in the cluster setup.
Whenever client connects to a pod ,it creates a queue in that broker pod.So if client connects to 3 brokers at random time, three queues gets created in pods, one in each pod respectively.So when a producer sends message to the queue, it is sent to the connected pod. It wont be present in each broker pod.(No replication of messages).
My question is what service name should client applications use inorder to connect to artemis pods and also maintain session affinity? In other words, what should be done in order to make a client connect to same broker whenever it tries a connection?(and avoid duplicate queue creation)
What I currently use is a kubernetes clusterip service I created that splits traffics to pods.And queues are created via stomp producer.
Not sure which helm it will be using in the background but you will be using the service name inside the kubernetes app for connection instead of the ClusterIP.
Service name could be starting the with helm chart name. i was checking and referring this chart for ref : https://github.com/deviceinsight/activemq-artemis-helm could be different.
Looks like it's creating the service with Cluster IP:none (Headless service) so connecting to the existing service is not working, i doubt currently it might be returning all the PODs IP.
Option : 2 if above one not work give try to this
In another case, you can create the new service type clusterIP with a different name, everything else will be the same, like port and all.
In service you can notice amqp, mqtt and other port, so you app will connect to new service with port config as per requirement. For example : active-mq-service:61613
If everything working fine for you and you are just looking for session affinity you can add this config to your service and it will start managing the session affinity.
SessionAffinity: ClientIP
If you want to make sure that connections from a particular client are
passed to the same Pod each time, you can select the session affinity
based on the client's IP addresses by setting
service.spec.sessionAffinity to "ClientIP" (the default is "None").
You can also set the maximum session sticky time by setting
service.spec.sessionAffinityConfig.clientIP.timeoutSeconds
appropriately. (the default value is 10800, which works out to be 3
hours).
Ref doc : https://kubernetes.io/docs/concepts/services-networking/service/

Cannot connect to kafka connect cluster running on AWS from outside EC2

I have an ECS cluster with 3 EC2 instances all sitting in private subnets. I created a task definition to run the kafka-connect image provided by Confluent with the following environment variables:
CONNECT_CONFIG_STORAGE_TOPIC=quickstart-config
CONNECT_GROUP_ID=quickstart
CONNECT_INTERNAL_KEY_CONVERTER=org.apache.kafka.connect.json.JsonConverter
CONNECT_INTERNAL_VALUE_CONVERTER=org.apache.kafka.connect.json.JsonConverter
CONNECT_KEY_CONVERTER=org.apache.kafka.connect.json.JsonConverter
CONNECT_OFFSET_STORAGE_TOPIC=quickstart-offsets
CONNECT_PLUGIN_PATH=/usr/share/java
CONNECT_REST_ADVERTISED_HOST_NAME=localhost
CONNECT_REST_ADVERTISED_PORT=8083
CONNECT_SECURITY_PROTOCOL=SSL
CONNECT_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM=
CONNECT_STATUS_STORAGE_TOPIC=quickstart-status
CONNECT_VALUE_CONVERTER=org.apache.kafka.connect.json.JsonConverter
I have an application load balancer in front of this cluster with a listener on port 8083. I have correctly set up target group to include the EC2 instances running kafka-connect. So the load balancer should forward requests to the cluster. And it does, but I always get back a 502 Bad Gateway response. I can ssh into the EC2 instances and curl localhost:8083 and get the response back from kafka-connect, but from outside the EC2, I don't get a response.
To rule out networking issues between the load balancer and the cluster, I created a separate task defintion running Nginx on port 80 and I'm able to successfully hit it from outside the EC2 instances through the load balancer.
I have a feeling that I have not set CONNECT_REST_ADVERTISED_HOST_NAME to the correct value. It's my understanding that this is the host clients should connect to. However, because my EC2 instances are in a private subnet, I have no idea what to set this to, which is why I've set it to localhost. I tried setting it to the load balancer's DNS name, but that doesn't work.
You need to set CONNECT_REST_ADVERTISED_HOST_NAME to the host or IP that the other Kafka Connect workers can resolve and connect to.
It's used for the internal communication between workers, and if it's localhost then if your REST request (via your load balancer) hits a worker that is not the current leader of the cluster, that worker will try to forward the request to the leader—using the CONNECT_REST_ADVERTISED_HOST_NAME. But if CONNECT_REST_ADVERTISED_HOST_NAME is localhost then the worker will simply be forwarding the request to itself and hence things won't work.
For more details see https://rmoff.net/2019/11/22/common-mistakes-made-when-configuring-multiple-kafka-connect-workers/

How to get znode ip

I have many services connected to zookeeper, and I want that service A can get service B's IP, when service B connected to zookeeper, is there any API can do that? Or I have to use other config file to write down all services's IP?
Take a look if this solves your problem:
http://curator.apache.org/curator-x-discovery/
Zookeeper doesn't provide service discovery out of the box, but it is easy to implement it yourself.
You won't be able to get the IP addresses of other connected clients (services, in your case) straight from the Zookeeper API. In order to get other services connected to the cluster, each service has to individually create an ephemeral znode under a specific path, e.g. /services, and set the necessary addressing and naming info as znode's data (IP, port, etc). This way, you can list that path and discover active services, or watch the /services path for any changes in your service configuration.
Since services are creating ephemeral nodes, they will automatically be removed once they are disconnected and their session expires. Of course, once you start doing something like this, you will see that there are many small details and design decisions you have to make, ergo the already mentioned Curator recipe.

Getting ZooKeeper to run on Google's Compute Engine using external IPs

I have been trying to setup a ZooKeeper cluster on the Google Compute Engine and have run into some issues when using the external IPs of the machines. My cluster consists of 3 nodes on their own separate instances on GCE.
Now, when I configure each node to use the external IP of the instance they seem to be unable to communicate with each other.
zoo.cfg
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=externalIp1:2888:3888
server.2=externalIp2:2888:3888
server.3=externalIp3:2888:3888
If I configure them with their internal IP, however, everything works perfectly fine. My guess is that when ZooKeeper starts up, it binds itself to the internal IP of the instance regardless of the configurations. Because of this, when each node tries to look for the other 2 using the external IPs that they were configured, they're unable to find them.
So my question is, is there any way to make it so that ZooKeeper uses the external IP of the machine instead of the internal one? I'm relatively new to the Google Cloud Platform and to setting up hardware in general, so I'm not really sure if something like ip forwarding, firewall rules, or something else would achieve what I'm trying to do (assuming it's even possible).
According to the Zookeeper 3.4.5 docs, you need to specify the following option:
clientPortAddress
New in 3.3.0: the address (ipv4, ipv6 or hostname) to listen for client connections; that is, the address that clients attempt to connect to. This is optional, by default we bind in such a way that any connection to the clientPort for any address/interface/nic on the server will be accepted.
Although it appears that by default, it will bind to all available IPs on the server, so theoretically, it should have worked as you have set it up.
Important note: if Zookeeper instances talk to each other using external IPs rather than internal IPs, you will be charged for data egress whereas if all communication is over internal network (using internal IPs) within the same zone, you won't.

Windows Failover Cluster not online during creation of SQL Always On Availability Group

I've been following this tutorial to create an Azure SQL AlwaysOn Availability Group using Powershell:
Tutorial: AlwaysOn Availability Groups in Windows Azure (PowerShell)
When I get to the command that invokes the CreateAzureFailoverCluster powershell script, I check the state of the failover cluster. In Failover Cluster Manager, it is shown as "the cluster network name is not online"
When I look at the Cluster Events, I see this:
Cluster network name resource 'Cluster Name' cannot be brought online. Ensure that the network adapters for dependent IP address resources have access to at least one DNS server. Alternatively, enable NetBIOS for dependent IP addresses.
Each of the 3 servers in the cluster has access to the DC via ping. All of the preceding setup steps execute correctly. The servers are all on the 10.10.2.x/24 IP range except the DC, which is on 10.10.0.0/16 (with IP of 10.10.0.4)
All of the settings have been validated by prior execution of the tutorial on a different Azure subscription to create a failover cluster that works fine.
Cluster validation reveals this warning:
The "Cluster Group" does not contain an Cluster IP Address resource. This is a required resource for the group. It may be difficult to manage the cluster with this resource missing
(sic)
How do I add a Cluster IP Address resource?
There was nothing wrong with the configuration or the steps taken.
It took the cluster 3 hours to come online.
Attempting to bring the cluster online manually, failed for at least 20 mins after creating the cluster.
The Windows Event logs on all 4 servers showed nothing to say when the cluster moved into the online state.
It seems the correct solution was to work on something else until the cluster started under its retry policy.
Did you setup a fixed IP address in the cluster, using the cluster manager? there's a bug, DHCP will assign the cluster the IP address of one of the sql server isntances. I just assigned a high enough number (x.x.x.171, I think), and the problem went away.