Kafka connector's task status are different when queried against different kafka connect nodes in a clustered enviroment - apache-kafka

We have a 3 node Kafka connect cluster running version 5.5.4 in distributed mode. We are observing a strange issue regarding connector's task status.
The REST calls to node 1 and 2 are returning different results.
The first node returned this result:
{
"connector":{
"state":"RUNNING",
"worker_id":"x.com:8083"
},
"name":"connector",
"type":"source",
"tasks":[
]
}
Yes the task is empty where as the other node returned this result:
{
"connector":{
"state":"RUNNING",
"worker_id":"x.com:8083"
},
"name":"connector...",
"type":"source",
"tasks":[
{
"id":0,
"state":"RUNNING",
"worker_id":"x.com:8083"
}
]
}
As mentioned in this doc https://docs.confluent.io/home/connect/userguide.html#kconnect-internal-topics, I have configured group.id, config.storage.topic, offset.storage.topic and status.storage.topic with identical values in all 3 nodes.
I did go through connect-statuses-0 data directory and the file sizes for log, index and timestamp are all identical in node 1 and node 2. I don't know what is the .snapshot file but I see only one with root user/group in first node where as I see 2 of them in the 2nd node. One owned by root user/group and the other owned by our custom created user. Not sure this has anything to do with this problem.
Please guide me in identifying the root cause for this problem. If I do need to check any configuration, please let me know.

Related

Service Fabric - 'System.Replicator' reported Warning for property 'RemoteReplicatorConnectionStatus'. Replica xyz cannot be reached

The error is
'System.Replicator' reported Warning for property 'RemoteReplicatorConnectionStatus'.
Replica 132295460844367404 cannot be reached to start the copy process. Error Code: CannotConnect, Target listen address: localhost:62352/5298ce62-a8b6-4c10-944c-ce861fb5abd9-132295460844367404;70bcec58-3f57-4a23-b787-7353d53e631d:fdd277399fb82af80e7f8a0f097d244d. Verify that ReplicatorAddress config is valid.
There are 3 replicas, and 2 of them are stuck InBuild. The error reports as coming from the Primary replica, and the replicaId it complains about is of one of the secondary replicas that is stuck InBuild.
Everything I find on this error is related to standalone clusters, but my cluster is Azure generated. What are some causes of this error? It only happens for my stateful service when I deploy multiple replicas.
In the Primary replica events it shows the following error for each of the other 2 replicas
"Description": "The api IReplicator.BuildReplica(132295460844367404) on node _default_4 is stuck. Start Time (UTC): 2020-03-24 17:55:24.215.",
If I set replica count to 1 the error doesn't appear, until I try to upgrade the application at which point it created a idle replica to swap and gets stuck on this error at that point causing the upgrade to hang indefinitely.
The same application can be deployed to my local 5 node cluster with no errors.
I started commenting out code to see if I could get to a state where it was working, and eventually narrowed it down to the way I was overriding the replicator settings.
I was doing this
public MyStateFulService(StatefulServiceContext context)
: base(context, new ReliableStateManager(context, new ReliableStateManagerConfiguration(new ReliableStateManagerReplicatorSettings
{
MaxReplicationMessageSize = 1073741824
}))){ }
and changing it to
<Section Name="ReplicatorConfig">
<Parameter Name="ReplicatorEndpoint" Value="ReplicatorEndpoint" />
<Parameter Name="MaxReplicationMessageSize" Value="524288000" />
<Parameter Name="MinLogSizeInMB" Value="4096" />
</Section>
resolved the issue. I assume I was overriding the default replicator endpoint by creating a new ReliableStateManagerReplicatorSettings object

Too many empty chk-* directories with Flink checkpointing using RocksDb as state backend

Too many empty chk-* files exist in the location where I have setup Rocksdb as state backend
I am using FlinkKafkaConsumer to get data from Kafka topic. And I am using RocksDb as state backend. I am just printing the messages received from Kafka.
Following are the properties I have to set up the state backend:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(100);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(50);
env.getCheckpointConfig().setCheckpointTimeout(60);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
StateBackend rdb = new RocksDBStateBackend("file:///Users/user/Documents/telemetry/flinkbackends10", true);
env.setStateBackend(rdb);
env.execute("Flink kafka");
In flink-conf.yaml I have also set this property:
state.checkpoints.num-retained: 3
I am using simple 1 node flink cluster(using ./start-cluster.sh) .I started the job and kept it running for 1 hour and I see too many chk-* files created under /Users/user/Documents/telemetry/flinkbackends10 location
chk-10 chk-12667 chk-18263 chk-20998 chk-25790 chk-26348 chk-26408 chk-3 chk-3333 chk-38650 chk-4588 chk-8 chk-96
chk-10397 chk-13 chk-18472 chk-21754 chk-25861 chk-26351 chk-26409 chk-30592 chk-34872 chk-39405 chk-5 chk-8127 chk-97
chk-10649 chk-13172 chk-18479 chk-22259 chk-26216 chk-26357 chk-26411 chk-31097 chk-35123 chk-39656 chk-5093 chk-8379 chk-98
chk-1087 chk-14183 chk-18548 chk-22512 chk-26307 chk-26360 chk-27055 chk-31601 chk-35627 chk-4 chk-5348 chk-8883 chk-9892
chk-10902 chk-15444 chk-18576 chk-22764 chk-26315 chk-26377 chk-28064 chk-31853 chk-36382 chk-40412 chk-5687 chk-9 chk-99
chk-11153 chk-15696 chk-18978 chk-23016 chk-26317 chk-26380 chk-28491 chk-32356 chk-36885 chk-41168 chk-6 chk-9135 shared
chk-11658 chk-16201 chk-19736 chk-23521 chk-26320 chk-26396 chk-28571 chk-32607 chk-37389 chk-41666 chk-6611 chk-9388 taskowned
chk-11910 chk-17210 chk-2 chk-24277 chk-26325 chk-26405 chk-29076 chk-32859 chk-37642 chk-41667 chk-7 chk-94
chk-12162 chk-17462 chk-20746 chk-25538 chk-26337 chk-26407 chk-29581 chk-33111 chk-38398 chk-41668 chk-7116 chk-95
out of which only chk-41668, chk-41667, chk-41666 have data.
The rest of the directories are empty.
Is this expected behavior. How to delete those empty directories? Is there some configuration for deleting empty directories?
Answering my own question here:
In UI I was seeing 'checkpoint expired before completing' error in the checkpointing section. And found out that to resolve the error we need to increase the checkpoint timeout.
I increased the timeout from 60 to 500 and it started deleting the empty chk-* files.
env.getCheckpointConfig().setCheckpointTimeout(500);

red hat OSP10 deploy fails on node profile tag, even though it is configured

I am trying to deploy rhosp10, and when getting to "openstack overcloud deploy" phase, I get these errors:
Error: only 0 of 1 requested ironic nodes are tagged to profile control (for flavor control)
Recommendation: tag more nodes using ironic node-update <NODE ID> replace properties/capabilities=profile:control,boot_option:local
Error: only 0 of 5 requested ironic nodes are tagged to profile compute (for flavor compute)
Recommendation: tag more nodes using ironic node-update <NODE ID> replace properties/capabilities=profile:compute,boot_option:local
Not enough nodes - available: 0, requested: 6
Configuration has 3 errors, fix them before proceeding. Ignoring these errors is likely to lead to a failed deploy.
However, I configured 1 node to use control profile, and 5 to use compute profile. For example:
[stack#rhosp-1-director ~]$ openstack baremetal node show 4e153e0a-4c7b-4ee9-afb5-9036e263949b|grep prop
| properties | {u'cpu_arch': u'x86_64', u'root_device': {u'serial': u'600508b1001c7b0731bc32edbb3a8369'}, u'cpus': u'48', u'capabilities': u'profile:control,boot_option:local', u'memory_mb': u'131072', u'local_gb': u'744'} |
[stack#rhosp-1-director ~]$ openstack baremetal node show 4989038d-de10-4365-8051-44fd42fd0ec7|grep prop
| properties | {u'cpu_arch': u'x86_64', u'root_device': {u'serial': u'600508b1001c73b9fa55f385cd1a4008'}, u'cpus': u'48', u'capabilities': u'profile:compute,boot_option:local', u'memory_mb': u'131072', u'local_gb': u'744'} |
Another thing is that the following command yields no output:
openstack overcloud profiles list
I am following their manual from https://access.redhat.com/documentation/en/red-hat-openstack-platform/10-beta/single/director-installation-and-usage/#sect-Registering_Nodes_for_the_Overcloud step by step, so don't know what I'm doing wrong.
problem ended up being ironic auto cleaning. introspection never completed ok. not sure why, but disabling auto cleaning in ironic.conf right after undercloud install, followed by a reboot (for all ironic services to update this property), followed by the next steps, was successful (including introspection).

[ejabberd w/ smack]: how to successfully create a leaf node inside a pubsub collection node

A registered user created a collection node on my ejabberd server using smack library and following config:
PubSubManager psMgr = new PubSubManager(conn, "pubsub.mydomain");
ConfigureForm CForm = new ConfigureForm(DataForm.Type.submit);
CForm.setAccessModel(AccessModel.open); //anyone can access
CForm.setDeliverPayloads(true); //allow payloads with notif
CForm.setNotifyDelete(true); //notify subscribers when nodeis deleted
CForm.setPersistentItems(true); //save published items in storage # server
CForm.setPresenceBasedDelivery(false); //notify subscribers even when offline
CForm.setPublishModel(PublishModel.open); //only publishers can post to this node
CForm.setNodeType(NodeType.collection);
CForm.setChildrenAssociationPolicy(ChildrenAssociationPolicy.all);
CForm.setChildrenMax(65536);
psMgr.createNode("/collection_node", lCForm);
....this collection node is created fine. Note that the children association policy is 'all'.
Now, if a different user, registered on the same server, tries to create a leaf node inside this collection node, the server returns 'forbidden - auth' error.
ConfigureForm form = new ConfigureForm(DataForm.Type.submit);
form.setNodeType(NodeType.leaf);
form.setCollection("/collection_node");
psMgr.createNode("/collection_node/leaf_node", form);
I have these plugins enabled in my ejabberd server for the pubsub module ["collections", "dag", "flat", "hometree", "pep"].
Can anyone please suggest why should the leaf node creation fail even after the collection node has granted 'all' to associate child nodes with itself.
Smack version is: 4.1.2
ejabberd version: (for some weird reason shows): 0.0 . [However, the server was installed from source code available on (https://github.com/processone/ejabberd/archive/master.zip) in Nov-2015 with erlang installed at the same time (OTP 17.1). So should be pretty much latest unless i screwed up something during installation.]

Zookeeper - three nodes and nothing but errors

I have three zookeeper nodes. All ports are open. The ip address are correct. Below is my config file. All nodes where booted by chef and all have the same install and config file.
# The number of milliseconds of each tick
tickTime=3000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/var/lib/zookeeper
# Place the dataLogDir to a separate physical disc for better performance
# dataLogDir=/disk2/zookeeper
# the port at which the clients will connect
clientPort=2181
server.1=111.111.111:2888:3888
server.2=111.111.112:2888:3888
server.3=111.111.113:2888:3888
Here is error for one of the nodes. So...I am rather confused on how I could get an error since the config is rather vanilla. All three nodes are doing hte same thing.
2012-07-16 05:16:57,558 - INFO [main:QuorumPeerConfig#90] - Reading configuration from: /etc/zookeeper/conf/zoo.cfg
2012-07-16 05:16:57,567 - INFO [main:QuorumPeerConfig#310] - Defaulting to majority quorums
2012-07-16 05:16:57,572 - FATAL [main:QuorumPeerMain#83] - Invalid config, exiting abnormally
org.apache.zookeeper.server.quorum.QuorumPeerConfig$ConfigException: Error processing /etc/zookeeper/conf/zoo.cfg
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:110)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:99)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:76)
Caused by: java.lang.IllegalArgumentException: serverid replace this text with the cluster-unique zookeeper's instance id (1-255) is not a number
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parseProperties(QuorumPeerConfig.java:333)
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:106)
... 2 more
You need create a file named myid and put it into zookeeper var directory, one for each server, consists of a single line containing only the text of that machine's id. So myid of server 1 would contain the text "1" and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255.
see more at http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_zkMulitServerSetup
server.1=111.111.111:2888:3888
server.2=111.111.112:2888:3888
server.3=111.111.113:2888:3888
Are your servers and IP's
Then create myid file on each of the nodes with value 1 in 111.111.111 and 2 in 111.111.111.112 and 3 in 111.111.111.113 servers under directory(dataDir=/var/lib/zookeeper)
If you place value "1" myid file you will get Number format exception and "Invalid config, exiting abnormally" if the myid file is created with any extension.
Therefore just create myid file without any extension and place integer values 1,2,3 in the corresponding servers without double quotes