Celery worker not reading pyrogram session file - celery

I'm trying to execute a pyrogram function via a celery task (scheduling etc.)
The function works when ran via shell:
from app_name.users.tasks import establish_session
establish_session()
Sending it to celery via establish_session.delay() is where the problem arises.
The exact same function when executed via celery fails to read the session required session file.
I've confirmed that the session file is seen in both methods and have permissions for os.R_OK, os.W_OK, os.F_OK.
users.tasks
#shared_task
def establish_session():
from utils.telegram import get_new_session
user_bot = get_new_session()
print(user_bot)
utils.telegram
def get_new_session():
import os
cwd = os.getcwd()
print(cwd)
print(os.access('user.session', os.R_OK)) # Check for read access
print(os.access('user.session', os.W_OK)) # Check for write access
print(os.access('user.session', os.X_OK)) # Check for execution access
print(os.access('user.session', os.F_OK)) # Check for existence of file
user_bot = Client("user", api_id=ID, api_hash=HASH)
user_bot.start()
user_bot.stop()
return user_bot
Difference in outputs:
establish_session()
INFO 2021-01-23 18:07:21,379 connection Connecting...
INFO 2021-01-23 18:07:21,382 connection Connected! Production DC5 - IPv4 - TCPAbridgedO
INFO 2021-01-23 18:07:21,383 session NetworkTask started
INFO 2021-01-23 18:07:21,435 msg_id Time synced: 2021-01-23 10:07:21.439058 UTC
INFO 2021-01-23 18:07:21,439 session NextSaltTask started
INFO 2021-01-23 18:07:21,439 session Next salt in 33m 13s (at 2021-01-23 18:40:35)
INFO 2021-01-23 18:07:21,524 session Session initialized: Layer 122
INFO 2021-01-23 18:07:21,524 session Device: CPython 3.8.6 - Pyrogram 1.1.10
INFO 2021-01-23 18:07:21,524 session System: Linux 5.8.0-33-generic (EN)
INFO 2021-01-23 18:07:21,524 session Session started
INFO 2021-01-23 18:07:21,540 session PingTask started
INFO 2021-01-23 18:07:21,620 dispatcher Started 6 HandlerTasks
INFO 2021-01-23 18:07:21,632 syncer Synced "user" in 11.2832 ms
INFO 2021-01-23 18:07:21,639 syncer Synced "user" in 7.18236 ms
INFO 2021-01-23 18:07:21,640 dispatcher Stopped 6 HandlerTasks
INFO 2021-01-23 18:07:21,640 session PingTask stopped
INFO 2021-01-23 18:07:21,640 session NextSaltTask stopped
INFO 2021-01-23 18:07:21,640 connection Disconnected
INFO 2021-01-23 18:07:21,641 session NetworkTask stopped
INFO 2021-01-23 18:07:21,641 session Session stopped
vs
establish_session.delay()
[2021-01-23 18:07:35,832: INFO/ForkPoolWorker-2] Start creating a new auth key on DC2
[2021-01-23 18:07:35,832: INFO/ForkPoolWorker-2] Connecting...
[2021-01-23 18:07:36,105: INFO/ForkPoolWorker-2] Connected! Production DC2 - IPv4 - TCPAbridgedO
[2021-01-23 18:07:37,592: INFO/ForkPoolWorker-2] Done auth key exchange:
[2021-01-23 18:07:37,592: INFO/ForkPoolWorker-2] Disconnected
[2021-01-23 18:07:37,605: WARNING/ForkPoolWorker-2] Pyrogram v1.1.10, Copyright (C) 2017-2021 Dan <https://github.com/delivrance>
[2021-01-23 18:07:37,605: WARNING/ForkPoolWorker-2] Licensed under the terms of the GNU Lesser General Public License v3 or later (LGPLv3+)
[2021-01-23 18:07:37,605: INFO/ForkPoolWorker-2] Connecting...
[2021-01-23 18:07:37,875: INFO/ForkPoolWorker-2] Connected! Production DC2 - IPv4 - TCPAbridgedO
[2021-01-23 18:07:37,875: INFO/ForkPoolWorker-2] NetworkTask started
[2021-01-23 18:07:38,459: INFO/ForkPoolWorker-2] Time synced: 2021-01-23 10:07:38.353224 UTC
[2021-01-23 18:07:38,732: INFO/ForkPoolWorker-2] NextSaltTask started
[2021-01-23 18:07:38,732: INFO/ForkPoolWorker-2] Next salt in 44m 58s (at 2021-01-23 18:52:37)
[2021-01-23 18:07:39,096: INFO/ForkPoolWorker-2] Session initialized: Layer 122
[2021-01-23 18:07:39,096: INFO/ForkPoolWorker-2] Device: CPython 3.8.6 - Pyrogram 1.1.10
[2021-01-23 18:07:39,096: INFO/ForkPoolWorker-2] System: Linux 5.8.0-33-generic (EN)
[2021-01-23 18:07:39,096: INFO/ForkPoolWorker-2] Session started
[2021-01-23 18:07:39,099: WARNING/ForkPoolWorker-2] Enter phone number or bot token:
[2021-01-23 18:07:39,099: INFO/ForkPoolWorker-2] PingTask started
[2021-01-23 18:07:39,100: INFO/ForkPoolWorker-2] PingTask stopped
[2021-01-23 18:07:39,100: INFO/ForkPoolWorker-2] NextSaltTask stopped
[2021-01-23 18:07:39,100: INFO/ForkPoolWorker-2] Disconnected
[2021-01-23 18:07:39,101: INFO/ForkPoolWorker-2] NetworkTask stopped
[2021-01-23 18:07:39,101: INFO/ForkPoolWorker-2] Session stopped
Any assistance is greatly appreciated!

I did a lot of work to get pyrogram working under celery. It's not ideal but it works for my case. Maybe this could help you too.
I'm using the latest version of pyrogram(1.3.5) and celery(5.2.3)
# first need to create a client, save session file in memory
tg_client=Client(":memory:",APP_ID=123,APP_HASH="abc")
# create celery app
app = Celery('tasks', broker=BROKER)
#app.task
def some_task():
print(tg_client.get_me())
# define celery startup
def run_celery():
# pool must be threads
argv = [
"-A", "tasks", 'worker', '--loglevel=info',
"--pool=threads"]
app.worker_main(argv)
if __name__ == '__main__':
tg_client.start() # <-- I think you can also put it in the first line of `run_celery`
threading.Thread(target=run_celery, daemon=True).start()
idle()
celery_client.stop()
Key points are:
need to start celery worker in a different thread than main thread because pyrogram is async library, it relies on main thread while celery is blocking the main thread
celery pool must be threads or solo
Besides that, you can also use with in a task
#app.task
def some_task()
with tg_client():
print(tg_client.get_me())
some references:
https://github.com/pyrogram/pyrogram/issues/480
https://github.com/tgbot-collection/ytdlbot/blob/master/ytdlbot/tasks.py

Related

Creating topics in SASL/GSSAPI (Kerberos) based Kafka Cluster

We have a SASL/GSSAPI (Kerberos) based authentication scheme in our Kafka cluster. Brokers are configured to authenticate with Zookeeper and each other. We added a principal to the "Super Users" list on all the brokers so that we can create topics using that principal, however, topic creation is failing, seemingly because of lack of privileges:
[2019-09-11 02:16:30,905] INFO Starting ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)
[2019-09-11 02:16:30,912] INFO Waiting for keeper state SaslAuthenticated (org.I0Itec.zkclient.ZkClient)
[2019-09-11 02:16:31,157] INFO Client successfully logged in. (org.apache.zookeeper.Login)
[2019-09-11 02:16:31,161] INFO Client will use GSSAPI as SASL mechanism. (org.apache.zookeeper.client.ZooKeeperSaslClient)
[2019-09-11 02:16:31,164] INFO Opening socket connection to server broker101.prod/13.14.15.16:2181. Will attempt to SASL-authenticate using Login Context section 'Client' (org.apache.zookeeper.ClientCnxn)
[2019-09-11 02:16:31,177] INFO Socket connection established to broker101.prod/13.14.15.16:2181, initiating session (org.apache.zookeeper.ClientCnxn)
[2019-09-11 02:16:31,179] INFO TGT refresh thread started. (org.apache.zookeeper.Login)
[2019-09-11 02:16:31,193] INFO TGT valid starting at: Tue Aug 20 02:16:31 UTC 2019 (org.apache.zookeeper.Login)
[2019-09-11 02:16:31,194] INFO TGT expires: Wed Aug 21 02:16:31 UTC 2019 (org.apache.zookeeper.Login)
[2019-09-11 02:16:31,194] INFO TGT refresh sleeping until: Tue Aug 20 21:34:57 UTC 2019 (org.apache.zookeeper.Login)
[2019-09-11 02:16:31,203] INFO Session establishment complete on server broker101.prod/13.14.15.16:2181, sessionid = 0x16c60b863b00035, negotiated timeout = 30000 (org.apache.zookeeper.ClientCnxn)
[2019-09-11 02:16:31,204] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)
[2019-09-11 02:16:31,214] ERROR An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - LOOKING_UP_SERVER)]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state. (org.apache.zookeeper.client.ZooKeeperSaslClient)
[2019-09-11 02:16:31,214] ERROR SASL authentication with Zookeeper Quorum member failed: javax.security.sasl.SaslException: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - LOOKING_UP_SERVER)]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn)
[2019-09-11 02:16:31,215] INFO zookeeper state changed (AuthFailed) (org.I0Itec.zkclient.ZkClient)
[2019-09-11 02:16:31,215] INFO Terminate ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)
Exception in thread "main" org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:157)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:131)
at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:103)
at kafka.utils.ZkUtils$.apply(ZkUtils.scala:85)
at kafka.admin.TopicCommand$.main(TopicCommand.scala:58)
at kafka.admin.TopicCommand.main(TopicCommand.scala)
Is it even possible to create topics with a principal other than principal name used by broker to authentication with zookeeper? if yes, then how?
We can successfully create topics using principal which is used by brokers to authenticate with Zookeeper. We were thinking if Super User can perhaps do anything on the cluster, including creating new topics. Is that perception incorrect?

JobManager doesn't automatically redirect all requests to the remaining / running TaskManager

Problem Description
2 computers(203,204)
created a Standalone mode HA Flink v1.6.1 cluster
both run jobmanager and taskmanager(2 task slots) on every computer
After I start a job (examples SocketWindowWordCount.jar ./flink run ../examples/streaming/SocketWindowWordCount.jar --hostname 10.1.2.9 --port 9000) on the JobManager node, I kill the working TaskManager instance.
Web Dashboard I can see the job being cancelled and then failed. Web Dashboard image
flink-conf.yaml
state.backend: filesystem
state.checkpoints.dir: hdfs://10.1.2.109:8020/wulin/flink-checkpoints
rest.port: 9081
blob.server.port: 6124
query.server.port: 6125
web.tmpdir: /home/flink/deploy/webTmp
web.log.path: /home/flink/deploy/log
io.tmp.dirs: /home/flink/deploy/taskManagerTmp
high-availability: zookeeper
high-availability.zookeeper.quorum: 10.0.1.79:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: flink
high-availability.storageDir: hdfs://10.1.2.109:8020/wulin
security.kerberos.login.principal: xxxx
security.kerberos.login.keytab: /home/ctu/flink/flink-1.6/conf/user.keytab
full logs
log-standalonesession-203
log-taskexecutor-203
log-standalonesession-204
exception
kill working TM, get the excpetion like this
2018-12-28 11:04:27,877 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,660 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: hz203/10.0.0.203:42861
2018-12-28 11:04:28,660 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Closing TaskExecutor connection 0f41bca09600cd25000e19801076fa1f because: The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Unregister TaskManager dcf3bb5b7ed2208cf45b658d212fd8d2 from the SlotManager.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (88aa62ad152f4df6b39a969dd32c0249) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot 0f41bca09600cd25000e19801076fa1f_0 was removed.
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:803)
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1116)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2018-12-28 11:04:28,680 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Socket Window WordCount (61f55876e79934d515c163d095d706a6) switched from state RUNNING to FAILING.
submit job
run ./bin/flink run -d ./examples/streaming/SocketWindowWordCount.jar --port 9000 --hostname 10.1.2.9, get the JM logs like this
2018-12-28 19:20:01,354 INFO org.apache.flink.runtime.jobmaster.JobMaster - Starting execution of job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291)
2018-12-28 19:20:01,354 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291) switched from state CREATED to RUNNING.
2018-12-28 19:20:01,356 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,359 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,364 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e33a40832a3922897470fb76bcf76b29}]
2018-12-28 19:20:01,367 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink#hz203:46596/user/resourcemanager(b22f96303e74df23645fe4567f884b9e)
2018-12-28 19:20:01,370 INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration
2018-12-28 19:20:01,370 INFO org.apache.flink.runtime.jobmaster.JobMaster - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-12-28 19:20:01,371 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/5cdb91c15ee12ec6e74256eed10b5291/job_manager_lock.
2018-12-28 19:20:01,371 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6#akka.tcp://flink#hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,431 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6#akka.tcp://flink#hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,432 INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: b22f96303e74df23645fe4567f884b9e.
2018-12-28 19:20:01,433 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Requesting new slot [SlotRequestId{e33a40832a3922897470fb76bcf76b29}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-12-28 19:20:01,434 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 5cdb91c15ee12ec6e74256eed10b5291 with allocation id AllocationID{f7a24e609e2ec618ccb456076049fa3b}.
2018-12-28 19:20:01,510 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,511 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Source: Socket Stream -> Flat Map (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,515 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,515 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,674 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:01,708 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:43,267 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-513fbe1e6ddf69d10689eccf4c65da97 from hz203/10.0.0.203:6124
2018-12-28 19:20:48,339 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-dd915bb9821ff6ced34dd5e489966b674de5a48f-7ea2600930e5fc5a4fbb7d47ee198789 from hz203/10.0.0.203:6124
2018-12-28 19:20:52,623 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-0bd1ab86fa4cc54daeb472079bfbea8c from hz203/10.0.0.203:6124
kill TM
Body is limited to 30000 characters. please read this JM logs when kill TM
The logs indicate that your RestartStrategy has depleted its restart attempts or that no RestartStrategy has been configured. Please check whether you specified a RestartStrategy in your program via env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 0L)) or in flink-conf.yaml via restart-strategy: fixed-delay. If you want to learn more about Flink's restart strategies check out the documentation.

Confluent-Kafka "Consumer instance not found" error even-if consumer instance is not timedout

I'm observing Consumer instance not found error, at the time of consumer registration even if consumer instance is not timedout. Using Confluent API's.
Following are steps followed for this negative testing:
Running a script for consumer registration.
Kafka Topology : 3 ZK instance (where 2 ZK are dummy value) and 1 node cluster (single instance of rest-proxy & broker node).
When script is in progress for consumer registration, have cancelled the script & re-run it.
Have seen that for the last registered consumer, Instance not found error is returned. But, in logs after some ms, 200 OK is listed for that consumer registration request [Consumer name in shared below logs : CGStress_TEST111111111111111_6].
[2018-08-28 09:05:48,411] INFO http://localhost:8082/v1/consumer/CGStress_TEST111111111111111_5 Service: TransacationId:5 EntityId:ed_1 Authorized:true Allowed:true (io.confluent.kafkarest.resources.SecurityRestrictions)
{"X-Nssvc-serviceid":null,"Type":"API","X-Nssvc-customerid":null,"Client-IP":"127.0.0.1","Severity":"INFO","X-Cws-Transactionid":"5","message":{"request":{"content-length":81,"method":"POST","time":"2018-08-28 09:05:48.409","uri":"mr/v1/consumer/CGStress_TEST111111111111111_5","entity-id":"ed_1","user-agent":"python-requests/2.11.1"},"response":{"status_code":200,"time":"2018-08-28 09:05:48.412"}}}
[2018-08-28 09:05:48,412] INFO 127.0.0.1 - - [28/Aug/2018:09:05:48 +0000] "POST /mr/v1/consumer/CGStress_TEST111111111111111_4 HTTP/1.1" 200 205 19 (io.confluent.rest-utils.requests)
[2018-08-28 09:05:48,420] INFO http://localhost:8082/mr/v1/consumer/CGStress_TEST111111111111111_6 Service: TransacationId:6 EntityId:ed_1 Authorized:true Allowed:true (io.confluent.kafkarest.resources.SecurityRestrictions)
{"X-Nssvc-serviceid":null,"Type":"API","X-Nssvc-customerid":null,"Client-IP":"127.0.0.1","Severity":"INFO","X-Cws-Transactionid":"6","message":{"request":{"content-length":81,"method":"POST","time":"2018-08-28 09:05:48.419","uri":"mr/v1/consumer/CGStress_TEST111111111111111_6","entity-id":"ed_1","user-agent":"python-requests/2.11.1"},"response":{"status_code":404,"error_response":{"message":"Consumer instance not found.","error":40403},"time":"2018-08-28 09:05:48.421"}}}
[2018-08-28 09:05:48,423] INFO 127.0.0.1 - - [28/Aug/2018:09:05:48 +0000] "POST /mr/v1/consumer/CGStress_TEST111111111111111_5 HTTP/1.1" 200 205 15 (io.confluent.rest-utils.requests)
[2018-08-28 09:05:48,431] INFO 127.0.0.1 - - [28/Aug/2018:09:05:48 +0000] "POST /mr/v1/consumer/CGStress_TEST111111111111111_6 HTTP/1.1" 404 61 13 (io.confluent.rest-utils.requests)
[2018-08-28 09:05:49,299] WARN Client session timed out, have not heard from server in 1501ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
[2018-08-28 09:05:49,300] INFO Client session timed out, have not heard from server in 1501ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2018-08-28 09:05:49,400] INFO Opening socket connection to server localhost/0:0:0:0:0:0:0:1:32181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2018-08-28 09:05:49,400] INFO Socket connection established to localhost/0:0:0:0:0:0:0:1:32181, initiating session (org.apache.zookeeper.ClientCnxn)
[2018-08-28 09:05:49,403] INFO Session establishment complete on server localhost/0:0:0:0:0:0:0:1:32181, sessionid = 0x1657f54045b00f3, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2018-08-28 09:05:49,403] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)
[2018-08-28 09:05:49,404] INFO [CGStress_TEST111111111111111_6_UbuntuNTP-1535447146187-7b4d0350], starting auto committer every 60000 ms (kafka.consumer.ZookeeperConsumerConnector)
{"X-Nssvc-serviceid":null,"Type":"API","X-Nssvc-customerid":null,"Client-IP":"127.0.0.1","Severity":"INFO","X-Cws-Transactionid":"6","message":{"request":{"content-length":81,"method":"POST","time":"2018-08-28 09:05:46.172","uri":"mr/v1/consumer/CGStress_TEST111111111111111_6","entity-id":"ed_1","user-agent":"python-requests/2.11.1"},"response":{"status_code":200,"time":"2018-08-28 09:05:49.405"}}}
[2018-08-28 09:05:49,409] INFO 127.0.0.1 - - [28/Aug/2018:09:05:46 +0000] "POST /mr/v1/consumer/CGStress_TEST111111111111111_6 HTTP/1.1" 200 124 3239 (io.confluent.rest-utils.requests)
root#UbuntuNTP:~/CloudServices/MsgRelay#
Is this something related to dummy zk instance added?

ipyparallel displaying "registration: purging stalled registration"

I am trying to use the ipyparallel library to run an ipcontroller and ipengine on different machines.
My setup is as follows:
Remote machine:
Windows Server 2012 R2 x64, running an ipcontroller, listening on port 5900 and ip=0.0.0.0.
Local machine:
Windows 10 x64, running an ipengine, listening on the remote machine's ip and port 5900.
Controller start command:
ipcontroller --ip=0.0.0.0 --port=5900 --reuse --log-to-file=True
Engine start command:
ipengine --file=/c/Users/User/ipcontroller-engine.json --timeout=10 --log-to-file=True
I've changed the interface field in ipcontroller-engine.json from "tcp://127.0.0.1" to "tcp://" for ipengine.
On startup, here is a snapshot of the ipcontroller log:
2016-10-10 01:14:00.651 [IPControllerApp] Hub listening on tcp://0.0.0.0:5900 for registration.
2016-10-10 01:14:00.677 [IPControllerApp] Hub using DB backend: 'DictDB'
2016-10-10 01:14:00.956 [IPControllerApp] hub::created hub
2016-10-10 01:14:00.957 [IPControllerApp] task::using Python leastload Task scheduler
2016-10-10 01:14:00.959 [IPControllerApp] Heartmonitor started
2016-10-10 01:14:00.967 [IPControllerApp] Creating pid file: C:\Users\Administrator\.ipython\profile_default\pid\ipcontroller.pid
2016-10-10 01:14:02.102 [IPControllerApp] client::client b'\x00\x80\x00\x00)' requested 'connection_request'
2016-10-10 01:14:02.102 [IPControllerApp] client::client [b'\x00\x80\x00\x00)'] connected
2016-10-10 01:14:47.895 [IPControllerApp] client::client b'82f5efed-52eb-46f2-8c92-e713aee8a363' requested 'registration_request'
2016-10-10 01:15:05.437 [IPControllerApp] client::client b'efe6919d-98ac-4544-a6b8-9d748f28697d' requested 'registration_request'
2016-10-10 01:15:17.899 [IPControllerApp] registration::purging stalled registration: 1
And the ipengine log:
2016-10-10 13:44:21.037 [IPEngineApp] Registering with controller at tcp://172.17.3.14:5900
2016-10-10 13:44:21.508 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
2016-10-10 13:44:21.522 [IPEngineApp] Completed registration with id 1
2016-10-10 13:44:27.529 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2016-10-10 13:44:30.539 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (2 time(s) in a row).
...
2016-10-10 13:46:52.009 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (49 time(s) in a row).
2016-10-10 13:46:55.028 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (50 time(s) in a row).
2016-10-10 13:46:55.028 [IPEngineApp] CRITICAL | Maximum number of heartbeats misses reached (50 times 3010 ms), shutting down.
(There is a 12.5 hour time difference between the local machine and the remote VM)
Any idea why this may happen?
If you are using --reuse, make sure to remove the files if you change settings. It's possible that it doesn't behave well when --reuse is given and you change things like --ip, as the connection file may be overriding your command-line arguments.
When setting --ip=0.0.0.0, it may be useful to also set --location=a.b.c.d where a.b.c.d is an ip address of the controller that you know is accessible to the engines. Changing the
If registration works and subsequent connections don't, this may be due to a firewall only opening one port, e.g. 5900. The machine running the controller needs to have all the ports listed in the connection file open. You can specify these to be a port-range by manually entering port numbers in the connection files.

OrientDB & .Net driver: Unable to read data from the transport connection

Getting error while reading network stream from a successful socket connection. PL see the debug log from orient DB:
2016-04-08 18:08:51:590 WARNI Not enough physical memory available for DISKCACHE: 1,977MB (heap=494MB). Set lower Maximum Heap (-Xmx setting on JVM) and restart OrientDB. Now
running with DISKCACHE=256MB [orientechnologies]
2016-04-08 18:08:51:606 INFO OrientDB config DISKCACHE=-566MB (heap=494MB os=1,977MB disk=16,656MB) [orientechnologies]
2016-04-08 18:08:51:809 INFO Loading configuration from: C:/inetpub/wwwroot/orientdb-2.1.5/config/orientdb-server-config.xml... [OServerConfigurationLoaderXml]
2016-04-08 18:08:52:292 INFO OrientDB Server v2.1.5 (build 2.1.x#r${buildNumber}; 2015-10-29 16:54:25+0000) is starting up... [OServer]
2016-04-08 18:08:52:370 INFO Databases directory: C:\inetpub\wwwroot\orientdb-2.1.5\databases [OServer]
2016-04-08 18:08:52:495 INFO Listening binary connections on 127.0.0.1:2424 (protocol v.32, socket=default) [OServerNetworkListener]
2016-04-08 18:08:52:511 INFO Listening http connections on 127.0.0.1:2480 (protocol v.10, socket=default) [OServerNetworkListener]
2016-04-08 18:08:52:573 INFO Installing dynamic plugin 'studio-2.1.zip'... [OServerPluginManager]
2016-04-08 18:08:52:838 INFO Installing GREMLIN language v.2.6.0 - graph.pool.max=50 [OGraphServerHandler]
2016-04-08 18:08:52:838 INFO [OVariableParser.resolveVariables] Error on resolving property: distributed [orientechnologies]
2016-04-08 18:08:52:854 INFO Installing Script interpreter. WARN: authenticated clients can execute any kind of code into the server by using the following allowed languages:
[sql] [OServerSideScriptInterpreter]
2016-04-08 18:08:52:854 INFO OrientDB Server v2.1.5 (build 2.1.x#r${buildNumber}; 2015-10-29 16:54:25+0000) is active. [OServer]
2016-04-08 18:08:57:986 INFO /127.0.0.1:49243 - Connected [OChannelBinaryServer]
2016-04-08 18:08:58:002 INFO /127.0.0.1:49243 - Writing short (2 bytes): 32 [OChannelBinaryServer]
2016-04-08 18:08:58:002 INFO /127.0.0.1:49243 - Flush [OChannelBinaryServer]
2016-04-08 18:08:58:002 INFO /127.0.0.1:49243 - Reading byte (1 byte)... [OChannelBinaryServer]
Using OrientDB .Net binary (C# driver) in Windows Vista. This was working fine until recently. Not sure what broke it...
Resetting TCP/IP using NetShell utility did not help.
Any help is highly appreciated.
The problem was with the AVG anti-virus program that is blocking the socket. Added an exception in the program for localhost to fix the problem.