FIX 5.0 (or SP2) timeout on logins very occasional - fix-protocol

[Sometimes] I get very weird login issues. And it's only sometimes I keep getting disconnect and login and its a loop. Probably it could be due to server load while it is doing many things. I dont have a dedicated server for the DB or FIX and all services is put onto 1 machine. But when FIX doesnt work, it always due to login issue, and i suspect there is a timeout for it, such that if it cannot login to a FIX server for certain time (less than fractions of a second) i get the dreaded LOGOUT and attempt to login again by FIX and the loop goes forever until i actually reboot the computer or stop all services and application and run my FIX client first. Here is the log....
2022-07-08 18:00:55.055 +08:00 [INF] OMS QuickFix Service Started
2022-07-08 18:00:55.079 +08:00 [INF] Server Started - Accepting connections [0.0.0.0:7000]
2022-07-08 18:00:55.082 +08:00 [INF] OMS Fix Router Service Started: Version [1.08]
2022-07-08 18:00:55.083 +08:00 [INF] FIX Connection Succeeded.
2022-07-08 18:00:55.139 +08:00 [INF] Logout - FIXT.1.1:CLIENT1->EXECUTOR
2022-07-08 18:00:57.071 +08:00 [INF] FIX Connection Succeeded.
2022-07-08 18:00:57.076 +08:00 [INF] Logout - FIXT.1.1:CLIENT1->EXECUTOR
2022-07-08 18:00:59.080 +08:00 [INF] FIX Connection Succeeded.
2022-07-08 18:00:59.085 +08:00 [INF] Logout - FIXT.1.1:CLIENT1->EXECUTOR
2022-07-08 18:01:01.091 +08:00 [INF] FIX Connection Succeeded.
2022-07-08 18:01:01.096 +08:00 [INF] Logout - FIXT.1.1:CLIENT1->EXECUTOR
2022-07-08 18:01:03.103 +08:00 [INF] FIX Connection Succeeded.
2022-07-08 18:01:03.107 +08:00 [INF] Logout - FIXT.1.1:CLIENT1->EXECUTOR
2022-07-08 18:01:05.126 +08:00 [INF] FIX Connection Succeeded.
2022-07-08 18:01:05.130 +08:00 [INF] Logout - FIXT.1.1:CLIENT1->EXECUTOR
2022-07-08 18:01:07.135 +08:00 [INF] FIX Connection Succeeded.
2022-07-08 18:01:07.139 +08:00 [INF] Logout - FIXT.1.1:CLIENT1->EXECUTOR
2022-07-08 18:01:08.138 +08:00 [INF] OMS QuickFix Service Stopped
Once i see this log, i stop and and restart the application with many services shut down and it will work. Why is that, or is there a way for me the increase the connection timeout so it doesnt time out on me, is there an option or configuration in the FIX config (cfg) file that i can use to change this behavior? I am using FIX library QuickFixN, a C# .NET FIX library.
I really need to get this resolved, it just happens randomly and i dont know what the real problem is, what i stated above was theoretical.

OK, sorry, i think i figured it out. I have a WebApi service running at port 5000 (default) and once this service runs, it SOMEHOW (which i have no idea why and how) it also opens port 5001.
QuickFix by default runs on port 5001 and so that is why if my WebApi service ran first, my FIX application will NOT run. But if i run my FIX app first and then the WebApi, then everything works.
I just dont know why and how the WebApi will open 5001 as well for no reason when it's only suppose to run on port 5000 only. Strange.

Related

Unable to start Drill in distributed mode

I am trying to setup drillv1.18 running. Facing the error below.
The drill-override.conf points to the zookeeper which runs on port 12181. On starting in distributed mode, it fails with the following log output. But the embedded mode has no issues.
It appears like permission issue, but both zookeeper, drill, zookeeper data-dir all are running under the same user.
2020-05-10 16:23:01,160 [main] DEBUG o.apache.drill.exec.server.Drillbit - Construction started.
2020-05-10 16:23:01,448 [main] DEBUG o.a.d.e.c.zk.ZKClusterCoordinator - Connect localhost:12181, zkRoot drill, clusterId: drillbits1
2020-05-10 16:23:01,531 [main] INFO o.a.d.e.s.s.PersistentStoreRegistry - Using the configured PStoreProvider class: 'org.apache.drill.exec.store.sys.store.provider.ZookeeperPersistentStoreProvider'.
2020-05-10 16:23:01,718 [main] DEBUG o.a.drill.exec.ssl.SSLConfigServer - Using Hadoop configuration for SSL
2020-05-10 16:23:01,718 [main] DEBUG o.a.drill.exec.ssl.SSLConfigServer - Hadoop SSL configuration file: ssl-server.xml
2020-05-10 16:23:01,731 [main] DEBUG org.apache.drill.exec.ssl.SSLConfig - Initialized SSL context.
2020-05-10 16:23:01,731 [main] INFO o.a.drill.exec.rpc.user.UserServer - Rpc server configured to use TLS protocol 'TLSv1.2'
2020-05-10 16:23:01,738 [main] INFO o.apache.drill.exec.server.Drillbit - Construction completed (577 ms).
2020-05-10 16:23:01,738 [main] DEBUG o.apache.drill.exec.server.Drillbit - Startup begun.
2020-05-10 16:23:01,738 [main] DEBUG o.a.d.e.c.zk.ZKClusterCoordinator - Starting ZKClusterCoordination.
2020-05-10 16:23:03,775 [main] ERROR o.apache.drill.exec.server.Drillbit - Failure during initial startup of Drillbit.
org.apache.zookeeper.KeeperException$UnimplementedException: KeeperErrorCode = Unimplemented for /drill
at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1538)
at org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:351)
at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:230)
at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:224)
at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:67)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:81)
at org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:221)
at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:206)
at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:35)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:265)
at org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:69)
at org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53)
at org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:596)
at org.apache.curator.framework.recipes.cache.PathChildrenCache.rebuild(PathChildrenCache.java:327)
at org.apache.curator.framework.recipes.cache.PathChildrenCache.start(PathChildrenCache.java:304)
at org.apache.curator.framework.recipes.cache.PathChildrenCache.start(PathChildrenCache.java:252)
at org.apache.curator.x.discovery.details.ServiceCacheImpl.start(ServiceCacheImpl.java:99)
at org.apache.drill.exec.coord.zk.ZKClusterCoordinator.start(ZKClusterCoordinator.java:145)
at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:220)
at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:584)
at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:554)
at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:550)
Version 1.17 has no issues in starting in distributed mode.
The issue here is with the zookeeper version. Perhaps you use 3.4.X version, but the current version of Drill requires 3.5.X. As a workaround, you may replace zookeeper jar in jars/ext/zookeeper-3.5.7.jar and jars/ext/zookeeper-jute-3.5.7.jar with the jars that corresponds to your zookeeper version.
In Addition to the answer of Vova Vysotskyi, you may find more information in Drill documentation about this issue:
https://drill.apache.org/docs/distributed-mode-prerequisites/
Starting in Drill 1.18 the bundled ZooKeeper libraries are upgraded to version 3.5.7, preventing connections to older (< 3.5) ZooKeeper clusters. In order to connect to a ZooKeeper < 3.5 cluster, replace the ZooKeeper library JARs in ${DRILL_HOME}/jars/ext with zookeeper-3.4.x.jar then restart the cluster.

Error restoring Rancher: This cluster is currently Unavailable; areas that interact directly with it will not be available until the API is ready

I am trying to backup and restore rancher server (single node install), as the described here.
After backup, I tried to turn off the rancher server node, and I run a new rancher container on a new node (in the same network, but another ip address), then I restored using the backup file.
After restoring, I logged in to the rancher UI and it showed the error below:
So, I checked the logs of the rancher server and it showed as below:
2019-10-05 16:41:32.197641 I | http: TLS handshake error from 127.0.0.1:38388: EOF
2019-10-05 16:41:32.202442 I | http: TLS handshake error from 127.0.0.1:38380: EOF
2019-10-05 16:41:32.210378 I | http: TLS handshake error from 127.0.0.1:38376: EOF
2019-10-05 16:41:32.211106 I | http: TLS handshake error from 127.0.0.1:38386: EOF
2019/10/05 16:42:26 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019/10/05 16:44:34 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019/10/05 16:48:50 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019-10-05 16:50:19.114475 I | mvcc: store.index: compact 75951
2019-10-05 16:50:19.137825 I | mvcc: finished scheduled compaction at 75951 (took 22.527694ms)
2019-10-05 16:55:19.120803 I | mvcc: store.index: compact 76282
2019-10-05 16:55:19.124813 I | mvcc: finished scheduled compaction at 76282 (took 2.746382ms)
After that, I checked logs of the master nodes, I found that the rancher agent still tries to connect to the old rancher server (old ip address), not as the new one, so it makes the cluster not available.
How can I fix this?
You need to re-register the node in Rancher using the following steps.
Update the server-url in Rancher by going to Global -> Settings -> server-url
This should be the full URL with https://
Then use this script to re-register the node in Rancher https://github.com/mattmattox/cluster-agent-tool

pgjdbc-ng throws mysterious ServiceLoader error

I'm getting a very weird error in my logs. This just started happening randomly and was not triggered by a version upgrade of anything.
2019-08-23 14:49:41.150 ccleves-mac-mini.local com.zaxxer.hikari.HikariDataSource 7177 INFO HikariPool-2 - Starting...
2019-08-23 14:49:41.150 ccleves-mac-mini.local com.zaxxer.hikari.HikariDataSource 7177 INFO HikariPool-1 - Starting...
2019-08-23 14:49:41.150 ccleves-mac-mini.local com.zaxxer.hikari.HikariDataSource 7177 INFO HikariPool-3 - Starting...
2019-08-23 14:49:41.676 ccleves-mac-mini.local com.zaxxer.hikari.pool.HikariPool 7703 ERROR HikariPool-3 - Error thrown while acquiring connection from data source
java.util.NoSuchElementException: null
at sun.misc.CompoundEnumeration.nextElement(CompoundEnumeration.java:59)
at java.util.ServiceLoader$LazyIterator.hasNextService(ServiceLoader.java:357)
at java.util.ServiceLoader$LazyIterator.hasNext(ServiceLoader.java:393)
at java.util.ServiceLoader$1.hasNext(ServiceLoader.java:474)
at com.impossibl.postgres.system.procs.Procs.loadDecoderProc(Procs.java:107)
at com.impossibl.postgres.system.procs.Procs.loadNamedBinaryCodec(Procs.java:83)
at com.impossibl.postgres.types.BaseType.<init>(BaseType.java:46)
at com.impossibl.postgres.types.BaseType.<init>(BaseType.java:50)
at com.impossibl.postgres.types.SharedRegistry.<init>(SharedRegistry.java:123)
at com.impossibl.postgres.jdbc.PGDriver.lambda$connect$0(PGDriver.java:106)
at java.util.HashMap.computeIfAbsent(HashMap.java:1127)
at com.impossibl.postgres.jdbc.PGDriver.lambda$connect$1(PGDriver.java:106)
at com.impossibl.postgres.system.BasicContext.init(BasicContext.java:303)
at com.impossibl.postgres.jdbc.PGDirectConnection.init(PGDirectConnection.java:276)
at com.impossibl.postgres.jdbc.ConnectionUtil.createConnection(ConnectionUtil.java:205)
at com.impossibl.postgres.jdbc.ConnectionUtil.createConnection(ConnectionUtil.java:165)
at com.impossibl.postgres.jdbc.PGDriver.connect(PGDriver.java:113)
at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:121)
at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:353)
at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:201)
at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:473)
at com.zaxxer.hikari.pool.HikariPool.checkFailFast(HikariPool.java:562)
at com.zaxxer.hikari.pool.HikariPool.<init>(HikariPool.java:115)
at com.zaxxer.hikari.HikariDataSource.<init>(HikariDataSource.java:81)
Even weirder: It happens when I am developing the app in Eclipse. If I stop the app and then restart in Eclipse, it happens. If I close out Eclipse entirely and reopen it, the problem goes away and all connections open fine.
If I look at the connections on the Postgres server I don't see anything out of place. When it works, 10 connections get opened. When it fails, then just one.

eclipse console suddenly showing INFO

I'm using a apache HttpClient and I've started seeing some INFO output on the eclipse console:
0 [main] INFO org.apache.commons.httpclient.HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect
3 [main] INFO org.apache.commons.httpclient.HttpMethodDirector - Retrying request
3861 [pool-1-thread-25] INFO org.apache.commons.httpclient.HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect
3861 [pool-1-thread-25] INFO org.apache.commons.httpclient.HttpMethodDirector - Retrying request
3913 [pool-1-thread-16] INFO org.apache.commons.httpclient.HttpMethodBase - Response content length is not known
To my knowledge, nothing has changed. How can I get rid of it?
It's probably your logging library. HttpClient likely depends on commons-logging, which automatically picks up a logging implementation in your classpath (either java.util.logging or log4j) which by default writes on the console.

Perl intermittently fails to connect to iis 7.5 (cygwin on windows server 2012)

We upgraded perl on our windows server 2012 to latest stable version. Ever since we did that we are getting intermittent Cannot connect to the server errors throwing 500 error responses.
But it is so intermittent we cannot identify the problem. Here is the debug log for some idea:
DEBUG: .../IO/Socket/SSL.pm:763: done Net::SSLeay::connect -> -1
DEBUG: .../IO/Socket/SSL.pm:773: ssl handshake in progress
DEBUG: .../IO/Socket/SSL.pm:783: waiting for fd to become ready: SSL wants a read first
DEBUG: .../IO/Socket/SSL.pm:803: socket ready, retrying connect
DEBUG: .../IO/Socket/SSL.pm:759: call Net::SSLeay::connect
DEBUG: .../IO/Socket/SSL.pm:763: done Net::SSLeay::connect -> -1
DEBUG: .../IO/Socket/SSL.pm:766: local error: SSL connect attempt failed
Windows server is running IIS 7.5 and We have a valid Certificate issued by COMODO.
Any insight would be much appreciated. Please let me know if you need any further information.
So updating windows server 2012 (my OS), which has not been updated in over a year fixed the problem. There must have been some windows patch for new tls support.