Trying to move from Flink 1.3.2 to 1.5 We have cluster deployed with kubernetes. Everything works fine with 1.3.2 but I can not submit job with 1.5. When I am trying to do that I just see spinner spin around infinitely, same via REST api. I even can't submit wordcount example job.
Seems my taskmanagers can not connect to jobmanager, I can see them in flink UI, but in logs I see
level=WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with
org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException:
connection timed out:
flink-jobmanager-nonprod-2.rpds.svc.cluster.local/25.0.84.226:6123
level=WARN akka.remote.ReliableDeliverySupervisor - Association with remote system
[akka.tcp://flink#flink-jobmanager-nonprod-2.rpds.svc.cluster.local:6123]
has failed, address is now gated for [50] ms. Reason: [Association
failed with
[akka.tcp://flink#flink-jobmanager-nonprod-2.rpds.svc.cluster.local:6123]]
Caused by: [No response from remote for outbound association.
Associate timed out after [20000 ms].]
level=WARN akka.remote.transport.netty.NettyTransport - Remote
connection to [null] failed with
org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException:
connection timed out:
flink-jobmanager-nonprod-2.rpds.svc.cluster.local/25.0.84.226:6123
But I can do telnet from taskmanager to jobmanager
Moreover everything works on my local if I start flink in cluster mode (jobmanager + taskmanager).
In 1.5 documentation I found mode option which flip mode between flip6 and legacy (default flip6), but If I set mode: legacy I don't see my taskmanagers registered at all.
Is this something specific about k8s deployment and 1.5 I need to do? I checked 1.5 k8s config and it looks pretty same as we have, but we using customized docker image for flink (Security, HA, checkpointing)
Thank you.
The issue with jobmanage connectivity. Jobmanager docker image cannot connect to "flink-jobmanager" (${JOB_MANAGER_RPC_ADDRESS}) address.
Just use afilichkin/flink-k8s Docker instead of flink:latest
I've fixed it by adding new host to jobmanager docker. You can see it in my github project
https://github.com/Aleksandr-Filichkin/flink-k8s/tree/master
Related
I am running into this error after Cadence Canary gets started on my cluster nodes.
After the error error starting cron workflow.... , Cadence Canary does nothing and just hangs there.
Any thoughts/suggestions?
UPDATE: I have turned on debug level logging and I am getting hammered with the following (note: it's a fresh cluster):
This error message says that cadence-canary was not able to call cadence-frontend service. This might indicate that cadence-frontend is not running or is not reachable. Check if cadence-frontend is running and check if your cadence-canary config points to correct cadence-frontend address
I set up a development environment in docker swarm environment, which consists of 2 nodes, a few networks and a few microservices. Following gives an example how it looks in cluster.
Service Network Node Image Version
nginx reverse proxy backend, frontend node 1 latest stable-alpine
Service A backend, database node 2 8-jre-alpine
Service B backend, database node 2 8-jre-alpine
Service C backend, database node 1 8-jre-alpine
Database postgresql database node 1 latest alpine
Services are spring boot 2.1.7 applications with boot-data-jpa. All the services above contain database connection to postgresql instance. For database I configured only following properties in application.properties:
spring.datasource.url
spring.datasource.username
spring.datasource.password
spring.datasource.driver-class-name
spring.jpa.hibernate.ddl-auto=
spring.jpa.properties.hibernate.jdbc.lob.non_contextual_creation=true
After some time I see that the connection limit in postgresql exceeds, which does not allow to create a new connection.
2019-09-21 13:01:07.031 1 --- [onnection adder] com.zaxxer.hikari.pool.HikariPool : HikariPool-1 - Cannot acquire connection from data source org.postgresql.util.PSQLException: FATAL: sorry, too many clients already
A similar error is shown also when I try to connect over ssh to database.
psql: FATAL: sorry, too many clients already
What I tried till now:
spring.datasource.hikari.leak-detection-threshold=20000
which didnt help.
I found several answers to this problem like:
increase connection limit in postgresql
No I dont want to do this. It is just a temporary solution. It will pollute the connection again but a bit later maybe.
add idle timeout in hikaripCP configuration
Default configuration of hikariCP has already a default value of 10mins, which doesnt help
add max life time to hikariCP configuration
Default configuration of hikariCP has already a default value of 30mins, which doesnt help
reduce number of idle connection in hikariCP configuration
Default configuration of hikariCP has already a default value of 10, which doesnt help
set min idle in hikariCP configuration
Default is 10 and I am fine with it. b
I am expecting a connection around 30 for the services but I find nearly 100 connections. Restarting the services or stopping them does not close the idle connections neither. What are your suggestions? Is it a docker specific problem? Did someone experience the same problem?
When I run 3 mesos-master with QUORUM=2, they fail 1 minute after being elected as the leader, giving errors:
E1015 11:50:35.539562 19150 socket.hpp:174] Shutdown failed on fd=25: Transport endpoint is not connected [107]
E1015 11:50:35.539897 19150 socket.hpp:174] Shutdown failed on fd=24: Transport endpoint is not connected [107]
They keep electing one another in a loop, consistently failing and re-electing.
If I set QUORUM=1, everything works well. What could be the reason for this?
One problem was that AWS firewall was blocking reaching public IPs of the server and zookeeper was broadcasting public IP (set in advertise_ip) so nobody was able to connect each other. Slaves also couldn't connect to the masters with the same error.
When I set local IP to advertise_ip (so that Zookeeper broadcasted local IPs), masters could communicate and QUORUM=2 worked. When I removed the firewall rule, slaves could connect to the master.
We had a similar problem yesterday, marathon was a little weird because some applications were not been deployed. The strange was that the application goes up but the health check never turns green, and so nixy wasn't updating nginx.
After a lot of investigation we came to this very same error:
E0718 18:51:05.836688 5049 socket.hpp:107] Shutdown failed on fd=46: Transport endpoint is not connected [107]
In the end we discovery that the problem was in the election, even that our QUORUM=1 (we have 2 masters) somehow it looses itself and one master wasn't communicating with the other.
To solve this we triggered a new election using Marathon API /v2/leader DELETE method and everything worked fine after that.
We had the same problem, the mesos-master log flooding with messages like:
mesos-master[27499]: E0616 14:29:39.310302 27523 socket.hpp:174] Shutdown failed on fd=67: Transport endpoint is not connected [107]
Turned out it was the loadbalancers health check to /stats.json
I've a Spark cluster with 10 nodes, and I'm getting this exception after using the Spark Context for the first time:
14/11/20 11:15:13 ERROR UserGroupInformation: PriviledgedActionException as:iuberdata (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
... 4 more
This guy have had a similar problem but I've already tried his solution and didn't worked.
The same exception also happens here but the problem isn't them same in here as I'm using spark version 1.1.0 in both master or slave and in client.
I've tried to increase the timeout to 120s but it still doesn't solve the problem.
I'm doploying the environment throught scripts and I'm using the context.addJar to include my code into the classpath.
This problem is intermittend, and I don't have any idea on how to track why is it happening. Anybody has faced this issue when configuring a spark cluster know how to solve it?
We had a similar problem which was quite hard to debug and isolate. Long story short - Spark uses Akka which is very picky about FQDN hostnames resolving to IP addresses. Even if you specify the IP Address at all places it is not enough. The answer here helped us isolate the problem.
A useful test to run is run netcat -l <port> on the master and run nc -vz <host> <port> on the worker to test the connectivity. Run the test with an IP address and with the FQDN. You can get the name Spark is using from the WARN message from the log snippet below. For us it was host032s4.staging.companynameremoved.info. The IP address test for us passed and the FQDN test failed as our DNS was not setup correctly.
INFO 2015-07-24 10:33:45 Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher#10.40.246.168:35455]
INFO 2015-07-24 10:33:45 Remoting: Remoting now listens on addresses: [akka.tcp://driverPropsFetcher#10.40.246.168:35455]
INFO 2015-07-24 10:33:45 org.apache.spark.util.Utils: Successfully started service 'driverPropsFetcher' on port 35455.
WARN 2015-07-24 10:33:45 Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver#host032s4.staging.companynameremoved.info:50855]. Address is now gated for 60000 ms, all messages to this address will be delivered to dead letters.
ERROR 2015-07-24 10:34:15 org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:skumar cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Another thing which we had to do was to specify the spark.driver.host and spark.driver.port properties in the spark submit script. This was because we had machines with two IP addresses and the FQDN resolved to the wrong IP address.
Make sure your network and DNS entries are correct!!
The Firewall was missconfigured and, in some instances, it didn't allowed the slaves to connect to the cluster.
This generated the timeout issue, as the slaves couldn't connect to the server.
If you are facing this timeout, check your firewall configs.
I had similar problem and I managed to get around it by using cluster deploy mode when submitting the application to Spark.
(Because even allowing all the incoming traffic to both my master and the single slave didn't allow me to use the client deploy mode. Before changing them I had default security group (AWS firewall) settings set up by Spark EC2 scripts from Spark 1.2.0).
I have three servers in my quorum. They are running ZooKeeper 3.4.5. Two of them appear to be running fine based on the output from mntr. One of them was restarted a couple days ago due to a deploy, and since then has not been able to join the quorum. Some lines in the logs that stick out are:
2014-03-03 18:44:40,995 [myid:1] - INFO [main:QuorumPeer#429] - currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation
and:
2014-03-03 18:44:41,233 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager#190] - Have smaller server identifier, so dropping the connection: (2, 1)
2014-03-03 18:44:41,234 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager#190] - Have smaller server identifier, so dropping the connection: (3, 1)
2014-03-03 18:44:41,235 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:FastLeaderElection#774] - Notification time out: 400
Googling for the first ('currentEpoch not found!') led me to JIRA ZOOKEEPER-1653 - zookeeper fails to start because of inconsistent epoch. It describes a bug fix but doesn't describe a way to resolve the issue without upgrading zookeeper.
Googling for the second ('Have smaller server identifier, so dropping the connection') led me to JIRA ZOOKEEPER-1506 - Re-try DNS hostname -> IP resolution if node connection fails. This makes sense because I am using AWS Elastic IPs for the servers. The fix for this issue seems to be to do a rolling restart, which would cause us to temporarily lose quorum.
It looks like the second issue is definitely in play because I see timeouts in the other ZooKeeper server's logs (the ones still in the quorum) when trying to connect to the first server. What I'm not sure of is if the first issue will disappear when I do a rolling restart. I would like to avoid upgrading and/or doing a rolling restart, but if I have to do a rolling restart I'd like to avoid doing it multiple times. Is there a way to fix the first issue without upgrading? Or even better: Is there a way to resolve both issues without doing a rolling restart?
Thanks for reading and for your help!
This is a bug of zookeeper: Server is unable to join quorum after connection broken to other peers
Restart the leader solves this issue.
Everyone has this problem when your pods or hosts rejoining the cluster with different ips using the same id. For your host your Ip could change because specify in your config perhaps 0.0.0.0 or domains name. So Follow these instructions:
1.stop all server, and in config use
server.1=10.x.x.x:1234:5678
server.2=10.x.x.y:1234:5678
server.3=10.x.x.z:1234:5678
not dns name .
Use Your IP LAN as Identifier .
start your server it should work