Intermittent Timeout Exception using Spark - scala

I've a Spark cluster with 10 nodes, and I'm getting this exception after using the Spark Context for the first time:
14/11/20 11:15:13 ERROR UserGroupInformation: PriviledgedActionException as:iuberdata (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
... 4 more
This guy have had a similar problem but I've already tried his solution and didn't worked.
The same exception also happens here but the problem isn't them same in here as I'm using spark version 1.1.0 in both master or slave and in client.
I've tried to increase the timeout to 120s but it still doesn't solve the problem.
I'm doploying the environment throught scripts and I'm using the context.addJar to include my code into the classpath.
This problem is intermittend, and I don't have any idea on how to track why is it happening. Anybody has faced this issue when configuring a spark cluster know how to solve it?

We had a similar problem which was quite hard to debug and isolate. Long story short - Spark uses Akka which is very picky about FQDN hostnames resolving to IP addresses. Even if you specify the IP Address at all places it is not enough. The answer here helped us isolate the problem.
A useful test to run is run netcat -l <port> on the master and run nc -vz <host> <port> on the worker to test the connectivity. Run the test with an IP address and with the FQDN. You can get the name Spark is using from the WARN message from the log snippet below. For us it was host032s4.staging.companynameremoved.info. The IP address test for us passed and the FQDN test failed as our DNS was not setup correctly.
INFO 2015-07-24 10:33:45 Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher#10.40.246.168:35455]
INFO 2015-07-24 10:33:45 Remoting: Remoting now listens on addresses: [akka.tcp://driverPropsFetcher#10.40.246.168:35455]
INFO 2015-07-24 10:33:45 org.apache.spark.util.Utils: Successfully started service 'driverPropsFetcher' on port 35455.
WARN 2015-07-24 10:33:45 Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver#host032s4.staging.companynameremoved.info:50855]. Address is now gated for 60000 ms, all messages to this address will be delivered to dead letters.
ERROR 2015-07-24 10:34:15 org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:skumar cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Another thing which we had to do was to specify the spark.driver.host and spark.driver.port properties in the spark submit script. This was because we had machines with two IP addresses and the FQDN resolved to the wrong IP address.
Make sure your network and DNS entries are correct!!

The Firewall was missconfigured and, in some instances, it didn't allowed the slaves to connect to the cluster.
This generated the timeout issue, as the slaves couldn't connect to the server.
If you are facing this timeout, check your firewall configs.

I had similar problem and I managed to get around it by using cluster deploy mode when submitting the application to Spark.
(Because even allowing all the incoming traffic to both my master and the single slave didn't allow me to use the client deploy mode. Before changing them I had default security group (AWS firewall) settings set up by Spark EC2 scripts from Spark 1.2.0).

Related

Can't submit job with Flink 1.5 cluster

Trying to move from Flink 1.3.2 to 1.5 We have cluster deployed with kubernetes. Everything works fine with 1.3.2 but I can not submit job with 1.5. When I am trying to do that I just see spinner spin around infinitely, same via REST api. I even can't submit wordcount example job.
Seems my taskmanagers can not connect to jobmanager, I can see them in flink UI, but in logs I see
level=WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with
org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException:
connection timed out:
flink-jobmanager-nonprod-2.rpds.svc.cluster.local/25.0.84.226:6123
level=WARN akka.remote.ReliableDeliverySupervisor - Association with remote system
[akka.tcp://flink#flink-jobmanager-nonprod-2.rpds.svc.cluster.local:6123]
has failed, address is now gated for [50] ms. Reason: [Association
failed with
[akka.tcp://flink#flink-jobmanager-nonprod-2.rpds.svc.cluster.local:6123]]
Caused by: [No response from remote for outbound association.
Associate timed out after [20000 ms].]
level=WARN akka.remote.transport.netty.NettyTransport - Remote
connection to [null] failed with
org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException:
connection timed out:
flink-jobmanager-nonprod-2.rpds.svc.cluster.local/25.0.84.226:6123
But I can do telnet from taskmanager to jobmanager
Moreover everything works on my local if I start flink in cluster mode (jobmanager + taskmanager).
In 1.5 documentation I found mode option which flip mode between flip6 and legacy (default flip6), but If I set mode: legacy I don't see my taskmanagers registered at all.
Is this something specific about k8s deployment and 1.5 I need to do? I checked 1.5 k8s config and it looks pretty same as we have, but we using customized docker image for flink (Security, HA, checkpointing)
Thank you.
The issue with jobmanage connectivity. Jobmanager docker image cannot connect to "flink-jobmanager" (${JOB_MANAGER_RPC_ADDRESS}) address.
Just use afilichkin/flink-k8s Docker instead of flink:latest
I've fixed it by adding new host to jobmanager docker. You can see it in my github project
https://github.com/Aleksandr-Filichkin/flink-k8s/tree/master

Mesos-master: Shutdown failed on fd=25: Transport endpoint is not connected [107]

When I run 3 mesos-master with QUORUM=2, they fail 1 minute after being elected as the leader, giving errors:
E1015 11:50:35.539562 19150 socket.hpp:174] Shutdown failed on fd=25: Transport endpoint is not connected [107]
E1015 11:50:35.539897 19150 socket.hpp:174] Shutdown failed on fd=24: Transport endpoint is not connected [107]
They keep electing one another in a loop, consistently failing and re-electing.
If I set QUORUM=1, everything works well. What could be the reason for this?
One problem was that AWS firewall was blocking reaching public IPs of the server and zookeeper was broadcasting public IP (set in advertise_ip) so nobody was able to connect each other. Slaves also couldn't connect to the masters with the same error.
When I set local IP to advertise_ip (so that Zookeeper broadcasted local IPs), masters could communicate and QUORUM=2 worked. When I removed the firewall rule, slaves could connect to the master.
We had a similar problem yesterday, marathon was a little weird because some applications were not been deployed. The strange was that the application goes up but the health check never turns green, and so nixy wasn't updating nginx.
After a lot of investigation we came to this very same error:
E0718 18:51:05.836688 5049 socket.hpp:107] Shutdown failed on fd=46: Transport endpoint is not connected [107]
In the end we discovery that the problem was in the election, even that our QUORUM=1 (we have 2 masters) somehow it looses itself and one master wasn't communicating with the other.
To solve this we triggered a new election using Marathon API /v2/leader DELETE method and everything worked fine after that.
We had the same problem, the mesos-master log flooding with messages like:
mesos-master[27499]: E0616 14:29:39.310302 27523 socket.hpp:174] Shutdown failed on fd=67: Transport endpoint is not connected [107]
Turned out it was the loadbalancers health check to /stats.json

Connect to spark through a SOCKS proxy

TL;DR How can I connect a local driver to a spark cluster through a SOCKS-proxy.
We have an onsite spark cluster that is behind a firewall that blocks most ports. We have ssh access, so I can create a SOCKS proxy with ssh -D 7777 ....
It works fine for browsing the web-UI's when my browser uses the proxy, but I do not know how to make a local driver use the it.
So far I have this, which obviously is not configuring any proxies:
val sconf = new SparkConf()
.setMaster("spark://masterserver:7077")
.setAppName("MySpark")
new SparkContext(sconf)
Which logs these messages 16 times before throwing an exception.
15/01/20 14:43:34 INFO Remoting: Starting remoting
15/01/20 14:43:34 ERROR NettyTransport: failed to bind to server-name/ip.ip.ip.ip:0, shutting down Netty transport
15/01/20 14:43:34 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/01/20 14:43:34 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1.
15/01/20 14:43:34 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/01/20 14:43:34 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
Your best shot may be to forward a local port to remote 7077, and then setMaster("spark://localhost:nnnn") where nnnn is the local port you have forwarded.
To do this use ssh -L (instead of -D).
I cannot guarantee that this will work, or if it works, that it will continue to work, but at least it will spare you using an actual proxy for this one port. Things that might break it, are mostly secondary connections that the initial connection might trigger. I didn't test this yet, but unless there are secondary connections, in principle it should work.
Also, this doesn't answer the TL;DR-version of your question, but since you have SSH-access, it's more likely to work.

Why can't my Zookeeper server rejoin the Quorum?

I have three servers in my quorum. They are running ZooKeeper 3.4.5. Two of them appear to be running fine based on the output from mntr. One of them was restarted a couple days ago due to a deploy, and since then has not been able to join the quorum. Some lines in the logs that stick out are:
2014-03-03 18:44:40,995 [myid:1] - INFO [main:QuorumPeer#429] - currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation
and:
2014-03-03 18:44:41,233 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager#190] - Have smaller server identifier, so dropping the connection: (2, 1)
2014-03-03 18:44:41,234 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager#190] - Have smaller server identifier, so dropping the connection: (3, 1)
2014-03-03 18:44:41,235 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:FastLeaderElection#774] - Notification time out: 400
Googling for the first ('currentEpoch not found!') led me to JIRA ZOOKEEPER-1653 - zookeeper fails to start because of inconsistent epoch. It describes a bug fix but doesn't describe a way to resolve the issue without upgrading zookeeper.
Googling for the second ('Have smaller server identifier, so dropping the connection') led me to JIRA ZOOKEEPER-1506 - Re-try DNS hostname -> IP resolution if node connection fails. This makes sense because I am using AWS Elastic IPs for the servers. The fix for this issue seems to be to do a rolling restart, which would cause us to temporarily lose quorum.
It looks like the second issue is definitely in play because I see timeouts in the other ZooKeeper server's logs (the ones still in the quorum) when trying to connect to the first server. What I'm not sure of is if the first issue will disappear when I do a rolling restart. I would like to avoid upgrading and/or doing a rolling restart, but if I have to do a rolling restart I'd like to avoid doing it multiple times. Is there a way to fix the first issue without upgrading? Or even better: Is there a way to resolve both issues without doing a rolling restart?
Thanks for reading and for your help!
This is a bug of zookeeper: Server is unable to join quorum after connection broken to other peers
Restart the leader solves this issue.
Everyone has this problem when your pods or hosts rejoining the cluster with different ips using the same id. For your host your Ip could change because specify in your config perhaps 0.0.0.0 or domains name. So Follow these instructions:
1.stop all server, and in config use
server.1=10.x.x.x:1234:5678
server.2=10.x.x.y:1234:5678
server.3=10.x.x.z:1234:5678
not dns name .
Use Your IP LAN as Identifier .
start your server it should work

jboss clustering GMS, join

I have jboss 5.1.0.
we have configured jboss somehow using clustering, but in fact we do not use clustering while developing or testing. But in order to launch the project i have to type the following:
./run.sh -c all -g uniqueclustername
-b 0.0.0.0 -Djboss.messaging.ServerPeerID=1 -Djboss.service.binding.set=ports-01
but while jboss starting i able to see something like this in the console:
17:24:45,149 WARN [GMS]
join(172.24.224.7:60519) sent to
172.24.224.2:61247 timed out (after 3000 ms), retrying 17:24:48,170 WARN
[GMS] join(172.24.224.7:60519) sent to
172.24.224.2:61247 timed out (after 3000 ms), retrying 17:24:51,172 WARN
[GMS] join(172.24.224.7:60519)
here 172.24.224.7 it is my local IP
though 172.24.224.2 other IP of other developer in our room (and jboss there is stoped).
So, it tries to join to the other node or something. (i'm not very familiar how jboss acts in clusters). And as a result the application are not starting.
What may be the problem in? how to avoid this joining ?
You can probably fix this by specifying
-Djgroups.udp.ip_ttl=0
in your startup. This Sets the IP time-to-live on the JGroups packets to zero, so they never get anywhere, and the cluster will never form. We use this in dev here to stop the various developer machines from forming a cluster. There's no need to specify a unique cluster name.
I'm assuming you need to do clustering in production, is that right? Could you just use the default configuration instead of all? This would remove the clustering stuff altogether.
while setting up the server, keeping the host name = localhost and --host=localhost instead of ip address will solve the problem. That makes the server to start in non clustered mode.