getting timeout when submitting fat jar to spark-jobserver (akka.pattern.AskTimeoutException) - scala

I have built my job jar using sbt assembly to have all dependencies in one jar. When I try to submit my binary to spark-jobserver I am getting akka.pattern.AskTimeoutException
I modified my configuration to be able to submit large jars (I added parsing.max-content-length = 300m to my configuration) I also increased some of timeouts in configuration but nothing helped.
After I run:
curl --data-binary #matching-ml-assembly-1.0.jar localhost:8090/jars/matching-ml
I am getting:
{
"status": "ERROR",
"result": {
"message": "Ask timed out on [Actor[akka://JobServer/user/binary-manager#1785133213]] after [3000 ms]. Sender[null] sent message of type \"spark.jobserver.StoreBinary\".",
"errorClass": "akka.pattern.AskTimeoutException",
"stack": ["akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)", "akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)", "scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)", "scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)", "scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)", "akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)", "akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)", "akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)", "akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)", "java.lang.Thread.run(Thread.java:745)"]
}
My configuration:
# Template for a Spark Job Server configuration file
# When deployed these settings are loaded when job server starts
#
# Spark Cluster / Job Server configuration
spark {
# spark.master will be passed to each job's JobContext
master = "local[4]"
# master = "mesos://vm28-hulk-pub:5050"
# master = "yarn-client"
# Default # of CPUs for jobs to use for Spark standalone cluster
job-number-cpus = 4
jobserver {
port = 8090
context-per-jvm = false
# Note: JobFileDAO is deprecated from v0.7.0 because of issues in
# production and will be removed in future, now defaults to H2 file.
jobdao = spark.jobserver.io.JobSqlDAO
filedao {
rootdir = /tmp/spark-jobserver/filedao/data
}
datadao {
# storage directory for files that are uploaded to the server
# via POST/data commands
rootdir = /tmp/spark-jobserver/upload
}
sqldao {
# Slick database driver, full classpath
slick-driver = slick.driver.H2Driver
# JDBC driver, full classpath
jdbc-driver = org.h2.Driver
# Directory where default H2 driver stores its data. Only needed for H2.
rootdir = /tmp/spark-jobserver/sqldao/data
# Full JDBC URL / init string, along with username and password. Sorry, needs to match above.
# Substitutions may be used to launch job-server, but leave it out here in the default or tests won't pass
jdbc {
url = "jdbc:h2:file:/tmp/spark-jobserver/sqldao/data/h2-db"
user = ""
password = ""
}
# DB connection pool settings
dbcp {
enabled = false
maxactive = 20
maxidle = 10
initialsize = 10
}
}
# When using chunked transfer encoding with scala Stream job results, this is the size of each chunk
result-chunk-size = 1m
}
# Predefined Spark contexts
# contexts {
# my-low-latency-context {
# num-cpu-cores = 1 # Number of cores to allocate. Required.
# memory-per-node = 512m # Executor memory per node, -Xmx style eg 512m, 1G, etc.
# }
# # define additional contexts here
# }
# Universal context configuration. These settings can be overridden, see README.md
context-settings {
num-cpu-cores = 2 # Number of cores to allocate. Required.
memory-per-node = 2G # Executor memory per node, -Xmx style eg 512m, #1G, etc.
# In case spark distribution should be accessed from HDFS (as opposed to being installed on every Mesos slave)
# spark.executor.uri = "hdfs://namenode:8020/apps/spark/spark.tgz"
# URIs of Jars to be loaded into the classpath for this context.
# Uris is a string list, or a string separated by commas ','
# dependent-jar-uris = ["file:///some/path/present/in/each/mesos/slave/somepackage.jar"]
# Add settings you wish to pass directly to the sparkConf as-is such as Hadoop connection
# settings that don't use the "spark." prefix
passthrough {
#es.nodes = "192.1.1.1"
}
}
# This needs to match SPARK_HOME for cluster SparkContexts to be created successfully
# home = "/home/spark/spark"
}
# Note that you can use this file to define settings not only for job server,
# but for your Spark jobs as well. Spark job configuration merges with this configuration file as defaults.
spray.can.server {
# uncomment the next lines for making this an HTTPS example
# ssl-encryption = on
# path to keystore
#keystore = "/some/path/sjs.jks"
#keystorePW = "changeit"
# see http://docs.oracle.com/javase/7/docs/technotes/guides/security/StandardNames.html#SSLContext for more examples
# typical are either SSL or TLS
encryptionType = "SSL"
keystoreType = "JKS"
# key manager factory provider
provider = "SunX509"
# ssl engine provider protocols
enabledProtocols = ["SSLv3", "TLSv1"]
idle-timeout = 60 s
request-timeout = 20 s
connecting-timeout = 5s
pipelining-limit = 2 # for maximum performance (prevents StopReading / ResumeReading messages to the IOBridge)
# Needed for HTTP/1.0 requests with missing Host headers
default-host-header = "spray.io:8765"
# Increase this in order to upload bigger job jars
parsing.max-content-length = 300m
}
akka {
remote.netty.tcp {
# This controls the maximum message size, including job results, that can be sent
# maximum-frame-size = 10 MiB
}
}

I came to the similar issue. The way how to solve it is a bit tricky. First you need to add spark.jobserver.short-timeout to your configuration. Just modify your configuration like this:
jobserver {
port = 8090
context-per-jvm = false
short-timeout = 60s
...
}
The second (tricky) part is you can't fix it without modifying code of the spark-job-application. The attribute which cause timeout is in class BinaryManager:
implicit val daoAskTimeout = Timeout(3 seconds)
The default is set to 3 second which apparently for big jar is not enough. You can increase it to for example 60 second which solve problem for me.
implicit val daoAskTimeout = Timeout(60 seconds)

Actually you can bring down the size of the jars easily. Also some of the dependent jars can be passed using dependent-jar-uris instead of assembling into one big fat jar.

Related

mock outputs in Terragrunt dependency

I want to use Terragrunt to deploy this example: https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/main/examples/complete-kubernetes-addons/main.tf
So far, I was able to create the VPC/EKS resource without a problem, I separated each module into a different module directory, and everything worked as expected.
When I tried to do the same for the Kubernetes-addons module, I faced an issue with the data source trying to call to the cluster and failing since the cluster wasn't created at this point.
Here's my terragrunt.hcl which I'm trying to execute for this specific module:
...
terraform {
source = "git::git#github.com:aws-ia/terraform-aws-eks-blueprints.git//modules/kubernetes-addons?ref=v4.6.1"
}
locals {
# Extract needed variables for reuse
cluster_version = "${include.envcommon.locals.cluster_version}"
name = "${include.envcommon.locals.name}"
}
dependency "eks" {
config_path = "../eks"
mock_outputs = {
eks_cluster_endpoint = "https://000000000000.gr7.eu-west-3.eks.amazonaws.com"
eks_oidc_provider = "something"
eks_cluster_id = "something"
}
}
inputs = {
eks_cluster_id = dependency.eks.outputs.cluster_id
eks_cluster_endpoint = dependency.eks.outputs.eks_cluster_endpoint
eks_oidc_provider = dependency.eks.outputs.eks_oidc_provider
eks_cluster_version = local.cluster_version
...
}
The error that I'm getting here:
`
INFO[0035]
Error: error reading EKS Cluster (something): couldn't find resource
with data.aws_eks_cluster.eks_cluster,
on data.tf line 7, in data "aws_eks_cluster" "eks_cluster":
7: data "aws_eks_cluster" "eks_cluster" {
`
The kubernetes-addons module is deploying addons into an existing Kubernetes cluster. If you don't have a cluster running (apparently you don't have one when you're mocking the cluster_id variable), then you get the error of not having the aws_eks_cluster data source.
You need to create the K8s cluster first, before you can start deploying the addons.

How to reinitialize hashicorp vault

I'm working on an automating a hashicorp vault process, and I need to repeatedly run the vault operator init command because of trial and error testing, I tried uninstalling vault and installing it again, but it seems like that doesn't remove the previous unseal keys + root token it generates, how can I do this?
I read somewhere that I needed to delete my storage "file" path which I already did but its not working (Actually my /opt/vault/data/ directory is empty), here is my vault.hcl file:
# Full configuration options can be found at
https://www.vaultproject.io/docs/configuration
ui = true
#mlock = true
#disable_mlock = true
storage "file" {
path = "/opt/vault/data"
}
#storage "consul" {
# address = "127.0.0.1:8500"
# path = "vault"
#}
# HTTP listener
#listener "tcp" {
# address = "127.0.0.1:8200"
# tls_disable = 1
#}
# HTTPS listener
listener "tcp" {
address = "0.0.0.0:8200"
tls_cert_file = "/opt/vault/tls/tls.crt"
tls_key_file = "/opt/vault/tls/tls.key"
}
# Enterprise license_path
# This will be required for enterprise as of v1.8
#license_path = "/etc/vault.d/vault.hclic"
# Example AWS KMS auto unseal
#seal "awskms" {
# region = "us-east-1"
# kms_key_id = "REPLACE-ME"
#}
# Example HSM auto unseal
#seal "pkcs11" {
# lib = "/usr/vault/lib/libCryptoki2_64.so"
# slot = "0"
# pin = "AAAA-BBBB-CCCC-DDDD"
# key_label = "vault-hsm-key"
# hmac_key_label = "vault-hsm-hmac-key"
#}
Best practice for this type of setup is actually terraform or chef or any other stateful transformer. That way you can bring the environment to an ideal state (terraform apply) and easily removed (terraform destroy).
To reinit vault, you can bring it down, delete the data folder: "/opt/vault/data" in your case. Bring up another instance.
Delete /opt/vault/data
Reboot your computer
(You many also need to delete the file located at ~/.vault-token)
If you want to do the testing only why don't you use the vault in dev mode?

How to integrate PostgreSQL with Corda 3.0 or with Corda 4.0?

I tried to configure PostgreSQL as node's database in Corda 3.0 and in Corda 4.0.
I have added following things in build.gradle file. (Testdb1 is Database name. I have tried with postgres also)
node{
...
// this part i have added
extraConfig = [
jarDirs: ['path'],
'dataSourceProperties': [
'dataSourceClassName': 'org.postgresql.ds.PGSimpleDataSource',
'"dataSource.url"' : 'jdbc:postgresql://127.0.0.1:5432/Testdb1',
'"dataSource.user"' : 'postgres',
'"dataSource.password"': 'admin#123'
],
'database': [
'transactionIsolationLevel': 'READ_COMMITTED'
]
]
// till here
}
this part in reference.conf file
dataSourceProperties = {
dataSourceClassName = org.postgresql.ds.PGSimpleDataSource
dataSource.url = "jdbc:postgresql://127.0.0.1:5432/Testdb1"
dataSource.user = postgres
dataSource.password = "admin#123"
}
database = {
transactionIsolationLevel = "READ_COMMITTED"
}
jarDirs = ["path"]
I got the follwing Error while deploying the nodes:
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':java-source:deployNodes'.
In node-info-gen.log file it showed the CAPSULE EXCEPTION. Then I updated my JDK to 8u191 but still got the same error.
I have go-through the followings to get the things done. One can get reference from here.
https://docs.corda.net/node-database.html ,
https://github.com/corda/corda/issues/4037 ,
How can the Corda node be extended to work with databases other than H2?
You need to add those properties in node.conf in each of your corda nodes. After you do "deployNodes"
After you add these properties in node.conf file , just run the corda jar . It will automatically start . But before that you need to create the tables (The migration to other DBs is already provided in corda documentation)
I have added following things in .conf file of each node and one referene.conf file. I have given all the privileges for the user postgres which are mentioned in corda documentation.
https://docs.corda.r3.com/node-database.html
(previously I used postgresql-42.2.5.jar file but that didn't work so I used one downgrade version of it postgresql-42.1.4.jar.
one can download jar files from https://jdbc.postgresql.org/download.html )
After deployed Nodes successfully add following things:
dataSourceProperties = {
dataSourceClassName = org.postgresql.ds.PGSimpleDataSource
dataSource.url = "jdbc:postgresql://127.0.0.1:5432/Testdb1"
dataSource.user = postgres
dataSource.password = "admin#123"
}
database = {
transactionIsolationLevel = "READ_COMMITTED"
}
jarDirs = ["path"]
(path = jar file's location) after adding this configuration run file called runnodes.bat

Connecting to DSE Graph running on a Docker Container results into No host found

I am running my DSE Graph inside a Docker container with this command.
docker run -e DS_LICENSE=accept -p 9042:9042 --name my-dse -d datastax/dse-server -g -k -s
On my Scala Code I am referencing it as follows
object GrDbConnection {
val dseCluster = DseCluster.builder()
.addContactPoints("127.0.0.1").withPort(9042)
.build()
val graphName = "graphName"
val graphOptions = new GraphOptions()
.setGraphName(graphName)
var graph:ScalaGraph = null
try {
val session = dseCluster.connect()
// The following uses the DSE graph schema API, which is currently only supported by the string-based
// execution interface. Eventually there will be a programmatic API for making schema changes, but until
// then this needs to be used.
// Create graph
session.executeGraph("system.graph(name).ifNotExists().create()", ImmutableMap.of("name", graphName))
// Clear the schema to drop any existing data and schema
session.executeGraph(new SimpleGraphStatement("schema.clear()").setGraphName(graphName))
// Note: typically you would not want to use development mode and allow scans, but it is good for convenience
// and experimentation during development.
// Enable development mode and allow scans
session.executeGraph(new SimpleGraphStatement("schema.config().option('graph.schema_mode').set('development')")
.setGraphName(graphName))
session.executeGraph(new SimpleGraphStatement("schema.config().option('graph.allow_scan').set('true')")
.setGraphName(graphName))
// Create a ScalaGraph from a remote Traversal Source using withRemote
// See: http://tinkerpop.apache.org/docs/current/reference/#connecting-via-remotegraph for more details
val connection = DseRemoteConnection.builder(session)
.withGraphOptions(graphOptions)
.build()
graph = EmptyGraph.instance().asScala
.configure(_.withRemote(connection))
} finally {
dseCluster.close()
}
}
Then In one of my controllers I have used this to invoke a query to the DSE Graph.
def test = Action {
val r = GrDbConnection.graph.V().count()
print(r.iterate())
Ok
}
This returns an error
lay.api.http.HttpErrorHandlerExceptions$$anon$1: Execution exception[[NoHostAvailableException: All host(s) tried for query failed (no host was tried)]]
at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:251)
at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:178)
at play.core.server.AkkaHttpServer$$anonfun$1.applyOrElse(AkkaHttpServer.scala:363)
at play.core.server.AkkaHttpServer$$anonfun$1.applyOrElse(AkkaHttpServer.scala:361)
at scala.concurrent.Future.$anonfun$recoverWith$1(Future.scala:413)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:37)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:221)
at com.datastax.driver.core.RequestHandler.access$1000(RequestHandler.java:41)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:292)
at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:109)
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:89)
at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:124)
at com.datastax.driver.dse.DefaultDseSession.executeGraphAsync(DefaultDseSession.java:123)
at com.datastax.dse.graph.internal.DseRemoteConnection.submitAsync(DseRemoteConnection.java:74)
at org.apache.tinkerpop.gremlin.process.remote.traversal.step.map.RemoteStep.promise(RemoteStep.java:89)
at org.apache.tinkerpop.gremlin.process.remote.traversal.step.map.RemoteStep.processNextStart(RemoteStep.java:65)
Turns out all I needed to do was to remove
finally {
dseCluster.close()
}

Typesafe Config - reference.conf different behavior?

It seems to me that application.conf and reference.conf behaves differently. I do understand that reference.conf is intended as "safe fall back" configuration which works every time and application.conf is specific. However I would expect that configuration loaded from either of those will behave exactly the same in the sense of parsing the configuration.
What I am facing is that if the configuration is in application.conf it works fine and when the same file is renamed to reference.conf it doesn't work.
2015-03-30 11:35:54,603 [DEBUG] [BackEndServices-akka.actor.default-dispatcher-15] [com.ss.rg.service.ad.AdImporterServiceActor]akka.tcp://BackEndServices#127.0.0.1:2551/user/AdImporterService - Snapshot saved successfully - removing messages and snapshots up to 0 and timestamp: 1427708154584
2015-03-30 11:35:55,037 [DEBUG] [BackEndServices-akka.actor.default-dispatcher-4] [spray.can.server.HttpListener]akka.tcp://BackEndServices#127.0.0.1:2551/user/IO-HTTP/listener-0 - Binding to /0.0.0.0:8080
2015-03-30 11:35:55,054 [DEBUG] [BackEndServices-akka.actor.default-dispatcher-15] [akka.io.TcpListener]akka.tcp://BackEndServices#127.0.0.1:2551/system/IO-TCP/selectors/$a/0 - Successfully bound to /0:0:0:0:0:0:0:0:8080
2015-03-30 11:35:55,056 [INFO ] [BackEndServices-akka.actor.default-dispatcher-4] [spray.can.server.HttpListener]akka.tcp://BackEndServices#127.0.0.1:2551/user/IO-HTTP/listener-0 - Bound to /0.0.0.0:8080
Compared to :
2015-03-30 11:48:34,053 [INFO ] [BackEndServices-akka.actor.default-dispatcher-3] [Cluster(akka://BackEndServices)]Cluster(akka://BackEndServices) - Cluster Node [akka.tcp://BackEndServices#127.0.0.1:2551] - Leader is moving node [akka.tcp://BackEndServices#127.0.0.1:2551] to [Up]
2015-03-30 11:48:36,413 [DEBUG] [BackEndServices-akka.actor.default-dispatcher-15] [spray.can.server.HttpListener]akka.tcp://BackEndServices#127.0.0.1:2551/user/IO-HTTP/listener-0 - Binding to "0.0.0.0":8080
2015-03-30 11:48:36,446 [DEBUG] [BackEndServices-akka.actor.default-dispatcher-3] [akka.io.TcpListener]akka.tcp://BackEndServices#127.0.0.1:2551/system/IO-TCP/selectors/$a/0 - Bind failed for TCP channel on endpoint ["0.0.0.0":8080]: java.net.SocketException: Unresolved address
2015-03-30 11:48:36,446 [WARN ] [BackEndServices-akka.actor.default-dispatcher-15] [spray.can.server.HttpListener]akka.tcp://BackEndServices#127.0.0.1:2551/user/IO-HTTP/listener-0 - Bind to "0.0.0.0":8080 failed
The settle difference are double quotes. And my configuration is specified as follows:
akka {
... standard akka configuration ...
}
webserver.port = 8080
webserver.bindaddress = "0.0.0.0"
Configuration setting is loaded as follows in code:
val webserver_port_key = "webserver.port"
val webserver_bindaddress_key = "webserver.bindaddress"
protected val webserver_bindaddress = ConfigFactory.load().getString(webserver_bindaddress_key)
protected val webserver_port = ConfigFactory.load().getInt(webserver_port_key)
Did I missed something? I double checked that the port 8080 is free when reference.conf fails to bind.
Thanks for hints
UPDATE:
Start with log-config-on-start = on:
- When it is in application.conf
# application.conf: 60-61
"webserver" : {
# application.conf: 61
"bindaddress" : "0.0.0.0",
# application.conf: 60
"port" : 8080
}
- When it is in reference.conf
# reference.conf: 60-61
"webserver" : {
# reference.conf: 61
"bindaddress" : "0.0.0.0",
# reference.conf: 60
"port" : 8080
}
Issue found :
# application.properties
"webserver" : {
# application.properties
"bindaddress" : "\"0.0.0.0\"",
# application.properties
"port" : "8080"
}
It seems that the bindaddress is of a different type because it shows up differently in logs.
In either case enable Akka full config printing on start with this setting in your config:
log-config-on-start = on
Then compare both configurations to see where they mismatch. They should work the same way if the are the same. I suspect that the way you define bindaddress is different, i.e. String vs some other type.