Kerberos Exception launching Spark locally - scala

I am trying to set up a Spark Testng unit test:
#Test
def testStuff(): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
...
}
The code fails with: IllegalArgumentException: Can't get Kerberos realm
What am I missing?

The error suggests that your JVM is unable to locate the kerberos config (krb5.conf file).
Depending on your company's environment/infrastruture you have a few options:
Check if your company has standard library to set kerberos authentication.
Alternatively try:
set JVM property: -Djava.security.krb5.conf=/file-path/for/krb5.conf
Put the krb5.conf file into the <jdk-home>/jre/lib/security folder

Related

How to query flink's queryable state

I am using flink 1.8.0 and I am trying to query my job state.
val descriptor = new ValueStateDescriptor("myState", Types.CASE_CLASS[Foo])
descriptor.setQueryable("my-queryable-State")
I used port 9067 which is the default port according to this, my client:
val client = new QueryableStateClient("127.0.0.1", 9067)
val jobId = JobID.fromHexString("d48a6c980d1a147e0622565700158d9e")
val execConfig = new ExecutionConfig
val descriptor = new ValueStateDescriptor("my-queryable-State", Types.CASE_CLASS[Foo])
val res: Future[ValueState[Foo]] = client.getKvState(jobId, "my-queryable-State","a", BasicTypeInfo.STRING_TYPE_INFO, descriptor)
res.map(_.toString).pipeTo(sender)
but I am getting :
[ERROR] [06/25/2019 20:37:05.499] [bvAkkaHttpServer-akka.actor.default-dispatcher-5] [akka.actor.ActorSystemImpl(bvAkkaHttpServer)] Error during processing of request: 'org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9067'. Completing with 500 Internal Server Error response. To change default exception handling behavior, provide a custom ExceptionHandler.
java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9067
what am I doing wrong ?
how and where should I define QueryableStateOptions
So If You want to use the QueryableState You need to add the proper Jar to Your flink. The jar is flink-queryable-state-runtime, it can be found in the opt folder in Your flink distribution and You should move it to the lib folder.
As for the second question the QueryableStateOption is just a class that is used to create static ConfigOption definitions. The definitions are then used to read the configurations from flink-conf.yaml file. So currently the only option to configure the QueryableState is to use the flink-conf file in the flink distribution.
EDIT: Also, try reading this]1 it provides more info on how does Queryable State works. You shouldn't really connect directly to the server port but rather You should use the proxy port which by default is 9069.

How to ignore spark_home when unit testing with pyspark

I trying to write unit-tests with pyspark. Tests pass with the following configuration when SPARK_HOME is NOT set. There are multiple installations of spark on our machines and if SPARK_HOME is set to one of them tests fails on that machine.
#pytest.fixture(scope="session")
def spark_session(request):
session = SparkSession\
.builder\
.master("local[2]")\
.appName("pytest-pyspark-local-testing")\
.getOrCreate()
request.addfinalizer(lambda: session.stop())
quiet_py4j()
return session
I have tried os.environ["SPARK_HOME"] = "" which gets FileNotFoundError: [Errno 2] No such file or directory: './bin/spark-submit': './bin/spark-submit error.
I have also tried os.unsetenv('SPARK_HOME') which gets Exception: Java gateway process exited before sending its port number. When I don't try to unset the env var, I get this same error as well.
How can I make sure that my tests will work on any machine simply ignoring any environment variables.

Multiple SSH connections not allowed in an sbt session

In a multi-module Scala project I'm running several integration tests where I use scala-ssh (v. 0.8) to connect to a remote machine via SSH and transfer a file from there.
If I run an integration test once in an sbt session, everything works as expected - I can connect to the machine and download any file. The related bits of Scala code are:
private lazy val fileInventory: AnsibleYamlFileInventory = {
val inventory = SSH(ansibleHost, HostResourceConfig()) { client =>
client.fileTransfer {
scp =>
val tmpLocalFile = Files.createTempFile("inventory", ".yaml")
scp.download(remoteYamlInventoryFile, tmpLocalFile.toAbsolutePath.toString)
new AnsibleYamlFileInventory(tmpLocalFile)
}
}
inventory.fold(s => throw new RuntimeException(s), identity)
}
The problem occurs if I try to run the same test (or another integration test) within the same sbt session. I get the same error message as mentioned here:
14:32:11.751 [reader] ERROR net.schmizz.sshj.transport.TransportImpl - Dying because - {}
net.schmizz.sshj.common.SSHRuntimeException: null
at net.schmizz.sshj.common.Buffer.readPublicKey(Buffer.java:432)
at net.schmizz.sshj.transport.kex.AbstractDHG.next(AbstractDHG.java:75)
at net.schmizz.sshj.transport.KeyExchanger.handle(KeyExchanger.java:367)
at net.schmizz.sshj.transport.TransportImpl.handle(TransportImpl.java:509)
at net.schmizz.sshj.transport.Decoder.decode(Decoder.java:107)
at net.schmizz.sshj.transport.Decoder.received(Decoder.java:175)
at net.schmizz.sshj.transport.Reader.run(Reader.java:60)
Caused by: java.security.GeneralSecurityException: java.security.spec.InvalidKeySpecException: key spec not recognised
at net.schmizz.sshj.common.KeyType$3.readPubKeyFromBuffer(KeyType.java:156)
at net.schmizz.sshj.common.Buffer.readPublicKey(Buffer.java:430)
... 6 common frames omitted
Caused by: java.security.spec.InvalidKeySpecException: key spec not recognised
at org.bouncycastle.jcajce.provider.asymmetric.util.BaseKeyFactorySpi.engineGeneratePublic(Unknown Source)
at org.bouncycastle.jcajce.provider.asymmetric.ec.KeyFactorySpi.engineGeneratePublic(Unknown Source)
at java.security.KeyFactory.generatePublic(KeyFactory.java:334)
at net.schmizz.sshj.common.KeyType$3.readPubKeyFromBuffer(KeyType.java:154)
... 7 common frames omitted
If I kill that sbt session and relaunch another one, I can again run only a single integration test before the problem reoccurs.
I have already installed the JCE 8 files as suggested. So, I'm wondering what I need to fix to get multiple tests running successfully where one after another they can ssh into that remote machine.
After some debugging I found out that the problem was due to BouncyCastle, which remains registered as a JCE provider in a follow-up test and causes problems. This shows up in the stack trace as:
INFO net.schmizz.sshj.common.SecurityUtils - BouncyCastle already registered as a JCE provider
I decided to add a security provider dynamically and remove it after tests are done.
def doTests(): Unit = {
import org.bouncycastle.jce.provider.BouncyCastleProvider
import java.security.Security
Security.addProvider(new BouncyCastleProvider)
"Some test" should {
"be BLABLA" in {
assert(...) // some test
}
}
"Some other test" should {
"be BLABLABLA" in {
assert(...) // some other test
}
}
Security.removeProvider(BouncyCastleProvider.PROVIDER_NAME)
}

spark scala on windows machine

I am learning from the class. I have run the code as shown in the class and i get below errors. Any idea what i should do?
I have spark 1.6.1 and Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
val datadir = "C:/Personal/V2Maestros/Courses/Big Data Analytics with Spark/Scala"
//............................................................................
//// Building and saving the model
//............................................................................
val tweetData = sc.textFile(datadir + "/movietweets.csv")
tweetData.collect()
def convertToRDD(inStr : String) : (Double,String) = {
val attList = inStr.split(",")
val sentiment = attList(0).contains("positive") match {
case true => 0.0
case false => 1.0
}
return (sentiment, attList(1))
}
val tweetText=tweetData.map(convertToRDD)
tweetText.collect()
//val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
var ttDF = sqlContext.createDataFrame(tweetText).toDF("label","text")
ttDF.show()
The error is:
scala> ttDF.show()
[Stage 2:> (0 + 2) / 2]16/03/30 11:40:25 ERROR ExecutorClassLoader: Failed to check existence of class org.apache.spark.sql.catalyst.expressio
REPL class server at http://192.168.56.1:54595
java.net.ConnectException: Connection timed out: connect
at java.net.TwoStacksPlainSocketImpl.socketConnect(Native Method)
re/4729300
I'm no expert but the connection IP in the error message looks like a private node or even your router/modem local address.
As stated in the comment it could be that you're running the context with a wrong configuration that tries to spread the work to a cluster that's not there, instead of in your local jvm process.
For further information you can read here and experiment with something like
import org.apache.spark.SparkContext
val sc = new SparkContext(master = "local[4]", appName = "tweetsClass", conf = new SparkConf)
Update
Since you're using the interactive shell and the provided SparkContext available there, I guess you should pass the equivalent parameters to the shell command as in
<your-spark-path>/bin/spark-shell --master local[4]
Which instructs the driver to assign a master for the spark cluster on the local machine, on 4 threads.
I think the problem comes with connectivity and not from within the code.
Check if you can actually connect to this address and port (54595).
Probably your spark master is not accessible at the specified port. Use local[*] to validate using a smaller dataset and local master. Then, ckeck if the port is accessible or change it based on Spark port configuration (http://spark.apache.org/docs/latest/configuration.html)

How to handle a 'Configuration error[Cannot connect to database [...]]'

I am implementing a web service with the Play Framework, that uses multiple databases. All databases are configured in the conf/application.conf by specifying the db.database1..., db.database2... properties.
At startup, play will try to establish connections to all databases configured in the database and if one connection fails, the service will not start.
In my case, not all databases are necessary to start the web service, but the web service can still run with limited functionality, if some databases are not available. Since not all databases are under my control, it is crucial for my web service to handle a connection error.
Therefore my question:
Is there a way to either
handle the connection error by overriding some 'onError' method or insert a try-catch at the right place or
manually create the Datasources at runtime to handle the error when they are created
I would prefer solution 2.
I am using play version 2.4.2 with scala version 2.11.7.
Since the whole exceptions fills multiple pages, I only insert the first lines here:
CreationException: Unable to create injector, see the following errors:
1) Error in custom provider, Configuration error: Configuration error[Cannot connect to database [foo]]
while locating play.api.db.DBApiProvider
while locating play.api.db.DBApi
for field at play.api.db.NamedDatabaseProvider.dbApi(DBModule.scala:80)
while locating play.api.db.NamedDatabaseProvider
at com.google.inject.util.Providers$GuicifiedProviderWithDependencies.initialize(Providers.java:149)
at play.api.db.DBModule$$anonfun$namedDatabaseBindings$1.apply(DBModule.scala:34):
Binding(interface play.api.db.Database qualified with QualifierInstance(#play.db.NamedDatabase(value=appstate)) to ProviderTarget(play.api.db.NamedDatabaseProvider#1a7884c6)) (via modules: com.google.inject.util.Modules$OverrideModule -> play.api.inject.guice.GuiceableModuleConversions$$anon$1)
Caused by: Configuration error: Configuration error[Cannot connect to database [foo]]
at play.api.Configuration$.configError(Configuration.scala:178)
at play.api.Configuration.reportError(Configuration.scala:829)
at play.api.db.DefaultDBApi$$anonfun$connect$1.apply(DefaultDBApi.scala:48)
at play.api.db.DefaultDBApi$$anonfun$connect$1.apply(DefaultDBApi.scala:42)
at scala.collection.immutable.List.foreach(List.scala:381)
at play.api.db.DefaultDBApi.connect(DefaultDBApi.scala:42)
at play.api.db.DBApiProvider.get$lzycompute(DBModule.scala:72)
I remember that exists a Global Settings configuration file to catch errors when the application starts.
Take a look here: https://www.playframework.com/documentation/2.0/ScalaGlobal I know you are using a more recent play version but you will get a more general idea how it works.
In play 2.4.x this file was removed and uses DI now (https://www.playframework.com/documentation/2.4.x/GlobalSettings).
To solve my problem, I ended up writing my own wrapper for BoneCP. This way the initialization of the ConnectionPools were in my hands and I could handle connection errors.
Instead of using the db prefix, I use the prefix database (so that Play will not automatically parse its content) in a new config file database.conf.
object ConnectionPool {
private var connectionPools = Map.empty[String, BoneCP]
val config = ConfigFactory.parseFile(new File("conf/database.conf"))
private def dbConfig(dbId: String): ConfigObject = {
config.getObject("database." + dbId).asInstanceOf[ConfigObject]
}
def createConnectionPool(dbId: String): BoneCP = {
val dbConf = dbConfig(dbId)
val cpConfig: BoneCPConfig = new BoneCPConfig()
cpConfig.setJdbcUrl(dbConf.get("url").unwrapped().toString)
cpConfig.setUsername(dbConf.get("user").unwrapped().toString)
cpConfig.setPassword(dbConf.get("password").unwrapped().toString)
new BoneCP(cpConfig)
}
def getConnectionPool(dbId: String): BoneCP = {
if(!connectionPools.contains(dbId)) {
val cp = createConnectionPool(dbId)
connectionPools = (connectionPools + (dbId -> cp))
}
connectionPools(dbId)
}
def getConnection(dbId: String): Connection = {
getConnectionPool(dbId).getConnection()
}
def withConnection[T](dbId: String)(fun: Connection => T): T = {
val conn = getConnection(dbId)
val t = fun(conn)
conn.close()
t
}
}