How does Guice inject typesafe config? - scala

Recently I am reading some scala code which uses Guice to inject the Typesafe Config.. It seems kind of magic to me how this works.. My question is, how to interpret this code? Does Guice inject all those configuration values read in sbt-assembly into typesafe config automatically?
Scala code:
class FooImpl #Inject() (
config: Config
) extends Foo {
private val myConfig = "section.foo"
override val batchSize = config.getInt(s"$myConfig.batchSize")
.....
}
In Setting.scala
object Settings {
...
assemblyMergeStrategy in assembly := {
case "prod.conf" => MergeStrategy.concat
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
...
In prod.conf
section {
foo {
batchSize = 10000
...

I think you're mixing three different mechanisms here :)
#Inject is indeed Guice, and it's a final step in the process. Simply put, Guice has a "dependency injection container" that knows where to look for instances of certain types. One of the types it knows of is the Config. How it knows this depends on the framework you're using (or how you instantiate your Guice container if you're not using it);
Typesafe config has rules on where to look for configuration. Readme sums it pretty good, but in short - it finds application.conf in resources folder (or, actually, anywhere on the classpath), and then imports all the other files that application.conf explicitly imports (using import other_conf.conf). I'm assuming in your case there's import prod.conf somewhere in the application.conf
Assembly - just puts all the resources from all the dependencies into one giant resource folder - specifying rules on what to do if there are multiple files wit hthe same name. In your case, it tells it to just concatenate them.

Related

Apache Spark - Is it possible to use a Dependency Injection Mechanism

Is there any possibility using a framework for enabling / using Dependency Injection in a Spark Application?
Is it possible to use Guice, for instance?
If so, is there any documentation, or samples of how to do it?
I am using Scala as the implementation language, Spark 2.2, and SBT as the build tool.
At the moment, my team adn I are using the Cake Pattern - it has however become quite verbose, and we would prefer Guice. That's something more intuitive, and already know for by other team members.
I've been struggling with the same problem recently. Most of my findings are that you'll face issues with serialization.
I found a nice solution with Guice here:
https://www.slideshare.net/databricks/dependency-injection-in-apache-spark-applications
Spring Boot offers integration with various systems including Spark, Hadoop, YARN, Kafka, JDBC databases.
For example, I have this application.properties
spring.main.web-environment=false
appName=spring-spark
sparkHome=/Users/username/Applications/spark-2.2.1-bin-hadoop2.7
masterUri=local
This as an Application class
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.PropertySource;
import org.springframework.context.support.PropertySourcesPlaceholderConfigurer;
import org.springframework.core.env.Environment;
#Configuration
#PropertySource("classpath:application.properties")
public class ApplicationConfig {
#Autowired
private Environment env;
#Value("${appName:Spark Example}")
private String appName;
#Value("${sparkHome}")
private String sparkHome;
#Value("${masterUri:local}")
private String masterUri;
#Bean
public SparkConf sparkConf() {
return new SparkConf()
.setAppName(appName)
.setSparkHome(sparkHome)
.setMaster(masterUri);
}
#Bean
public JavaSparkContext javaSparkContext() {
return new JavaSparkContext(sparkConf());
}
#Bean
public SparkSession sparkSession() {
SparkSession.Builder sparkBuilder = SparkSession.builder()
.appName(appName)
.master(masterUri)
.sparkContext(javaSparkContext().sc());
return sparkBuilder.getOrCreate();
}
#Bean
public static PropertySourcesPlaceholderConfigurer propertySourcesPlaceholderConfigurer() {
return new PropertySourcesPlaceholderConfigurer();
}
}
taskContext.xml
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">
<!--List out all tasks here-->
<bean id="exampleSparkTask" class="com.example.spark.task.SampleSparkTask">
<constructor-arg ref="sparkSession" />
</bean>
</beans>
App
#SpringBootApplication
#ImportResource("classpath:taskContext.xml")
public class App {
public static void main(String[] args) {
SpringApplication.run(App.class, args);
}
}
And actually running Scala code here for Spark
#Order(1)
class SampleSparkTask(sparkSession: SparkSession) extends ApplicationRunner with Serializable {
// for spark streaming
#transient val ssc = new StreamingContext(sparkSession.sparkContext, Seconds(3))
import sparkSession.implicits._
#throws[Exception]
override def run(args: ApplicationArguments): Unit = {
// spark code here
}
}
From there, you can define some #AutoWired things.
Of course you can! At Qwant.com we use Spark 1.6 with Google Guice 4, run java programs on Hadoop YARN with spark-submit bin.
guice is already here if you run Spark on Hadoop (via the HDP assembly jar), so pay attention at the version you compile and you really run.
org.apache.spark:spark-yarn_2.10:1.6.3
| +--- .....
| +--- org.apache.hadoop:hadoop-yarn-server-web-proxy:2.2.0
| | +--- .....
| | +--- com.google.inject:guice:3.0 -> 4.2.2 (*)
Spark 1.6 brings Google Guice 3.0.
If you want to "force" the version of Google Guice, you must use something like this (with Gradle):
shadowJar {
relocate 'com.google.inject', 'shadow.com.google.inject'
}
https://imperceptiblethoughts.com/shadow/configuration/relocation/
The neutrino framework is exactly for your requirement.
Disclaimer: I am the author of the neutrino framework.
What is the neutrino framework
It is a Guice-based dependency injection framework for apache spark and is designed to relieve the serialization work of development. More specifically, it will handle the serialization/deserialization work for the DI-generated objects automatically during the process of object transmission and checkpoint recovery.
Example:
Here is a simple example (just filter a event stream based on redis data):
trait EventFilter[T] {
def filter(t: T): Boolean
}
// The RedisEventFilter class depends on JedisCommands directly,
// and doesn't extend `java.io.Serializable` interface.
class RedisEventFilter #Inject()(jedis: JedisCommands)
extends EventFilter[ClickEvent] {
override def filter(e: ClickEvent): Boolean = {
// filter logic based on redis
}
}
/* create injector */
val injector = ...
val eventFilter = injector.instance[EventFilter[ClickEvent]]
val eventStream: DStream[ClickEvent] = ...
eventStream.filter(e => eventFilter.filter(e))
Here is how to config the bindings:
class FilterModule(redisConfig: RedisConfig) extends SparkModule {
override def configure(): Unit = {
// the magic is here
// The method `withSerializableProxy` will generate a proxy
// extending `EventFilter` and `java.io.Serializable` interfaces with Scala macro.
// The module must extend `SparkModule` or `SparkPrivateModule` to get it
bind[EventFilter[ClickEvent]].withSerializableProxy
.to[RedisEventFilter].in[SingletonScope]
}
}
With neutrino, the RedisEventFilter doesn't even care about serialization problem. Every thing just works like in a single JVM.
How does it handle the serialization problem internally
As we know, to adopt the DI framework, we need to first build a dependency graph first, which describes the dependency relationship between various types. Guice uses Module API to build the graph while the Spring framework uses XML files or annotations.
The neutrino is built based on Guice framework, and of course, builds the dependency graph with the guice module API. It doesn't only keep the graph in the driver, but also has the same graph running on every executor.
In the dependency graph, some nodes may generate objects which may be passed to the executors, and neutrino framework would assign unique ids to these nodes. As every JVM have the same graph, the graph on each JVM have the same node id set.
In the example above, the neutrino generates a proxy class which extends the EventFilter. The proxy instance holds the node id in the graph, which will be passed to the executors to find the node in the graph and recreate the instance and all its dependencies accordingly.
Other features
Scopes
Since we have a graph on every executor, the object lifetime/scope on executors can be controlled with neutrino, which is impossible for classic DI method.
The neutrino also provides some utility scope, such as singleton per JVM, StreamingBatch scope.
Key object injection
Some key spark objects such as SparkContext, StreamingContext are also injectable.
For details, please refer to the neutrino readme file.

Scala play 2.5 test are not running since I used depencency injection

I moved my unit tests from :
class UserSpec extends PlaySpec with OneAppPerTest with BeforeAndAfter with AsyncAssertions
{
To :
class UserSpec #Inject() (implicit exec: ExecutionContext, db: DBConnectionPool)
extends PlaySpec with OneAppPerTest with BeforeAndAfter with AsyncAssertions
{
Everything was ok with the first version, but now, when I launch tests, I get the following result :
[info] No tests were executed.
[success] Total time: 4 s, completed Dec 5, 2016 8:35:24 PM
Note that I don't really want my tests to run with the same dependency injected in both tests and production. Thanks !
EDIT
Code available on github
You don't use constructor injection when writing Play tests with scalatest.
Instead you have access to the injector directly within the app.injector field when mixing in a server or app trait (such as your OneAppPerTest). This way you can inject a field into your test suite if you need anything from the DI graph:
val example = app.injector.instanceOf[Example]
So your initial code is the correct approach, mixed with using the injector directly. It could look similar to this:
class UserSpec extends PlaySpec with OneAppPerSuite
with BeforeAndAfter with AsyncAssertions {
implicit val exec : ExecutionContext = app.injector.instanceOf[ExecutionContext]
val db : DBConnectionPool = app.injector.instanceOf[DBConnectionPool]
// ...
}
As far as customizing your DI bindings for tests goes, you can override them by customizing your app instance via the GuiceApplicationBuilder, see Creating Application Instances for Testing and Testing with Guice.
When you test a class which need dependency injection, this classes need an application which inject object into those classes. In test, you have to manually create this application. Do it by adding the following line at the beginning of your test suite :
import play.api.inject.guice.GuiceApplicationBuilder
class UserSpec extends PlaySpec with OneAppPerTest with BeforeAndAfter with AsyncAssertions {
override def newAppForTest(td: TestData) = new GuiceApplicationBuilder().build()
[...]
Note that you can modify this application with special conf for tests. See the play documentation for more details.
The application I created for this case tiny and open source. See it for more details on how I implemented this : https://github.com/gbersac/electricity_manager

Scala slf4j dynamic file name

I just successfully added Grizzled-SLF4J logger to my project using this link http://alvinalexander.com/scala/how-to-log-output-file-grizzled-slf4j-scala-simplelogger.properties
But using this properties, there is no option to create dynamic file name:
org.slf4j.simpleLogger.logFile = /tmp/myapp.log
org.slf4j.simpleLogger.defaultLogLevel = info
org.slf4j.simpleLogger.showDateTime = true
org.slf4j.simpleLogger.dateTimeFormat = yyyy'/'MM'/'dd' 'HH':'mm':'ss'-'S
org.slf4j.simpleLogger.showThreadName = true
org.slf4j.simpleLogger.showLogName = true
org.slf4j.simpleLogger.showShortLogName= false
org.slf4j.simpleLogger.levelInBrackets = true
Is there any other logger for scala projects that allow me add dynamic file name, or how to do this using this library (I see it is just a wrapper for slf4j)
The slf4j library is really an interface to some underlying logging implementation. You would have log4j, logback or some other logging implementation do the heavy lifting, with an adapter jar, as explained in the slf4j documentation.
You would then provide the details in the properties file for log4j for instance, where you can bind in dynamically constructed file names.

How to suppress Spark logging in unit tests?

So thanks to easily googleable blogs I tried:
import org.specs2.mutable.Specification
class SparkEngineSpecs extends Specification {
sequential
def setLogLevels(level: Level, loggers: Seq[String]): Map[String, Level] = loggers.map(loggerName => {
val logger = Logger.getLogger(loggerName)
val prevLevel = logger.getLevel
logger.setLevel(level)
loggerName -> prevLevel
}).toMap
setLogLevels(Level.WARN, Seq("spark", "org.eclipse.jetty", "akka"))
val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("Test Spark Engine"))
// ... my unit tests
But unfortunately it doesn't work, I still get a lot of spark output, e.g.:
14/12/02 12:01:56 INFO MemoryStore: Block broadcast_4 of size 4184 dropped from memory (free 583461216)
14/12/02 12:01:56 INFO ContextCleaner: Cleaned broadcast 4
14/12/02 12:01:56 INFO ContextCleaner: Cleaned shuffle 4
14/12/02 12:01:56 INFO ShuffleBlockManager: Deleted all files for shuffle 4
Add the following code into the log4j.properties file inside the src/test/resources dir, create the file/dir if not exist
# Change this to set Spark log level
log4j.logger.org.apache.spark=WARN
# Silence akka remoting
log4j.logger.Remoting=WARN
# Ignore messages below warning level from Jetty, because it's a bit verbose
log4j.logger.org.eclipse.jetty=WARN
When I run my unit tests (I'm using JUnit and Maven), I only receive WARN level logs, in other words no more cluttering with INFO level logs (though they can be useful at times for debugging).
I hope this helps.
In my case one of my own libraries brought logback-classic into the mix. This materialized in a warning at the start:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/alex/.ivy2/cache/ch.qos.logback/logback-classic/jars/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/alex/.ivy2/cache/org.slf4j/slf4j-log4j12/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
I solved this by excluding it from the dependency:
"com.mystuff" % "mylib" % "1.0.0" exclude("ch.qos.logback", "logback-classic")
Now I could add a log4j.properties file in test/resources which now gets used by Spark.
After some time of struggling with Spark log output as well, I found a blog post with a solution I particularly liked.
If you use slf4j, one can simply exchange the underlying log implementation. A good canidate for the test scope is slf4j-nop, which carfully takes the log output and puts it where the sun never shines.
When using Maven you can add the following to the top of your dependencies list:
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.12</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-nop</artifactId>
<version>1.7.12</version>
<scope>test</scope>
</dependency>
Note that it might be important to have it at the beginning of the dependencies list to make sure that the given implementations are used instead of those that might come with other packages (and which you can consider to exclude in order to keep your class path tidy and avoid unexpected conflicts).
You can use a separate Logback config for tests. Depending on your environment it's possible that you just need to create conf/logback-test.xml with something that hides the logs. I think this should do that:
<configuration>
<root level="debug">
</root>
</configuration>
As I understand it, this captures all logs (level debug and higher) and assigns no logger to them, so they get discarded. A better option is to configure a file logger for them, so you can still access the logs if you want to.
See http://logback.qos.ch/manual/configuration.html for the detailed documentation.
A little late to the party but I found this in the spark example code :
def setStreamingLogLevels() {
val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
if (!log4jInitialized) {
// We first log something to initialize Spark's default logging, then we override the
// logging level.
logInfo("Setting log level to [WARN] for streaming example." +
" To override add a custom log4j.properties to the classpath.")
Logger.getRootLogger.setLevel(Level.WARN)
}
}
I also found that with your code if you call setLogLevels like below it cut out alot of out put for me.
setLogLevels(Level.WARN, Seq("spark", "org", "akka"))
The easiest solution working for me is:
cp $SPARK_HOME/conf/log4j.properties.template $YOUR_PROJECT/src/test/resources/log4j.properties
sed -i -e 's/log4j.rootCategory=INFO/log4j.rootCategory=WARN/g' $YOUR_PROJECT/src/test/resources/log4j.properties

Using spark to access HDFS failed

I am using Cloudera 4.2.0 and Spark.
I just want to try out some examples given by Spark.
// HdfsTest.scala
package spark.examples
import spark._
object HdfsTest {
def main(args: Array[String]) {
val sc = new SparkContext(args(0), "HdfsTest",
System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
val file = sc.textFile("hdfs://n1.example.com/user/cloudera/data/navi_test.csv")
val mapped = file.map(s => s.length).cache()
for (iter <- 1 to 10) {
val start = System.currentTimeMillis()
for (x <- mapped) { x + 2 }
// println("Processing: " + x)
val end = System.currentTimeMillis()
println("Iteration " + iter + " took " + (end-start) + " ms")
}
System.exit(0)
}
}
It's ok for compiling, but there is always some runtime problems:
Exception in thread "main" java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.HftpFileSystem could not be instantiated: java.lang.IllegalAccessError: tried to access method org.apache.hadoop.fs.DelegationTokenRenewer.<init>(Ljava/lang/Class;)V from class org.apache.hadoop.hdfs.HftpFileSystem
at java.util.ServiceLoader.fail(ServiceLoader.java:224)
at java.util.ServiceLoader.access$100(ServiceLoader.java:181)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:377)
at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2229)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2240)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2257)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:86)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2296)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2278)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:162)
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
at spark.SparkContext.hadoopFile(SparkContext.scala:263)
at spark.SparkContext.textFile(SparkContext.scala:235)
at spark.examples.HdfsTest$.main(HdfsTest.scala:9)
at spark.examples.HdfsTest.main(HdfsTest.scala)
Caused by: java.lang.IllegalAccessError: tried to access method org.apache.hadoop.fs.DelegationTokenRenewer.<init>(Ljava/lang/Class;)V from class org.apache.hadoop.hdfs.HftpFileSystem
at org.apache.hadoop.hdfs.HftpFileSystem.<clinit>(HftpFileSystem.java:84)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at java.lang.Class.newInstance0(Class.java:374)
at java.lang.Class.newInstance(Class.java:327)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
... 16 more
I have searched on Google, no idea about this kind of exception for Spark and HDFS.
val file = sc.textFile("hdfs://n1.example.com/user/cloudera/data/navi_test.csv") is where the problem occurs.
13/04/04 12:20:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
And I got this Warning. Maybe I should add some hadoop paths in CLASS_PATH.
Feel free to give any clue. =)
Thank you all.
REN Hao
(This question was also asked / answered on the spark-users mailing list).
You need to compile Spark against the particular version of Hadoop/HDFS running on your cluster. From the Spark documentation:
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the HDFS protocol has changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs. You can change the version by setting the HADOOP_VERSION variable at the top of project/SparkBuild.scala, then rebuilding Spark (sbt/sbt clean compile).
The spark-users mailing list archives contain several questions about compiling against specific Hadoop versions, so I would search there if you run into any problems when building Spark.
You can set Coudera's Hadoop version with an environment variable when building Spark, look up your exact artifact version on Cloudera's maven repo, should be this:
SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 sbt/sbt assembly publish-local
Make sure you run whatever you run with the same Java engine you use to build Spark. Also, there are pre-built Spark packages for different Cloudera Hadoop distributions, like http://spark-project.org/download/spark-0.8.0-incubating-bin-cdh4.tgz
This might be a problem related with the installed Java in your system. Hadoop requires (Sun) Java 1.6+.
Make sure you have:
JAVA_HOME="/usr/lib/jvm/java-6-sun