Is there any possibility using a framework for enabling / using Dependency Injection in a Spark Application?
Is it possible to use Guice, for instance?
If so, is there any documentation, or samples of how to do it?
I am using Scala as the implementation language, Spark 2.2, and SBT as the build tool.
At the moment, my team adn I are using the Cake Pattern - it has however become quite verbose, and we would prefer Guice. That's something more intuitive, and already know for by other team members.
I've been struggling with the same problem recently. Most of my findings are that you'll face issues with serialization.
I found a nice solution with Guice here:
https://www.slideshare.net/databricks/dependency-injection-in-apache-spark-applications
Spring Boot offers integration with various systems including Spark, Hadoop, YARN, Kafka, JDBC databases.
For example, I have this application.properties
spring.main.web-environment=false
appName=spring-spark
sparkHome=/Users/username/Applications/spark-2.2.1-bin-hadoop2.7
masterUri=local
This as an Application class
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.PropertySource;
import org.springframework.context.support.PropertySourcesPlaceholderConfigurer;
import org.springframework.core.env.Environment;
#Configuration
#PropertySource("classpath:application.properties")
public class ApplicationConfig {
#Autowired
private Environment env;
#Value("${appName:Spark Example}")
private String appName;
#Value("${sparkHome}")
private String sparkHome;
#Value("${masterUri:local}")
private String masterUri;
#Bean
public SparkConf sparkConf() {
return new SparkConf()
.setAppName(appName)
.setSparkHome(sparkHome)
.setMaster(masterUri);
}
#Bean
public JavaSparkContext javaSparkContext() {
return new JavaSparkContext(sparkConf());
}
#Bean
public SparkSession sparkSession() {
SparkSession.Builder sparkBuilder = SparkSession.builder()
.appName(appName)
.master(masterUri)
.sparkContext(javaSparkContext().sc());
return sparkBuilder.getOrCreate();
}
#Bean
public static PropertySourcesPlaceholderConfigurer propertySourcesPlaceholderConfigurer() {
return new PropertySourcesPlaceholderConfigurer();
}
}
taskContext.xml
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">
<!--List out all tasks here-->
<bean id="exampleSparkTask" class="com.example.spark.task.SampleSparkTask">
<constructor-arg ref="sparkSession" />
</bean>
</beans>
App
#SpringBootApplication
#ImportResource("classpath:taskContext.xml")
public class App {
public static void main(String[] args) {
SpringApplication.run(App.class, args);
}
}
And actually running Scala code here for Spark
#Order(1)
class SampleSparkTask(sparkSession: SparkSession) extends ApplicationRunner with Serializable {
// for spark streaming
#transient val ssc = new StreamingContext(sparkSession.sparkContext, Seconds(3))
import sparkSession.implicits._
#throws[Exception]
override def run(args: ApplicationArguments): Unit = {
// spark code here
}
}
From there, you can define some #AutoWired things.
Of course you can! At Qwant.com we use Spark 1.6 with Google Guice 4, run java programs on Hadoop YARN with spark-submit bin.
guice is already here if you run Spark on Hadoop (via the HDP assembly jar), so pay attention at the version you compile and you really run.
org.apache.spark:spark-yarn_2.10:1.6.3
| +--- .....
| +--- org.apache.hadoop:hadoop-yarn-server-web-proxy:2.2.0
| | +--- .....
| | +--- com.google.inject:guice:3.0 -> 4.2.2 (*)
Spark 1.6 brings Google Guice 3.0.
If you want to "force" the version of Google Guice, you must use something like this (with Gradle):
shadowJar {
relocate 'com.google.inject', 'shadow.com.google.inject'
}
https://imperceptiblethoughts.com/shadow/configuration/relocation/
The neutrino framework is exactly for your requirement.
Disclaimer: I am the author of the neutrino framework.
What is the neutrino framework
It is a Guice-based dependency injection framework for apache spark and is designed to relieve the serialization work of development. More specifically, it will handle the serialization/deserialization work for the DI-generated objects automatically during the process of object transmission and checkpoint recovery.
Example:
Here is a simple example (just filter a event stream based on redis data):
trait EventFilter[T] {
def filter(t: T): Boolean
}
// The RedisEventFilter class depends on JedisCommands directly,
// and doesn't extend `java.io.Serializable` interface.
class RedisEventFilter #Inject()(jedis: JedisCommands)
extends EventFilter[ClickEvent] {
override def filter(e: ClickEvent): Boolean = {
// filter logic based on redis
}
}
/* create injector */
val injector = ...
val eventFilter = injector.instance[EventFilter[ClickEvent]]
val eventStream: DStream[ClickEvent] = ...
eventStream.filter(e => eventFilter.filter(e))
Here is how to config the bindings:
class FilterModule(redisConfig: RedisConfig) extends SparkModule {
override def configure(): Unit = {
// the magic is here
// The method `withSerializableProxy` will generate a proxy
// extending `EventFilter` and `java.io.Serializable` interfaces with Scala macro.
// The module must extend `SparkModule` or `SparkPrivateModule` to get it
bind[EventFilter[ClickEvent]].withSerializableProxy
.to[RedisEventFilter].in[SingletonScope]
}
}
With neutrino, the RedisEventFilter doesn't even care about serialization problem. Every thing just works like in a single JVM.
How does it handle the serialization problem internally
As we know, to adopt the DI framework, we need to first build a dependency graph first, which describes the dependency relationship between various types. Guice uses Module API to build the graph while the Spring framework uses XML files or annotations.
The neutrino is built based on Guice framework, and of course, builds the dependency graph with the guice module API. It doesn't only keep the graph in the driver, but also has the same graph running on every executor.
In the dependency graph, some nodes may generate objects which may be passed to the executors, and neutrino framework would assign unique ids to these nodes. As every JVM have the same graph, the graph on each JVM have the same node id set.
In the example above, the neutrino generates a proxy class which extends the EventFilter. The proxy instance holds the node id in the graph, which will be passed to the executors to find the node in the graph and recreate the instance and all its dependencies accordingly.
Other features
Scopes
Since we have a graph on every executor, the object lifetime/scope on executors can be controlled with neutrino, which is impossible for classic DI method.
The neutrino also provides some utility scope, such as singleton per JVM, StreamingBatch scope.
Key object injection
Some key spark objects such as SparkContext, StreamingContext are also injectable.
For details, please refer to the neutrino readme file.
Related
I have a Spark job that have been working well till a few days ago I needed to enable Kryo Serialization.
spark.kryo.registrationRequired true
spark.kryo.referenceTracking true
spark.kryo.registrator org.mycompany.serialization.MyKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Now it started to complain it can not find registered classes. I registered like this
def registerByName(kryo: Kryo, name: String) = kryo.register(Class.forName(name))
registerByName(kryo, "org.apache.spark.util.collection.BitSet")
registerByName(kryo, "org.apache.spark.util.collection.OpenHashSet")
registerByName(kryo, "org.apache.spark.util.collection.OpenHashSet$Hasher")
registerByName(kryo, "org.apache.spark.util.collection.OpenHashMap")
registerByName(kryo, "org.apache.spark.util.collection.OpenHashMap$mcJ$sp")
After this it complains with
com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.util.collection.OpenHashMap$mcJ$sp$$Lambda$1429/0x0000000800cd3840
Note: To register this class use: kryo.register(org.apache.spark.util.collection.OpenHashMap$mcJ$sp$$Lambda$1429/0x0000000800cd3840.class
But if I try register
registerByName(kryo, "org.apache.spark.util.collection.OpenHashMap$mcJ$sp$$Lambda$1429/0x0000000800cd3840")
it throws java.lang.ClassNotFoundException
The class OpenHashMap is private for [spark] package scala generic that is used somewhere in deeps of Spark and seems like once Kryo is enabled, Spark offloads all serialization things related to its internals to Kryo. If it was my class I would write custom serializer but I have no Idea what can I do in my situation.
The problematic class definition https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala
Recently I am reading some scala code which uses Guice to inject the Typesafe Config.. It seems kind of magic to me how this works.. My question is, how to interpret this code? Does Guice inject all those configuration values read in sbt-assembly into typesafe config automatically?
Scala code:
class FooImpl #Inject() (
config: Config
) extends Foo {
private val myConfig = "section.foo"
override val batchSize = config.getInt(s"$myConfig.batchSize")
.....
}
In Setting.scala
object Settings {
...
assemblyMergeStrategy in assembly := {
case "prod.conf" => MergeStrategy.concat
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
...
In prod.conf
section {
foo {
batchSize = 10000
...
I think you're mixing three different mechanisms here :)
#Inject is indeed Guice, and it's a final step in the process. Simply put, Guice has a "dependency injection container" that knows where to look for instances of certain types. One of the types it knows of is the Config. How it knows this depends on the framework you're using (or how you instantiate your Guice container if you're not using it);
Typesafe config has rules on where to look for configuration. Readme sums it pretty good, but in short - it finds application.conf in resources folder (or, actually, anywhere on the classpath), and then imports all the other files that application.conf explicitly imports (using import other_conf.conf). I'm assuming in your case there's import prod.conf somewhere in the application.conf
Assembly - just puts all the resources from all the dependencies into one giant resource folder - specifying rules on what to do if there are multiple files wit hthe same name. In your case, it tells it to just concatenate them.
I have Spring Boot application with kafka and jpa in it. I am using h2 as my in-memory database. For each test class execution, I don't want kafka to come up for each test class. I have 2 test classes, one is KafkaConsumerTest and another one is JPATest. KafkaConsumerTest is annotated with #SpringBootTest and it perfectly loads the entire application and passes all the test. However, for JPATest, I don't want to bring up the entire application and just few desired context to test out JPA related changes. When I do that, it is throwing the following exception.
Caused by: java.lang.IllegalArgumentException: dataSource or dataSourceClassName or jdbcUrl is required.
at com.zaxxer.hikari.HikariConfig.validate(HikariConfig.java:958)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:109)
at org.eclipse.persistence.sessions.JNDIConnector.connect(JNDIConnector.java:138)
at org.eclipse.persistence.sessions.DatasourceLogin.connectToDatasource(DatasourceLogin.java:172)
at org.eclipse.persistence.internal.sessions.DatabaseSessionImpl.setOrDetectDatasource(DatabaseSessionImpl.java:233)
at org.eclipse.persistence.internal.sessions.DatabaseSessionImpl.loginAndDetectDatasource(DatabaseSessionImpl.java:815)
at org.eclipse.persistence.internal.jpa.EntityManagerFactoryProvider.login(EntityManagerFactoryProvider.java:256)
at org.eclipse.persistence.internal.jpa.EntityManagerSetupImpl.deploy(EntityManagerSetupImpl.java:769)
I am passing the datasource with jdbcUrl in my application.yml file
src/test/resources/application.yml
spring:
datasource:
jdbcUrl: jdbc:h2:mem:mydb
url: jdbc:h2:mem:mydb
driverClassName: org.h2.Driver
username: sa
kafka:
bootstrap-servers: ${spring.embedded.kafka.brokers}
KafkaConsumerTest.java
#RunWith(SpringRunner.class)
#SpringBootTest (classes = Application.class)
#DirtiesContext
#EmbeddedKafka(partitions = 1,
topics = {"${kafka.topic}"})
public class KafkaConsumerTest {
JpaTest.java
#RunWith(SpringRunner.class)
#ContextConfiguration(initializers = ConfigFileApplicationContextInitializer.class, classes = {JPAConfiguration.class})
public class NotificationServiceTest {
I tried putting loader as AnnotationConfigContextLoader.class but it gave me the same error. I tried specifying application.yml exclusively using TestPropertyResource but still the same error.
#TestPropertyResource(locations = {"classpath:application.yml"})
I think I am not able to load the context properly here and application.yml file is not able to pick or parse values here.
Any suggestions on how to resolve this.
I am able to solve the issue. The reason of this issue was spring context was not getting loaded properly for other tests as I was not using #SpringBootTest. How I bypassed the error and also loading the spring boot context only for one time was to create a base class like this.
#RunWith(SpringRunner.class)
#SpringBootTest(classes = Application.class)
#DirtiesContext
#EmbeddedKafka(partitions = 1,
topics = {"${kafka.topic}"})
public abstract class AbstractSpringBootTest {
}
Now every test class has to extend this class as per the following code. This way spring test will be loaded once only provided the context doesn't get changed during the tests run.
public class MyTest extends AbstractSpringBootTest {
Posting the solution which worked for me for other people's reference.
I moved my unit tests from :
class UserSpec extends PlaySpec with OneAppPerTest with BeforeAndAfter with AsyncAssertions
{
To :
class UserSpec #Inject() (implicit exec: ExecutionContext, db: DBConnectionPool)
extends PlaySpec with OneAppPerTest with BeforeAndAfter with AsyncAssertions
{
Everything was ok with the first version, but now, when I launch tests, I get the following result :
[info] No tests were executed.
[success] Total time: 4 s, completed Dec 5, 2016 8:35:24 PM
Note that I don't really want my tests to run with the same dependency injected in both tests and production. Thanks !
EDIT
Code available on github
You don't use constructor injection when writing Play tests with scalatest.
Instead you have access to the injector directly within the app.injector field when mixing in a server or app trait (such as your OneAppPerTest). This way you can inject a field into your test suite if you need anything from the DI graph:
val example = app.injector.instanceOf[Example]
So your initial code is the correct approach, mixed with using the injector directly. It could look similar to this:
class UserSpec extends PlaySpec with OneAppPerSuite
with BeforeAndAfter with AsyncAssertions {
implicit val exec : ExecutionContext = app.injector.instanceOf[ExecutionContext]
val db : DBConnectionPool = app.injector.instanceOf[DBConnectionPool]
// ...
}
As far as customizing your DI bindings for tests goes, you can override them by customizing your app instance via the GuiceApplicationBuilder, see Creating Application Instances for Testing and Testing with Guice.
When you test a class which need dependency injection, this classes need an application which inject object into those classes. In test, you have to manually create this application. Do it by adding the following line at the beginning of your test suite :
import play.api.inject.guice.GuiceApplicationBuilder
class UserSpec extends PlaySpec with OneAppPerTest with BeforeAndAfter with AsyncAssertions {
override def newAppForTest(td: TestData) = new GuiceApplicationBuilder().build()
[...]
Note that you can modify this application with special conf for tests. See the play documentation for more details.
The application I created for this case tiny and open source. See it for more details on how I implemented this : https://github.com/gbersac/electricity_manager
I am new to Infinispan and JBOSS Cache, and am trying to learn these concepts using infinispan documentation. But was not successful in configuring custom xml configuration for cache. Can you please help me out ??
I have following Java Class (Infinispan jar files added to build path)
CustomCacheBean.java
package com.jboss.cache;
import java.io.IOException;
import org.infinispan.Cache;
import org.infinispan.configuration.cache.ConfigurationBuilder;
import org.infinispan.eviction.EvictionStrategy;
import org.infinispan.manager.DefaultCacheManager;
import org.infinispan.manager.EmbeddedCacheManager;
public class CustomCacheBean {
public static void main(String[] args) {
EmbeddedCacheManager manager =
new DefaultCacheManager();
manager.defineConfiguration("custom-cache",new ConfigurationBuilder().build());
Cache<Object, Object> c = manager.getCache("custom-cache");
try {
c = new DefaultCacheManager("infinispan.xml").getCache("xml-configured-cache");
} catch (IOException e) {
e.printStackTrace();
}
}
}
And following is my xml
infinispan.xml (placed under web_Content folder)
<infinispan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:infinispan:config:8.0.1 http://www.infinispan.org/schemas/infinispan-config-8.0.1.xsd"
xmlns="urn:infinispan:config:8.0.1">
<namedCache name="xml-configured-cache">
<eviction strategy="LIRS" maxEntries="10" />
</namedCache>
</infinispan>
When I try to execute CustomCacheBean java class, I am getting following error
Console :
log4j:WARN No appenders could be found for logger (infinispan.org.jboss.logging).
log4j:WARN Please initialize the log4j system properly.
Exception in thread "main" org.infinispan.commons.CacheConfigurationException: ISPN000327: Cannot find a parser for element 'infinispan' in namespace
'urn:infinispan:config:8.0.1'. Check that your configuration is up-to date for this version of Infinispan.
at org.infinispan.configuration.parsing.ParserRegistry.parseElement(ParserRegistry.java:147)
at org.infinispan.configuration.parsing.ParserRegistry.parse(ParserRegistry.java:131)
at org.infinispan.configuration.parsing.ParserRegistry.parse(ParserRegistry.java:118)
at org.infinispan.configuration.parsing.ParserRegistry.parse(ParserRegistry.java:105)
at org.infinispan.manager.DefaultCacheManager.<init>(DefaultCacheManager.java:271)
at org.infinispan.manager.DefaultCacheManager.<init>(DefaultCacheManager.java:244)
at org.infinispan.manager.DefaultCacheManager.<init>(DefaultCacheManager.java:231)
at com.jboss.cache.CustomCacheBean.main(CustomCacheBean.java:19)
I would recommend using Java based configuration instead of XML. You may take a look at tutorials:
Tutorials page
Distributed Cache (which is probably what you will need)
Please note there is a Github button at the bottom of the page (which will navigate you to the Github repository).
You're probably using wrong namespace in the XML config - namespaces don't use micro version, therefore, use:
xsi:schemaLocation="urn:infinispan:config:8.0 http://www.infinispan.org/schemas/infinispan-config-8.0.xsd" xmlns="urn:infinispan:config:8.0"
instead of
xsi:schemaLocation="urn:infinispan:config:8.0.1 http://www.infinispan.org/schemas/infinispan-config-8.0.1.xsd" xmlns="urn:infinispan:config:8.0.1"
Please, make sure that your IDE validates your configuration against the XSD; this can save you a lot of fuss (not only with Infinispan).