Spark Kryo Serialization issue - scala

I have a Spark job that have been working well till a few days ago I needed to enable Kryo Serialization.
spark.kryo.registrationRequired true
spark.kryo.referenceTracking true
spark.kryo.registrator org.mycompany.serialization.MyKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
Now it started to complain it can not find registered classes. I registered like this
def registerByName(kryo: Kryo, name: String) = kryo.register(Class.forName(name))
registerByName(kryo, "org.apache.spark.util.collection.BitSet")
registerByName(kryo, "org.apache.spark.util.collection.OpenHashSet")
registerByName(kryo, "org.apache.spark.util.collection.OpenHashSet$Hasher")
registerByName(kryo, "org.apache.spark.util.collection.OpenHashMap")
registerByName(kryo, "org.apache.spark.util.collection.OpenHashMap$mcJ$sp")
After this it complains with
com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.util.collection.OpenHashMap$mcJ$sp$$Lambda$1429/0x0000000800cd3840
Note: To register this class use: kryo.register(org.apache.spark.util.collection.OpenHashMap$mcJ$sp$$Lambda$1429/0x0000000800cd3840.class
But if I try register
registerByName(kryo, "org.apache.spark.util.collection.OpenHashMap$mcJ$sp$$Lambda$1429/0x0000000800cd3840")
it throws java.lang.ClassNotFoundException
The class OpenHashMap is private for [spark] package scala generic that is used somewhere in deeps of Spark and seems like once Kryo is enabled, Spark offloads all serialization things related to its internals to Kryo. If it was my class I would write custom serializer but I have no Idea what can I do in my situation.
The problematic class definition https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala

Related

SpringBootTest + JPA + Kafka - context is not loading properly during testing

I have Spring Boot application with kafka and jpa in it. I am using h2 as my in-memory database. For each test class execution, I don't want kafka to come up for each test class. I have 2 test classes, one is KafkaConsumerTest and another one is JPATest. KafkaConsumerTest is annotated with #SpringBootTest and it perfectly loads the entire application and passes all the test. However, for JPATest, I don't want to bring up the entire application and just few desired context to test out JPA related changes. When I do that, it is throwing the following exception.
Caused by: java.lang.IllegalArgumentException: dataSource or dataSourceClassName or jdbcUrl is required.
at com.zaxxer.hikari.HikariConfig.validate(HikariConfig.java:958)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:109)
at org.eclipse.persistence.sessions.JNDIConnector.connect(JNDIConnector.java:138)
at org.eclipse.persistence.sessions.DatasourceLogin.connectToDatasource(DatasourceLogin.java:172)
at org.eclipse.persistence.internal.sessions.DatabaseSessionImpl.setOrDetectDatasource(DatabaseSessionImpl.java:233)
at org.eclipse.persistence.internal.sessions.DatabaseSessionImpl.loginAndDetectDatasource(DatabaseSessionImpl.java:815)
at org.eclipse.persistence.internal.jpa.EntityManagerFactoryProvider.login(EntityManagerFactoryProvider.java:256)
at org.eclipse.persistence.internal.jpa.EntityManagerSetupImpl.deploy(EntityManagerSetupImpl.java:769)
I am passing the datasource with jdbcUrl in my application.yml file
src/test/resources/application.yml
spring:
datasource:
jdbcUrl: jdbc:h2:mem:mydb
url: jdbc:h2:mem:mydb
driverClassName: org.h2.Driver
username: sa
kafka:
bootstrap-servers: ${spring.embedded.kafka.brokers}
KafkaConsumerTest.java
#RunWith(SpringRunner.class)
#SpringBootTest (classes = Application.class)
#DirtiesContext
#EmbeddedKafka(partitions = 1,
topics = {"${kafka.topic}"})
public class KafkaConsumerTest {
JpaTest.java
#RunWith(SpringRunner.class)
#ContextConfiguration(initializers = ConfigFileApplicationContextInitializer.class, classes = {JPAConfiguration.class})
public class NotificationServiceTest {
I tried putting loader as AnnotationConfigContextLoader.class but it gave me the same error. I tried specifying application.yml exclusively using TestPropertyResource but still the same error.
#TestPropertyResource(locations = {"classpath:application.yml"})
I think I am not able to load the context properly here and application.yml file is not able to pick or parse values here.
Any suggestions on how to resolve this.
I am able to solve the issue. The reason of this issue was spring context was not getting loaded properly for other tests as I was not using #SpringBootTest. How I bypassed the error and also loading the spring boot context only for one time was to create a base class like this.
#RunWith(SpringRunner.class)
#SpringBootTest(classes = Application.class)
#DirtiesContext
#EmbeddedKafka(partitions = 1,
topics = {"${kafka.topic}"})
public abstract class AbstractSpringBootTest {
}
Now every test class has to extend this class as per the following code. This way spring test will be loaded once only provided the context doesn't get changed during the tests run.
public class MyTest extends AbstractSpringBootTest {
Posting the solution which worked for me for other people's reference.

Apache Spark - Is it possible to use a Dependency Injection Mechanism

Is there any possibility using a framework for enabling / using Dependency Injection in a Spark Application?
Is it possible to use Guice, for instance?
If so, is there any documentation, or samples of how to do it?
I am using Scala as the implementation language, Spark 2.2, and SBT as the build tool.
At the moment, my team adn I are using the Cake Pattern - it has however become quite verbose, and we would prefer Guice. That's something more intuitive, and already know for by other team members.
I've been struggling with the same problem recently. Most of my findings are that you'll face issues with serialization.
I found a nice solution with Guice here:
https://www.slideshare.net/databricks/dependency-injection-in-apache-spark-applications
Spring Boot offers integration with various systems including Spark, Hadoop, YARN, Kafka, JDBC databases.
For example, I have this application.properties
spring.main.web-environment=false
appName=spring-spark
sparkHome=/Users/username/Applications/spark-2.2.1-bin-hadoop2.7
masterUri=local
This as an Application class
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.PropertySource;
import org.springframework.context.support.PropertySourcesPlaceholderConfigurer;
import org.springframework.core.env.Environment;
#Configuration
#PropertySource("classpath:application.properties")
public class ApplicationConfig {
#Autowired
private Environment env;
#Value("${appName:Spark Example}")
private String appName;
#Value("${sparkHome}")
private String sparkHome;
#Value("${masterUri:local}")
private String masterUri;
#Bean
public SparkConf sparkConf() {
return new SparkConf()
.setAppName(appName)
.setSparkHome(sparkHome)
.setMaster(masterUri);
}
#Bean
public JavaSparkContext javaSparkContext() {
return new JavaSparkContext(sparkConf());
}
#Bean
public SparkSession sparkSession() {
SparkSession.Builder sparkBuilder = SparkSession.builder()
.appName(appName)
.master(masterUri)
.sparkContext(javaSparkContext().sc());
return sparkBuilder.getOrCreate();
}
#Bean
public static PropertySourcesPlaceholderConfigurer propertySourcesPlaceholderConfigurer() {
return new PropertySourcesPlaceholderConfigurer();
}
}
taskContext.xml
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">
<!--List out all tasks here-->
<bean id="exampleSparkTask" class="com.example.spark.task.SampleSparkTask">
<constructor-arg ref="sparkSession" />
</bean>
</beans>
App
#SpringBootApplication
#ImportResource("classpath:taskContext.xml")
public class App {
public static void main(String[] args) {
SpringApplication.run(App.class, args);
}
}
And actually running Scala code here for Spark
#Order(1)
class SampleSparkTask(sparkSession: SparkSession) extends ApplicationRunner with Serializable {
// for spark streaming
#transient val ssc = new StreamingContext(sparkSession.sparkContext, Seconds(3))
import sparkSession.implicits._
#throws[Exception]
override def run(args: ApplicationArguments): Unit = {
// spark code here
}
}
From there, you can define some #AutoWired things.
Of course you can! At Qwant.com we use Spark 1.6 with Google Guice 4, run java programs on Hadoop YARN with spark-submit bin.
guice is already here if you run Spark on Hadoop (via the HDP assembly jar), so pay attention at the version you compile and you really run.
org.apache.spark:spark-yarn_2.10:1.6.3
| +--- .....
| +--- org.apache.hadoop:hadoop-yarn-server-web-proxy:2.2.0
| | +--- .....
| | +--- com.google.inject:guice:3.0 -> 4.2.2 (*)
Spark 1.6 brings Google Guice 3.0.
If you want to "force" the version of Google Guice, you must use something like this (with Gradle):
shadowJar {
relocate 'com.google.inject', 'shadow.com.google.inject'
}
https://imperceptiblethoughts.com/shadow/configuration/relocation/
The neutrino framework is exactly for your requirement.
Disclaimer: I am the author of the neutrino framework.
What is the neutrino framework
It is a Guice-based dependency injection framework for apache spark and is designed to relieve the serialization work of development. More specifically, it will handle the serialization/deserialization work for the DI-generated objects automatically during the process of object transmission and checkpoint recovery.
Example:
Here is a simple example (just filter a event stream based on redis data):
trait EventFilter[T] {
def filter(t: T): Boolean
}
// The RedisEventFilter class depends on JedisCommands directly,
// and doesn't extend `java.io.Serializable` interface.
class RedisEventFilter #Inject()(jedis: JedisCommands)
extends EventFilter[ClickEvent] {
override def filter(e: ClickEvent): Boolean = {
// filter logic based on redis
}
}
/* create injector */
val injector = ...
val eventFilter = injector.instance[EventFilter[ClickEvent]]
val eventStream: DStream[ClickEvent] = ...
eventStream.filter(e => eventFilter.filter(e))
Here is how to config the bindings:
class FilterModule(redisConfig: RedisConfig) extends SparkModule {
override def configure(): Unit = {
// the magic is here
// The method `withSerializableProxy` will generate a proxy
// extending `EventFilter` and `java.io.Serializable` interfaces with Scala macro.
// The module must extend `SparkModule` or `SparkPrivateModule` to get it
bind[EventFilter[ClickEvent]].withSerializableProxy
.to[RedisEventFilter].in[SingletonScope]
}
}
With neutrino, the RedisEventFilter doesn't even care about serialization problem. Every thing just works like in a single JVM.
How does it handle the serialization problem internally
As we know, to adopt the DI framework, we need to first build a dependency graph first, which describes the dependency relationship between various types. Guice uses Module API to build the graph while the Spring framework uses XML files or annotations.
The neutrino is built based on Guice framework, and of course, builds the dependency graph with the guice module API. It doesn't only keep the graph in the driver, but also has the same graph running on every executor.
In the dependency graph, some nodes may generate objects which may be passed to the executors, and neutrino framework would assign unique ids to these nodes. As every JVM have the same graph, the graph on each JVM have the same node id set.
In the example above, the neutrino generates a proxy class which extends the EventFilter. The proxy instance holds the node id in the graph, which will be passed to the executors to find the node in the graph and recreate the instance and all its dependencies accordingly.
Other features
Scopes
Since we have a graph on every executor, the object lifetime/scope on executors can be controlled with neutrino, which is impossible for classic DI method.
The neutrino also provides some utility scope, such as singleton per JVM, StreamingBatch scope.
Key object injection
Some key spark objects such as SparkContext, StreamingContext are also injectable.
For details, please refer to the neutrino readme file.

No Runnable Methods and Test Class Should Have Exactly One Public Constructor Exception

I can't seem to shake this error. The Eclipse Runner appears to find the #Test annotations however the BlockJUnit4ClassRunner validateConstructor and validateInstanceMethods can't see the #Test annotations.
I have read In this case, Test.class have been loaded with the system ClassLoader (i.e. the one that loaded JUnitCore), therefore technically none of the test methods will have been annotated with that annotation.
Solution is to load JUnitCore in the same ClassLoader as the tests themselves.
I've tried updating the JUnitCore in the same ClassLoader however missed the exact configuration. I have also tried an external runner along with try/catch to consume the error.. Nothing appears to work.
How would I...
Junit Runner to identify the class,
or
Possibly Ignore this error in a try/catch
or
Implement the correct structure for the ClassLoader solution?
I've attached a copy of code in git repository reference. bhhc.java throws the exception upon execution.
Supported Data:
SRC: https://github.com/agolem2/MavenJunit/blob/master/src/Test/bhhc.java
ERROR:
Google
Completed Test Case bhhc Date: 2017-09-26 08_57_17.318
Test run was NOT successful:
No runnable methods - Trace: java.lang.Exception: No runnable methods
org.junit.runners.BlockJUnit4ClassRunner.validateInstanceMethods(BlockJUnit4Cl
assRunner.java:191)
org.junit.runners.BlockJUnit4ClassRunner.collectInitializationErrors(BlockJUni
t4ClassRunner.java:128)
Test class should have exactly one public constructor - Trace:
java.lang.Exception: Test class should have exactly one public constructor
org.junit.runners.BlockJUnit4ClassRunner.validateOnlyOneConstructor(BlockJUnit
4ClassRunner.java:158)
org.junit.runners.BlockJUnit4ClassRunner.validateConstructor(BlockJUnit4ClassR
unner.java:147)

How do I configure Guice in Play Framework to require an #Inject annotation for empty constructors?

I'd like to prevent Guice in Play Framework (2.5.x) from injecting classes without the #Inject() annotation on the constructor. Guice provides the requireAtInjectOnConstructors configuration option on Binder which achieves this.
The Play documentation describes two ways to customize Guice:
Programmatic Bindings
Extending the GuiceApplicationLoader
I first tried customizing the Binder in my Module$configure method:
import com.google.inject.AbstractModule
class Module extends AbstractModule {
override def configure() = {
binder().requireAtInjectOnConstructors()
}
}
When I then run the application I see the following errors:
! #7115jkk8o - Internal server error, for (GET) [/] ->
play.api.UnexpectedException: Unexpected exception[CreationException: Unable to create injector, see the following errors:
1) Explicit #Inject annotations are required on constructors, but play.api.inject.DefaultApplicationLifecycle has no constructors annotated with #Inject.
at play.api.inject.BuiltinModule.bindings(BuiltinModule.scala:42):
Binding(class play.api.inject.DefaultApplicationLifecycle to self) (via modules: com.google.inject.util.Modules$OverrideModule -> play.api.inject.guice.GuiceableModuleConversions$$anon$1)
2) Explicit #Inject annotations are required on constructors, but play.api.http.NoHttpFilters has no constructors annotated with #Inject.
at play.utils.Reflect$.bindingsFromConfiguration(Reflect.scala:53):
Binding(interface play.api.http.HttpFilters to ConstructionTarget(class play.api.http.NoHttpFilters)) (via modules: com.google.inject.util.Modules$OverrideModule -> play.api.inject.guice.GuiceableModuleConversions$$anon$1)
When I extend the GuiceApplicationBuilder to set this configuration:
import play.api.ApplicationLoader
import play.api.inject.guice.{ GuiceApplicationBuilder, GuiceApplicationLoader }
class CustomApplicationLoader extends GuiceApplicationLoader {
override def builder(context: ApplicationLoader.Context): GuiceApplicationBuilder = {
initialBuilder.requireAtInjectOnConstructors()
}
}
I see the same errors with this approach.
The error messages indicate to me that the Play Framework itself relies on Guice injecting classes with empty constructors which are not annotated with #Inject().
Is there a way I can configure Guice to require the #Inject() annotation for classes in my application without affecting the Play Framework Guice configuration?

Serialization exception in HazelCast 3.5 with Scala

I am using Hazelcast 3.5 with scala
i have a case class Abc i am trying to store object of my class in hazelcast from my client but it gives me some serialization exception
here is my class
#SerialVersionUID(1)
case class Abc( id : Int ,name : String , subjectCode : MutableList[Int]) extends Serialable
when i run the client code it gives me following exception
18:33:43.274 [hz._hzInstance_1_dev.partition-operation.thread-1] ERROR c.h.map.impl.operation.PutOperation - [192.168.15.20]:5701 [dev] [3.5] java.lang.ClassNotFoundException: scala.collection.mutable.MutableList
com.hazelcast.nio.serialization.HazelcastSerializationException: java.lang.ClassNotFoundException: scala.collection.mutable.MutableList
at com.hazelcast.nio.serialization.DefaultSerializers$ObjectSerializer.read(DefaultSerializers.java:201) ~[hazelcast-3.5.jar:3.5]
at com.hazelcast.nio.serialization.StreamSerializerAdapter.read(StreamSerializerAdapter.java:41) ~[hazelcast-3.5.jar:3.5]
at com.hazelcast.nio.serialization.SerializationServiceImpl.toObject(SerializationServiceImpl.java:276) ~[hazelcast-3.5.jar:3.5]
at com.hazelcast.map.impl.mapstore.AbstractMapDataStore.toObject(AbstractMapDataStore.java:78) ~[hazelcast-3.5.jar:3.5]
Your cluster must run with the same codebase (in this case the scala jar) as your clients.