Use implicit value from one module in another in Scala/Spark - scala

I'm trying to get the SQLContext instance from one module in another module. The first module instantiates it to an implicit sqlContext and I had (erroneously) thought that I could then use an implicit parameter in the second module, but the compiler informs me that:
could not find implicit value for parameter sqlCtxt: org.apache.spark.sql.SQLContext
Here's the skeletal setup I have (I have elided imports and details):
-----
// Application.scala
-----
package apps
object Application extends App {
val env = new SparkEnvironment("My app", ...)
try {
// Call methods from various packages that use code from internally DFExtensions.scala
}
}
-----
// SparkEnvironment.scala
-----
package common
class SparkEnvironment(val app: String, ...) {
#transient lazy val conf: SparkConf = new SparkConf().setAppName(app)
#transient implicit lazy val sc: SparkContext = new SparkContext(conf)
#transient implicit lazy val sqlContext: SQLContext = new SQLContext(sc)
...
}
-----
// DFExtensions.scala
-----
package util
object DFExtensions {
private def myFun(...)(implicit sqlCtxt: SQLContext) = { ... }
implicit final class DFExt(val df: DataFrame) extends AnyVal {
// Extension methods for DataFrame where myFun is supposed to be used -- causes exception!
}
}
Since it's a multi-project sbt setup I don't want to pass around the instance env to all related objects because the stuff in util is really a shared library. Each sub-project (i.e. app) has its own instance created in the main method.
Because myFun is only called from the implicit class DFExt I thought about creating an implicit just before each call à la implicit val sqlCtxt = df.sqlContext and that compiles but it's kind of ugly and I would not need the implicit in SparkEnvironment any longer.
According to this discussion the implicit sqlContext instance is not in scope, hence compilation fails. I'm not sure a package object would work because the implicit value and parameter are in different packages.
Is what I'm trying to achieve even possible? Is there a better alternative?
The idea is to have several sub-projects that use the same libraries and core functions to share the same project. They are typically updated together, so it's nice to have them in a single place. Most of the library functions directly work on data frames and other structures in Spark, but occasionally I need to do something that requires an instance of SparkContext or SQLContext, for instance write a query with sqlContext.sql as some syntax is not yet natively supported (e.g. flattening with outer lateral views).
Each sub-project has its own main method that creates an implicit instance. Obviously the libraries do not 'know' about this as they are in different packages and I don't pass around the instances. I had thought that somehow implicits are looked for at runtime, so that when an application runs there is an instance of SQLContext defined as an implicit. It's possible that a) it's not in scope because it's in a different package or b) what I'm trying to do is just a bad idea.
Currently there is only one main method because I first have to split the application in multiple components, which I have not done yet.
Just in case it helps:
Spark 1.4.1
Scala 2.10
sbt 0.13.8

Because myFun is only called from the implicit class DFExt I thought about creating an implicit just before each call à la implicit val sqlCtxt = df.sqlContext and that compiles but it's kind of ugly and I would not need the implicit in SparkEnvironment any longer.
Just put the implicit and myFun inside DFExt:
implicit final class DFExt(val df: DataFrame) extends AnyVal {
private implicit def sqlCtxt: SqlContext = df.sqlContext
// no need to take an implicit parameter, as sqlCtxt is already in scope
private def myFun(...) = ...
// The extension methods can now use sqlCtxt and/or myFun freely
}
You could also make sqlCtxt a val, but then: 1) DFExt can't extend AnyVal anymore; 2) it needs to be initialized even if the extension method you call doesn't need it; 3) any calls to sqlCtxt are likely to be inlined, so you are just accessing a val from df instead of this anyway. If they aren't, this means you are using it far too little to matter.

Related

Write method to a class dynamically at runtime in scala and create a jar

I would like to understand is there a way to write a method to existing class at runtime and to create a jar dynamically in scala.
So far i tried to create a class dynamically and able to run it thru reflection, however the class is dynamic class which isnt generated.
val mirror = runtimeMirror(getClass.getClassLoader)
val tb = ToolBox(mirror).mkToolBox()
val function = q"def function(x: Int): Int = x + 2"
val functionWrapper = "object FunctionWrapper { " + function + "}"
data.map(x => tb.eval(q"$functionSymbol.function($x)"))
i got this from other source, however the class is available only for this run and will not be generated.
i would like to add a function to the existing class at runtime and able to compile it and create a jar for it.
Kindly suggest me the way
Thanks in advance.
I guess the code snippet you provided should actually look like
import scala.reflect.runtime.universe._
import scala.tools.reflect.ToolBox
val mirror = runtimeMirror(getClass.getClassLoader)
val tb = ToolBox(mirror).mkToolBox()
val function: Tree = q"def function(x: Int): Int = x + 2"
val functionWrapper: Symbol = tb.define(q"object FunctionWrapper { $function }".asInstanceOf[ImplDef])
val data: List[Tree] = List(q"1", q"2")
data.map(x => tb.eval(q"$functionWrapper.function($x)")) // List(3, 4)
... however the class is dynamic class which isnt generated.
... however the class is available only for this run and will not be generated.
How did you check that the class is not generated? (Which class, FunctionWrapper?)
is there a way to write a method to existing class at runtime and to create a jar dynamically in scala.
i would like to add a function to the existing class at runtime and able to compile it and create a jar for it.
What is "existing class"? Do you have access to its sources? Then you can modify the sources, compile them etc.
Does the class exist as a .class file? You can modify its byte code with Byte-buddy, ASM, Javassist, cglib etc., instrument the byte code with aspects etc.
Is it dynamic class (like FunctionWrapper above)? How did you create it? (For FunctionWrapper you have access to its Symbol so you can use it in further sources.)
Is the class already loaded? Then you'll have to play with class loaders (unload, modify, load modified).
Can a Java class add a method to itself at runtime?
In Java, given an object, is it possible to override one of the methods?

flink parsing JSON in map: InvalidProgramException: Task not serializable

I am working a on Flink project and would like to parse the source JSON string data to Json Object. I am using jackson-module-scala for the JSON parsing. However, I encountered some issues with using the JSON parser within Flink APIs (map for example).
Here are some examples of the code, and I cannot understand the reason under the hood why it is behaving like this.
Situation 1:
In this case, I am doing what the jackson-module-scala's official exmaple code told me to do:
create a new ObjectMapper
register the DefaultScalaModule
DefaultScalaModule is a Scala object that includes support for all currently supported Scala data types.
call the readValue in order to parse the JSON to Map
The error I got is: org.apache.flink.api.common.InvalidProgramException:Task not serializable.
object JsonProcessing {
def main(args: Array[String]) {
// set up the execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
// get input data
val text = env.readTextFile("xxx")
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
val counts = text.map(mapper.readValue(_, classOf[Map[String, String]]))
// execute and print result
counts.print()
env.execute("JsonProcessing")
}
}
Situation 2:
Then I did some Google, and came up with the following solution, where registerModule is moved into the map function.
val mapper = new ObjectMapper
val counts = text.map(l => {
mapper.registerModule(DefaultScalaModule)
mapper.readValue(l, classOf[Map[String, String]])
})
However, what I am not able to understand is: why this is going to work, with calling method of outside-defined object mapper? Is it because the ObjectMapper itself is Serializable as stated here ObjectMapper.java#L114?
Now, the JSON parsing is working fine, but every time, I have to call mapper.registerModule(DefaultScalaModule) which I think could cause some performance issue (Does it?). I also tried another solution as follows.
Situation 3:
I created a new case class Jsen, and use it as the corresponding parsing class, registering the Scala modules. And it is also working fine.
However, this is not so flexible if your input JSON is varying very often. It is not maintainable to manage the class Jsen.
case class Jsen(
#JsonProperty("a") a: String,
#JsonProperty("c") c: String,
#JsonProperty("e") e: String
)
object JsonProcessing {
def main(args: Array[String]) {
...
val mapper = new ObjectMapper
val counts = text.map(mapper.readValue(_, classOf[Jsen]))
...
}
Additionally, I also tried using JsonNode without calling registerModule as follows:
...
val mapper = new ObjectMapper
val counts = text.map(mapper.readValue(_, classOf[JsonNode]))
...
It is working fine as well.
My main question is: what is actually causing the problem of Task not serializable under the hood of registerModule(DefaultScalaModule)?
How to identify whether your code could potentially cause this unserializable problem during coding?
The thing is that Apache Flink is designed to be distributed. It means that it needs to be able to run your code remotely. So it means that all your processing functions should be serializable. In the current implementation this is ensure early on when you build your streaming process even if you will not run this in any distributed mode. This is a trade-off with an obvious benefit of providing you feedback down to the very line that breaks this contract (via exception stack trace).
So when you write
val counts = text.map(mapper.readValue(_, classOf[Map[String, String]]))
what you actually write is something like
val counts = text.map(new Function1[String, Map[String, String]] {
val capturedMapper = mapper
override def apply(param: String) = capturedMapper.readValue(param, classOf[Map[String, String]])
})
The important thing here is that you capture the mapper from the outside context and store it as a part of your Function1 object that has to be serializble. And this means that the mapper has to be serializable. The designers of Jackson library recognized that kind of a need and since there is nothing fundamentally non-serizliable in a mapper they made their ObjectMapper and the default Modules serializable. Unfortunately for you the designers of Scala Jackson Module missed that and made their DefaultScalaModule deeply non-serialiazable by making ScalaTypeModifier and all sub-classes non-serializable. This is why your second code works while the first one doesn't: "raw" ObjectMapper is serializable while ObjectMapper with pre-registered DefaultScalaModule is not.
There are a few possible workarounds. Probably the easiest one is to wrap ObjectMapper
object MapperWrapper extends java.io.Serializable {
// this lazy is the important trick here
// #transient adds some safety in current Scala (see also Update section)
#transient lazy val mapper = {
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper
}
def readValue[T](content: String, valueType: Class[T]): T = mapper.readValue(content, valueType)
}
and then use it as
val counts = text.map(MapperWrapper.readValue(_, classOf[Map[String, String]]))
This lazy trick works because although an instance of DefaultScalaModule is not serializable, the function to create an instance of DefaultScalaModule is.
Update: what about #transient?
what are the differences here, if I add lazy val vs. #transient lazy val?
This is actually a tricky question. What the lazy val is compiled to is actually something like this:
object MapperWrapper extends java.io.Serializable {
// #transient is set or not set for both fields depending on its presence at "lazy val"
[#transient] private var mapperValue: ObjectMapper = null
[#transient] #volatile private var mapperInitialized = false
def mapper: ObjectMapper = {
if (!mapperInitialized) {
this.synchronized {
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
mapperValue = mapper
mapperInitialized = true
}
}
mapperValue
}
def readValue[T](content: String, valueType: Class[T]): T = mapper.readValue(content, valueType)
}
where #transient on the lazy val affects both backing fields. So now you can see why lazy val trick works:
locally it works because it delays initialization of the mapperValue field until first access to the mapper method so the field is safe null when the serialization check is performed
remotely it works because MapperWrapper is fully serializable and the logic of how lazy val should be initialized is put into a method of the same class (see def mapper).
Note however that AFAIK this behavior of how lazy val is compiled is an implementation detail of the current Scala compiler rather than a part of the Scala specification. If at some later point a class similar to .Net Lazy will be added to the Java standard library, Scala compiler potentially might start generating different code. This is important because it provides a kind of trade-off for #transient. The benefit of adding #transient now is that it ensures that code like this works as well:
val someJson:String = "..."
val something:Something = MapperWrapper.readValue(someJson:String, ...)
val counts = text.map(MapperWrapper.readValue(_, classOf[Map[String, String]]))
Without #transient the code above will fail because we forced initialization of the lazy backing field and now it contains a non-serializable value. With #transient this is not an issue as that field will not be serialized at all.
A potential drawback of #transient is that if Scala changes the way code for lazy val is generated and the field is marked as #transient, it might actually be not de-serialized in the remote-work scenario.
Also there is a trick with object because for objects Scala compiler generates custom de-serialization logic (overrides readResolve) to return the same singleton object. It means that the object including the lazy val is not really de-serialized and the value from the object itself is used. It means that #transient lazy val inside object is much more future-proof than inside class in remote scenario.

Importing generic implicits from class instances

I'm trying to make a generic implicit provider which can create an implicit value for a given type, something in the lines of:
trait Evidence[T]
class ImplicitProvider[T] {
class Implementation extends Evidence[T]
implicit val evidence: Evidence[T] = new Implementation
}
To use this implicit, I create a val provider = new ImplicitProvider[T] instance where necessary and import from it import provider._. This works fine as long as there is just one instance. However sometimes implicits for several types are needed in one place
case class A()
case class B()
class Test extends App {
val aProvider = new ImplicitProvider[A]
val bProvider = new ImplicitProvider[B]
import aProvider._
import bProvider._
val a = implicitly[Evidence[A]]
val b = implicitly[Evidence[B]]
}
And this fails to compile with could not find implicit value for parameter and not enough arguments for method implicitly errors.
If I use implicit vals from providers directly, everything starts to work again.
implicit val aEvidence = aProvider.evidence
implicit val bEvidence = bProvider.evidence
However I'm trying to avoid importing individual values, as there are actually several implicits inside each provider and the goal is to abstract them if possible.
Can this be achieved somehow or do I want too much from the compiler?
The issue is that when you import from both objects, you're bringing in two entities that have colliding names: evidence in aProvider and evidence in bProvider. The compiler cannot disambiguate those, both because of how its implemented, and because it'd be a bad idea for implicits, which can already be arcane, to be able to do things that cannot be done explicitly (disambiguating between clashing names).
What I don't understand is what the point of ImplicitProvider is. You can pull the Implementation class out to the top level and have an object somewhere that holds the implicit vals.
class Implementation[T] extends Evidence[T]
object Evidence {
implicit val aEvidence: Evidence[A] = new Implementation[A]
implicit val bEvidence: Evidence[B] = new Implementation[B]
}
// Usage:
import Evidence._
implicitly[Evidence[A]]
implicitly[Evidence[B]]
Now, there is no name clash.
If you need to have an actual ImplicitProvider, you can instead do this:
class ImplicitProvider[T] { ... }
object ImplicitProviders {
implicit val aProvider = new ImplicitProvider[A]
implicit val bProvider = new ImplicitProvider[B]
implicit def ImplicitProvider2Evidence[T: ImplicitProvider]: Evidence[T]
= implicitly[ImplicitProvider[T]].evidence
}
// Usage
import ImplicitProviders._
// ...

snakeyaml and spark results in an inability to construct objects

The following code executes fine in a scala shell given snakeyaml version 1.17
import org.yaml.snakeyaml.Yaml
import org.yaml.snakeyaml.constructor.Constructor
import scala.collection.mutable.ListBuffer
import scala.beans.BeanProperty
class EmailAccount {
#scala.beans.BeanProperty var accountName: String = null
override def toString: String = {
return s"acct ($accountName)"
}
}
val text = """accountName: Ymail Account"""
val yaml = new Yaml(new Constructor(classOf[EmailAccount]))
val e = yaml.load(text).asInstanceOf[EmailAccount]
println(e)
However when running in spark (2.0.0 in this case) the resulting error is:
org.yaml.snakeyaml.constructor.ConstructorException: Can't construct a java object for tag:yaml.org,2002:EmailAccount; exception=java.lang.NoSuchMethodException: EmailAccount.<init>()
in 'string', line 1, column 1:
accountName: Ymail Account
^
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:350)
at org.yaml.snakeyaml.constructor.BaseConstructor.constructObject(BaseConstructor.java:182)
at org.yaml.snakeyaml.constructor.BaseConstructor.constructDocument(BaseConstructor.java:141)
at org.yaml.snakeyaml.constructor.BaseConstructor.getSingleData(BaseConstructor.java:127)
at org.yaml.snakeyaml.Yaml.loadFromReader(Yaml.java:450)
at org.yaml.snakeyaml.Yaml.load(Yaml.java:369)
... 48 elided
Caused by: org.yaml.snakeyaml.error.YAMLException: java.lang.NoSuchMethodException: EmailAccount.<init>()
at org.yaml.snakeyaml.constructor.Constructor$ConstructMapping.createEmptyJavaBean(Constructor.java:220)
at org.yaml.snakeyaml.constructor.Constructor$ConstructMapping.construct(Constructor.java:190)
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:346)
... 53 more
Caused by: java.lang.NoSuchMethodException: EmailAccount.<init>()
at java.lang.Class.getConstructor0(Class.java:2810)
at java.lang.Class.getDeclaredConstructor(Class.java:2053)
at org.yaml.snakeyaml.constructor.Constructor$ConstructMapping.createEmptyJavaBean(Constructor.java:216)
... 55 more
I launched the scala shell with
scala -classpath "/home/placey/snakeyaml-1.17.jar"
I launched the spark shell with
/home/placey/Downloads/spark-2.0.0-bin-hadoop2.7/bin/spark-shell --master local --jars /home/placey/snakeyaml-1.17.jar
Solution
Create a self-contained application and run it using spark-submit instead of using spark-shell.
I've created a minimal project for you as a gist here. All you need to do is put both files (build.sbt and Main.scala) in some directory, then run:
sbt package
in order to create a JAR. The JAR will be in target/scala-2.11/sparksnakeyamltest_2.11-1.0.jar or a similar location. You can get SBT from here if you haven't used it yet. Finally, you can run the project:
/home/placey/Downloads/spark-2.0.0-bin-hadoop2.7/bin/spark-submit --class "Main" --master local --jars /home/placey/snakeyaml-1.17.jar target/scala-2.11/sparksnakeyamltest_2.11-1.0.jar
The output should be:
[many lines of Spark's log)]
acct (Ymail Account)
[more lines of Spark's log)]
Explanation
Spark's shell (REPL) transforms all classes you define in it by adding $iw parameter to your constructors. I've explained it here. SnakeYAML expects a zero-parameter constructor for JavaBean-like classes, but there isn't one, so it fails.
You can try this yourself:
scala> class Foo() {}
defined class Foo
scala> classOf[Foo].getConstructors()
res0: Array[java.lang.reflect.Constructor[_]] = Array(public Foo($iw))
scala> classOf[Foo].getConstructors()(0).getParameterCount
res1: Int = 1
As you can see, Spark transforms the constructor by adding a parameter of type $iw.
Alternative solutions
Define your own Constructor
If you really need to get it working in the shell, you could define your own class implementing org.yaml.snakeyaml.constructor.BaseConstructor and make sure that $iw gets passed to constructors, but this is a lot of work (I actually wrote my own Constructor in Scala for security reasons some time ago, so I have some experience with this).
You could also define a custom Constructor hard-coded to instantiate a specific class (EmailAccount in your case) similar to the DiceConstructor shown in SnakeYAML's documentation. This is much easier, but requires writing code for each class you want to support.
Example:
case class EmailAccount(accountName: String)
class EmailAccountConstructor extends org.yaml.snakeyaml.constructor.Constructor {
val emailAccountTag = new org.yaml.snakeyaml.nodes.Tag("!emailAccount")
this.rootTag = emailAccountTag
this.yamlConstructors.put(emailAccountTag, new ConstructEmailAccount)
private class ConstructEmailAccount extends org.yaml.snakeyaml.constructor.AbstractConstruct {
def construct(node: org.yaml.snakeyaml.nodes.Node): Object = {
// TODO: This is fine for quick prototyping in a REPL, but in a real
// application you should probably add type checks.
val mnode = node.asInstanceOf[org.yaml.snakeyaml.nodes.MappingNode]
val mapping = constructMapping(mnode)
val name = mapping.get("accountName").asInstanceOf[String]
new EmailAccount(name)
}
}
}
You can save this as a file and load it in the REPL using :load filename.scala.
Bonus advantage of this solution is that it can create immutable case class instances directly. Unfortunately Scala REPL seems to have issues with imports, so I've used fully qualified names.
Don't use JavaBeans
You can also just parse YAML documents as simple Java maps:
scala> val yaml2 = new Yaml()
yaml2: org.yaml.snakeyaml.Yaml = Yaml:1141996301
scala> val e2 = yaml2.load(text)
e2: Object = {accountName=Ymail Account}
scala> val map = e2.asInstanceOf[java.util.Map[String, Any]]
map: java.util.Map[String,Any] = {accountName=Ymail Account}
scala> map.get("accountName")
res4: Any = Ymail Account
This way SnakeYAML won't need to use reflection.
However, since you're using Scala, I recommend trying
MoultingYAML, which is a Scala wrapper for SnakeYAML. It parses YAML documents to simple Java types and then maps them to Scala types (even your own types like EmailAccount).

Get a class from a type scala

In scala, I want to be able to say
val user = Node.create[User](...) // return User object
So here's what I have so far:
def create[T : TypeTag](map: Map[String, Any]) {
val type = typeOf[T]
// create class from type here???
}
I've been digging around how to create classes from generic types and found out that using ClassManifest seems to be deprecated. Instead, type tags are here, so I'm able to do something like this typeOf[T] and actually get the type.. but then I'm lost. If I could get the class, then I could use something like class.newInstance and manually set the fields from there.
Question is: given a type, can I get a class instance of the given type?
The easiest way in fact is to use ClassTag:
def create[T : ClassTag](map: Map[String, Any]): T = {
val clazz: Class[_] = classTag[T].runtimeClass
clazz.newInstance(<constructor arguments here>).asInstanceOf[T]
}
ClassTag is a thin wrapper around Java Class, primarily used for arrays instantiation.
TypeTag facility is more powerful. First, you can use it to invoke Java reflection:
import scala.reflect.runtime.universe._
def create[T: TypeTag](map: Map[String, Any]): T = {
val mirror = runtimeMirror(getClass.getClassLoader) // current class classloader
val clazz: Class[_] = mirror.runtimeClass(typeOf[T].typeSymbol.asClass)
clazz.newInstance(<constructor arguments here>).asInstanceOf[T]
}
However, Scala reflection allows to instantiate classes without dropping back to Java reflection:
def create[T: TypeTag](map: Map[String, Any]): T = {
// obtain type symbol for the class, it is like Class but for Scala types
val typeSym = typeOf[T].typeSymbol.asClass
// obtain class mirror using runtime mirror for the given classloader
val mirror = runtimeMirror(getClass.getClassLoader) // current class classloader
val cm = mirror.reflectClass(typeSym)
// resolve class constructor using class mirror and
// a constructor declaration on the type symbol
val ctor = typeSym.decl(termNames.CONSTRUCTOR).asMethod
val ctorm = cm.reflectConstructor(cm)
// invoke the constructor
ctorm(<constructor arguments here>).asInstanceOf[T]
}
If you want to create a class with overloaded constructors, it may require more work though - you'll have to select correct constructor from declarations list, but the basic idea is the same. You can read more on Scala reflection here
There is a way to do it with reflection: either runtime reflection, or in a macro. Regarding runtime reflection way, you can have a look at my blog post where I tried to do something like what you are trying to do now. Using compile-time reflection with macros might be a better option, depending on your need.