EDIT#2: This might be memory related. Logs are showing out-of-heap.
Yes, definitely memory related. Basically docker logs reports all the
spewage of out-of-heap from the java, but the jupyter web notebook does not pass that to the user. Instead the user gets kernel failures and occasional weird behavior like code not compiling correctly.
Spark 1.6, particularly docker run -d .... jupyter/all-spark-notebook
Would like to count accounts in a file of ~ 1 million transactions.
This is simple enough, it can be done without spark but I've hit an odd error trying with spark scala.
Input data is type RDD[etherTrans] where etherTrans is a custom type enclosing a single transaction: a timestamp, the from and to accounts, and the value transacted in ether.
class etherTrans(ts_in:Long, afrom_in:String, ato_in:String, ether_in: Float)
extends Serializable {
var ts: Long = ts_in
var afrom: String = afrom_in
var ato: String = ato_in
var ether: Float = ether_in
override def toString():String = ts.toString+","+afrom+","+ato+","+ether.toString
}
data:RDD[etherTrans] looks ok:
data.take(10).foreach(println)
etherTrans(1438918233,0xa1e4380a3b1f749673e270229993ee55f35663b4,0x5df9b87991262f6ba471f09758cde1c0fc1de734,3.1337E-14)
etherTrans(1438918613,0xbd08e0cddec097db7901ea819a3d1fd9de8951a2,0x5c12a8e43faf884521c2454f39560e6c265a68c8,19.9)
etherTrans(1438918630,0x63ac545c991243fa18aec41d4f6f598e555015dc,0xc93f2250589a6563f5359051c1ea25746549f0d8,599.9895)
etherTrans(1438918983,0x037dd056e7fdbd641db5b6bea2a8780a83fae180,0x7e7ec15a5944e978257ddae0008c2f2ece0a6090,100.0)
etherTrans(1438919175,0x3f2f381491797cc5c0d48296c14fd0cd00cdfa2d,0x4bd5f0ee173c81d42765154865ee69361b6ad189,803.9895)
etherTrans(1438919394,0xa1e4380a3b1f749673e270229993ee55f35663b4,0xc9d4035f4a9226d50f79b73aafb5d874a1b6537e,3.1337E-14)
etherTrans(1438919451,0xc8ebccc5f5689fa8659d83713341e5ad19349448,0xc8ebccc5f5689fa8659d83713341e5ad19349448,0.0)
etherTrans(1438919461,0xa1e4380a3b1f749673e270229993ee55f35663b4,0x5df9b87991262f6ba471f09758cde1c0fc1de734,3.1337E-14)
etherTrans(1438919491,0xf0cf0af5bd7d8a3a1cad12a30b097265d49f255d,0xb608771949021d2f2f1c9c5afb980ad8bcda3985,100.0)
etherTrans(1438919571,0x1c68a66138783a63c98cc675a9ec77af4598d35e,0xc8ebccc5f5689fa8659d83713341e5ad19349448,50.0)
This next function parses ok and is written this way because earlier attempts were complaining of type mismatch between Array[String] or List[String] and TraversableOnce[?]:
def arrow(e:etherTrans):TraversableOnce[String] = Array(e.afrom,e.ato)
But then using this function with flatMap to get an RDD[String] of all accounts fails.
val accts:RDD[String] = data.flatMap(arrow)
Name: Compile Error
Message: :38: error: type mismatch;
found : etherTrans(in class $iwC)(in class $iwC)(in class $iwC)(in class $iwC) => TraversableOnce[String]
required: etherTrans(in class $iwC)(in class $iwC)(in class $iwC)(in class $iwC) => TraversableOnce[String]
val accts:RDD[String] = data.flatMap(arrow)
^
StackTrace:
Make sure you scroll right to see it complain that TraversableOnce[String]
doesn't match TraversableOnce[String]
This must be a fairly common problem as a more blatant type mismatch comes up in Generate List of Pairs and while there isn't enough context, is suggested in I have a Scala List, how can I get a TraversableOnce?.
What's going on here?
EDIT: The issue reported above doesn't appear, and code works fine in older spark-shell, Spark 1.3.1 running standalone in a docker container. Errors are generated running in the spark 1.6 scala jupyter environment with the jupyter/all-spark-notebook docker container.
Also #zero323 says that this toy example:
val rdd = sc.parallelize(Seq((1L, "foo", "bar", 1))).map{ case (ts, fr, to, et) => new etherTrans(ts, fr, to, et)}
rdd.flatMap(arrow).collect
worked for him in the terminal spark-shell 1.6.0/spark 2.10.5 and also Scala 2.11.7 and Spark 1.5.2 work as well.
I think you should switch to use case classes, and it should work fine. Using "regular" classes, might case weird issues when serializing them, and it looks like all you need are value objects, so case classes look like a better fit for your use case.
An example:
case class EtherTrans(ts: Long, afrom: String, ato: String, ether: Float)
val source = sc.parallelize(Array(
(1L, "from1", "to1", 1.234F),
(2L, "from2", "to2", 3.456F)
))
val data = source.as[EtherTrans]
val data = source.map { l => EtherTrans(l._1, l._2, l._3, l._4) }
def arrow(e: EtherTrans) = Array(e.afrom, e.ato)
data.map(arrow).take(5)
/*
res3: Array[Array[String]] = Array(Array(from1, to1), Array(from2, to2))
*/
data.map(arrow).take(5)
// res3: Array[Array[String]] = Array(Array(from1, to1), Array(from2, to2))
If you need to, you can just create some method / object to generate your case classes.
If you don't really need the "toString" method for your logic, but just for "presentation", keep it out of the case class: you can always add it with a map operation before storing if or showing it.
Also, if you are in Spark 1.6.0 or higher, you could try using the DataSet API instead, that would look more or less like this:
val data = sqlContext.read.text("your_file").as[EtherTrans]
https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html
Related
I am just experimenting with the use of Scala type classes within Flink. I have defined the following type class interface:
trait LikeEvent[T] {
def timestamp(payload: T): Int
}
Now, I want to consider a DataSet of LikeEvent[_] like this:
// existing classes that need to be adapted/normalized (without touching them)
case class Log(ts: Int, severity: Int, message: String)
case class Metric(ts: Int, name: String, value: Double)
// create instances for the raw events
object EventInstance {
implicit val logEvent = new LikeEvent[Log] {
def timestamp(log: Log): Int = log.ts
}
implicit val metricEvent = new LikeEvent[Metric] {
def timestamp(metric: Metric): Int = metric.ts
}
}
// add ops to the raw event classes (regular class)
object EventSyntax {
implicit class Event[T: LikeEvent](val payload: T) {
val le = implicitly[LikeEvent[T]]
def timestamp: Int = le.timestamp(payload)
}
}
The following app runs just fine:
// set up the execution environment
val env = ExecutionEnvironment.getExecutionEnvironment
// underlying (raw) events
val events: DataSet[Event[_]] = env.fromElements(
Metric(1586736000, "cpu_usage", 0.2),
Log(1586736005, 1, "invalid login"),
Log(1586736010, 1, "invalid login"),
Log(1586736015, 1, "invalid login"),
Log(1586736030, 2, "valid login"),
Metric(1586736060, "cpu_usage", 0.8),
Log(1586736120, 0, "end of world"),
)
// count events per hour
val eventsPerHour = events
.map(new GetMinuteEventTuple())
.groupBy(0).reduceGroup { g =>
val gl = g.toList
val (hour, count) = (gl.head._1, gl.size)
(hour, count)
}
eventsPerHour.print()
Printing the expected output
(0,5)
(1,1)
(2,1)
However, if I modify the syntax object like this:
// couldn't make it work with Flink!
// add ops to the raw event classes (case class)
object EventSyntax2 {
case class Event[T: LikeEvent](payload: T) {
val le = implicitly[LikeEvent[T]]
def timestamp: Int = le.timestamp(payload)
}
implicit def fromPayload[T: LikeEvent](payload: T): Event[T] = Event(payload)
}
I get the following error:
type mismatch;
found : org.apache.flink.api.scala.DataSet[Product with Serializable]
required: org.apache.flink.api.scala.DataSet[com.salvalcantara.fp.EventSyntax2.Event[_]]
So, guided by the message, I do the following change:
val events: DataSet[Event[_]] = env.fromElements[Event[_]](...)
After that, the error changes to:
could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[com.salvalcantara.fp.EventSyntax2.Event[_]]
I cannot understand why EventSyntax2 results into these errors, whereas EventSyntax compiles and runs well. Why is using a case class wrapper in EventSyntax2 more problematic than using a regular class as in EventSyntax?
Anyway, my question is twofold:
How can I solve my problem with EventSyntax2?
What would be the simplest way to achieve my goals? Here, I am just experimenting with the type class pattern for the sake of learning, but definitively a more Object-Oriented approach (based on subtyping) looks simpler to me. Something like this:
// Define trait
trait Event {
def timestamp: Int
def payload: Product with Serializable // Any case class
}
// Metric adapter (similar for Log)
object MetricAdapter {
implicit class MetricEvent(val payload: Metric) extends Event {
def timestamp: Int = payload.ts
}
}
And then simply use val events: DataSet[Event] = env.fromElements(...) in the main.
Note that List of classes implementing a certain typeclass poses a similar question, but it considers a simple Scala List instead of a Flink DataSet (or DataStream). The focus of my question is on using the type class pattern within Flink to somehow consider heterogeneous streams/datasets, and whether it really makes sense or one should just clearly favour a regular trait in this case and inherit from it as outlined above.
BTW, you can find the code here: https://github.com/salvalcantara/flink-events-and-polymorphism.
Short answer: Flink cannot derive TypeInformation in scala for wildcard types
Long answer:
Both of your questions are really asking, what is TypeInformation, how is it used, and how is it derived.
TypeInformation is Flink's internal type system that it uses to serialize data when it is shuffled across the network and stored in a statebackend (when using the DataStream api).
Serialization is a major performance concern in data processing, so Flink contains specialized serializers for common data types and patterns. Out of the box, in its Java stack, it supports all JVM primitives, Pojo's, Flink tuples, some common collections types, and avro. The type of your class is determined using reflection and if it does not match a known type it will fall back to Kryo.
In the scala api, type information is derived using implicits. All methods on the scala DataSet and DataStream api have their generic parameters annotated for the implicit as a type class.
def map[T: TypeInformation]
This TypeInformation can be provided manually, like any type class, or derived using a macro that is imported from flink.
import org.apache.flink.api.scala._
This macro decorates the java type stack with support for scala tuples, scala case classes, and some common scala std library types. I say decorator because it can and will fall back to the java stack if your class is not one of those types.
So why does version 1 work?
Because it is an ordinary class that the type stack cannot match and so it resolved it to a generic type and returned a kryo based serializer. You can test this from the console and see it returns a generic type.
> scala> implicitly[TypeInformation[EventSyntax.Event[_]]]
res2: org.apache.flink.api.common.typeinfo.TypeInformation[com.salvalcantara.fp.EventSyntax.Event[_]] = GenericType<com.salvalcantara.fp.EventSyntax.Event>
Version 2 does not work because it recognized the type as a case class and then works to recursively derive TypeInformation instances for each of its members. This is not possible for wildcard types, which are different than Any and so derivation fails.
In general, you should not use Flink with heterogeneous types because it will not be able to derive efficient serializers for your workload.
I had a user defined function as follows:
def myFunc(df: DataFrame, column_name: String, func:(String)=>String = myFunc) = {
val my_udf = udf{ (columnValue: String) => func(columnValue) }
df.withColumn("newColumn", my_udf(df(column_name)))
}
Now, while writing test cases, I wanted to mock myFunc. I'm using Mockito for testing.
So I created a mock function with
var fmock = mock[(String)=>String]
when(fmock(any[String])).thenReturn("default_string_write_value")
However, when I try to pass fmock, I get the following exception: java.io.NotSerializableException.
So I tried to make fmock serializable by declaring it as follows:
var fmock = mock[(String)=>String](Mockito.withSettings().serializable())
Now when I call a collect on the dataframe returned by myFunc, I get the following exception:
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
This error is obtained when I am trying to perform the collect operation.
Also, a weird thing is happening. The test case passes in IntelliJ (that's the IDE I'm using), but fails when I run it with Maven.
The test case is as follows (assume I already have a dataframe-TESTDF in place with columns-COL1,COL2,COL3):
val fmock = mock[(String)=>String](Mockito.withSettings().serializable())
when(fmock(any[String])).thenReturn("default_string_write_value")
val ansDf = myFunc(TESTDF, COL1, fmock)
var rowValueArray = ansDf.select("newColumn").collect.map(x => x(0).asInstanceOf[String])
assert(rowValueArray(0)=="default_string_write_value")
Note that I'm using Spark versions 2.3.3 and scala 2.11.8. The application is running fine, the issue only comes in with test cases (that too on maven).
I'm using Slick 3.1.0 and Slick-pg 0.10.0. I have an enum as:
object UserProviders extends Enumeration {
type Provider = Value
val Google, Facebook = Value
}
Following the test case, it works fine with the column mapper simply adding the following implicit mapper into my customized driver.
implicit val userProviderMapper = createEnumJdbcType("UserProvider", UserProviders, quoteName = true)
However, when using plain SQL, I encountered the following compilation error:
could not find implicit value for parameter e: slick.jdbc.SetParameter[Option[models.UserProviders.Provider]]
I could not find any document about this. How can I write plain SQL with enum in slick? Thanks.
You need to have an implicit of type SetParameter[T] in scope which tells slick how to set parameters from some custom type T that it doesn't already know about. For example:
implicit val setInstant: SetParameter[Instant] = SetParameter { (instant, pp) =>
pp.setTimestamp(new Timestamp(instant.toEpochMilli))
}
The type of pp is PositionedParameters.
You might also come across the need to tell slick how to extract a query result into some custom type T that it doesn't already know about. For this, you need an implicit GetResult[T] in scope. For example:
implicit def getInstant(implicit get: GetResult[Long]): GetResult[Instant] =
get andThen (Instant.ofEpochMilli(_))
Suppose I have:
class X
{
val listPrimitive: List[Int] = null
val listX: List[X] = null
}
and I print out the return types of each method in Scala as follows:
classOf[ComplexType].getMethods().foreach { m => println(s"${m.getName}: ${m.getGenericReturnType()}") }
listPrimitive: scala.collection.immutable.List<Object>
listX: scala.collection.immutable.List<X>
So... I can determine that the listX's element type is X, but is there any way to determine via reflection that listPrimitive's element type is actually java.lang.Integer? ...
val list:List[Int] = List[Int](123);
val listErased:List[_] = list;
println(s"${listErased(0).getClass()}") // java.lang.Integer
NB. This seems not to be an issue due to JVM type erasure since I can find the types parameter of List. It looks like the scala compiler throws away this type information IFF the parameter type is java.lang.[numbers] .
UPDATE:
I suspect this type information is available, due to the following experiment. Suppose I define:
class TestX{
def f(x:X):Unit = {
val floats:List[Float] = x.listPrimitive() // type mismatch error
}
}
and X.class is imported via a jar. The full type information must be available in X.class in order that this case correctly fails to compile.
UPDATE2:
Imagine you're writing a scala extension to a Java serialization library. You need to implement a:
def getSerializer(clz:Class[_]):Serializer
function that needs to do different things depending on whether:
clz==List[Int] (or equivalently: List[java.lang.Integer])
clz==List[Float] (or equivalently: List[java.lang.Float])
clz==List[MyClass]
My problem is that I will only ever see:
clz==List[Object]
clz==List[Object]
clz==List[MyClass]
because clz is provided to this function as clz.getMethods()(i).getGenericReturnType().
Starting with clz:Class[_] how can I recover the element type information that was lost?
Its not clear to me that TypeToken will help me because its usages:
typeTag[T]
requires that I provide T (ie. at compile time).
So, one path to a solution... Given some clz:Class[_], can I determine the TypeTokens of its method's return types? Clearly this is possible as this information must be contained (somewhere) in a .class file for a scala compiler to correctly generate type mismatch errors (see above).
At the java bytecode level Ints have to be represented as something else (apparently Object) because a List can only contain objects, not primitives. So that's what java-level reflection can tell you. But the scala type information is, as you infer, present (at the bytecode level it's in an annotation, IIRC), so you should be able to inspect it with scala reflection:
import scala.reflect.runtime.universe._
val list:List[Int] = List[Int](123)
def printTypeOf[A: TypeTag](a: A) = println(typeOf[A])
printTypeOf(list)
Response to update2: you should use scala reflection to obtain a mirror, not the Class[_] object. You can go via the class name if need be:
import scala.reflect.runtime.universe._
val rm = runtimeMirror(getClass.getClassLoader)
val someClass: Class[_] = ...
val scalaMirrorOfClass = rm.staticClass(someClass.getName)
// or possibly rm.reflectClass(someClass) ?
val someObject: Any = ...
val scalaMirrorOfObject = rm.reflectClass(someObject)
I guess if you really only have the class, you could create a classloader that only loads that class? I can't imagine a use case where you wouldn't have the class, or even a value, though.
I'm trying to port my application to Scala 2.10.0-M2. I'm seeing some nice improvements with better warnings from compiler. But I also got bunch of errors, all related to me mapping from Enumeration.values.
I'll give you a simple example. I'd like to have an enumeration and then pre-create bunch of objects and build a map that uses enumeration values as keys and then some matching objects as values. For example:
object Phrase extends Enumeration {
type Phrase = Value
val PHRASE1 = Value("My phrase 1")
val PHRASE2 = Value("My phrase 2")
}
class Entity(text:String)
object Test {
val myMapWithPhrases = Phrase.values.map(p => (p -> new Entity(p.toString))).toMap
}
Now this used to work just fine on Scala 2.8 and 2.9. But 2.10.0-M2 gives me following warning:
[ERROR] common/Test.scala:21: error: diverging implicit expansion for type scala.collection.generic.CanBuildFrom[common.Phrase.ValueSet,(common.Phrase.Value, common.Entity),That]
[INFO] starting with method newCanBuildFrom in object SortedSet
[INFO] val myMapWithPhrases = Phrase.values.map(p => (p -> new Entity(p.toString))).toMap
^
What's causing this and how do you fix it?
It's basically a type mismatch error. You can work around it by first converting is to a list:
scala> Phrase.values.toList.map(p => (p, new Entity(p.toString))).toMap
res15: scala.collection.immutable.Map[Phrase.Value,Entity] = Map(My phrase 1 -> Entity#d0e999, My phrase 2 -> Entity#1987acd)
For more information, see the answers to What's a “diverging implicit expansion” scalac message mean? and What is a diverging implicit expansion error?
As you can see from your error, the ValueSet that holds the enums became a SortedSet at some point. It wants to produce a SortedSet on map, but can't sort on your Entity.
Something like this works with case class Entity:
implicit object orderingOfEntity extends Ordering[Entity] {
def compare(e1: Entity, e2: Entity) = e1.text compare e2.text
}