Scala Reflection exception during creation of DataSet in Spark - scala

I want to run Spark Job on Spark Jobserver.
During execution, I got an exception:
stack:
java.lang.RuntimeException: scala.ScalaReflectionException: class
com.some.example.instrument.data.SQLMapping in JavaMirror with
org.apache.spark.util.MutableURLClassLoader#55b699ef of type class
org.apache.spark.util.MutableURLClassLoader with classpath
[file:/app/spark-job-server.jar] and parent being
sun.misc.Launcher$AppClassLoader#2e817b38 of type class
sun.misc.Launcher$AppClassLoader with classpath [.../classpath
jars/] not found.
at
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
at
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
at
com.some.example.instrument.DataRetriever$$anonfun$combineMappings$1$$typecreator15$1.apply(DataRetriever.scala:136)
at
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275) at
org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
at
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
at
com.some.example.instrument.DataRetriever$$anonfun$combineMappings$1.apply(DataRetriever.scala:136)
at
com.some.example.instrument.DataRetriever$$anonfun$combineMappings$1.apply(DataRetriever.scala:135)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:237) at
scala.util.Try$.apply(Try.scala:192) at
scala.util.Success.map(Try.scala:237) at
scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) at
scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) at
scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at
scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
In DataRetriever I convert simple case class to DataSet.
case class definition:
case class SQLMapping(id: String,
it: InstrumentPrivateKey,
cc: Option[String],
ri: Option[SourceInstrumentId],
p: Option[SourceInstrumentId],
m: Option[SourceInstrumentId])
case class SourceInstrumentId(instrumentId: Long,
providerId: String)
case class InstrumentPrivateKey(instrumentId: Long,
providerId: String,
clientId: String)
code that causes a problem:
import session.implicits._
def someFunc(future: Future[ID]): Dataset[SQLMappins] = {
future.map {f =>
val seq: Seq[SQLMapping] = getFromEndpoint(f)
val ds: Dataset[SQLMapping] = seq.toDS()
...
}
}
The job sometimes works, but if I re-run job, it will throw an exception.
update 28.03.2018
I forgot to mention one detail, that turns out to be important.
Dataset was constructed inside of Future.

Calling toDS() inside future causing ScalaReflectionException.
I decided to construct DataSet outside future.map.
You can verify that Dataset can't be constructed in future.map with this example job.
package com.example.sparkapplications
import com.typesafe.config.Config
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import scala.concurrent.Await
import scala.concurrent.Future
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
import spark.jobserver.SparkJob
import spark.jobserver.SparkJobValid
import spark.jobserver.SparkJobValidation
object FutureJob extends SparkJob{
override def runJob(sc: SparkContext,
jobConfig: Config): Any = {
val session = SparkSession.builder().config(sc.getConf).getOrCreate()
import session.implicits._
val f = Future{
val seq = Seq(
Dummy("1", 1),
Dummy("2", 2),
Dummy("3", 3),
Dummy("4", 4),
Dummy("5", 5)
)
val ds = seq.toDS
ds.collect()
}
Await.result(f, 10 seconds)
}
case class Dummy(id: String, value: Long)
override def validate(sc: SparkContext,
config: Config): SparkJobValidation = SparkJobValid
}
Later I will provide information if the problem persists using spark 2.3.0, and when you pass jar via spark-submit directly.

Related

How to call batchWriteItemAsync to load DynamoDB in scala/spark

I am trying to run batchwriteAsync to load data into DynamoDB by Scala(Spark), but meet error below when it is run, may I ask how to process batchwriteAsync correctly in Scala?
Code
import com.amazonaws.services.dynamodbv2.model._
import com.amazonaws.services.dynamodbv2.{AmazonDynamoDBAsyncClientBuilder, AmazonDynamoDBAsync, AmazonDynamoDBClientBuilder, AmazonDynamoDB}
import java.util.concurrent.Executors
import scala.concurrent.{ExecutionContext,Future}
val POOL_SIZE = 3
val client: AmazonDynamoDBAsync = AmazonDynamoDBAsyncClientBuilder.standard().withRegion(REGION).build()
// init threads pool
val jobExecutorPool = Executors.newFixedThreadPool(POOL_SIZE)
// create the implicit ExecutionContext based on our thread pool
implicit val xc: ExecutionContext = ExecutionContext.fromExecutorService(jobExecutorPool)
var contentSeq = ...
val batchWriteItems: java.util.List[WriteRequest] = contentSeq.asJava
def batchWriteFuture(tableName: String, batchWriteItems: java.util.List[WriteRequest])(implicit xc: ExecutionContext): Future[BatchWriteItemResult] = {
client.batchWriteItemAsync(
(new BatchWriteItemRequest()).withRequestItems(Map(tableName -> batchWriteItems).asJava)
)
}
batchWriteFuture(tableName, batchWriteItems)
Error:
error: type mismatch;
found : java.util.concurrent.Future[com.amazonaws.services.dynamodbv2.model.BatchWriteItemResult]
required: scala.concurrent.Future[com.amazonaws.services.dynamodbv2.model.BatchWriteItemResult]
client.batchWriteItemAsync(
client.batchWriteItemAsync is a Java API so it returns a java.util.concurrent.Future
Your method batchWriteFuture has a return type of scala.concurrent.Future
Try converting your DynamoDB call to scala like
def batchWriteFuture(tableName: String, batchWriteItems: java.util.List[WriteRequest])(implicit xc: ExecutionContext): Future[BatchWriteItemResult] = {
client.batchWriteItemAsync(
(new BatchWriteItemRequest()).withRequestItems(Map(tableName -> batchWriteItems).asJava)
).asScala // <---------- converting to scala future
}

Akka HTTP - max-open-requests and substreams?

I'm writing an app using Scala 2.13 with Akka HTTP 10.2.4 and Akka Stream 2.6.15. I'm trying to query a web service in a parallel manner, like so:
package com.example
import akka.actor.typed.scaladsl.ActorContext
import akka.http.scaladsl.Http
import akka.http.scaladsl.client.RequestBuilding.Get
import akka.http.scaladsl.model.HttpResponse
import akka.http.scaladsl.unmarshalling.Unmarshal
import akka.stream.scaladsl.{Flow, JsonFraming, Sink, Source}
import spray.json.DefaultJsonProtocol
import spray.json.DefaultJsonProtocol.jsonFormat2
import scala.util.Try
case class ClientStockPortfolio(id: Long, symbol: String)
case class StockTicker(symbol: String, price: Double)
trait SprayFormat extends DefaultJsonProtocol {
implicit val stockTickerFormat = jsonFormat2(StockTicker)
}
class StockTrader(context: ActorContext[_]) extends SprayFormat {
implicit val system = context.system.classicSystem
val httpPool = Http().superPool()[Seq[ClientStockPortfolio]]
def collectPrices() = {
val src = Source(Seq(
ClientStockPortfolio(1, "GOOG"),
ClientStockPortfolio(2, "AMZN"),
ClientStockPortfolio(3, "MSFT")
)
)
val graph = src
.groupBy(8, _.id % 8)
.via(createPost)
.via(httpPool)
.via(decodeTicker)
.mergeSubstreamsWithParallelism(8)
.to(
Sink.fold(0.0) { (totalPrice, ticker) =>
insertIntoDatabase(ticker)
totalPrice + ticker.price
}
)
graph.run()
}
def createPost = Flow[ClientStockPortfolio]
.grouped(10)
.map { port =>
(
Get(uri = s"http://wherever/?symbols=${port.map(_.symbol).mkString(",")}"),
port
)
}
def decodeTicker = Flow[(Try[HttpResponse], Seq[ClientStockPortfolio])]
.flatMapConcat { x =>
x._1.get.entity.dataBytes
.via(JsonFraming.objectScanner(Int.MaxValue))
.mapAsync(4)(bytes => Unmarshal(bytes).to[StockTicker])
.mapConcat { ticker =>
lookupPreviousPrices(ticker)
}
}
def lookupPreviousPrices(ticker: StockTicker): List[StockTicker] = ???
def insertIntoDatabase(ticker: StockTicker) = ???
}
I have two questions. First, will the groupBy call that splits the stream into substreams run them in parallel like I want? And second, when I call this code, I run into the max-open-requests error, since I haven't increased the setting from the default. But even if I am running in parallel, I'm only running 8 threads - how is the Http().superPool() getting backed up with 32 requests?

Mongo codec for value classes with Macro

Is it possible to auto derive codecs for values classes in scala-mongo-driver?
Using existing macros produces StackOverflowError
package test
import org.bson.codecs.configuration.CodecRegistries.{fromCodecs, fromProviders, fromRegistries}
import org.mongodb.scala.MongoClient.DEFAULT_CODEC_REGISTRY
import org.mongodb.scala.bson.codecs.Macros._
import org.mongodb.scala.{MongoClient, MongoCollection}
import scala.concurrent.duration._
import scala.concurrent.{Await, Future}
// Models
case class Name(value: String) extends AnyVal
case class Person(name: Name)
object TestValueClassCodecs extends App {
private[this] val codecRegistry =
fromRegistries(
fromProviders(
classOf[Person],
classOf[Name],
),
DEFAULT_CODEC_REGISTRY
)
protected val collection: MongoCollection[Person] =
MongoClient(s"mongodb://localhost:27017")
.getDatabase("TestDB")
.withCodecRegistry(codecRegistry)
.getCollection[Person]("test_repo_values_classes")
val res = Await.result(
collection.insertOne(Person(Name("Jesus"))).toFuture(),
10.seconds
)
}
Output:
Caused by: java.lang.StackOverflowError
at scala.collection.LinearSeqOptimized.length(LinearSeqOptimized.scala:54)
at scala.collection.LinearSeqOptimized.length$(LinearSeqOptimized.scala:51)
at scala.collection.immutable.List.length(List.scala:91)
at scala.collection.SeqLike.size(SeqLike.scala:108)
at scala.collection.SeqLike.size$(SeqLike.scala:108)
at scala.collection.AbstractSeq.size(Seq.scala:45)
at scala.collection.convert.Wrappers$IterableWrapperTrait.size(Wrappers.scala:25)
at scala.collection.convert.Wrappers$IterableWrapperTrait.size$(Wrappers.scala:25)
at scala.collection.convert.Wrappers$SeqWrapper.size(Wrappers.scala:66)
at java.util.AbstractCollection.toArray(AbstractCollection.java:136)
at java.util.ArrayList.<init>(ArrayList.java:178)
at org.bson.internal.ProvidersCodecRegistry.<init>(ProvidersCodecRegistry.java:34)
at org.bson.codecs.configuration.CodecRegistries.fromRegistries(CodecRegistries.java:126)
at org.mongodb.scala.bson.codecs.macrocodecs.MacroCodec.$init$(MacroCodec.scala:86)
at test.TestValueClassCodecs$$anon$1$$anon$3$NameMacroCodec$1.<init>(TestValueClassCodecs.scala:51)
at test.TestValueClassCodecs$$anon$1$$anon$3$NameMacroCodec$2$.apply(TestValueClassCodecs.scala:51)
at test.TestValueClassCodecs$$anon$1$$anon$3.get(TestValueClassCodecs.scala:51)
at org.bson.internal.ProvidersCodecRegistry.get(ProvidersCodecRegistry.java:45)
at org.bson.internal.ProvidersCodecRegistry.get(ProvidersCodecRegistry.java:45)
at org.bson.internal.ChildCodecRegistry.get(ChildCodecRegistry.java:58)
at org.bson.internal.ProvidersCodecRegistry.get(ProvidersCodecRegistry.java:45)
at org.bson.internal.ChildCodecRegistry.get(ChildCodecRegistry.java:58)
at org.bson.internal.ProvidersCodecRegistry.get(ProvidersCodecRegistry.java:45)
at org.bson.internal.ChildCodecRegistry.get(ChildCodecRegistry.java:58)
at org.bson.internal.ProvidersCodecRegistry.get(ProvidersCodecRegistry.java:45)
I am using version:
org.mongodb.scala:mongo-scala-bson_2.12:4.1.1
If the Name class is not a value one everything works just fine.

Apache Flink: ProcessWindowFunction implementation

I am trying to use a ProcessWindowFunction in my Apache Flink project using Scala. Unfortunately, I already fail at implementing a basic ProcessWindowFunction like it is used in the Apache Flink Documentation.
This is my code:
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
import org.fiware.cosmos.orion.flink.connector.{NgsiEvent, OrionSource}
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows
import org.apache.flink.util.Collector
import scala.collection.TraversableOnce
object StreamingJob {
def main(args: Array[String]) {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val eventStream = env.addSource(new OrionSource(9001))
val processedDataStream = eventStream.flatMap(event => event.entities)
.map(entity => (entity.id, entity.attrs("temperature").value.asInstanceOf[String]))
.keyBy(_._1)
.window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.process(new MyProcessWindowFunction())
env.execute("Socket Window NgsiEvent")
}
}
private class MyProcessWindowFunction extends ProcessWindowFunction[(String, String), String, String, TimeWindow] {
def process(key: String, context: Context, input: Iterable[(String, String)], out: Collector[String]): Unit = {
var count: Int = 0
for (in <- input) {
count = count + 1
}
out.collect(s"Window ${context.window} count: $count")
}
}
From IntelliJ I get the following hints:
1) This is shown where the new class object is created:
Type mismatch, expected: ProcessWindowFunction[(String, String), NotInferedR, String, TimeWindow], actual: MyProcessWindowFunction
2) This is shown directly at the class:
Class 'MyProcessWindowFunction' must either be declared abstract or implement abstract member 'process(key:KEY, context:ProcessWindowFunction.Context, iterable:Iterable<IN>, collector:Collector<OUT>):void' in 'org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction'
Building the code shows me the following error:
Error:(51, 16) type mismatch;
found : org.apache.flink.MyProcessWindowFunction
required:
org.apache.flink.streaming.api.scala.function.ProcessWindowFunction[(String, String),?,String,org.apache.flink.streaming.api.windowing.windows.TimeWindow]
.process(new MyProcessWindowFunction())
I am grateful for every help.
After spending some time debugging with 2 more people we finally managed to find the problem.
In my code I used the following import:
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction
But the correct import when using Scala seems to be:
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
//package of ProcessWindowFunction is
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
//The correct way to call this method
new MyProcessWindowFunction()[(String, String), String, String, TimeWindow]
//I know the official documents don't.This may be a bug

Code working in Spark-Shell not in eclipse

I have a small Scala code which works properly on Spark-Shell but not in Eclipse with Scala plugin. I can access hdfs using plugin tried writing another file and it worked..
FirstSpark.scala
package bigdata.spark
import org.apache.spark.SparkConf
import java. io. _
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object FirstSpark {
def main(args: Array[String])={
val conf = new SparkConf().setMaster("local").setAppName("FirstSparkProgram")
val sparkcontext = new SparkContext(conf)
val textFile =sparkcontext.textFile("hdfs://pranay:8020/spark/linkage")
val m = new Methods()
val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x))
q.saveAsTextFile("hdfs://pranay:8020/output") }
}
Methods.scala
package bigdata.spark
import java.util.function.ToDoubleFunction
class Methods {
def isHeader(s:String):Boolean={
s.contains("id_1")
}
def parse(line:String) ={
val pieces = line.split(',')
val id1=pieces(0).toInt
val id2=pieces(1).toInt
val matches=pieces(11).toBoolean
val mapArray=pieces.slice(2, 11).map(toDouble)
MatchData(id1,id2,mapArray,matches)
}
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
}
case class MatchData(id1: Int, id2: Int,
scores: Array[Double], matched: Boolean)
Error Message:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
Can anyone please help me with this
Try changing class Methods { .. } to object Methods { .. }.
I think the problem is at val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x)). When Spark sees the filter and map functions it tries to serialize the functions passed to them (x => !m.isHeader(x) and x=> m.parse(x)) so that it can dispatch the work of executing them to all of the executors (this is the Task referred to). However, to do this, it needs to serialize m, since this object is referenced inside the function (it is in the closure of the two anonymous methods) - but it cannot do this since Methods is not serializable. You could add extends Serializable to the Methods class, but in this case an object is more appropriate (and is already Serializable).