Writing a Tuple2Serializer for json4s - scala

I am having trouble turning my data into nested JSON objects. My current data looks like this:
Map(23 -> {"errorCode":null,"runStatusId":null,"lakeHdfsPath":"/user/jmblnvr/20140817_011500_zoot_kohls_offer_init.dat","fieldIndex":23,"datasetFieldName":"TERM_MM","datasetFieldSum":0.0,"datasetFieldMin":0.0,"datasetFieldMax":0.0,"datasetFieldMean":0.0,"datasetFieldSigma":0.0,"datasetFieldNullCount":170544.0,"datasetFieldObsCount":0.0,"datasetFieldKurtosis":0.0,"datasetFieldSkewness":0.0,"frequencyDistribution":null,"id":null,"fieldType":"NUMBER"}, 32 -> {"errorCode":null,"runStatusId":null,"lakeHdfsPath":"/user/jmblnvr/20140817_011500_zoot_kohls_offer_init.dat","fieldIndex":32,"datasetFieldName":"ACCT_NBR","datasetFieldSum":0.0,"datasetFieldMin":0.0,"datasetFieldMax":0.0,"datasetFieldMean":0.0,"datasetFieldSigma":0.0,"datasetFieldNullCount":0.0,"datasetFieldObsCount":0.0,"datasetFieldKurtosis":0.0,"datasetFieldSkewness":0.0,"frequencyDistribution":"(6393050780099594,56810)","id":null,"fieldType":"STRING"} etc. etc.
When I run it through:
def jsonClob(json: scala.collection.mutable.Map[Int, String]): String = {
implicit val formats = org.json4s.DefaultFormats
val A = Serialization.write(json)
A
}
I get the following Error:
Exception in thread "main" scala.MatchError: (23,{"errorCode":null,"fieldIndex":23,"datasetFieldObsCount":0.0,"datasetFieldKurtosis":0.0,"datasetFieldSkewness":0.0,"frequencyDistribution":null,"runStatusId":null,"lakeHdfsPath":"/user/jmblnvr/20140817_011500_zoot_kohls_offer_init.dat","datasetFieldName":"TERM_MM","datasetFieldSum":0.0,"datasetFieldMin":0.0,"datasetFieldMax":0.0,"datasetFieldMean":0.0,"datasetFieldSigma":0.0,"datasetFieldNullCount":170544.0,"id":null,"fieldType":"NUMBER"}) (of class scala.Tuple2)
at org.json4s.Extraction$.internalDecomposeWithBuilder(Extraction.scala:132)
at org.json4s.Extraction$.decomposeWithBuilder(Extraction.scala:67)
at org.json4s.Extraction$.decompose(Extraction.scala:194)
at org.json4s.jackson.Serialization$.write(Serialization.scala:22)
at com.capitalone.dts.toolset.jsonWrite$.jsonClob(jsonWrite.scala:17)
at com.capitalone.dts.dq.profiling.DQProfilingEngine.profile(DQProfilingEngine.scala:264)
at com.capitalone.dts.dq.profiling.Profiler$.main(DQProfilingEngine.scala:64)
at com.capitalone.dts.dq.profiling.Profiler.main(DQProfilingEngine.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I am taking advice from another post I created but am having 0 luck with a custom serializer. So far my code looks like this but completely lost on it:
class Tuple2Serializer extends CustomSerializer[(Int, String)]( format => (
{
case JObject(JField(JInt(k), v)) => (k, v)
},
{
case (t: Int, s:String ) => (t -> s)
} ) )
Edit:
I have it working now thanks to the comment but it is creating with these \, not sure why or how to remove without ruining the JSON
Example:
\"errorCode\":null,\"runStatusId\":null,\"lakeHdfsPath\":\"/user/jmblnvr/20140817_011500_zoot_kohls_offer_init.dat\",\"fieldIndex\":45,\"datasetFieldName\":\"PRESENTABLE_FLAG\"

Related

How to use Using (with Source.fromFile) and handle error

This is a follow up question to https://stackoverflow.com/a/55440851/2691976
I have the following code
import scala.io.Source
import scala.util.Using
object Problem {
def main(args: Array[String]): Unit = {
Using(Source.fromFile("thisfileexists.txt")) { source =>
println(1 / 1)
println(1 / 0)
}
}
}
Running it with scala3, it will just print out 1 single line and no error.
scala3 test.scala
1
I am expecting an error like the following,
Exception in thread "main" java.lang.ArithmeticException: / by zero
at Problem$.main(test.scala:10)
at Problem.main(test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at dotty.tools.scripting.ScriptingDriver.compileAndRun(ScriptingDriver.scala:42)
at dotty.tools.scripting.Main$.main(Main.scala:43)
at dotty.tools.MainGenericRunner$.run$1(MainGenericRunner.scala:230)
at dotty.tools.MainGenericRunner$.main(MainGenericRunner.scala:239)
at dotty.tools.MainGenericRunner.main(MainGenericRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at coursier.bootstrap.launcher.a.a(Unknown Source)
at coursier.bootstrap.launcher.Launcher.main(Unknown Source)
So why does it not print out error when I am using Using (which I suspect it is causing the problem here)?
And what is the solution so I can use both Using and Source.fromFile with potential error?
I have read the Using Scala 2 doc and Scala 3 doc but it doesn't say anything about error
In case this is important, I am using Mac
scala3 --version
Scala code runner version 3.1.2-RC1-bin-20211213-8e1054e-NIGHTLY-git-8e1054e -- Copyright 2002-2021, LAMP/EPFL
Thats because Using returns Try as you can see here
https://www.scala-lang.org/api/2.13.x/scala/util/Using$.html#apply[R,A](resource:=%3ER)(f:R=%3EA)(implicitevidence$1:scala.util.Using.Releasable[R]):scala.util.Try[A]
You can use .fold, .get, pattern matching, etc.
for example:
import scala.io.Source
import scala.util.Using
object Problem {
def main(args: Array[String]): Unit = {
Using(Source.fromFile("thisfileexists.txt")) { source =>
println(1 / 1)
println(1 / 0)
}.get
}
}
or as follows:
import scala.io.Source
import scala.util.Using
import scala.util._
object Problem {
def main(args: Array[String]): Unit = {
Using(Source.fromFile("thisfileexists.txt")) { source =>
println(1 / 1)
println(1 / 0)
} match {
case Success(res) => println("Do something on sucess")
case Failure(ex) => println(s"Failed with ex: ${ex.getMessage}")
}
}
}
You can read more about Try at scala:
https://www.scala-lang.org/api/2.13.x/scala/util/Try.html

Trying an alternative to Spark broadcast variable because of NullPointerException ONLY in Local mode

I was using my Spark app in cluster mode and everything went well. Now, I need to do some test in my local installation (on my laptop) and I get NullPointerException in the following line:
val brdVar = spark.sparkContext.broadcast(rdd.collectAsMap())
EDIT: This is the full stacktrace:
Exception in thread "main" java.lang.NullPointerException
at learner.LearnCh$.learn(LearnCh.scala:81)
at learner.Learner.runLearningStage(Learner.scala:166)
at learner.Learner.run(Learner.scala:29)
at Driver$.runTask(Driver.scala:26)
at Driver$.main(Driver.scala:19)
at Driver.main(Driver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I was reading a lot, and I couldn't get the answer to my problem (EDIT: I'm using def main(args: Array[String]): Unit = ...). The use case for this brdVar is to get a numerical id value from a string one:
val newRdd: RDD[(Long, Map[Byte, Int])] = origRdd.mapPartitions { partition => partition.map(r => (r.idString, r)) }
.aggregateByKey // this line doesn't affect to my problem ....
.mapPartitions { partition => partition.map { case (idString, listIndexes) => (brdVar.value(idString), .....) } }
So, in order to continue and don't get stuck with broadcast in local mode, I change the idea and I wanted to simulate the brdVar saving its data in a file, and reading and searching the key calling a function instead of this part brdVar.value(idString) doing this getNumericalID(id). To do so, I've written this function:
def getNumericalID(strID: String): Long = {
val pathToRead = ....
val file = spark.sparkContext.textFile(pathToRead)
val process = file.map{line =>
val l = line.split(",")
(l(0), l(1))
}.filter(e=>e._1 == strID).collect()
process(0)._2.toLong
}
But I'm still getting NullPointerException message, but this time in this val file = .... line. I've checked, and the file has content. I think maybe I'm misunderstanding something, any ideas?

How to use flatmap in a Play framework controller with SparkContext?

I have a web app using Play 2.6, Scala 2.11 and Spark 2.2.0.
I am getting the exception: org.apache.spark.SparkException: Task not serializable when I execute a flatmap transformation on some variable. I know that I have to implement Serializable in some class, but I don't know the best practice to do that.
The exception happens on the line var namesRdd = names.flatMap(parseNames). If I use MyController with Serializable I have another error: class invalid for deserialization. So I suppose that this is not the solution.
Does anyone know how to serialize a Controller to use Spark Context and flatmap?
class SparkMarvelController #Inject()(cc: ControllerComponents) extends AbstractController(cc) with I18nSupport {
def mostPopularSuperHero() = Action { implicit request: Request[AnyContent] =>
val sparkContext = SparkCommons.sparkSession.sparkContext // got sparkContext
var names = sparkContext
.textFile("resource/marvel/Marvel-names.txt") // build up a hero ID - name RDD
var namesRdd = names.flatMap(parseNames)
val mostPopularHero = sparkContext
.textFile("resource/marvel/Marvel-graph.txt") // build up superhero co-apperance data
.map(countCoOccurrences) // convert to (hero ID, number of connections) RDD
.reduceByKey((x, y) => x + y) // combine entries that span more than one line
.map(x => (x._2, x._1)) // flip it to (number of connections, hero ID)
.max // find the max connections
// Look up the name (lookup returns an array of results, so we need to access the first result with (0))
val mostPopularHeroName = namesRdd.lookup(mostPopularHero._2)(0)
Ok(s"The most popular superhero is [$mostPopularHeroName] with [${mostPopularHero._1}] co-appearances.")
}
// Function to extract the hero ID and number of connections from each line
def countCoOccurrences(line: String) = {
// regex expression to split using any type of space occurrency in the line
val elements = line.split("\\s+")
(elements(0).toInt, elements.length - 1)
}
// function to extract hero ID -> hero name tuples (or None in case of Failure)
def parseNames(line: String): Option[(Int, String)] = {
var fields = line.split('\"')
if (fields.length > 1) return Some(fields(0).trim.toInt, fields(1))
else return None
}
}
error:
play.api.http.HttpErrorHandlerExceptions$$anon$1: Execution exception[[ClassNotFoundException: controllers.SparkMarvelController$$anonfun$mostPopularSuperHero$1$$anonfun$2]]
at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:255)
at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:180)
at play.core.server.AkkaHttpServer$$anonfun$3.applyOrElse(AkkaHttpServer.scala:311)
at play.core.server.AkkaHttpServer$$anonfun$3.applyOrElse(AkkaHttpServer.scala:309)
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:346)
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:345)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
Caused by: java.lang.ClassNotFoundException: controllers.SparkMarvelController$$anonfun$mostPopularSuperHero$1$$anonfun$2
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:429)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)

Scala Case Class serialization

I've been trying to binary serialize a composite case class object that kept throwing a weird exception. I don't really understand what is wrong with this example which throws the following exception. I used to get that exception for circular references which is not the case here. Some hints please?
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field com.Table.rows of type scala.collection.immutable.List in instance of com.Table
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field com.Table.rows of type scala.collection.immutable.List in instance of com.Table
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2024)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at com.TestSeri$.serializeBinDeserialise(TestSeri.scala:37)
at com.TestSeri$.main(TestSeri.scala:22)
at com.TestSeri.main(TestSeri.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
Here is the code
import java.io._
import scalax.file.Path
case class Row(name: String)
case class Table(rows: List[Row])
case class Cont(docs: Map[String, Table])
case object TestSeri {
def main(args: Array[String]) {
val cc = Cont(docs = List(
"1" -> Table(rows = List(Row("r1"), Row("r2"))),
"2" -> Table(rows = List(Row("r301"), Row("r31"), Row("r32")))
).toMap)
val tt = Table(rows = List(Row("r1"), Row("r2")))
val ttdes = serializeBinDeserialize(tt)
println(ttdes == tt)
val ccdes = serializeBinDeserialize(cc)
println(ccdes == cc)
}
def serializeBinDeserialize[T](payload: T): T = {
val bos = new ByteArrayOutputStream()
val out = new ObjectOutputStream(bos)
out.writeObject(payload)
val bis = new ByteArrayInputStream(bos.toByteArray)
val in = new ObjectInputStream(bis)
in.readObject().asInstanceOf[T]
}
}
Replacing List with Array which is immutable too, fixed the problem.
In my original problem I had a Map which I replaced with TreeMap.
I think is likely to be related to the proxy pattern implementation in generic immutable List and Map mentioned here:
https://issues.scala-lang.org/browse/SI-9237.
Can't believe I wasted a full day on this.

Akka-stream UnsupportedOperationException by creating a Source from Graph

I am trying to connect a stream with a n * subFlows. Therefore I build a source from the outlet of a broadcast. But it throws an UnsupportedOperationException: cannot replace the shape of the EmptyModule. I tried to google this exception, but I wasn't able to find anything similar.
Here my Code
val aggFlow = Flow.fromGraph(GraphDSL.create() { implicit builder =>
val broadcast = builder.add(Broadcast[MonitoringMetricEvent](2))
val bc = builder.add(Broadcast[Long](1))
val zip = builder.add(ZipWith[StreamMeasurement, Long, (StreamMeasurement, Long)]((value, ewma) => (value, ewma)))
val merge = builder.add(Merge[Seq[StreamMeasurement]](1))
broadcast.out(1) ~> identityFlow ~> maxFlow ~> bc
val source = Source.fromGraph(GraphDSL.create() { implicit bl =>
SourceShape(bc.out(0))
})
broadcast.out(0) ~> identityFlow ~> topicFlow.groupBy(MAX_SUB_STREAMS, _._1)
.map(_._2)
.zip[Long](source)
.takeWhile(deciderFunction)
.map(_._1)
.fold[Seq[StreamMeasurement]](Seq.empty[StreamMeasurement])((seq, sm) => seq:+sm)
.mergeSubstreams ~> merge
FlowShape(broadcast.in, merge.out)
})
and here the Exception I get:
Exception in thread "main" java.lang.ExceptionInInitializerError
at xxx$.main(Processor.scala:80)
at xxx.Processor.main(Processor.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: java.lang.UnsupportedOperationException: cannot replace the shape of the EmptyModule
at akka.stream.impl.StreamLayout$EmptyModule$.replaceShape(StreamLayout.scala:322)
at akka.stream.scaladsl.GraphApply$class.create(GraphApply.scala:18)
at akka.stream.scaladsl.GraphDSL$.create(Graph.scala:801)
at xxx.logic$$anonfun$22.apply(logic.scala:156)
at xxx.logic$$anonfun$22.apply(logic.scala:146)
at akka.stream.scaladsl.GraphApply$class.create(GraphApply.scala:17)
at akka.stream.scaladsl.GraphDSL$.create(Graph.scala:801)
at xxx.logic$.<init>(logic.scala:146)
at xxx.logic$.<clinit>(logic.scala)
... 7 more
the key problem can be found here: akka-stream Zipping Flows with SubFlows