Akka-stream UnsupportedOperationException by creating a Source from Graph - scala

I am trying to connect a stream with a n * subFlows. Therefore I build a source from the outlet of a broadcast. But it throws an UnsupportedOperationException: cannot replace the shape of the EmptyModule. I tried to google this exception, but I wasn't able to find anything similar.
Here my Code
val aggFlow = Flow.fromGraph(GraphDSL.create() { implicit builder =>
val broadcast = builder.add(Broadcast[MonitoringMetricEvent](2))
val bc = builder.add(Broadcast[Long](1))
val zip = builder.add(ZipWith[StreamMeasurement, Long, (StreamMeasurement, Long)]((value, ewma) => (value, ewma)))
val merge = builder.add(Merge[Seq[StreamMeasurement]](1))
broadcast.out(1) ~> identityFlow ~> maxFlow ~> bc
val source = Source.fromGraph(GraphDSL.create() { implicit bl =>
SourceShape(bc.out(0))
})
broadcast.out(0) ~> identityFlow ~> topicFlow.groupBy(MAX_SUB_STREAMS, _._1)
.map(_._2)
.zip[Long](source)
.takeWhile(deciderFunction)
.map(_._1)
.fold[Seq[StreamMeasurement]](Seq.empty[StreamMeasurement])((seq, sm) => seq:+sm)
.mergeSubstreams ~> merge
FlowShape(broadcast.in, merge.out)
})
and here the Exception I get:
Exception in thread "main" java.lang.ExceptionInInitializerError
at xxx$.main(Processor.scala:80)
at xxx.Processor.main(Processor.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: java.lang.UnsupportedOperationException: cannot replace the shape of the EmptyModule
at akka.stream.impl.StreamLayout$EmptyModule$.replaceShape(StreamLayout.scala:322)
at akka.stream.scaladsl.GraphApply$class.create(GraphApply.scala:18)
at akka.stream.scaladsl.GraphDSL$.create(Graph.scala:801)
at xxx.logic$$anonfun$22.apply(logic.scala:156)
at xxx.logic$$anonfun$22.apply(logic.scala:146)
at akka.stream.scaladsl.GraphApply$class.create(GraphApply.scala:17)
at akka.stream.scaladsl.GraphDSL$.create(Graph.scala:801)
at xxx.logic$.<init>(logic.scala:146)
at xxx.logic$.<clinit>(logic.scala)
... 7 more
the key problem can be found here: akka-stream Zipping Flows with SubFlows

Related

Trying an alternative to Spark broadcast variable because of NullPointerException ONLY in Local mode

I was using my Spark app in cluster mode and everything went well. Now, I need to do some test in my local installation (on my laptop) and I get NullPointerException in the following line:
val brdVar = spark.sparkContext.broadcast(rdd.collectAsMap())
EDIT: This is the full stacktrace:
Exception in thread "main" java.lang.NullPointerException
at learner.LearnCh$.learn(LearnCh.scala:81)
at learner.Learner.runLearningStage(Learner.scala:166)
at learner.Learner.run(Learner.scala:29)
at Driver$.runTask(Driver.scala:26)
at Driver$.main(Driver.scala:19)
at Driver.main(Driver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I was reading a lot, and I couldn't get the answer to my problem (EDIT: I'm using def main(args: Array[String]): Unit = ...). The use case for this brdVar is to get a numerical id value from a string one:
val newRdd: RDD[(Long, Map[Byte, Int])] = origRdd.mapPartitions { partition => partition.map(r => (r.idString, r)) }
.aggregateByKey // this line doesn't affect to my problem ....
.mapPartitions { partition => partition.map { case (idString, listIndexes) => (brdVar.value(idString), .....) } }
So, in order to continue and don't get stuck with broadcast in local mode, I change the idea and I wanted to simulate the brdVar saving its data in a file, and reading and searching the key calling a function instead of this part brdVar.value(idString) doing this getNumericalID(id). To do so, I've written this function:
def getNumericalID(strID: String): Long = {
val pathToRead = ....
val file = spark.sparkContext.textFile(pathToRead)
val process = file.map{line =>
val l = line.split(",")
(l(0), l(1))
}.filter(e=>e._1 == strID).collect()
process(0)._2.toLong
}
But I'm still getting NullPointerException message, but this time in this val file = .... line. I've checked, and the file has content. I think maybe I'm misunderstanding something, any ideas?

How to use flatmap in a Play framework controller with SparkContext?

I have a web app using Play 2.6, Scala 2.11 and Spark 2.2.0.
I am getting the exception: org.apache.spark.SparkException: Task not serializable when I execute a flatmap transformation on some variable. I know that I have to implement Serializable in some class, but I don't know the best practice to do that.
The exception happens on the line var namesRdd = names.flatMap(parseNames). If I use MyController with Serializable I have another error: class invalid for deserialization. So I suppose that this is not the solution.
Does anyone know how to serialize a Controller to use Spark Context and flatmap?
class SparkMarvelController #Inject()(cc: ControllerComponents) extends AbstractController(cc) with I18nSupport {
def mostPopularSuperHero() = Action { implicit request: Request[AnyContent] =>
val sparkContext = SparkCommons.sparkSession.sparkContext // got sparkContext
var names = sparkContext
.textFile("resource/marvel/Marvel-names.txt") // build up a hero ID - name RDD
var namesRdd = names.flatMap(parseNames)
val mostPopularHero = sparkContext
.textFile("resource/marvel/Marvel-graph.txt") // build up superhero co-apperance data
.map(countCoOccurrences) // convert to (hero ID, number of connections) RDD
.reduceByKey((x, y) => x + y) // combine entries that span more than one line
.map(x => (x._2, x._1)) // flip it to (number of connections, hero ID)
.max // find the max connections
// Look up the name (lookup returns an array of results, so we need to access the first result with (0))
val mostPopularHeroName = namesRdd.lookup(mostPopularHero._2)(0)
Ok(s"The most popular superhero is [$mostPopularHeroName] with [${mostPopularHero._1}] co-appearances.")
}
// Function to extract the hero ID and number of connections from each line
def countCoOccurrences(line: String) = {
// regex expression to split using any type of space occurrency in the line
val elements = line.split("\\s+")
(elements(0).toInt, elements.length - 1)
}
// function to extract hero ID -> hero name tuples (or None in case of Failure)
def parseNames(line: String): Option[(Int, String)] = {
var fields = line.split('\"')
if (fields.length > 1) return Some(fields(0).trim.toInt, fields(1))
else return None
}
}
error:
play.api.http.HttpErrorHandlerExceptions$$anon$1: Execution exception[[ClassNotFoundException: controllers.SparkMarvelController$$anonfun$mostPopularSuperHero$1$$anonfun$2]]
at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:255)
at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:180)
at play.core.server.AkkaHttpServer$$anonfun$3.applyOrElse(AkkaHttpServer.scala:311)
at play.core.server.AkkaHttpServer$$anonfun$3.applyOrElse(AkkaHttpServer.scala:309)
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:346)
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:345)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
Caused by: java.lang.ClassNotFoundException: controllers.SparkMarvelController$$anonfun$mostPopularSuperHero$1$$anonfun$2
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:429)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)

Scala Case Class serialization

I've been trying to binary serialize a composite case class object that kept throwing a weird exception. I don't really understand what is wrong with this example which throws the following exception. I used to get that exception for circular references which is not the case here. Some hints please?
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field com.Table.rows of type scala.collection.immutable.List in instance of com.Table
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field com.Table.rows of type scala.collection.immutable.List in instance of com.Table
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2024)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at com.TestSeri$.serializeBinDeserialise(TestSeri.scala:37)
at com.TestSeri$.main(TestSeri.scala:22)
at com.TestSeri.main(TestSeri.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
Here is the code
import java.io._
import scalax.file.Path
case class Row(name: String)
case class Table(rows: List[Row])
case class Cont(docs: Map[String, Table])
case object TestSeri {
def main(args: Array[String]) {
val cc = Cont(docs = List(
"1" -> Table(rows = List(Row("r1"), Row("r2"))),
"2" -> Table(rows = List(Row("r301"), Row("r31"), Row("r32")))
).toMap)
val tt = Table(rows = List(Row("r1"), Row("r2")))
val ttdes = serializeBinDeserialize(tt)
println(ttdes == tt)
val ccdes = serializeBinDeserialize(cc)
println(ccdes == cc)
}
def serializeBinDeserialize[T](payload: T): T = {
val bos = new ByteArrayOutputStream()
val out = new ObjectOutputStream(bos)
out.writeObject(payload)
val bis = new ByteArrayInputStream(bos.toByteArray)
val in = new ObjectInputStream(bis)
in.readObject().asInstanceOf[T]
}
}
Replacing List with Array which is immutable too, fixed the problem.
In my original problem I had a Map which I replaced with TreeMap.
I think is likely to be related to the proxy pattern implementation in generic immutable List and Map mentioned here:
https://issues.scala-lang.org/browse/SI-9237.
Can't believe I wasted a full day on this.

Spark Hadoop Failed to get broadcast

Running a spark-submit job and receiving a "Failed to get broadcast_58_piece0..." error. I'm really not sure what I'm doing wrong. Am I overusing UDFs? Too complicated a function?
As a summary of my objective, I am parsing text from pdfs, which are stored as base64 encoded strings in JSON objects. I'm using Apache Tika to get the text, and trying to make copious use of data frames to make things easier.
I had written a piece of code that ran the text extraction through tika as a function outside of "main" on the data as a RDD, and that worked flawlessly. When I try to bring the extraction into main as a UDF on data frames, though, it borks in various different ways. Before I got here I was actually trying to write the final data frame as:
valid.toJSON.saveAsTextFile(hdfs_dir)
This was giving me all sorts of "File/Path already exists" headaches.
Current code:
object Driver {
def main(args: Array[String]):Unit = {
val hdfs_dir = args(0)
val spark_conf = new SparkConf().setAppName("Spark Tika HDFS")
val sc = new SparkContext(spark_conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
// load json data into dataframe
val df = sqlContext.read.json("hdfs://hadoophost.com:8888/user/spark/data/in/*")
val extractInfo: (Array[Byte] => String) = (fp: Array[Byte]) => {
val parser:Parser = new AutoDetectParser()
val handler:BodyContentHandler = new BodyContentHandler(Integer.MAX_VALUE)
val config:TesseractOCRConfig = new TesseractOCRConfig()
val pdfConfig:PDFParserConfig = new PDFParserConfig()
val inputstream:InputStream = new ByteArrayInputStream(fp)
val metadata:Metadata = new Metadata()
val parseContext:ParseContext = new ParseContext()
parseContext.set(classOf[TesseractOCRConfig], config)
parseContext.set(classOf[PDFParserConfig], pdfConfig)
parseContext.set(classOf[Parser], parser)
parser.parse(inputstream, handler, metadata, parseContext)
handler.toString
}
val extract_udf = udf(extractInfo)
val df2 = df.withColumn("unbased_media", unbase64($"media_file")).drop("media_file")
val dfRenamed = df2.withColumn("media_corpus", extract_udf(col("unbased_media"))).drop("unbased_media")
val depuncter: (String => String) = (corpus: String) => {
val r = corpus.replaceAll("""[\p{Punct}]""", "")
val s = r.replaceAll("""[0-9]""", "")
s
}
val depuncter_udf = udf(depuncter)
val withoutPunct = dfRenamed.withColumn("sentence", depuncter_udf(col("media_corpus")))
val model = sc.objectFile[org.apache.spark.ml.PipelineModel]("hdfs://hadoophost.com:8888/user/spark/hawkeye-nb-ml-v2.0").first()
val with_predictions = model.transform(withoutPunct)
val fullNameChecker: ((String, String, String, String, String) => String) = (fname: String, mname: String, lname: String, sfx: String, text: String) =>{
val newtext = text.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val new_fname = fname.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val new_mname = mname.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val new_lname = lname.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val new_sfx = sfx.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val name_full = new_fname.concat(new_mname).concat(new_lname).concat(new_sfx)
val c = name_full.r.findAllIn(newtext).length
c match {
case 0 => "N"
case _ => "Y"
}
}
val fullNameChecker_udf = udf(fullNameChecker)
val stringChecker: ((String, String) => String) = (term: String, text: String) => {
val termLower = term.replaceAll("""[\p{Punct}]""", "").toLowerCase
val textLower = text.replaceAll("""[\p{Punct}]""", "").toLowerCase
val c = termLower.r.findAllIn(textLower).length
c match {
case 0 => "N"
case _ => "Y"
}
}
val stringChecker_udf = udf(stringChecker)
val stringChecker2: ((String, String) => String) = (term: String, text: String) => {
val termLower = term takeRight 4
val textLower = text
val c = termLower.r.findAllIn(textLower).length
c match {
case 0 => "N"
case _ => "Y"
}
}
val stringChecker2_udf = udf(stringChecker)
val valids = with_predictions.withColumn("fname_valid", stringChecker_udf(col("first_name"), col("media_corpus")))
.withColumn("lname_valid", stringChecker_udf(col("last_name"), col("media_corpus")))
.withColumn("fname2_valid", stringChecker_udf(col("first_name_2"), col("media_corpus")))
.withColumn("lname2_valid", stringChecker_udf(col("last_name_2"), col("media_corpus")))
.withColumn("camt_valid", stringChecker_udf(col("chargeoff_amount"), col("media_corpus")))
.withColumn("ocan_valid", stringChecker2_udf(col("original_creditor_account_nbr"), col("media_corpus")))
.withColumn("dpan_valid", stringChecker2_udf(col("debt_provider_account_nbr"), col("media_corpus")))
.withColumn("full_name_valid", fullNameChecker_udf(col("first_name"), col("middle_name"), col("last_name"), col("suffix"), col("media_corpus")))
.withColumn("full_name_2_valid", fullNameChecker_udf(col("first_name_2"), col("middle_name_2"), col("last_name_2"), col("suffix_2"), col("media_corpus")))
valids.write.mode(SaveMode.Overwrite).format("json").save(hdfs_dir)
}
}
Full stack trace starting with error:
16/06/14 15:02:01 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 53, hdpd11n05.squaretwofinancial.com): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:272)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_58_piece0 of broadcast_58
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$9$$anonfun$apply$7.apply(CountVectorizer.scala:222)
at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$9$$anonfun$apply$7.apply(CountVectorizer.scala:221)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$9.apply(CountVectorizer.scala:221)
at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$9.apply(CountVectorizer.scala:218)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr43$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:263)
... 8 more
Caused by: org.apache.spark.SparkException: Failed to get broadcast_58_piece0 of broadcast_58
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
... 25 more
I encountered a similar error.
It turns out to be caused by the broadcast usage in CounterVectorModel. Following is the detailed cause in my case:
When model.transform() is called , the vocabulary is broadcasted and saved as an attribute broadcastDic in model implicitly. Therefore, if the CounterVectorModel is saved after calling model.transform(), the private var attribute broadcastDic is also saved. But unfortunately, in Spark, broadcasted object is context-sensitive, which means it is embedded in SparkContext. If that CounterVectorModel is loaded in a different SparkContext, it will fail to find the previous saved broadcastDic.
So either solution is to prevent calling model.transform() before saving the model, or clone the model by method model.copy().
For anyone coming across this, it turns out the model I was loading was malformed. I found out by using spark-shell in yarn-client mode and stepping through the code. When I tried to load the model it was fine, but running it against the datagram (model.transform) through errors about not finding a metadata directory.
I went back and found a good model, ran against that and it worked fine. This code is actually sound.

Writing a Tuple2Serializer for json4s

I am having trouble turning my data into nested JSON objects. My current data looks like this:
Map(23 -> {"errorCode":null,"runStatusId":null,"lakeHdfsPath":"/user/jmblnvr/20140817_011500_zoot_kohls_offer_init.dat","fieldIndex":23,"datasetFieldName":"TERM_MM","datasetFieldSum":0.0,"datasetFieldMin":0.0,"datasetFieldMax":0.0,"datasetFieldMean":0.0,"datasetFieldSigma":0.0,"datasetFieldNullCount":170544.0,"datasetFieldObsCount":0.0,"datasetFieldKurtosis":0.0,"datasetFieldSkewness":0.0,"frequencyDistribution":null,"id":null,"fieldType":"NUMBER"}, 32 -> {"errorCode":null,"runStatusId":null,"lakeHdfsPath":"/user/jmblnvr/20140817_011500_zoot_kohls_offer_init.dat","fieldIndex":32,"datasetFieldName":"ACCT_NBR","datasetFieldSum":0.0,"datasetFieldMin":0.0,"datasetFieldMax":0.0,"datasetFieldMean":0.0,"datasetFieldSigma":0.0,"datasetFieldNullCount":0.0,"datasetFieldObsCount":0.0,"datasetFieldKurtosis":0.0,"datasetFieldSkewness":0.0,"frequencyDistribution":"(6393050780099594,56810)","id":null,"fieldType":"STRING"} etc. etc.
When I run it through:
def jsonClob(json: scala.collection.mutable.Map[Int, String]): String = {
implicit val formats = org.json4s.DefaultFormats
val A = Serialization.write(json)
A
}
I get the following Error:
Exception in thread "main" scala.MatchError: (23,{"errorCode":null,"fieldIndex":23,"datasetFieldObsCount":0.0,"datasetFieldKurtosis":0.0,"datasetFieldSkewness":0.0,"frequencyDistribution":null,"runStatusId":null,"lakeHdfsPath":"/user/jmblnvr/20140817_011500_zoot_kohls_offer_init.dat","datasetFieldName":"TERM_MM","datasetFieldSum":0.0,"datasetFieldMin":0.0,"datasetFieldMax":0.0,"datasetFieldMean":0.0,"datasetFieldSigma":0.0,"datasetFieldNullCount":170544.0,"id":null,"fieldType":"NUMBER"}) (of class scala.Tuple2)
at org.json4s.Extraction$.internalDecomposeWithBuilder(Extraction.scala:132)
at org.json4s.Extraction$.decomposeWithBuilder(Extraction.scala:67)
at org.json4s.Extraction$.decompose(Extraction.scala:194)
at org.json4s.jackson.Serialization$.write(Serialization.scala:22)
at com.capitalone.dts.toolset.jsonWrite$.jsonClob(jsonWrite.scala:17)
at com.capitalone.dts.dq.profiling.DQProfilingEngine.profile(DQProfilingEngine.scala:264)
at com.capitalone.dts.dq.profiling.Profiler$.main(DQProfilingEngine.scala:64)
at com.capitalone.dts.dq.profiling.Profiler.main(DQProfilingEngine.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I am taking advice from another post I created but am having 0 luck with a custom serializer. So far my code looks like this but completely lost on it:
class Tuple2Serializer extends CustomSerializer[(Int, String)]( format => (
{
case JObject(JField(JInt(k), v)) => (k, v)
},
{
case (t: Int, s:String ) => (t -> s)
} ) )
Edit:
I have it working now thanks to the comment but it is creating with these \, not sure why or how to remove without ruining the JSON
Example:
\"errorCode\":null,\"runStatusId\":null,\"lakeHdfsPath\":\"/user/jmblnvr/20140817_011500_zoot_kohls_offer_init.dat\",\"fieldIndex\":45,\"datasetFieldName\":\"PRESENTABLE_FLAG\"