Scala Pickling: macro implementation not found - scala

I new to Scala (2.10) and currently working on a POC to store some data inside HBase. To store the data I'm trying to use the scala pickling library to serialise my case classes into a binary format
"org.scala-lang.modules" %% "scala-pickling" % "0.10.1"
I have these two simple classes:
case class Location(source: Source,
country: String,
region: String,
metro: String,
postalcode: String) {
}
And
case class Source(name: String,
trust: Float,
created: String) {
/** compares this Source with the other source and returns the difference in their trust levels */
def compare(other: Source): Float = {
trust - other.trust
}
/** returns whether you should prefer this source (true) or the other source (false) */
def prefer(other: Source): Boolean = {
trust >= other.trust
}
}
object Source {
def apply(name: String, trust: Float) = new Source(name, trust, DateTime.now.toString)
def apply(row: Row) = {
val name = row.getAs[String](0)
val trust = row.getAs[Float](1)
val created = row.getAs[String](2)
new Source(name, trust, created)
}
}
And I'm testing out the serialisation using ScalaTest class
import scala.pickling._
import binary._
class DebuggingSpec extends UnitSpec {
"debugging" should "Allow the serialisation and deserialisation of a Link class" in {
val loc = new Location(Source("Source1", 1), "UK", "Wales", "Cardiff", "CF99 1PP")
val bytes = loc.pickle
bytes.value.length should not be(0)
}
it should "Allow the serialisation and deserialisation of a Location class" in {
val link = Link(Source("Source1", 1), "MyId1", 3)
val bytes = link.pickle
bytes.value.length should not be(0)
}
}
But when I compile this inside IntelliJ or on the command line via sbt package I get the following error message
Error:(12, 9) macro implementation not found: pickle (the most common reason for that is that you cannot use macro implementations in the same compilation run that defines them)
val bytes = loc.pickle
^
EDIT: I've run this code successfully in the spark-shell (1.3.1) and it will happily pickle and unpickle these classes... but identical code and import produce and error when compiling

Related

How to infer StructType schema for Spark Scala at run time given a Fully Qualified Name of a case class

Since a few days I was wondering if it is possible to infer a schema for Spark in Scala for a given case class, but unknown at compile time.
The only input is a string containing the FQN of the class (that could be used for example to create an instance of the case class at runtime via reflection)
I was thinking if it was possible to do something like:
package com.my.namespace
case class MyCaseClass (name: String, num: Int)
//Somewhere else in codebase
// coming from external configuration file, so unknown at compile time
val fqn = "com.my.namespace.MyCaseClass"
val schema = Encoders.product [ getXYZ( fqn ) ].schema
Of course, any other techniques that is not using Encoders is fine (building StructType analysing an instance of the case class ? Is it even possible ?)
What is the best approach?
Is it something feasible ?
You can use reflective toolbox
package com.my.namespace
import org.apache.spark.sql.types.StructType
import scala.reflect.runtime
import scala.tools.reflect.ToolBox
case class MyCaseClass (name: String, num: Int)
object Main extends App {
val fqn = "com.my.namespace.MyCaseClass"
val runtimeMirror = runtime.currentMirror
val toolbox = runtimeMirror.mkToolBox()
val res = toolbox.eval(toolbox.parse(s"""
import org.apache.spark.sql.Encoders
Encoders.product[$fqn].schema
""")).asInstanceOf[StructType]
println(res) // StructType(StructField(name,StringType,true),StructField(num,IntegerType,false))
}

Spark unable to find encoder(case class) although providing it

Trying to figure out why getting an error on encoders, any insight would be helpful!
ERROR Unable to find encoder for type SolrNewsDocument, An implicit Encoder[SolrNewsDocument] is needed to store `
Clearly I have imported spark.implicits._. I have also have provided an encoder as a case class.
def ingestDocsToSolr(newsItemDF: DataFrame) = {
case class SolrNewsDocument(
title: String,
body: String,
publication: String,
date: String,
byline: String,
length: String
)
import spark.implicits._
val solrDocs = newsItemDF.as[SolrNewsDocument].map { doc =>
val solrDoc = new SolrInputDocument
solrDoc.setField("title", doc.title.toString)
solrDoc.setField("body", doc.body)
solrDoc.setField("publication", doc.publication)
solrDoc.setField("date", doc.date)
solrDoc.setField("byline", doc.byline)
solrDoc.setField("length", doc.length)
solrDoc
}
// can be used for stream SolrSupport.
SolrSupport.indexDocs("localhost:2181", "collection", 10, solrDocs.rdd);
val solrServer = SolrSupport.getCachedCloudClient("localhost:2181")
solrServer.setDefaultCollection("collection")
solrServer.commit(false, false)
}
//Check this one.-Move case class declaration before function declaration.
//Encoder is created once case class statement is executed by compiler. Then only compiler will be able to use encoder inside function deceleration.
import spark.implicits._
case class SolrNewsDocument(title: String,body: String,publication: String,date: String,byline: String,length: String)
def ingestDocsToSolr(newsItemDF:DataFrame) = {
val solrDocs = newsItemDF.as[SolrNewsDocument]}
i got this error trying to iterate over a text file, and in my case, as of spark 2.4.x the issue was that i had to cast it to an RDD first (that used to be implicit)
textFile
.rdd
.flatMap(line=>line.split(" "))
Migrating our Scala codebase to Spark 2

How to unit-test a class is serializable for spark?

I just found a bug on a class serialization in spark.
=> Now, I want to make a unit-test, but I don't see how?
Notes:
the failure appends in a (de)serialized object which has been broadcasted.
I want to test exactly what spark will do, to assert it will work once deployed
the class to serialize is a standard class (not case class) which extends Serializer
Looking into spark broadcast code, I found a way. But it uses private spark code, so it might becomes invalid if spark changes internally. But still it works.
Add a test class in a package starting by org.apache.spark, such as:
package org.apache.spark.my_company_tests
// [imports]
/**
* test data that need to be broadcast in spark (using kryo)
*/
class BroadcastSerializationTests extends FlatSpec with Matchers {
it should "serialize a transient val, which should be lazy" in {
val data = new MyClass(42) // data to test
val conf = new SparkConf()
// Serialization
// code found in TorrentBroadcast.(un)blockifyObject that is used by TorrentBroadcastFactory
val blockSize = 4 * 1024 * 1024 // 4Mb
val out = new ChunkedByteBufferOutputStream(blockSize, ByteBuffer.allocate)
val ser = new KryoSerializer(conf).newInstance() // Here I test using KryoSerializer, you can use JavaSerializer too
val serOut = ser.serializeStream(out)
Utils.tryWithSafeFinally { serOut.writeObject(data) } { serOut.close() }
// Deserialization
val blocks = out.toChunkedByteBuffer.getChunks()
val in = new SequenceInputStream(blocks.iterator.map(new ByteBufferInputStream(_)).asJavaEnumeration)
val serIn = ser.deserializeStream(in)
val data2 = Utils.tryWithSafeFinally { serIn.readObject[MyClass]() } { serIn.close() }
// run test on data2
data2.yo shouldBe data.yo
}
}
class MyClass(i: Int) extends Serializable {
#transient val yo = 1 to i // add lazy to make the test pass: not lazy transient val are not recomputed after deserialization
}

How to unit test HBase in Spark streaming scala

I was trying to unit test doSomethingRdd which requires to read some reference data from HBase in rdd transformation.
def doSomethingRdd(in: DStream[String]): DStream[String] = {
in.map(i => {
val cell = HbaseUtil.getCell("myTable", "myRowKey", "myFamily", "myColumn")
i + cell.getOrElse("")
})
}
Object HBaseUtil {
def getCell(tableName: String, rowKey: String, columnFamily: String, column: String): Option[String] = {
val HBaseConn = ConnectionPool.getConnection()
//the rest of the code will use HBaseConn
//to get a HBase cell and convert to a string
}
}
I read this Cloudera article but I have some problem with their recommended methods.
This first thing I tried was using ScalaMock to mock HBaseUtil.getUtil method so I can bypass HBase connection. I also did some workaround in order to mock Object singleton suggested by this article. I updated my code a bit like below. However, doSomethingRdd failed because mocked hbaseUtil is not serialization which also explained by Paul Butcher in his reply
def doSomethingRdd(in: DStream[String], hbaseUtil: HBaseUtilBody:HBaseUtil): DStream[String] = {
in.map(i => {
val cell = HbaseUtil.getCell("myTable", "myRowKey", "myFamily", "myColumn")
i + cell.getOrElse("")
})
}
trait HBaseUtilBody {
def getCell(tableName: String, rowKey: String, columnFamily: String, column: String): Option[String] = {
val HBaseConn = ConnectionPool.getConnection()
//the rest of the code will use HBaseConn
//to get a HBase cell and convert to a string
}
}
object HBaseUtil extends HBaseUtilBody
I think getting data from HBase in RDD transformation would be a very common pattern. But I'm not sure how to unit test it without connecting to a real HBase instance.
In 2020 with HBase 2.x we use the hbase-testing-util. Simply add it to your SBT build file
// https://mvnrepository.com/artifact/org.apache.hbase/hbase-testing-util
libraryDependencies += "org.apache.hbase" % "hbase-testing-util" % "2.2.3" % Test
And then establish a connection like this
import org.apache.hadoop.hbase.HBaseTestingUtility
val utility = new HBaseTestingUtiliy
utility.startMiniCluster() // defaults to 1 master, 1 region server and 1 data node
val connection = utility.getConnection()
Starting the MiniCluster actually starts
MiniDFSCluster
MiniZKCluster, and
MiniHBaseCluster
In case you need to add some specific configuration (e.g. security settings) you can add hbase-site.xml to your resources.
For more information refer to section Integration Testing with an HBase Mini-Cluster in HBase Reference Guide.

How do I provide basic configuration for a Scala application?

I am working on a small GUI application written in Scala. There are a few settings that the user will set in the GUI and I want them to persist between program executions. Basically I want a scala.collections.mutable.Map that automatically persists to a file when modified.
This seems like it must be a common problem, but I have been unable to find a lightweight solution. How is this problem typically solved?
I do a lot of this, and I use .properties files (it's idiomatic in Java-land). I keep my config pretty straight-forward by design, though. If you have nested config constructs you might want a different format like YAML (if humans are the main authors) or JSON or XML (if machines are the authors).
Here's some example code for loading props, manipulating as Scala Map, then saving as .properties again:
import java.io._
import java.util._
import scala.collection.JavaConverters._
val f = new File("test.properties")
// test.properties:
// foo=bar
// baz=123
val props = new Properties
// Note: in real code make sure all these streams are
// closed carefully in try/finally
val fis = new InputStreamReader(new FileInputStream(f), "UTF-8")
props.load(fis)
fis.close()
println(props) // {baz=123, foo=bar}
val map = props.asScala // Get to Scala Map via JavaConverters
map("foo") = "42"
map("quux") = "newvalue"
println(map) // Map(baz -> 123, quux -> newvalue, foo -> 42)
println(props) // {baz=123, quux=newvalue, foo=42}
val fos = new OutputStreamWriter(new FileOutputStream(f), "UTF-8")
props.store(fos, "")
fos.close()
Here's an example of using XML and a case class for reading a config. A real class can be nicer than a map. (You could also do what sbt and at least one project do, take the config as Scala source and compile it in; saving it is less automatic. Or as a repl script. I haven't googled, but someone must have done that.)
Here's the simpler code.
This version uses a case class:
case class PluginDescription(name: String, classname: String) {
def toXML: Node = {
<plugin>
<name>{name}</name>
<classname>{classname}</classname>
</plugin>
}
}
object PluginDescription {
def fromXML(xml: Node): PluginDescription = {
// extract one field
def getField(field: String): Option[String] = {
val text = (xml \\ field).text.trim
if (text == "") None else Some(text)
}
def extracted = {
val name = "name"
val claas = "classname"
val vs = Map(name -> getField(name), claas -> getField(claas))
if (vs.values exists (_.isEmpty)) fail()
else PluginDescription(name = vs(name).get, classname = vs(claas).get)
}
def fail() = throw new RuntimeException("Bad plugin descriptor.")
// check the top-level tag
xml match {
case <plugin>{_*}</plugin> => extracted
case _ => fail()
}
}
}
This code reflectively calls the apply of a case class. The use case is that fields missing from config can be supplied by default args. No type conversions here. E.g., case class Config(foo: String = "bar").
// isn't it easier to write a quick loop to reflect the field names?
import scala.reflect.runtime.{currentMirror => cm, universe => ru}
import ru._
def fromXML(xml: Node): Option[PluginDescription] = {
def extract[A]()(implicit tt: TypeTag[A]): Option[A] = {
// extract one field
def getField(field: String): Option[String] = {
val text = (xml \\ field).text.trim
if (text == "") None else Some(text)
}
val apply = ru.newTermName("apply")
val module = ru.typeOf[A].typeSymbol.companionSymbol.asModule
val ts = module.moduleClass.typeSignature
val m = (ts member apply).asMethod
val im = cm reflect (cm reflectModule module).instance
val mm = im reflectMethod m
def getDefault(i: Int): Option[Any] = {
val n = ru.newTermName("apply$default$" + (i+1))
val m = ts member n
if (m == NoSymbol) None
else Some((im reflectMethod m.asMethod)())
}
def extractArgs(pss: List[List[Symbol]]): List[Option[Any]] =
pss.flatten.zipWithIndex map (p => getField(p._1.name.encoded) orElse getDefault(p._2))
val args = extractArgs(m.paramss)
if (args exists (!_.isDefined)) None
else Some(mm(args.flatten: _*).asInstanceOf[A])
}
// check the top-level tag
xml match {
case <plugin>{_*}</plugin> => extract[PluginDescription]()
case _ => None
}
}
XML has loadFile and save, it's too bad there seems to be no one-liner for Properties.
$ scala
Welcome to Scala version 2.10.0-RC5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_06).
Type in expressions to have them evaluated.
Type :help for more information.
scala> import reflect.io._
import reflect.io._
scala> import java.util._
import java.util._
scala> import java.io.{StringReader, File=>JFile}
import java.io.{StringReader, File=>JFile}
scala> import scala.collection.JavaConverters._
import scala.collection.JavaConverters._
scala> val p = new Properties
p: java.util.Properties = {}
scala> p load new StringReader(
| (new File(new JFile("t.properties"))).slurp)
scala> p.asScala
res2: scala.collection.mutable.Map[String,String] = Map(foo -> bar)
As it all boils down to serializing a map / object to a file, your choices are:
classic serialization to Bytecode
serialization to XML
serialization to JSON (easy using Jackson, or Lift-JSON)
use of a properties file (ugly, no utf-8 support)
serialization to a proprietary format (ugly, why reinvent the wheel)
I suggest to convert Map to Properties and vice versa. "*.properties" files are standard for storing configuration in Java world, why not use it for Scala?
The common way are *. properties, *.xml, since scala supports xml natively, so it would be easier using xml config then in java.