how can i create a custom CellEncoder in kantan - scala

i have a code that converts a list of case class into a csv string, i'm using kantan so when i tried to pass the encoder i have this error:
could not find implicit value for evidence parameter of type kantan.csv.CellEncoder[Option[javax.xml.datatype.XMLGregorianCalendar]]
original date : 2020-08-13T21:52:27.000Z
this is my code:
import kantan.csv._
import kantan.csv.ops._
import kantan.csv.java8._
import kantan.csv.CellEncoder
val itemsList :List[ItemData] = getItems.getOrElse(Seq.empty[ItemData]).toList
implicit val itemEncoder: HeaderEncoder[ItemData] = HeaderEncoder.caseEncoder("absolutePath", "creationDate",
"displayName", "fileName", "lastModified","lastModifier","owner","parentAbsolutePath","typeValue")(ItemData.unapply _)
val csvItems :String =itemsList.asCsv(rfc.withHeader)
the case class :
case class ItemData(absolutePath: Option[String] = None,
creationDate: Option[javax.xml.datatype.XMLGregorianCalendar] = None,
displayName: Option[String] = None,
fileName: Option[String] = None,
lastModified: Option[javax.xml.datatype.XMLGregorianCalendar] = None,
lastModifier: Option[String] = None,
owner: Option[String] = None,
parentAbsolutePath: Option[String] = None,
typeValue: Option[String] = None)
dependencies:
lazy val `kantan-csv` = "com.nrinaudo" %% "kantan.csv" % Version.kantan
lazy val `kantan-csv-commons` = "com.nrinaudo" %% "kantan.csv-commons" % Version.kantan
lazy val `kantan-csv-generic` = "com.nrinaudo" %% "kantan.csv-generic" % Version.kantan
lazy val `kantan-csv-java8` = "com.nrinaudo" %% "kantan.csv-java8" % Version.kantan

I don't actually know much about javax.xml.datatype.XMLGregorianCalendar, so I'm not sure how you'd represent that as a string. This answer assumes it's done by calling toString, but change that to whatever is the correct way of doing so.
You need to provide a CellEncoder[javax.xml.datatype.XMLGregorianCalendar]. This is documented, and fairly straightforward:
implicit val xmlCalendarEncoder: CellEncoder[javax.xml.datatype.XMLGregorianCalendar] = CellEncoder.from(_.toString)
kantan.csv should be able to work out the rest for you.

Related

How to input and output an Seq of an object to a function in Scala

I want to parse a column to get split values using Seq of an object
case class RawData(rawId: String, rawData: String)
case class SplitData(
rawId: String,
rawData: String,
split1: Option[Int],
split2: Option[String],
split3: Option[String],
split4: Option[String]
)
def rawDataParser(unparsedRawData: Seq[RawData]): Seq[RawData] = {
unparsedrawData.map(rawData => {
val split = rawData.address.split(", ")
rawData.copy(
split1 = Some(split(0).toInt),
split2 = Some(split(1)),
split3 = Some(split(2)),
split4 = Some(split(3))
)
})
}
val rawDataDF= Seq[(String, String)](
("001", "Split1, Split2, Split3, Split4"),
("002", "Split1, Split2, Split3, Split4")
).toDF("rawDataID", "rawData")
val rawDataDS: Dataset[RawData] = rawDataDF.as[RawData]
I need to use rawDataParser function to parse my rawData. However, the parameter to the function is of type Seq. I am not sure how should I convert rawDataDS as an input to function to parse the raw data. some form of guidance to solve this is appreciated.
Each DataSet is further divided into partitions. You can use mapPartitions with a mapping Iterator[T] => Iterator[U] to convert a DataSet[T] into a DataSet[U].
So, you can just use your addressParser as the argument for mapPartition.
val rawAddressDataDS =
spark.read
.option("header", "true")
.csv(csvFilePath)
.as[AddressRawData]
val addressDataDS =
rawAddressDataDS
.map { rad =>
AddressData(
addressId = rad.addressId,
address = rad.address,
number = None,
road = None,
city = None,
country = None
)
}
.mapPartitions { unparsedAddresses =>
addressParser(unparsedAddresses.toSeq).toIterator
}

Extending DefaultParamsReadable and DefaultParamsWritable not allowing reading of custom model

Good day,
I have been struggling for a few days to save a custom transformer that is part of a large pipeline of stages. I have a transformer that is completely defined by its params. I have an estimator which in it's fit method will generate a matrix and then set the transformer parameters accordingly so that I can use DefaultParamsReadable and DefaultParamsReadable to take advantage of the serialisation/deserialisation already present in util.ReadWrite.scala.
My summarised code is as follows (includes important aspects):
...
import org.apache.spark.ml.util._
...
// trait to implement in Estimator and Transformer for params
trait NBParams extends Params {
final val featuresCol= new Param[String](this, "featuresCol", "The input column")
setDefault(featuresCol, "_tfIdfOut")
final val labelCol = new Param[String](this, "labelCol", "The labels column")
setDefault(labelCol, "P_Root_Code_Index")
final val predictionsCol = new Param[String](this, "predictionsCol", "The output column")
setDefault(predictionsCol, "NBOutput")
final val ratioMatrix = new Param[DenseMatrix](this, "ratioMatrix", "The transformation matrix")
def getfeaturesCol: String = $(featuresCol)
def getlabelCol: String = $(labelCol)
def getPredictionCol: String = $(predictionsCol)
def getRatioMatrix: DenseMatrix = $(ratioMatrix)
}
// Estimator
class CustomNaiveBayes(override val uid: String, val alpha: Double)
extends Estimator[CustomNaiveBayesModel] with NBParams with DefaultParamsWritable {
def copy(extra: ParamMap): CustomNaiveBayes = {
defaultCopy(extra)
}
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
def setLabelCol(value: String): this.type = set(labelCol, value)
def setPredictionCol(value: String): this.type = set(predictionsCol, value)
def setRatioMatrix(value: DenseMatrix): this.type = set(ratioMatrix, value)
override def transformSchema(schema: StructType): StructType = {...}
override def fit(ds: Dataset[_]): CustomNaiveBayesModel = {
...
val model = new CustomNaiveBayesModel(uid)
model
.setRatioMatrix(ratioMatrix)
.setFeaturesCol($(featuresCol))
.setLabelCol($(labelCol))
.setPredictionCol($(predictionsCol))
}
}
// companion object for Estimator
object CustomNaiveBayes extends DefaultParamsReadable[CustomNaiveBayes]{
override def load(path: String): CustomNaiveBayes = super.load(path)
}
// Transformer
class CustomNaiveBayesModel(override val uid: String)
extends Model[CustomNaiveBayesModel] with NBParams with DefaultParamsWritable {
def this() = this(Identifiable.randomUID("customnaivebayes"))
def copy(extra: ParamMap): CustomNaiveBayesModel = {defaultCopy(extra)}
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
def setLabelCol(value: String): this.type = set(labelCol, value)
def setPredictionCol(value: String): this.type = set(predictionsCol, value)
def setRatioMatrix(value: DenseMatrix): this.type = set(ratioMatrix, value)
override def transformSchema(schema: StructType): StructType = {...}
}
def transform(dataset: Dataset[_]): DataFrame = {...}
}
// companion object for Transformer
object CustomNaiveBayesModel extends DefaultParamsReadable[CustomNaiveBayesModel]
When I add this Model as part of a pipeline and fit the pipeline, all runs ok. When I save the pipeline, there are no errors. However, when I attempt to load the pipeline in I get the following error:
NoSuchMethodException: $line3b380bcad77e4e84ae25a6bfb1f3ec0d45.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$6fa979eb27fa6bf89c6b6d1b271932c$$$$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$CustomNaiveBayesModel.read()
To save the pipeline, which includes a number of other transformers related to NLP pre-processing, I run
fittedModelRootCode.write.save("path")
and to then load it (where the failure occurs) I run
import org.apache.spark.ml.PipelineModel
val fittedModelRootCode = PipelineModel.load("path")
The model itself appears to be working well but I cannot afford to retrain the model on a dataset every time I wish to use it. Does anyone have any ideas why even with the companion object, the read() method appears to be unavailable?
Notes:
I am running on Databricks Runtime 8.3 (Spark 3.1.1, Scala 2.12)
My model is in a separate package so is external to Spark
I have reproduced this based on a number of existing examples all of which appear to work fine so I am unsure why my code is failing
I am aware there is a Naive Bayes model available in Spark ML, however, I have been tasked with making a large number of customizations so it is not worth modifying the existing version (plus I would like to learn how to get this right)
Any help would be greatly appreciated.
Since you extend the CustomNaiveBayesModel companion object by DefaultParamsReadable, I think you should use the companion object CustomNaiveBayesModel for loading the model. Here I write some code for saving and loading models and it works properly:
import org.apache.spark.SparkConf
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.sql.SparkSession
import path.to.CustomNaiveBayesModel
object SavingModelApp extends App {
val spark: SparkSession = SparkSession.builder().config(
new SparkConf()
.setMaster("local[*]")
.setAppName("Test app")
.set("spark.driver.host", "localhost")
.set("spark.ui.enabled", "false")
).getOrCreate()
val training = spark.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")
val fittedModelRootCode: PipelineModel = new Pipeline().setStages(Array(new CustomNaiveBayesModel())).fit(training)
fittedModelRootCode.write.save("path/to/model")
val mod = PipelineModel.load("path/to/model")
}
I think your mistake is using PipelineModel.load for loading the concrete model.
My environment:
scalaVersion := "2.12.6"
scalacOptions := Seq(
"-encoding", "UTF-8", "-target:jvm-1.8", "-deprecation",
"-feature", "-unchecked", "-language:implicitConversions", "-language:postfixOps")
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.1",
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.1.1"

scala how to parameterized case class, and pass the case class variable to [T <: Product: TypeTag]

// class definition of RsGoods schema
case class RsGoods(add_time: Int)
// my operation
originRDD.toDF[Schemas.RsGoods]()
// and the function definition
def toDF[T <: Product: TypeTag](): DataFrame = mongoSpark.toDF[T]()
now i defined too many schemas(RsGoods1,RsGoods2,RsGoods3), and more will be added in the future.
so the question is how to pass a case class as a variable to structure the code
Attach sbt dependency
"org.apache.spark" % "spark-core_2.11" % "2.3.0",
"org.apache.spark" %% "spark-sql" % "2.3.0",
"org.mongodb.spark" %% "mongo-spark-connector" % "2.3.1",
Attach the key code snippet
var originRDD = MongoSpark.load(sc, readConfig)
val df = table match {
case "rs_goods_multi" => originRDD.toDF[Schemas.RsGoodsMulti]()
case "rs_goods" => originRDD.toDF[Schemas.RsGoods]()
case "ma_item_price" => originRDD.toDF[Schemas.MaItemPrice]()
case "ma_siteuid" => originRDD.toDF[Schemas.MaSiteuid]()
case "pi_attribute" => originRDD.toDF[Schemas.PiAttribute]()
case "pi_attribute_name" => originRDD.toDF[Schemas.PiAttributeName]()
case "pi_attribute_value" => originRDD.toDF[Schemas.PiAttributeValue]()
case "pi_attribute_value_name" => originRDD.toDF[Schemas.PiAttributeValueName]()
From what I have understood about your requirement, i think following should be a decent starting point.
def readDataset[A: Encoder](
spark: SparkSession,
mongoUrl: String,
collectionName: String,
clazz: Class[A]
): Dataset[A] = {
val config = ReadConfig(
Map("uri" -> s"$mongoUrl.$collectionName")
)
val df = MongoSpark.load(spark, config)
val fieldNames = clazz.getDeclaredFields.map(f => f.getName).dropRight(1).toList
val dfWithMatchingFieldNames = df.toDf(fieldNames: _*)
dfWithMatchingFieldNames.as[A]
}
You can use it like this,
case class RsGoods(add_time: Int)
val spark: SparkSession = ...
import spark.implicts._
val rdGoodsDS = readDataset[RsGoods](
spark,
"mongodb://example.com/database",
"rs_goods",
classOf[RsGoods]
)
Also, the following two lines,
val fieldNames = clazz.getDeclaredFields.map(f => f.getName).dropRight(1).toList
val dfWithMatchingFieldNames = df.toDf(fieldNames: _*)
are only required because normally Spark reads DataFrames with column names like value1, value2, .... So we want to change the column names to match what we have in our case class.
I am not sure what these "defalut" column names will be because MongoSpark is involved.
You should first check the column names in the df created as following,
val config = ReadConfig(
Map("uri" -> s"$mongoUrl.$collectionName")
)
val df = MongoSpark.load(spark, config)
If, MongoSpark fixes the problem of these "default" column names and picks the coulmn names from your collection then those 2 lines will not be required and your method will become just this,
def readDataset[A: Encoder](
spark: SparkSession,
mongoUrl: String,
collectionName: String,
): Dataset[A] = {
val config = ReadConfig(
Map("uri" -> s"$mongoUrl.$collectionName")
)
val df = MongoSpark.load(spark, config)
df.as[A]
}
And,
val rsGoodsDS = readDataset[RsGoods](
spark,
"mongodb://example.com/database",
"rs_goods"
)

How to perform a simple json post with spray-json in spray?

I'm trying to perform a simple json post with spray. But it seems that i can get an http entity for a json object that can be Marshall.
here is my error:
[error]
...../IdeaProjects/PoolpartyConnector/src/main/scala/org/iadb/poolpartyconnector/thesaurusoperation/ThesaurusCacheService.scala:172:
could not find implicit value for evidence parameter of type
spray.httpx.marshalling.Marshaller[spray.json.JsValue]
[error] val request =
Post(s"$thesaurusapiEndpoint/$coreProjectId/suggestFreeConcept?",
suggestionJsonBody)
and the code that comes with it:
override def createSuggestedFreeConcept(suggestedPrefLabel: String, lang: String, scheme: String, b: Boolean): String = {
import system.dispatcher
import spray.json._
val pipeline = addCredentials(BasicHttpCredentials("superadmin", "poolparty")) ~> sendReceive
val label = LanguageLiteral(suggestedPrefLabel, lang)
val suggestion = SuggestFreeConcept(List(label), b, Some(List(scheme)), None, None,None, None)
val suggestionJsonBody = suggestion.toJson
val request = Post(s"$thesaurusapiEndpoint/$coreProjectId/suggestFreeConcept?", suggestionJsonBody)
val res = pipeline(request)
getSuggestedFromFutureHttpResponse(res) match {
case None => ""
case Some(e) => e
}
}
Please, does any one has an idea of what is going on with the implicit marshaller. I though spray Json would come with implicit marshaller.
I assume you already have a custom Json Protocol somewhere so that suggestion.toJson works correctly?
Try the following:
val body = HttpEntity(`application/json`, suggestionJsonBody.prettyPrint)
val request = Post(s"$thesaurusapiEndpoint/$coreProjectId/suggestFreeConcept?", body)
you could also use compactPrint rather than prettyPrint, in either case, it turns the Json into a string containing the json information.
Here is how i solved it:
override def createSuggestedFreeConcepts(suggestedPrefLabels: List[LanguageLiteral], scheme: String, checkDuplicates: Boolean): List[String] = {
import system.dispatcher
import spray.httpx.marshalling._
import spray.httpx.SprayJsonSupport._
val pipeline = addCredentials(BasicHttpCredentials("superadmin", "poolparty")) ~> sendReceive
suggestedPrefLabels map { suggestedPrefLabel =>
val suggestion = SuggestFreeConcept(List(suggestedPrefLabel), checkDuplicates, Some(List(Uri(scheme))), None, None, None, None)
val request = Post(s"$thesaurusapiEndpoint/$coreProjectId/suggestFreeConcept", marshal(suggestion))
val res = pipeline(request)
getSuggestedFromFutureHttpResponse(res) match {
case None => ""
case Some(e) => e
}
}
}
the key is:
import spray.httpx.marshalling._ import spray.httpx.SprayJsonSupport._
and
val request =
Post(s"$thesaurusapiEndpoint/$coreProjectId/suggestFreeConcept",
marshal(suggestion))
I marshall suggestion. The explanation is not super super straightforward. But by fetching around in the doc, it is explained.

How do I use Scala Hashmaps and Options together correctly?

My code snippets are below
import scala.collection.mutable.HashMap
val crossingMap = new HashMap[String, Option[Long]]
val crossingData: String = ...
val time: Long = crossingMap.get(crossingData).getOrElse(0)
I get the following compile error
error: type mismatch;
found : Any
required: Long
val time: Long = crossingMap.get(crossingData).getOrElse(0)
You might want crossingMap to contain String -> Long pairs. Then you can do the following,
val crossingMap = new HashMap[String, Long]
val crossingData: String = ""
val time: Long = crossingMap.getOrElse(crossingData, 0)
If you really do want the crossingMap values to have type Option[Long], then you'll have to do something like,
val crossingMap = new HashMap[String, Option[Long]]
val crossingData: String = ""
val time: Long = crossingMap.getOrElse(crossingData, None).getOrElse(0)