Too Many Parameters - scala

I have an application that has a single EntryPoint, it's a library to automate some data engineers stuffs.
case class DeltaContextConfig(
primaryKey: List[String],
columnToOrder: String,
filesCountFirstBatch: Int,
destinationPath: String,
sparkDf: DataFrame,
sparkContext: SparkSession,
operationType: String,
partitionColumn: Option[String] = None,
tableName: String,
databaseName: String,
autoCompaction: Option[Boolean] = Option(true),
idealFileSize: Option[Int] = Option(128),
deduplicationColumn: Option[String] = None,
compactionIntervalTime: Option[Int] = Option(180),
updateCondition: Option[String] = None,
setExpression: Option[String] = None
)
This is my case class, my single Entrypoint.
After that all these parameters are pass to other objects, I have objects to write in Datalake, to Compact files and so on. And these objects use some of these parameters, for example, I have a DeltaWriterConfig object:
DeltaWriterConfig(
sparkDf = deltaContextConfig.sparkDf,
columnToOrder = deltaContextConfig.columnToOrder,
destinationPath = deltaContextConfig.destinationPath,
primaryKey = deltaContextConfig.primaryKey,
filesCountFirstBatch = deltaContextConfig.filesCountFirstBatch,
sparkContext = deltaContextConfig.sparkContext,
operationType = deltaContextConfig.operationType,
partitionColumn = deltaContextConfig.partitionColumn,
updateCondition = deltaContextConfig.updateCondition,
setExpression = deltaContextConfig.setExpression
)
I use the DeltaWriterConfig, to pass these parameters to my class DeltaWriter. I was creating all these configs objects on the MAIN, but I think it is not good, because, I have 3 Config Objects to populate, so I have 3 big constructors on the application main.
Is there any pattern to solve this?

I think at least it would be better to replace creating another config from the first one to the companion object of DeltaWriterConfig:
case class DeltaWriterConfig(
sparkDf: DataFrame,
columnToOrder: String,
destinationPath: String,
primaryKey: List[String],
filesCountFirstBatch: Int,
sparkContext: SparkSession,
operationType: String,
partitionColumn: Option[String] = None,
updateCondition: Option[String] = None,
setExpression: Option[String] = None
)
case object DeltaWriterConfig {
def from(deltaContextConfig: DeltaContextConfig): DeltaWriterConfig =
DeltaWriterConfig(
sparkDf = deltaContextConfig.sparkDf,
columnToOrder = deltaContextConfig.columnToOrder,
destinationPath = deltaContextConfig.destinationPath,
primaryKey = deltaContextConfig.primaryKey,
filesCountFirstBatch = deltaContextConfig.filesCountFirstBatch,
sparkContext = deltaContextConfig.sparkContext,
operationType = deltaContextConfig.operationType,
partitionColumn = deltaContextConfig.partitionColumn,
updateCondition = deltaContextConfig.updateCondition,
setExpression = deltaContextConfig.setExpression
)
}
it gives us opportunity to create new config just in one line:
val deltaContextConfig: DeltaContextConfig = ???
val deltaWriterConfig = DeltaWriterConfig.from(deltaContextConfig)
but the better solution is have only that configs that are unique. For example if we have duplicates fields in DeltaContextConfig and DeltaWriterConfig why we couldn't have just composition of config and not duplicate these fields:
// instead of this DeltaContextConfig declaration
case class DeltaContextConfig(
tableName: String,
databaseName: String,
autoCompaction: Option[Boolean] = Option(true),
idealFileSize: Option[Int] = Option(128),
deduplicationColumn: Option[String] = None,
compactionIntervalTime: Option[Int] = Option(180),
sparkDf: DataFrame,
columnToOrder: String,
destinationPath: String,
primaryKey: List[String],
filesCountFirstBatch: Int,
sparkContext: SparkSession,
operationType: String,
partitionColumn: Option[String] = None,
updateCondition: Option[String] = None,
setExpression: Option[String] = None
)
case class DeltaWriterConfig(
sparkDf: DataFrame,
columnToOrder: String,
destinationPath: String,
primaryKey: List[String],
filesCountFirstBatch: Int,
sparkContext: SparkSession,
operationType: String,
partitionColumn: Option[String] = None,
updateCondition: Option[String] = None,
setExpression: Option[String] = None
)
we use such config structure:
case class DeltaContextConfig(
tableName: String,
databaseName: String,
autoCompaction: Option[Boolean] = Option(true),
idealFileSize: Option[Int] = Option(128),
deduplicationColumn: Option[String] = None,
compactionIntervalTime: Option[Int] = Option(180),
deltaWriterConfig: DeltaWriterConfig
)
case class DeltaWriterConfig(
sparkDf: DataFrame,
columnToOrder: String,
destinationPath: String,
primaryKey: List[String],
filesCountFirstBatch: Int,
sparkContext: SparkSession,
operationType: String,
partitionColumn: Option[String] = None,
updateCondition: Option[String] = None,
setExpression: Option[String] = None
)
but remember you should use the same config structure in your config file.

Related

Scala Akka (un)marshalling nested Seq collections with Spray JSON

I'm trying to use Spray JSON to marshall the 'Seq' collection below into a 'BidRequest' entity with the parameters as defined.
The Seq collection is mostly nested, therefore some 'Seq' parameter fields also have variable collection types that need to be marshalled.
Then after a computation, the aim is to unmarshall the results as an entity of 'BidResponse'.
What's the best approach to do this?
I'm using Akka-HTTP, Akka-Streams, Akka-Actor.
Seq collection:
val activeUsers = Seq(
Campaign(
id = 1,
country = "UK",
targeting = Targeting(
targetedSiteIds = Seq("0006a522ce0f4bbbbaa6b3c38cafaa0f")
),
banners = List(
Banner(
id = 1,
src ="https://business.URLTV.com/wp-content/uploads/2020/06/openGraph.jpeg",
width = 300,
height = 250
)
),
bid = 5d
)
)
BidRequest case class:
case class BidRequest(id: String, imp: Option[List[Impression]], site:Site, user: Option[User], device: Option[Device])
BidResponse case class:
case class BidResponse(id: String, bidRequestId: String, price: Double, adid:Option[String], banner: Option[Banner])
The other case classes:
case class Campaign(id: Int, country: String, targeting: Targeting, banners: List[Banner], bid: Double)
case class Targeting(targetedSiteIds: Seq[String])
case class Banner(id: Int, src: String, width: Int, height: Int)
case class Impression(id: String, wmin: Option[Int], wmax: Option[Int], w: Option[Int], hmin: Option[Int], hmax: Option[Int], h: Option[Int], bidFloor: Option[Double])
case class Site(id: Int, domain: String)
case class User(id: String, geo: Option[Geo])
case class Device(id: String, geo: Option[Geo])
case class Geo(country: Option[String])
I've so far tried using the code below but keep getting type mismatch errors:
import akka.http.scaladsl.marshallers.sprayjson.SprayJsonSupport._
import spray.json.DefaultJsonProtocol._
implicit val resFormat = jsonFormat2(BidResponse)
implicit val bidFormat = jsonFormat1(BidRequest)
implicit val cFormat = jsonFormat1(Campaign)
implicit val tFormat = jsonFormat1(Targeting)
implicit val bFormat = jsonFormat1(Banner)
implicit val iFormat = jsonFormat1(Impression)
implicit val sFormat = jsonFormat1(Site)
implicit val uFormat = jsonFormat1(User)
implicit val dFormat = jsonFormat1(Device)
implicit val gFormat = jsonFormat1(Geo)
The reason why you are getting Type errors with Spray JSON is because you need to use the corresponding jsonFormatN method, depending on the number of parameters in the case class.
In your case:
implicit val resFormat = jsonFormat5(BidResponse)
implicit val bidFormat = jsonFormat5(BidRequest)
implicit val cFormat = jsonFormat1(Campaign)
implicit val tFormat = jsonFormat1(Targeting)
implicit val bFormat = jsonFormat4(Banner)
...

Scala Spark Dataset Error on Nested Object

I am trying to test dataframe(dataset) code with strongly typed nested case classes into dataframe to then pass over my functions. The serialize/creation of the dataframe keeps failing and I do not have enough experience to know what is going on in scala or spark.
I think that I am trying to determine a schema while spark also determines a schema and since those do not match it errors??
Models:
package io.swagger.client.model
import java.sql.Date
import scala.Enumeration
case class Member (
memberId: String,
memberIdSuffix: String,
memberSubscriberId: String,
memberEmpi: Option[Long] = None,
memberFirstName: String,
memberLastName: String,
memberMiddleInitial: Option[String] = None,
memberGender: String,
memberBirthDate: Date,
memberSocialSecurityNumber: Option[String] = None,
memeberPhoneNumbers: List[Telecom],
memberEmailAddresses: Option[List[Email]] = None,
memberAddresses: List[Address],
memberEligibilities: List[MemberEligibility]
)
case class Email (
address: String,
effectiveDate: Option[Date] = None,
terminationDate: Option[Date] = None,
isCurrent: Option[Boolean] = None,
isActive: Option[Boolean] = None
)
case class Address (
lineOne: String,
lineTwo: String,
cityName: String,
stateCode: String,
zipCode: String,
effectiveDate: Option[Date] = None,
terminationDate: Option[Date] = None,
isCurrent: Option[Boolean] = None,
isActive: Option[Boolean] = None
)
case class MemberEligibility (
productId: String,
productCategoryCode: String,
classId: String,
planId: String,
groupId: String,
maxCopayAmount: Option[Float] = None,
voidIndicator: Boolean,
healthplanEntryDate: Date,
memberStatusDescription: Option[String] = None,
eligibilityExplanation: Option[String] = None,
eligibilitySelectionLevelDescription: Option[String] = None,
eligibilityReason: Option[String] = None,
effectiveDate: Option[Date] = None,
terminationDate: Option[Date] = None,
isCurrent: Option[Boolean] = None,
isActive: Option[Boolean] = None
)
case class Telecom (
phoneNumber: String,
effectiveDate: Option[Date] = None,
terminationDate: Option[Date] = None,
isCurrent: Option[Boolean] = None,
isActive: Option[Boolean] = None,
telecomType: String
)
object Genders extends Enumeration {
val male, female, unknown, other = Value
}
object Gender extends Enumeration {
val home, work, fax = Value
}
Test code:
import scala.util.{Try, Success, Failure}
import io.swagger.client.model._
import org.apache.spark.sql.{SparkSession, DataFrame, Dataset}
import org.apache.spark.SparkContext
import org.scalatest._
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
trait SparkContextSetup {
def withSparkContext(testMethod: (SparkSession, SparkContext) => Any) {
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark test")
.getOrCreate()
val sparkContext = spark.sparkContext
try {
testMethod(spark,sparkContext)
} finally sparkContext.stop()
}
}
class HelloSpec extends WordSpec with Matchers with SparkContextSetup {
"My analytics" should {
"calculate the right thing" in withSparkContext { (spark, sparkContext) =>
MockMemberData(spark)
}
}
private def MockMemberData(spark: SparkSession) = {
import spark.implicits._
import java.sql.{Date}
import java.text.SimpleDateFormat
import org.apache.spark.sql.types._
var testDate = Try(new SimpleDateFormat("dd/MM/yyyy").parse("01/01/2018"))
.map(d => new java.sql.Date(d.getTime()))
.get
val mockData = spark.sparkContext
.parallelize(
Seq(
Member(
memberId = "12345",
memberIdSuffix = "Mr.",
memberSubscriberId = "000000011",
memberEmpi = None,
memberFirstName = "firstname",
memberLastName = "lastname",
Some("w"),
Genders.male.toString,
testDate,
Some("123456789"),
List(
Telecom("12345678910", None, None, Some(true), Some(true), "")
),
Option(
List(
Email(
"test#gmail.com",
None,
Some(testDate),
isCurrent = Some(true),
isActive = Some(true)
)
)
),
List(
Address(
"10 Awesome Dr",
"",
"St. Louis",
"MO",
"63000",
None,
None,
None,
None
)
),
List(
MemberEligibility(
"productid",
"productCategoryCode",
"classId",
"planId",
"groupId",
None,
false,
testDate,
None,
None,
None,
None,
None,
None,
None
)
)
)
)
)
.toDF()
mockData.show()
}
}
I expected to recieve a dataframe's schema(or dataset in this case, what i did recieve was:
[info] HelloSpec:
[info] My analytics
[info] - should calculate the right thing *** FAILED ***
[info] org.apache.spark.sql.AnalysisException: cannot resolve 'wrapoption(staticinvoke(class scala.collection.mutable.WrappedArray$, ObjectType(interface scala.collection.Seq), make, mapobjects(MapObjects_loopValue10, MapObjects_loopIsNull11, StructField(address,StringType,true), StructField(effectiveDate,DateType,true), StructField(terminationDate,DateType,true), StructField(isCurrent,BooleanType,true), StructField(isActive,BooleanType,true), if (isnull(lambdavariable(MapObjects_loopValue10, MapObjects_loopIsNull11, StructField(address,StringType,true), StructField(effectiveDate,DateType,true), StructField(terminationDate,DateType,true), StructField(isCurrent,BooleanType,true), StructField(isActive,BooleanType,true)))) null else newInstance(class io.swagger.client.model.Email), cast(memberEmailAddresses as array<struct<address:string,effectiveDate:date,terminationDate:date,isCurrent:boolean,isActive:boolean>>)).array, true), ObjectType(class scala.collection.immutable.List))' due to data type mismatch: argument 1 requires scala.collection.immutable.List type, however, 'staticinvoke(class scala.collection.mutable.WrappedArray$, ObjectType(interface scala.collection.Seq), make, mapobjects(MapObjects_loopValue10, MapObjects_loopIsNull11, StructField(address,StringType,true), StructField(effectiveDate,DateType,true), StructField(terminationDate,DateType,true), StructField(isCurrent,BooleanType,true), StructField(isActive,BooleanType,true), if (isnull(lambdavariable(MapObjects_loopValue10, MapObjects_loopIsNull11, StructField(address,StringType,true), StructField(effectiveDate,DateType,true), StructField(terminationDate,DateType,true), StructField(isCurrent,BooleanType,true), StructField(isActive,BooleanType,true)))) null else newInstance(class io.swagger.client.model.Email), cast(memberEmailAddresses as array<struct<address:string,effectiveDate:date,terminationDate:date,isCurrent:boolean,isActive:boolean>>)).array, true)' is of scala.collection.Seq type.;
[info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
[info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:360)
[info] ...
UPDATE
so instead of
val mockData = spark.sparkContext
.parallelize(
Seq(
Or
val mockData = spark.sparkContext
.parallelize(
List(
Using Array works?
val mockData = spark.sparkContext
.parallelize(
Array(
Why does Array work but Seq and List do not work?

How to create generated objects in shapeless

Suppose I have a normalized database model for a generic type that comes in like this:
case class BaseModel(id: String,
createdBy: String,
attr1: Option[String] = None,
attr2: Option[Int] = None,
attr3: Option[LocalDate] = None)
Given a sequence of BaseModel, if all the fields of a certain Option attribute are not populated, can shapeless create a reduced model for me?
For example suppose that all the attr1 fields are empty. Without me having to specify the object before hand can shapeless create a generic object that looks like this?
case class BaseModel(id: String,
createdBy: String,
attr2: Option[Int] = None,
attr3: Option[LocalDate] = None)
What Shapeless can do is, given two case classes, create an object of one of them from an object of another.
import java.time.LocalDate
import shapeless.LabelledGeneric
import shapeless.record._
case class BaseModel(id: String,
createdBy: String,
attr1: Option[String] = None,
attr2: Option[Int] = None,
attr3: Option[LocalDate] = None)
case class BaseModel1(id: String,
createdBy: String,
attr2: Option[Int] = None,
attr3: Option[LocalDate] = None)
val bm = BaseModel(
id = "cff4545gvgf",
createdBy = "John Doe",
attr2 = Some(42),
attr3 = Some(LocalDate.parse("2018-11-03"))
) // BaseModel(cff4545gvgf,John Doe,None,Some(42),Some(2018-11-03))
val hlist = LabelledGeneric[BaseModel].to(bm)
val hlist1 = hlist - 'attr1
val bm1 = LabelledGeneric[BaseModel1].from(hlist1)
// BaseModel1(cff4545gvgf,John Doe,Some(42),Some(2018-11-03))
But Shapeless can't create a new case class. If you need a new case class to be created automatically you can write a macro.

playframework scala slick how to update one single attribute

Hi I want to update only one value of an model and store it to database.
I try this:
def updateProcessTemplateApproveProcessId(processTemplate: ProcessTemplatesModel, approveProcessInstanceId: Int): Future[Int] = {
val action = for {
processTemplatesUpdate <- processTemplates if processTemplatesUpdate.id === processTemplate.id // WHERE Statement
} yield processTemplatesUpdate.approveProcessInstance = Some(approveProcessInstanceId) // SELECT Statement
db.run(action.update(Some(true)))
}
But I got error reassignment to val is not allowed ... thats correct ;) so I changed the attribute in model to var.
case class ProcessTemplatesModel(
id: Option[Int] = None,
title: String,
version: String,
createdat: Option[String],
updatedat: Option[String],
deadline: Option[Date],
status: Option[String],
comment: Option[String],
checked: Option[Boolean],
checkedat: Option[Date],
approved: Option[Boolean],
approvedat: Option[Date],
deleted: Boolean,
approveprocess: Int,
trainingsprocess: Option[Int],
previousVersion: Option[Int],
originTemplate: Option[Int],
client: Int,
var approveProcessInstance: Option[Int],
Whats my fil in this case?
Thanks in advance
Don't change your case class. Assuming that processTemplates is a TableQuery, do the following instead:
val query = for {
processTemplatesUpdate <- processTemplates if processTemplatesUpdate.id === processTemplate.id
} yield processTemplatesUpdate.approveProcessInstance
val action = query.update(Some(approveProcessInstanceId))
db.run(action)
If you want one liner (Assuming that processTemplates is a TableQuery object)
db.run(processTemplates.filter(_.id === processTemplate.id).map(_. approveProcessInstanceId).update(Some(true)))
Hope this helps.

Scala Implementation

I have a case class:
case class EvaluateAddress(addressFormat: String,
screeningAddressType: String,
value: Option[String])
This was working fine until I have a new use case where "value" parameter can be a class Object instead of String.
My initial implementation to handle this use case:
case class EvaluateAddress(addressFormat: String,
screeningAddressType: String,
addressId: Option[String],
addressValue: Option[MailingAddress]) {
def this(addressFormat: String, screeningAddressType: String, addressId: String) = {
this(addressFormat, screeningAddressType, Option(addressId), None)
}
def this(addressFormat: String, screeningAddressType: String, address: MailingAddress) = {
this(addressFormat, screeningAddressType, None, Option(address))
}
}
But because of some problem, I can not have four parameters in any constructor.
Is there a way I can create a class containing three parameters: ** addressFormat, screeningAddressType, value** and handle both the use cases?
Your code works fine, to use the other constructor's you just need to use the new keyword:
case class MailingAddress(i: Int)
case class EvaluateAddress(addressFormat: String, screeningAddressType: String, addressId: Option[String], addressValue: Option[MailingAddress]) {
def this(addressFormat: String, screeningAddressType: String, addressId: String) = {
this(addressFormat, screeningAddressType, Option(addressId), None)
}
def this(addressFormat: String, screeningAddressType: String, address: MailingAddress) = {
this(addressFormat, screeningAddressType, None, Option(address))
}
}
val e1 = EvaluateAddress("a", "b", None, None)
val e2 = new EvaluateAddress("a", "b", "c")
val e3 = new EvaluateAddress("a", "b", MailingAddress(0))
You can create an auxilliary ADT to wrap different types of values. Inside EvaluateAddress you can check the alternative that was provided with a match:
case class EvaluateAddress(addressFormat: String,
screeningAddressType: String,
value: Option[EvaluateAddress.Value]
) {
import EvaluateAddress._
def doEvaluation() = value match {
case Some(Value.AsId(id)) =>
case Some(Value.AsAddress(mailingAddress)) =>
case None =>
}
}
object EvaluateAddress {
sealed trait Value
object Value {
case class AsId(id: String) extends Value
case class AsAddress(address: MailingAddress) extends Value
}
}
It's then possible to also define some implicit conversions to automatically convert Strings and MailingAddresses into Values:
object EvaluateAddress {
sealed trait Value
object Value {
case class AsId(id: String) extends Value
case class AsAddress(address: MailingAddress) extends Value
implicit def idAsValue(id: String): Value = AsId(id)
implicit def addressAsValue(address: MailingAddress): Value = AsAddress(address)
}
def withRawValue[T](addressFormat: String,
screeningAddressType: String,
rawValue: Option[T])(implicit asValue: T => Value): EvaluateAddress =
{
EvaluateAddress(addressFormat, screeningAddressType, rawValue.map(asValue))
}
}
Some examples of using those implicit conversions:
scala> EvaluateAddress("a", "b", Some("c"))
res1: EvaluateAddress = EvaluateAddress(a,b,Some(AsId(c)))
scala> EvaluateAddress("a", "b", Some(MailingAddress("d")))
res2: EvaluateAddress = EvaluateAddress(a,b,Some(AsAddress(MailingAddress(d))))
scala> val id: Option[String] = Some("id")
id: Option[String] = Some(id)
scala> EvaluateAddress.withRawValue("a", "b", id)
res3: EvaluateAddress = EvaluateAddress(a,b,Some(AsId(id)))