I am writing a Scala/spark program which would find the max salary of the employee. The employee data is available in a CSV file, and the salary column has a comma separator for thousands and also it has a $ prefixed to it e.g. $74,628.00.
To handle this comma and dollar sign, I have written a parser function in scala which would split each line on "," and then map each column to individual variables to be assigned to a case class.
My parser program looks like below. In this to eliminate the comma and dollar signs I am using the replace function to replace it with empty, and then finally typecase to Int.
def ParseEmployee(line: String): Classes.Employee = {
val fields = line.split(",")
val Name = fields(0)
val JOBTITLE = fields(2)
val DEPARTMENT = fields(3)
val temp = fields(4)
temp.replace(",","")//To eliminate the ,
temp.replace("$","")//To remove the $
val EMPLOYEEANNUALSALARY = temp.toInt //Type cast the string to Int
Classes.Employee(Name, JOBTITLE, DEPARTMENT, EMPLOYEEANNUALSALARY)
}
My Case class look like below
case class Employee (Name: String,
JOBTITLE: String,
DEPARTMENT: String,
EMPLOYEEANNUALSALARY: Number,
)
My spark dataframe sql query looks like below
val empMaxSalaryValue = sc.sqlContext.sql("Select Max(EMPLOYEEANNUALSALARY) From EMP")
empMaxSalaryValue.show
when I Run this program I am getting this below exception
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for Number
- field (class: "java.lang.Number", name: "EMPLOYEEANNUALSALARY")
- root class: "Classes.Employee"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:625)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:282)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:272)
at CalculateMaximumSalary$.main(CalculateMaximumSalary.scala:27)
at CalculateMaximumSalary.main(CalculateMaximumSalary.scala)
Any idea why I am getting this error? what is the mistake I am doing here and why it is not able to typecast to number?
Is there any better approach to handle this problem of getting maximum salary of the employee?
Spark SQL provides only a limited number of Encoders which target concrete classes. Abstract classes like Number are not supported (can be used with limited binary Encoders).
Since you convert to Int anyway, just redefine the class:
case class Employee (
Name: String,
JOBTITLE: String,
DEPARTMENT: String,
EMPLOYEEANNUALSALARY: Int
)
Related
I'm currently challenging myself to skill up in Scala and FP. And today:
I came up with an issue that might interest you, devil prog masters ;)
Let's say I have the following case class in scala 3:
type EmailAddress = String // I defined them like that to show I'm interested in
type PhoneNumber = String // ... attributes via their names, not via their types.
case class Person(name: String, emails: List[EmailAddress], phones: List[PhoneNumber])
I would like to have a method that automatically transform (almost) all fields.
For example, I would like to order emails with the default given instance of Ordering[String] and phones with a specified one.
Ideally I should be able to exclude name field.
So I would get something like:
/* Below, I represented the kind of parametrization I would like to be able to do
* as parameters of the method orderValues,
* but it could be annotations or meta-programming instead.
*
* An `orderedPerson` can be directly an instance of Person
* or something else like an OderedEntity[Person], I don't care so far.
*/
val orderedPerson =
person.orderValues(
excluded = Set("name"),
explicitRules = Map(
// Phones would have a special ordering (reverse is just a dummy value)
"phones" -> Ordering.String.reverse
)
)
// -----
// So we would get:
Person(
name = "Xiao",
emails = List("a#a.a", "a#a.b", "a#b.a"),
phones = List("+86 100 9000 1000", "+86 100 2000 1000")
)
I haven't used Reflection for a long time and I'm not yet familiar with Meta-Programming, but I'm open to any solution that can help me to achieve that.
It's a good opportunity for learning !
[Edit]
My intentional intent was to have a library that can be use to easily anonymize any data.
The type keyword in Scala is just a type alias. You should use a newtype library like https://github.com/estatico/scala-newtype (or opaque type in Scala 3) and derive implicit instances of Ordering from String
Example with estatico/scala-newtype:
import io.estatico.newtype.macros.newtype
import io.estatico.newtype.ops._
#newtype case class Email(string: String)
object Email {
implicit val ordering: Ordering[Email] = deriving
}
#newtype case class PhoneNumber(string: String)
object PhoneNumber {
implicit val ordering: Ordering[PhoneNumber] = deriving[Ordering].reverse
}
case class Person(name: String, emails: List[Email], phones: List[PhoneNumber]) {
lazy val orderValues: Person = this.copy(emails = emails.sorted, phones = phones.sorted)
}
Person(
"Xiao",
List(Email("a#a.a"), Email("a#a.b"), Email("a#b.a")),
List(PhoneNumber("+86 100 9000 1000"), PhoneNumber("+86 100 2000 1000"))
).orderValues
so I want to search one value in 3 columns in the MYSQL table and write JPA query like this:
SELECT * FROM table WHERE 123 IN(col1, col2, col3);
so the problem is col2 and col3 are nullable they may have null value so the type of these columns is Option[String] in the entity class.
so I tried something like this in JpaRepository:
#Repository
trait ContactRepo extends JpaRepository[Contact, Long] {
def findFirstByCol1OrCol2OrCol3(c1: String, c2: scala.Option[String], c3: scala.Option[String]): Contact
}
and while calling this:
val optionData: scala.Option[String] = scala.Option("1234567890")
val someData: scala.Some[String] = scala.Some("12345678980")
val simpleStringData: String = "1234567890"
val contact = contactRepo.findFirstByCol1OrCol2OrCol3(simpleStringData, optionData, optionData)
I tried all variables(optionData, someData and simpleStringData) but I keep getting this error:
Parameter value [1234567890] did not match expected type [scala.Option (n/a)]
I tried OnIsNull also but still not working
I don't know what I'm missing I think its some small mistake that I don't understand
problem: how can i write JPA query for SELECT * FROM table WHERE 123 IN(col1, col2, col3);
Note: type of col2 and col3 are scala.Option[String]
I know this is old question but I faced the same problem today and I solved it by checking the exact value that required to the function. When you create JPA query with option type then query mapped variable in entity class is also defined as option so while passing values to the query method you need to pass it like:
Option[Option[String]]
so in your method it should be:
trait ContactRepo extends JpaRepository[Contact, Long] {
def findFirstByCol1OrCol2OrCol3(c1: String, c2: Option[Option[String]], c3: scala.Option[String]): Contact
}
and while calling:
val optionData = Some(Some("1234567890"))
val someData= Some(Some("1234567890"))
val simpleStringData: String = "1234567890"
val contact = contactRepo.findFirstByCol1OrCol2OrCol3(simpleStringData, optionData, someData)
I've created two rest end-points in akka http which takes string as input, parse it using Json4s and then do processing on it. My case class is like -
final case class A(id: String, name: String, address: String)
1st end point receives only id while the other receives all three fields and I want to use the same case class A for both. So I used default values for name & address fields like -
final case class A(id: Stirng, name: String = "", address: String = "")
This is working good for me. But now if I don't send address or name (or both) fields at second end point, it does not throw an exception stating that the name (or address) not found.
So, my question is can I create one end point in which id is mandatory while other fields does not matter and another end point where every field is mandatory using same case class ?
The code to parse the string to a case class is -
parse(jsonStr).extract[A]
I hope you're getting my point.
Any suggestions ?
There are two ways you can achieve what you want to do.
Option + Validations
name and address are optional so you need to handle them.
case class A(id: String, name: Option[String], address: Option[String])
val json = """{ "id":"1" }"""
// 1st endpoint
val r = parse(json).extract[A]
r.name.getOrElse("foo")
r.address.getOrElse("bar")
// 2nd endpoint
val r2 = parse(json).extract[A]
r2.name.getOrElse(/* boom! */)
Default JObject + Merge
or you can use an alternative JObject to provide default values to your input.
case class A(id: String, name: String, address: String)
val json = """{ "id":"1" }"""
val defaultValues = JObject(("name", JString("foo")), ("address", JString("bar")))
// 1st endpoint
val r = defaultValues.merge(parse(json)).extract[A]
// 2nd endpoint
val r2 = parse(json).extract[A] // boom! again
No your case class formally defines what you expect in input. It doesn't represent ambiguity. You could use optional and add checks. But that just defeats the purpose of extractor.
I have the following case class:
case class OrderDetails(OrderID : String, ProductID : String, UnitPrice : Double,
Qty : Int, Discount : Double)
I am trying read this csv: https://github.com/xsankar/fdps-v3/blob/master/data/NW-Order-Details.csv
This is my code:
val spark = SparkSession.builder.master(sparkMaster).appName(sparkAppName).getOrCreate()
import spark.implicits._
val orderDetails = spark.read.option("header","true").csv( inputFiles + "NW-Order-Details.csv").as[OrderDetails]
And the error is:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Cannot up cast `UnitPrice` from string to double as it may truncate
The type path of the target object is:
- field (class: "scala.Double", name: "UnitPrice")
- root class: "es.own3dh2so4.OrderDetails"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
Why can not it be transformed if all fields are "doubles" values? What do not I understand?
Spark version 2.1.0, Scala version 2.11.7
You just need to explicitly cast your field to a Double:
val orderDetails = spark.read
.option("header","true")
.csv( inputFiles + "NW-Order-Details.csv")
.withColumn("unitPrice", 'UnitPrice.cast(DoubleType))
.as[OrderDetails]
On a side note, by Scala (and Java) convention, your case class constructor parameters should be lower camel case:
case class OrderDetails(orderID: String,
productID: String,
unitPrice: Double,
qty: Int,
discount: Double)
If we want to change the datatype for multiple columns; if we use withColumn option it will look ugly.
The better way to apply schema for the data is
Get the Case Class schema using Encoders as shown below
val caseClassschema = Encoders.product[CaseClass].schema
Apply this schema while reading data
val data = spark.read.schema(caseClassschema)
i defined a class to map rows of a cassandra table:
case class Log(
val time: Long,
val date: String,
val appId: String,
val instanceId: String,
val appName: String,
val channel: String,
val originCode: String,
val message: String) {
}
i created an RDD to save all my tuples
val logEntries = sc.cassandraTable[Log]("keyspace", "log")
to see if all works i printed this:
println(logEntries.counts()) -> works, print the numbers of tuples retrieved.
println(logEntries.first()) -> exception on this line
java.lang.AssertionError: assertion failed: Missing columns needed by
com.model.Log: app_name, app_id, origin_code, instance_id
my columns of table log on cassandra are:
time bigint, date text, appid text, instanceid text, appname text, channel text, origincode text, message text
what's wrong?
As mentioned in cassandra-spark-connector docs, column name mapper has it's own logic for converting case class parameters to column names:
For multi-word column identifiers, separate each word by an underscore in Cassandra, and use the camel case convention on the Scala side.
So if you use case class Log(appId:String, instanceId:String) with camel-cased parameters, it will be automatically mapped to a underscore-separated notation: app_id text, instance_id text. It cannot be automatically mapped to appid text, instanceid text: you've missed an underscore.