Dealing with missing elements in Scala/Spark class - scala

I have the following file in Hadoop
val dataset=sc.textFile("/user/hue/mycompanies1.csv")
It looks like this
CS,84,Jimmys Bistro, Jimmys
CS,90,Pauls Fish
CS,100, Happy Hardware
My scala/Spark code looks like:
case class Company (
record_type: String,
company_num: Integer,
company_name: String;,
nickname: String
)
val company = dataset.map(k=>k.split(",")).map(
k => Company(k(0).trim, k(1).toInt, k(2).trim, k(3).trim)
company.toDF().registerTempTable("company_table4")
When i try to access company RDD after i get a nullpointerexception because of the missing nickname value in the data. How do i deal with this gracefully?

Since the nickname is optional, I would change the case class to reflect that, then use one of various ways to optionally obtain the index-3 element, eg:
case class Company (
record_type: String,
company_num: Integer,
company_name: String,
nickname: Option[String]
)
val company = dataset.map(k=>k.split(",")).map(
k => Company(k(0).trim, k(1).toInt, k(2).trim, k.drop(3).headOption.map(_.trim))

Related

Looking for help in Nested groupBy with scalikejdbc?

I am using Scala 2.12 and have required libraries downloaded via build.sbt.
I have my DB output in below format.
Basically, it is like per valuation date and book, there can be multiple currency data.
I have group by on book (majorly), which will have list of Pnl data based on currency.
Just the rough representation:
{ Bookid: 1234,
BookName: EQUITY,
PnlBreakdown: [currency: cad, actual_pnl_local: 100, actual_pnl_cde: 100], [currency: usd, actual_pnl_local: 100, actual_pnl_cde: 130]
}
Basically. Key will be book and value will be list of pnl data.
I have a case class defined as below:
case class PnlData(valuation_date: Option[String], currency: Option[String],pnl_status: Option[String],actual_pnl_local: Option[String] ,actual_pnl_cde: Option[String], actual_pnl_local_me_adj: Option[String] ,actual_pnl_cde_me_adj: Option[String] ) {
override def toString():String= {
s"valuation_date=$valuation_date,currency=$currency,pnl_status=$pnl_status,actual_pnl_local=$actual_pnl_local,actual_pnl_cde=$actual_pnl_cde,actual_pnl_local_me_adj=$actual_pnl_local_me_adj,actual_pnl_cde_me_adj=$actual_pnl_cde_me_adj"
}
}
case class BookLevelDaily(book_id: Option[String], book: Option[String], pnlBreakdown: List[SaPnlData]){
override def toString():String= {
s"book_id=$book_id,book=$book,pnl=$pnlBreakdown"
}
}
Basically, my final object is of type BookLevelDaily.
How do I translate the DB output (above) to my BookLevelDaily object?
I can convert the entire result to the list, but further how should I do groupBy?
val list: List[BookLevelDaily] =
sql"""
|SELECT QUERY TO GET ABOVE RESULTSET
""".stripMargin.map(rs =>
BookLevelDaily(
valuation_date = rs.stringOpt("valuation_date"),
book_id = rs.stringOpt("book_id"),
book = rs.stringOpt("book"),
currency= rs.stringOpt("currency"),
pnl_status= rs.stringOpt("pnl_status"),
actual_pnl_local= rs.stringOpt("actual_pnl_local"),
actual_pnl_cde= rs.stringOpt("actual_pnl_cde"),
actual_pnl_local_me_adj= rs.stringOpt("actual_pnl_local_me_adj"),
actual_pnl_cde_me_adj= rs.stringOpt("actual_pnl_cde_me_adj")
)
).list().apply()
Firstly above is not of type BookLevelDaily. So how to iterate or group by to separate Pnl level data and map it to key (book).
If I understand it correctly, it seems to be a one-to-many relationship (one: book_level_daily, many: pnl_breakdown). If so, check the following documentation.
http://scalikejdbc.org/documentation/one-to-x.html

How to create Anorm query to skip updating None values in DB (Scala)

I am using Anorm (2.5.1) in my Play+Scala application (2.5.x, 2.11.11). I keep facing the issue quite often where if the case class argument value is None, I don't want that parameter value to be inserted/updated in SQL DB. For example:
case class EditableUser(
user_pk: String,
country: Option[String],
country_phone_code: Option[Int],
phonenumber: Option[String],
emailid: Option[String],
format_all: Option[String]
)
....
val eUser: EditableUser = EditableUser("PK0001", None, None, None, Some("xyz#email.com"), Some("yes"))
...
SQL"""
update #$USR SET
COUNTRY=${eUser.country},
COUNTRY_PHONE_CODE=${eUser.country_phone_code},
PHONENUMBER=${eUser.phonenumber},
EMAILID=${emailid},
FORMAT_ALL=${format_all}
where (lower(USER_PK)=lower(${eUser.user_pk}))
""".execute()
Here when the value is None, Anorm will insert 'null' into corresponding column in SQL DB. Instead I want to write the query in such a way that Anorm skips updating those values which are None i.e. does not overwrite.
You should use boundStatements/preparedStatement and while setting values for the query don’t set the values for the columns which are none.
For example
SQL(
"""
select * from Country c
join CountryLanguage l on l.CountryCode = c.Code
where c.code = {countryCode};
"""
).on("countryCode" -> "FRA")
Or in your case:
import play.api.db.DB
import anorm._
val stat = DB.withConnection(implicit c =>
SQL("SELECT name, email FROM user WHERE id={id}").on("id" -> 42)
)
While writing you your query you check if the value you are going to put in on(x->something) is not None if it’s nice don’t put it hence you will not update the values which are none.
Without the ability (or library) to access the attribute names themselves, it would still be possible, if slightly clunky in some circles, to build the update statement dynamically depending on the values that are present in the case class:
case class Foo(name:String, age:Option[Int], heightCm:Option[Int])
...
def phrase(k:String,v:Option[Int]):String=if (v.isDefined) s", $k={$k}" else ""
def update(foo:Foo) : Either[String, Foo] = DB.withConnection { implicit c =>
def stmt(foo:Foo) = "update foo set "+
//-- non option fields
"name={name}" +
//-- option fields
phrase("age", foo.age) +
phrase("heightCm", foo.heightCm)
SQL(stmt(foo))
.on('name -> name, 'age -> age, 'heightCm -> heightCm)
.executeUpdate()
The symbols that are not present in the actual submitted SQL can still be specified in the on. Catering for other data types also needed.

Add value with groupByKey

I have some troubles with groupByKey in scala and Spark.
I have 2 case classes :
case class Employee(id_employee: Long, name_emp: String, salary: String)
For the moment I use this 2nd case class:
case class Company(id_company: Long, employee:Seq[Employee])
However, I want to replace it with this new one:
case class Company(id_company: Long, name_comp: String employee:Seq[Employee])
There is a parent DataSet (df1) that I use with groupByKey to create Company objects :
val companies = df1.groupByKey(v => v.id_company)
.mapGroups(
{
case(k,iter) => Company(k, iter.map(x => Employee(x.id_employee, x.name_emp, x.salary)).toSeq)
}
).collect()
This code works, it returns objects like this one :
Company(1234,List(Employee(0987, John, 30000),Employee(4567, Bob, 50000)))
But I don't find the tip to add the Company name_comp to those objects (this field exist df1). In order to retrieve objects like this (using the new case class):
Company(1234, NYTimes, List(Employee(0987, John, 30000),Employee(4567, Bob, 50000)))
Since you want both the company id and name, what you can do is to use a tuple as the key when you group your data. This will make both values easily available when constructing the Company class:
df1.groupByKey(v => (v.id_company, v.name_comp))
.mapGroups{ case((id, name), iter) =>
Company(id, name, iter.map(x => Employee(x.id_employee, x.name_emp, x.salary)).toSeq)}
.collect()

Converting one case class to another that is similar with additional parameter in Scala

So, the problem is in the title, but here are the details.
I have two case classes:
case class JourneyGroup(id: Option[Int] = None,
key: UUID,
name: String,
data: Option[JsValue],
accountId: Int,
createdAt: DateTime = DateTime.now,
createdById: Int,
updatedAt: Option[DateTime] = None,
updatedById: Option[Int] = None,
deletedAt: Option[DateTime] = None,
deletedById: Option[Int] = None)
and
case class JourneyGroupApi(id: Option[Int] = None,
key: UUID,
name: String,
data: Option[JsValue],
accountId: Int,
createdAt: DateTime = DateTime.now,
createdById: Int,
updatedAt: Option[DateTime] = None,
updatedById: Option[Int] = None,
deletedAt: Option[DateTime] = None,
deletedById: Option[Int] = None,
parties: Seq[Party] = Seq.empty[Party])
Background: the reason for having these two separate classes is the fact that slick does not support collections, and I do need collections of related objects that I build manually. Bottom line, I could not make it work with a single class.
What I need is an easy way to convert from one to another.
At this point, to unblock myself, I created a manual conversion:
def toJourneyGroupApi(parties: Seq[Party]): JourneyGroupApi = JourneyGroupApi(
id = id,
key = key,
name = name,
data = data,
accountId = accountId,
createdAt = createdAt,
createdById = createdById,
updatedAt = updatedAt,
updatedById = updatedById,
deletedAt = deletedAt,
deletedById = deletedById,
parties = parties
)
Which is working, but extremely ugly and requires a lot of maintenance.
One thing that I tried doing is:
convert the source object to tuple
Add an element to that tuple using shapeless
and build a target object from resulting tuple
import shapeless._
import syntax.std.tuple._
val groupApi = (JourneyGroup.unapply(group).get :+ Seq.empty[Party])(JourneyGroupApi.tupled)
But, this thing is claiming, that the result of :+ is not tuple, even though in console:
Party.unapply(p).get :+ Seq.empty[Participant]
res0: (Option[Int], model.Parties.Type.Value, Int, Int, org.joda.time.DateTime, Int, Option[org.joda.time.DateTime], Option[Int], Option[org.joda.time.DateTime], Option[Int], Seq[model.Participant]) = (None,client,123,234,2016-11-12T03:55:24.006-08:00,987,None,None,None,None,List())
What am I doing wrong? Maybe there is another way of achieving this.
Could you consider Composition?
case class JourneyGroup(
...
)
case class JourneyGroupApi(
journeyGroup: JourneyGroup=JourneyGroup(),
parties: Seq[Party] = Seq()
)
Converting a journeyGroup would just be something like JourneyGroupApi(journeyGroup,parties) and "converting" a journeyGroupApi would be a matter of accessing journeyGroupApi.journeyGroup. You could perhaps come up with names that worked better for this case. Not sure if this approach would fit the rest of your code. In particular referencing journeyGroup attributes in a journeyGroupApi will be one extra level, e.g. journeyGroupApi.journeyGroup.accountId. (This could potentially be mitigated by "shortcut" definitions on journeyGroupApi like lazy val accountId = journeyGroup.accountId.)
Inheritance might also be an approach to consider with a base case class of JourneyGroup then a normal class (not case class) that extends it with parties as the extra attribute. This option is discussed further in this SO thread.

Case Classes with optional fields in Scala

For example, I have this case class:
case class Student (firstName : String, lastName : String)
If I use this case class, is it possible that supplying data to the fields inside the case class are optional? For example, I'll do this:
val student = new Student(firstName = "Foo")
Thanks!
If you just want to miss the second parameter without a default information, I suggest you to use an Option.
case class Student(firstName: String, lastName: Option[String] = None)
Now you might create instances this way:
Student("Foo")
Student("Foo", None) // equal to the one above
Student("Foo", Some("Bar")) // neccesary to add a lastName
To make it usable as you wanted it, I will add an implicit:
object Student {
implicit def string2Option(s: String) = Some(s)
}
Now you are able to call it those ways:
import Student._
Student("Foo")
Student("Foo", None)
Student("Foo", Some("Bar"))
Student("Foo", "Bar")
You were close:
case class Student (firstName : String = "John", lastName : String = "Doe")
val student = Student(firstName = "Foo")
Another possibility is partially applied function:
case class Student (firstName : String, lastName : String)
val someJohn = Student("John", _: String)
//someJohn: String => Student = <function1>
val johnDoe = someJohn("Doe")
//johnDoe: Student = Student(John,Doe)
And to be complete, you can create some default object and then change some field:
val johnDeere = johnDoe.copy(lastName="Deere")
//johnDeer: Student = Student(John,Deere)
I would see two ways this is normally done.
1. default parameters
case class Student (firstName : String, lastName : String = "")
Student("jeypijeypi") # Student(jeypijeypi,)
2. alternative constructors
case class Student (firstName : String, lastName : String)
object Student {
def apply(firstName: String) = new Student(firstName,"")
}
Student("jeypijeypi") # Student(jeypijeypi,)
Which one is better depends slightly on the circumstances. The latter gives you more freedom: you can make any parameter(s) optional, or even change their order (not recommended). Default parameters need always to be at the end of the parameter list, I think. You can also combine these two ways.
Note: within the alternative constructors you need new to point the compiler to the actual constructor. Normally new is not used with case classes.