Convert spark scala dataset of one type to another - scala

I have a dataset with following case class type:
case class AddressRawData(
addressId: String,
customerId: String,
address: String
)
I want to convert it to:
case class AddressData(
addressId: String,
customerId: String,
address: String,
number: Option[Int], //i.e. it is optional
road: Option[String],
city: Option[String],
country: Option[String]
)
Using a parser function:
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData] = {
unparsedAddress.map(address => {
val split = address.address.split(", ")
address.copy(
number = Some(split(0).toInt),
road = Some(split(1)),
city = Some(split(2)),
country = Some(split(3))
)
}
)
}
I am new to scala and spark. Could anyone please let me know how can this be done?

You were on the right path! There are multiple ways of doing this of course. But as you're already on the way by making some case classes, and you've started making a parsing function an elegant solution is by using the Dataset's map function. From the docs, this map function signature is the following:
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
Where T is the starting type (AddressRawData in your case) and U is the type you want to get to (AddressData in your case). So the input of this map function is a function that transforms a AddressRawData to a AddressData. That could perfectly be the addressParser you've started making!
Now, your current addressParser has the following signature:
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData]
In order to be able to feed it to that map function, we need to make this signature:
def newAddressParser(unparsedAddress: AddressRawData): AddressData
Knowing all of this, we can work further! An example would be the following:
import spark.implicits._
import scala.util.Try
// Your case classes
case class AddressRawData(addressId: String, customerId: String, address: String)
case class AddressData(
addressId: String,
customerId: String,
address: String,
number: Option[Int],
road: Option[String],
city: Option[String],
country: Option[String]
)
// Your addressParser function, adapted to be able to feed into the Dataset.map
// function
def addressParser(rawAddress: AddressRawData): AddressData = {
val addressArray = rawAddress.address.split(", ")
AddressData(
rawAddress.addressId,
rawAddress.customerId,
rawAddress.address,
Try(addressArray(0).toInt).toOption,
Try(addressArray(1)).toOption,
Try(addressArray(2)).toOption,
Try(addressArray(3)).toOption
)
}
// Creating a sample dataset
val rawDS = Seq(
AddressRawData("1", "1", "20, my super road, beautifulCity, someCountry"),
AddressRawData("1", "1", "badFormat, some road, cityButNoCountry")
).toDS
val parsedDS = rawDS.map(addressParser)
parsedDS.show
+---------+----------+--------------------+------+-------------+----------------+-----------+
|addressId|customerId| address|number| road| city| country|
+---------+----------+--------------------+------+-------------+----------------+-----------+
| 1| 1|20, my super road...| 20|my super road| beautifulCity|someCountry|
| 1| 1|badFormat, some r...| null| some road|cityButNoCountry| null|
+---------+----------+--------------------+------+-------------+----------------+-----------+
As you see, thanks to the fact that you had already foreseen that parsing can go wrong, it was easily possible to use scala.util.Try to try and get the pieces of that raw address and add some robustness in there (the second line contains some null values where it could not parse the address string.
Hope this helps!

Related

How to convert List[String] to List[Object] in Scala

I need to change from one List[String] to List[MyObject] in scala.
For example,JSON input is like below
employee: {
name: "test",
employeeBranch: ["CSE", "IT", "ECE"]
}
Output should be like this,
Employee: {
Name: "test",
EmployeeBranch:[{"branch": "CSE"}, {"branch": "IT"}, {"branch": "ECE"}]
}
Input case class:
Class Office(
name: Option[String],
employeeBranch: Option[List[String]])
Output case class:
Class Output(
Name: Option[String],
EmployeeBranch: Option[List[Branch]])
case class Branch(
branch: Option[String])
This is the requirement.
It is hard to answer without knowing details of the particular JSON library, but an Object is probably represented as a Map. So to convert a List[String] to a List[Map[String, String]] you can do this:
val list = List("CSE", "IT", "ECE")
val map = list.map(x => Map("branch" -> x))
This gives
List(Map(branch -> CSE), Map(branch -> IT), Map(branch -> ECE))
which should convert to the JSON you want.

Scala Option and Some mismatch

I want to parse province to case class, it throws mismatch
scala.MatchError: Some(USA) (of class scala.Some)
val result = EntityUtils.toString(entity,"UTF-8")
val address = JsonParser.parse(result).extract[Address]
val value.province = Option(address.province)
val value.city = Option(address.city)
case class Access(
device: String,
deviceType: String,
os: String,
event: String,
net: String,
channel: String,
uid: String,
nu: Int,
ip: String,
time: Long,
version: String,
province: Option[String],
city: Option[String],
product: Option[Product]
)
This:
val value.province = Option(address.province)
val value.city = Option(address.city)
doesn't do what you think it does. It tries to treat value.province and value.city as extractors (which don't match the type, thus scala.MatchError exception). It doesn't mutate value as I believe you intended (because value apparently doesn't have such setters).
Since value is (apparently) Access case class, it is immutable and you can only obtain an updated copy:
val value2 = value.copy(
province = Option(address.province),
city = Option(address.city)
)
Assuming the starting point:
val province: Option[String] = ???
You can get the string with simple pattern matching:
province match {
case Some(stringValue) => JsonParser.parse(stringValue).extract[Province] //use parser to go from string to a case class
case None => .. //maybe provide a default value, depends on your context
}
Note: Without knowing what extract[T] returns it's hard to recommend a follow-up

Technique to write multiple columns into a single function in Scala

Below are the two methods using Spark Scala where I am trying to find, if the column contains a string and then sum the number of occurrences(1 or 0), Is there a better way to write it into a single function where we can avoid writing a method ,each time a new condition gets added. Thanks in advance.
def sumFunctDays1cols(columnName: String, dayid: String, processday: String, fieldString: String, newColName: String): Column = {
sum(when(('visit_start_time > dayid).and('visit_start_time <= processday).and(lower(col(columnName)).contains(fieldString)), 1).otherwise(0)).alias(newColName) }
def sumFunctDays2cols(columnName: String, dayid: String, processday: String, fieldString1: String, fieldString2: String, newColName: String): Column = {
sum(when(('visit_start_time > dayid).and('visit_start_time <= processday).and(lower(col(columnName)).contains(fieldString1) || lower(col(columnName)).contains(fieldString2)), 1).otherwise(0)).alias(newColName) }
Below is where I am calling the function.
sumFunctDays1cols("columnName", "2019-01-01", "2019-01-10", "mac", "cust_count")
sumFunctDays1cols("columnName", "2019-01-01", "2019-01-10", "mac", "lenovo","prod_count")
You could do something like below (Not tested yet)
def sumFunctDays2cols(columnName: String, dayid: String, processday: String, newColName: String, fields: Column*): Column = {
sum(
when(
('visit_start_time > dayid)
.and('visit_start_time <= processday)
.and(fields.map(lower(col(columnName)).contains(_)).reduce( _ || _)),
1
).otherwise(0)).alias(newColName)
}
And you can use it as
sumFunctDays2cols(
"columnName",
"2019-01-01",
"2019-01-10",
"prod_count",
col("lenovo"),col("prod_count")
)
Hope this helps!
Make the parameter to your function a list, instead of String1, String2 .. , make the parameter as a list of string.
I have implemented a small example for you:
import org.apache.spark.sql.functions.udf
val df = Seq(
(1, "mac"),
(2, "lenovo"),
(3, "hp"),
(4, "dell")).toDF("id", "brand")
// dictionary Set of words to check
val dict = Set("mac","leno","noname")
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_) )}
df.withColumn("brand_check", checkerUdf($"brand")).show()
I hope this solves your issue. But if you sill need help, upload the entire code snippet, and I will help you with it.

Implementing my HbaseConnector

I would like to implement a HbaseConnector.
I'm actually reading the guide but there is a part that I don't understand and I can't find any information about it.
In the part 2 of the guide we can see the following code :
case class HBaseRecord(col0: String, col1: Boolean,col2: Double, col3: Float,col4: Int, col5: Long, col6: Short, col7: String, col8: Byte)
object HBaseRecord {def apply(i: Int, t: String): HBaseRecord = { val s = s”””row${“%03d”.format(i)}””” HBaseRecord(s, i % 2 == 0, i.toDouble, i.toFloat, i, i.toLong, i.toShort, s”String$i: $t”, i.toByte) }}
val data = (0 to 255).map { i => HBaseRecord(i, “extra”)}
I do understand that they store the future column in the HbaseRecord case class but I don't understand the specific use of this line :
val s = s”””row${“%03d”.format(i)}”””
Could someone care to explain ?
It is used to generate row ids like row001, row002 etc. which will populate column0 of your table. Try out simpler way with function
def generate(i: Int): String = { s"""row${"%03d".format(i)}"""}

How to pass join key as variable in Spark Data Frame using Scala

I am trying to kep the join key of the two dataframe in two varibale. The same i want to pass into a join . here My variable contains one key. Can I also pass more than one key ?
Ex:
1st key :
scala> val primary_key_col = scd_table_keys_df.first().getString(2)
primary_key_col: String = acct_nbr
2nd key :
scala> val delta_primary_key_col = "delta_"+primary_key_col
delta_primary_key_col: String = delta_acct_nbr
** My python Code which is working
cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,(col(delta_primary_key_col) == col(primary_key_col)) ,'left_outer' ).where(hist_tgt_tbl_Y_df[primary_key_col].isNull())
I want to achieve same in Scala. Please suggest.
Tries multiple ways.
val cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,(delta_src_rename_df({primary_key_col.mkstring(",")}) == hist_tgt_tbl_Y_df({primary_key_col.mkstring(",")}),"left_outer" )).where(hist_tgt_tbl_Y_df[primary_key_col].isNull())
:121: error: value mkstring is not a member of String
val cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,(delta_src_rename_df(delta_primary_key_col.map(c => col(c))) == hist_tgt_tbl_Y_df(primary_key_col.map(c => col(c))),"left_outer" ))
:123: error: type mismatch;
found : Array[org.apache.spark.sql.Column]
required: String
Not able to solve. Please suggest.
I was missing the variable substitution. This is working for me.
scala> val cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,delta_src_rename_df(**s"$delta_primary_key_col"**) === hist_tgt_tbl_Y_df(s"$primary_key_col"),"left_outer" )
cdc_new_acct_df: org.apache.spark.sql.DataFrame = [delta_acct_nbr: string, delta_primary_state: string, delta_zip_code: string, delta_load_tm: string, delta_load_date: string, hash_key_col: string, delta_hash_key: int, delta_eff_start_date: string, acct_nbr: string, account_sk_id: bigint, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, load_tm: string, hash_key: string, eff_flag: string]