How to pass join key as variable in Spark Data Frame using Scala - scala

I am trying to kep the join key of the two dataframe in two varibale. The same i want to pass into a join . here My variable contains one key. Can I also pass more than one key ?
Ex:
1st key :
scala> val primary_key_col = scd_table_keys_df.first().getString(2)
primary_key_col: String = acct_nbr
2nd key :
scala> val delta_primary_key_col = "delta_"+primary_key_col
delta_primary_key_col: String = delta_acct_nbr
** My python Code which is working
cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,(col(delta_primary_key_col) == col(primary_key_col)) ,'left_outer' ).where(hist_tgt_tbl_Y_df[primary_key_col].isNull())
I want to achieve same in Scala. Please suggest.
Tries multiple ways.
val cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,(delta_src_rename_df({primary_key_col.mkstring(",")}) == hist_tgt_tbl_Y_df({primary_key_col.mkstring(",")}),"left_outer" )).where(hist_tgt_tbl_Y_df[primary_key_col].isNull())
:121: error: value mkstring is not a member of String
val cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,(delta_src_rename_df(delta_primary_key_col.map(c => col(c))) == hist_tgt_tbl_Y_df(primary_key_col.map(c => col(c))),"left_outer" ))
:123: error: type mismatch;
found : Array[org.apache.spark.sql.Column]
required: String
Not able to solve. Please suggest.

I was missing the variable substitution. This is working for me.
scala> val cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,delta_src_rename_df(**s"$delta_primary_key_col"**) === hist_tgt_tbl_Y_df(s"$primary_key_col"),"left_outer" )
cdc_new_acct_df: org.apache.spark.sql.DataFrame = [delta_acct_nbr: string, delta_primary_state: string, delta_zip_code: string, delta_load_tm: string, delta_load_date: string, hash_key_col: string, delta_hash_key: int, delta_eff_start_date: string, acct_nbr: string, account_sk_id: bigint, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, load_tm: string, hash_key: string, eff_flag: string]

Related

Convert spark scala dataset of one type to another

I have a dataset with following case class type:
case class AddressRawData(
addressId: String,
customerId: String,
address: String
)
I want to convert it to:
case class AddressData(
addressId: String,
customerId: String,
address: String,
number: Option[Int], //i.e. it is optional
road: Option[String],
city: Option[String],
country: Option[String]
)
Using a parser function:
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData] = {
unparsedAddress.map(address => {
val split = address.address.split(", ")
address.copy(
number = Some(split(0).toInt),
road = Some(split(1)),
city = Some(split(2)),
country = Some(split(3))
)
}
)
}
I am new to scala and spark. Could anyone please let me know how can this be done?
You were on the right path! There are multiple ways of doing this of course. But as you're already on the way by making some case classes, and you've started making a parsing function an elegant solution is by using the Dataset's map function. From the docs, this map function signature is the following:
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
Where T is the starting type (AddressRawData in your case) and U is the type you want to get to (AddressData in your case). So the input of this map function is a function that transforms a AddressRawData to a AddressData. That could perfectly be the addressParser you've started making!
Now, your current addressParser has the following signature:
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData]
In order to be able to feed it to that map function, we need to make this signature:
def newAddressParser(unparsedAddress: AddressRawData): AddressData
Knowing all of this, we can work further! An example would be the following:
import spark.implicits._
import scala.util.Try
// Your case classes
case class AddressRawData(addressId: String, customerId: String, address: String)
case class AddressData(
addressId: String,
customerId: String,
address: String,
number: Option[Int],
road: Option[String],
city: Option[String],
country: Option[String]
)
// Your addressParser function, adapted to be able to feed into the Dataset.map
// function
def addressParser(rawAddress: AddressRawData): AddressData = {
val addressArray = rawAddress.address.split(", ")
AddressData(
rawAddress.addressId,
rawAddress.customerId,
rawAddress.address,
Try(addressArray(0).toInt).toOption,
Try(addressArray(1)).toOption,
Try(addressArray(2)).toOption,
Try(addressArray(3)).toOption
)
}
// Creating a sample dataset
val rawDS = Seq(
AddressRawData("1", "1", "20, my super road, beautifulCity, someCountry"),
AddressRawData("1", "1", "badFormat, some road, cityButNoCountry")
).toDS
val parsedDS = rawDS.map(addressParser)
parsedDS.show
+---------+----------+--------------------+------+-------------+----------------+-----------+
|addressId|customerId| address|number| road| city| country|
+---------+----------+--------------------+------+-------------+----------------+-----------+
| 1| 1|20, my super road...| 20|my super road| beautifulCity|someCountry|
| 1| 1|badFormat, some r...| null| some road|cityButNoCountry| null|
+---------+----------+--------------------+------+-------------+----------------+-----------+
As you see, thanks to the fact that you had already foreseen that parsing can go wrong, it was easily possible to use scala.util.Try to try and get the pieces of that raw address and add some robustness in there (the second line contains some null values where it could not parse the address string.
Hope this helps!

spark scala Join Dataset wit Seq[caseclass]

I have been trying to create a new Dataset which contain an existing Dataset and Seq[Case class] where condition cid present in both of them.
There might be unique cid with multiple tid, trying to find out total transaction(money) done by individual cid.
case class Data1(
cid: String,
fname: String,
add: String
)
case class Data2(
cid: String,
tid: String,
money: Long
)
I have read csv into data1DF and data2DF, later created data1DS and data2DS with case class
data1DS: Dataset[Data1] = data1DF.as[Data1]
data2DS: Dataset[Data2] = data2DF.as[Data2]
I tried making Seq[Data2] and join wit the data1DF, it throws error.
val dd : Seq[Data2] = data2DS.collect().toSeq
val ansDS = data1DS.join(dd, data1DS("cid") === dd("cid"), "leftouter")
<console>:36: error: type mismatch;
found : String("cid")
required: Int

Scala Option and Some mismatch

I want to parse province to case class, it throws mismatch
scala.MatchError: Some(USA) (of class scala.Some)
val result = EntityUtils.toString(entity,"UTF-8")
val address = JsonParser.parse(result).extract[Address]
val value.province = Option(address.province)
val value.city = Option(address.city)
case class Access(
device: String,
deviceType: String,
os: String,
event: String,
net: String,
channel: String,
uid: String,
nu: Int,
ip: String,
time: Long,
version: String,
province: Option[String],
city: Option[String],
product: Option[Product]
)
This:
val value.province = Option(address.province)
val value.city = Option(address.city)
doesn't do what you think it does. It tries to treat value.province and value.city as extractors (which don't match the type, thus scala.MatchError exception). It doesn't mutate value as I believe you intended (because value apparently doesn't have such setters).
Since value is (apparently) Access case class, it is immutable and you can only obtain an updated copy:
val value2 = value.copy(
province = Option(address.province),
city = Option(address.city)
)
Assuming the starting point:
val province: Option[String] = ???
You can get the string with simple pattern matching:
province match {
case Some(stringValue) => JsonParser.parse(stringValue).extract[Province] //use parser to go from string to a case class
case None => .. //maybe provide a default value, depends on your context
}
Note: Without knowing what extract[T] returns it's hard to recommend a follow-up

Technique to write multiple columns into a single function in Scala

Below are the two methods using Spark Scala where I am trying to find, if the column contains a string and then sum the number of occurrences(1 or 0), Is there a better way to write it into a single function where we can avoid writing a method ,each time a new condition gets added. Thanks in advance.
def sumFunctDays1cols(columnName: String, dayid: String, processday: String, fieldString: String, newColName: String): Column = {
sum(when(('visit_start_time > dayid).and('visit_start_time <= processday).and(lower(col(columnName)).contains(fieldString)), 1).otherwise(0)).alias(newColName) }
def sumFunctDays2cols(columnName: String, dayid: String, processday: String, fieldString1: String, fieldString2: String, newColName: String): Column = {
sum(when(('visit_start_time > dayid).and('visit_start_time <= processday).and(lower(col(columnName)).contains(fieldString1) || lower(col(columnName)).contains(fieldString2)), 1).otherwise(0)).alias(newColName) }
Below is where I am calling the function.
sumFunctDays1cols("columnName", "2019-01-01", "2019-01-10", "mac", "cust_count")
sumFunctDays1cols("columnName", "2019-01-01", "2019-01-10", "mac", "lenovo","prod_count")
You could do something like below (Not tested yet)
def sumFunctDays2cols(columnName: String, dayid: String, processday: String, newColName: String, fields: Column*): Column = {
sum(
when(
('visit_start_time > dayid)
.and('visit_start_time <= processday)
.and(fields.map(lower(col(columnName)).contains(_)).reduce( _ || _)),
1
).otherwise(0)).alias(newColName)
}
And you can use it as
sumFunctDays2cols(
"columnName",
"2019-01-01",
"2019-01-10",
"prod_count",
col("lenovo"),col("prod_count")
)
Hope this helps!
Make the parameter to your function a list, instead of String1, String2 .. , make the parameter as a list of string.
I have implemented a small example for you:
import org.apache.spark.sql.functions.udf
val df = Seq(
(1, "mac"),
(2, "lenovo"),
(3, "hp"),
(4, "dell")).toDF("id", "brand")
// dictionary Set of words to check
val dict = Set("mac","leno","noname")
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_) )}
df.withColumn("brand_check", checkerUdf($"brand")).show()
I hope this solves your issue. But if you sill need help, upload the entire code snippet, and I will help you with it.

TypedPipe can't coerce strings to DateTime even when given implicit function

I've got a Scalding data flow that starts with a bunch of Pipe separated value files. The first column is a DateTime in a slightly non-standard format. I want to use the strongly typed TypedPipe API, so I've specified a tuple type and a case class to contain the data:
type Input = (DateTime, String, Double, Double, String)
case class LatLonRecord(date : DateTime, msidn : String, lat : Double, lon : Double, cellname : String)
however, Scalding doesn't know how to coerce a String into a DateTime, so I tried adding an implicit function to do the dirty work:
implicit def stringToDateTime(dateStr: String): DateTime =
DateTime.parse(dateStr, DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.S"))
However, I still get a ClassCastException:
val lines: TypedPipe[Input] = TypedPipe.from(TypedPsv[Input](args("input")))
lines.map(x => x._1).dump
//cascading.flow.FlowException: local step failed at java.lang.Thread.run(Thread.java:745)
//Caused by: cascading.pipe.OperatorException: [FixedPathTypedDelimite...][com.twitter.scalding.RichPipe.eachTo(RichPipe.scala:509)] operator Each failed executing operation
//Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.joda.time.DateTime
What do I need to do to get Scalding to call my conversion function?
So I ended up doing this:
case class LatLonRecord(date : DateTime, msisdn : String, lat : Double, lon : Double, cellname : String)
object dateparser {
val format = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.S")
def parse(s : String) : DateTime = DateTime.parse(s,format);
}
//changed first column to a String, yuck
type Input = (String, String, Double, Double, String)
val lines: TypedPipe[Input] = TypedPipe.from( TypedPsv[Input]( args("input")) )
val recs = lines.map(v => LatLonRecord(dateparser.parse(v._1), v._2, v._3,v._4, v._5))
But I feel like its a sub-optimal solution. I welcome better answers from people who have been using Scala for more than, say, 1 week, like me.