Technique to write multiple columns into a single function in Scala - scala

Below are the two methods using Spark Scala where I am trying to find, if the column contains a string and then sum the number of occurrences(1 or 0), Is there a better way to write it into a single function where we can avoid writing a method ,each time a new condition gets added. Thanks in advance.
def sumFunctDays1cols(columnName: String, dayid: String, processday: String, fieldString: String, newColName: String): Column = {
sum(when(('visit_start_time > dayid).and('visit_start_time <= processday).and(lower(col(columnName)).contains(fieldString)), 1).otherwise(0)).alias(newColName) }
def sumFunctDays2cols(columnName: String, dayid: String, processday: String, fieldString1: String, fieldString2: String, newColName: String): Column = {
sum(when(('visit_start_time > dayid).and('visit_start_time <= processday).and(lower(col(columnName)).contains(fieldString1) || lower(col(columnName)).contains(fieldString2)), 1).otherwise(0)).alias(newColName) }
Below is where I am calling the function.
sumFunctDays1cols("columnName", "2019-01-01", "2019-01-10", "mac", "cust_count")
sumFunctDays1cols("columnName", "2019-01-01", "2019-01-10", "mac", "lenovo","prod_count")

You could do something like below (Not tested yet)
def sumFunctDays2cols(columnName: String, dayid: String, processday: String, newColName: String, fields: Column*): Column = {
sum(
when(
('visit_start_time > dayid)
.and('visit_start_time <= processday)
.and(fields.map(lower(col(columnName)).contains(_)).reduce( _ || _)),
1
).otherwise(0)).alias(newColName)
}
And you can use it as
sumFunctDays2cols(
"columnName",
"2019-01-01",
"2019-01-10",
"prod_count",
col("lenovo"),col("prod_count")
)
Hope this helps!

Make the parameter to your function a list, instead of String1, String2 .. , make the parameter as a list of string.
I have implemented a small example for you:
import org.apache.spark.sql.functions.udf
val df = Seq(
(1, "mac"),
(2, "lenovo"),
(3, "hp"),
(4, "dell")).toDF("id", "brand")
// dictionary Set of words to check
val dict = Set("mac","leno","noname")
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_) )}
df.withColumn("brand_check", checkerUdf($"brand")).show()
I hope this solves your issue. But if you sill need help, upload the entire code snippet, and I will help you with it.

Related

Is there any way to specify type in scala dynamically

I'm new in Spark, Scala, so sorry for stupid question. So I have a number of tables:
table_a, table_b, ...
and number of corresponding types for these tables
case class classA(...), case class classB(...), ...
Then I need to write a methods that read data from these tables and create dataset:
def getDataFromSource: Dataset[classA] = {
val df: DataFrame = spark.sql("SELECT * FROM table_a")
df.as[classA]
}
The same for other tables and types. Is there any way to avoid routine code - I mean individual fucntion for each table and get by with one? For example:
def getDataFromSource[T: Encoder](table_name: String): Dataset[T] = {
val df: DataFrame = spark.sql(s"SELECT * FROM $table_name")
df.as[T]
}
Then create list of pairs (table_name, type_name):
val tableTypePairs = List(("table_a", classA), ("table_b", classB), ...)
Then to call it using foreach:
tableTypePairs.foreach(tupl => getDataFromSource[what should I put here?](tupl._1))
Thanks in advance!
Something like this should work
def getDataFromSource[T](table_name: String, encoder: Encoder[T]): Dataset[T] =
spark.sql(s"SELECT * FROM $table_name").as(encoder)
val tableTypePairs = List(
"table_a" -> implicitly[Encoder[classA]],
"table_b" -> implicitly[Encoder[classB]]
)
tableTypePairs.foreach {
case (table, enc) =>
getDataFromSource(table, enc)
}
Note that this is a case of discarding a value, which is a bit of a code smell. Since Encoder is invariant, tableTypePairs isn't going to have that useful of a type, and neither would something like
tableTypePairs.map {
case (table, enc) =>
getDataFromSource(table, enc)
}
One option is to pass the Class to the method, this way the generic type T will be inferred:
def getDataFromSource[T: Encoder](table_name: String, clazz: Class[T]): Dataset[T] = {
val df: DataFrame = spark.sql(s"SELECT * FROM $table_name")
df.as[T]
}
tableTypePairs.foreach { case (table name, clazz) => getDataFromSource(tableName, clazz) }
But then I'm not sure of how you'll be able to exploit this list of Dataset without .asInstanceOf.

How to aggregate data in Spark using Scala?

I have a data set test1.txt. It contain data like below
2::1::3
1::1::2
1::2::2
2::1::5
2::1::4
3::1::2
3::1::1
3::2::2
I have created data-frame using the below code.
case class Test(userId: Int, movieId: Int, rating: Float)
def pRating(str: String): Rating = {
val fields = str.split("::")
assert(fields.size == 3)
Test(fields(0).toInt, fields(1).toInt, fields(2).toFloat)
}
val ratings = spark.read.textFile("C:/Users/test/Desktop/test1.txt").map(pRating).toDF()
2,1,3
1,1,2
1,2,2
2,1,5
2,1,4
3,1,2
3,1,1
3,2,2
But I want to print output like below I.e. removing duplicate combinations and instead of field(2) value sum of values1,1, 2.0.
1,1,2.0
1,2,2.0
2,1,12.0
3,1,3.0
3,2,2.0
Please help me on this, how can achieve this.
To drop duplicates, use df.distinct. To aggregate you first groupBy and then agg. Putting this all together:
case class Rating(userId: Int, movieId: Int, rating: Float)
def pRating(str: String): Rating = {
val fields = str.split("::")
assert(fields.size == 3)
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat)
}
val ratings = spark.read.textFile("C:/Users/test/Desktop/test1.txt").map(pRating)
val totals = ratings.distinct
.groupBy('userId, 'movieId)
.agg(sum('rating).as("rating"))
.as[Rating]
I am not sure you'd want the final result as Dataset[Rating] and whether the distinct and sum logic is exactly as you'd want it as the example in the question is not very clear but, hopefully, this will give you what you need.
ratings.groupBy("userId","movieId").sum(rating)

Implementing my HbaseConnector

I would like to implement a HbaseConnector.
I'm actually reading the guide but there is a part that I don't understand and I can't find any information about it.
In the part 2 of the guide we can see the following code :
case class HBaseRecord(col0: String, col1: Boolean,col2: Double, col3: Float,col4: Int, col5: Long, col6: Short, col7: String, col8: Byte)
object HBaseRecord {def apply(i: Int, t: String): HBaseRecord = { val s = s”””row${“%03d”.format(i)}””” HBaseRecord(s, i % 2 == 0, i.toDouble, i.toFloat, i, i.toLong, i.toShort, s”String$i: $t”, i.toByte) }}
val data = (0 to 255).map { i => HBaseRecord(i, “extra”)}
I do understand that they store the future column in the HbaseRecord case class but I don't understand the specific use of this line :
val s = s”””row${“%03d”.format(i)}”””
Could someone care to explain ?
It is used to generate row ids like row001, row002 etc. which will populate column0 of your table. Try out simpler way with function
def generate(i: Int): String = { s"""row${"%03d".format(i)}"""}

How to pass join key as variable in Spark Data Frame using Scala

I am trying to kep the join key of the two dataframe in two varibale. The same i want to pass into a join . here My variable contains one key. Can I also pass more than one key ?
Ex:
1st key :
scala> val primary_key_col = scd_table_keys_df.first().getString(2)
primary_key_col: String = acct_nbr
2nd key :
scala> val delta_primary_key_col = "delta_"+primary_key_col
delta_primary_key_col: String = delta_acct_nbr
** My python Code which is working
cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,(col(delta_primary_key_col) == col(primary_key_col)) ,'left_outer' ).where(hist_tgt_tbl_Y_df[primary_key_col].isNull())
I want to achieve same in Scala. Please suggest.
Tries multiple ways.
val cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,(delta_src_rename_df({primary_key_col.mkstring(",")}) == hist_tgt_tbl_Y_df({primary_key_col.mkstring(",")}),"left_outer" )).where(hist_tgt_tbl_Y_df[primary_key_col].isNull())
:121: error: value mkstring is not a member of String
val cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,(delta_src_rename_df(delta_primary_key_col.map(c => col(c))) == hist_tgt_tbl_Y_df(primary_key_col.map(c => col(c))),"left_outer" ))
:123: error: type mismatch;
found : Array[org.apache.spark.sql.Column]
required: String
Not able to solve. Please suggest.
I was missing the variable substitution. This is working for me.
scala> val cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,delta_src_rename_df(**s"$delta_primary_key_col"**) === hist_tgt_tbl_Y_df(s"$primary_key_col"),"left_outer" )
cdc_new_acct_df: org.apache.spark.sql.DataFrame = [delta_acct_nbr: string, delta_primary_state: string, delta_zip_code: string, delta_load_tm: string, delta_load_date: string, hash_key_col: string, delta_hash_key: int, delta_eff_start_date: string, acct_nbr: string, account_sk_id: bigint, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, load_tm: string, hash_key: string, eff_flag: string]

Modify a position in a (String, String) variable in Scala

I have tuple separated by a coma that looks like this:
("TRN_KEY", "88.330000;1;2")
I would like to add some more info to the second position.
For example:
I would like to add ;99;99 to the 88.330000;1;2 so that at the end it would look like:
(TRN_KEY, 88.330000;1;2;99;99)
One way is to de-compose your tuple and concat the additional string to the second element:
object MyObject {
val (first, second) = ("TRN_KEY","88.330000;1;2")
(first, second + ";3;4"))
}
Which yields:
res0: (String, String) = (TRN_KEY,88.330000;1;2;3;4)
Another way to go is copy to tuple with the new value using Tuple2.copy, as tuples are immutable by design.
You can not modify the data in place as Tuple2 is immutable.
An option would be to have a var and then use the copy method.
In Scala due to structural sharing this is a rather cheap and fast operation.
scala> var tup = ("TRN_KEY","88.330000;1;2")
tup: (String, String) = (TRN_KEY,88.330000;1;2)
scala> tup = tup.copy(_2 = tup._2 + "data")
tup: (String, String) = (TRN_KEY,88.330000;1;2data)
Here is a simple function that gets the job done. It takes a tuple and appends a string to the second element of the tuple.
def appendTup(tup:(String, String))(append:String):(String,String) = {
(tup._1, tup._2 + append)
}
Here is some code using it
val tup = ("TRN_KEY", "88.330000;1;2")
val tup2 = appendTup(tup)(";99;99")
println(tup2)
Here is my output
(TRN_KEY,88.330000;1;2;99;99)
If you really want to make it mutable you could use a case class such as:
case class newTup(col1: String, var col2: String)
val rec1 = newTup("TRN_KEY", "88.330000;1;2")
rec1.col2 = rec1.col2 + ";99;99"
rec1
res3: newTup = newTup(TRN_KEY,88.330000;1;2;99;99)
But, as mentioned above, it would be better to use .copy