Implementing my HbaseConnector - scala

I would like to implement a HbaseConnector.
I'm actually reading the guide but there is a part that I don't understand and I can't find any information about it.
In the part 2 of the guide we can see the following code :
case class HBaseRecord(col0: String, col1: Boolean,col2: Double, col3: Float,col4: Int, col5: Long, col6: Short, col7: String, col8: Byte)
object HBaseRecord {def apply(i: Int, t: String): HBaseRecord = { val s = s”””row${“%03d”.format(i)}””” HBaseRecord(s, i % 2 == 0, i.toDouble, i.toFloat, i, i.toLong, i.toShort, s”String$i: $t”, i.toByte) }}
val data = (0 to 255).map { i => HBaseRecord(i, “extra”)}
I do understand that they store the future column in the HbaseRecord case class but I don't understand the specific use of this line :
val s = s”””row${“%03d”.format(i)}”””
Could someone care to explain ?

It is used to generate row ids like row001, row002 etc. which will populate column0 of your table. Try out simpler way with function
def generate(i: Int): String = { s"""row${"%03d".format(i)}"""}

Related

Convert spark scala dataset of one type to another

I have a dataset with following case class type:
case class AddressRawData(
addressId: String,
customerId: String,
address: String
)
I want to convert it to:
case class AddressData(
addressId: String,
customerId: String,
address: String,
number: Option[Int], //i.e. it is optional
road: Option[String],
city: Option[String],
country: Option[String]
)
Using a parser function:
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData] = {
unparsedAddress.map(address => {
val split = address.address.split(", ")
address.copy(
number = Some(split(0).toInt),
road = Some(split(1)),
city = Some(split(2)),
country = Some(split(3))
)
}
)
}
I am new to scala and spark. Could anyone please let me know how can this be done?
You were on the right path! There are multiple ways of doing this of course. But as you're already on the way by making some case classes, and you've started making a parsing function an elegant solution is by using the Dataset's map function. From the docs, this map function signature is the following:
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
Where T is the starting type (AddressRawData in your case) and U is the type you want to get to (AddressData in your case). So the input of this map function is a function that transforms a AddressRawData to a AddressData. That could perfectly be the addressParser you've started making!
Now, your current addressParser has the following signature:
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData]
In order to be able to feed it to that map function, we need to make this signature:
def newAddressParser(unparsedAddress: AddressRawData): AddressData
Knowing all of this, we can work further! An example would be the following:
import spark.implicits._
import scala.util.Try
// Your case classes
case class AddressRawData(addressId: String, customerId: String, address: String)
case class AddressData(
addressId: String,
customerId: String,
address: String,
number: Option[Int],
road: Option[String],
city: Option[String],
country: Option[String]
)
// Your addressParser function, adapted to be able to feed into the Dataset.map
// function
def addressParser(rawAddress: AddressRawData): AddressData = {
val addressArray = rawAddress.address.split(", ")
AddressData(
rawAddress.addressId,
rawAddress.customerId,
rawAddress.address,
Try(addressArray(0).toInt).toOption,
Try(addressArray(1)).toOption,
Try(addressArray(2)).toOption,
Try(addressArray(3)).toOption
)
}
// Creating a sample dataset
val rawDS = Seq(
AddressRawData("1", "1", "20, my super road, beautifulCity, someCountry"),
AddressRawData("1", "1", "badFormat, some road, cityButNoCountry")
).toDS
val parsedDS = rawDS.map(addressParser)
parsedDS.show
+---------+----------+--------------------+------+-------------+----------------+-----------+
|addressId|customerId| address|number| road| city| country|
+---------+----------+--------------------+------+-------------+----------------+-----------+
| 1| 1|20, my super road...| 20|my super road| beautifulCity|someCountry|
| 1| 1|badFormat, some r...| null| some road|cityButNoCountry| null|
+---------+----------+--------------------+------+-------------+----------------+-----------+
As you see, thanks to the fact that you had already foreseen that parsing can go wrong, it was easily possible to use scala.util.Try to try and get the pieces of that raw address and add some robustness in there (the second line contains some null values where it could not parse the address string.
Hope this helps!

Technique to write multiple columns into a single function in Scala

Below are the two methods using Spark Scala where I am trying to find, if the column contains a string and then sum the number of occurrences(1 or 0), Is there a better way to write it into a single function where we can avoid writing a method ,each time a new condition gets added. Thanks in advance.
def sumFunctDays1cols(columnName: String, dayid: String, processday: String, fieldString: String, newColName: String): Column = {
sum(when(('visit_start_time > dayid).and('visit_start_time <= processday).and(lower(col(columnName)).contains(fieldString)), 1).otherwise(0)).alias(newColName) }
def sumFunctDays2cols(columnName: String, dayid: String, processday: String, fieldString1: String, fieldString2: String, newColName: String): Column = {
sum(when(('visit_start_time > dayid).and('visit_start_time <= processday).and(lower(col(columnName)).contains(fieldString1) || lower(col(columnName)).contains(fieldString2)), 1).otherwise(0)).alias(newColName) }
Below is where I am calling the function.
sumFunctDays1cols("columnName", "2019-01-01", "2019-01-10", "mac", "cust_count")
sumFunctDays1cols("columnName", "2019-01-01", "2019-01-10", "mac", "lenovo","prod_count")
You could do something like below (Not tested yet)
def sumFunctDays2cols(columnName: String, dayid: String, processday: String, newColName: String, fields: Column*): Column = {
sum(
when(
('visit_start_time > dayid)
.and('visit_start_time <= processday)
.and(fields.map(lower(col(columnName)).contains(_)).reduce( _ || _)),
1
).otherwise(0)).alias(newColName)
}
And you can use it as
sumFunctDays2cols(
"columnName",
"2019-01-01",
"2019-01-10",
"prod_count",
col("lenovo"),col("prod_count")
)
Hope this helps!
Make the parameter to your function a list, instead of String1, String2 .. , make the parameter as a list of string.
I have implemented a small example for you:
import org.apache.spark.sql.functions.udf
val df = Seq(
(1, "mac"),
(2, "lenovo"),
(3, "hp"),
(4, "dell")).toDF("id", "brand")
// dictionary Set of words to check
val dict = Set("mac","leno","noname")
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_) )}
df.withColumn("brand_check", checkerUdf($"brand")).show()
I hope this solves your issue. But if you sill need help, upload the entire code snippet, and I will help you with it.

In Scala, is there a way to map over a collection while passing a value along fold-style?

In Scala, is there a way to map over a collection while passing a value along fold-style? Something like:
case class TxRecord(name: String, amount: Int)
case class TxSummary(name: String, amount: Int, balance: Int)
val txRecords: Seq[TxRecord] = txRecordService.getSortedTxRecordsOfUser("userId")
val txSummarys: Seq[TxSummary] = txRecords.foldMap(0)((sum, txRecord) =>
(sum + txRecord.amount, TxSummary(txRecord.name, txRecord.amount, sum + txRecord.amount)))

TypedPipe can't coerce strings to DateTime even when given implicit function

I've got a Scalding data flow that starts with a bunch of Pipe separated value files. The first column is a DateTime in a slightly non-standard format. I want to use the strongly typed TypedPipe API, so I've specified a tuple type and a case class to contain the data:
type Input = (DateTime, String, Double, Double, String)
case class LatLonRecord(date : DateTime, msidn : String, lat : Double, lon : Double, cellname : String)
however, Scalding doesn't know how to coerce a String into a DateTime, so I tried adding an implicit function to do the dirty work:
implicit def stringToDateTime(dateStr: String): DateTime =
DateTime.parse(dateStr, DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.S"))
However, I still get a ClassCastException:
val lines: TypedPipe[Input] = TypedPipe.from(TypedPsv[Input](args("input")))
lines.map(x => x._1).dump
//cascading.flow.FlowException: local step failed at java.lang.Thread.run(Thread.java:745)
//Caused by: cascading.pipe.OperatorException: [FixedPathTypedDelimite...][com.twitter.scalding.RichPipe.eachTo(RichPipe.scala:509)] operator Each failed executing operation
//Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.joda.time.DateTime
What do I need to do to get Scalding to call my conversion function?
So I ended up doing this:
case class LatLonRecord(date : DateTime, msisdn : String, lat : Double, lon : Double, cellname : String)
object dateparser {
val format = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.S")
def parse(s : String) : DateTime = DateTime.parse(s,format);
}
//changed first column to a String, yuck
type Input = (String, String, Double, Double, String)
val lines: TypedPipe[Input] = TypedPipe.from( TypedPsv[Input]( args("input")) )
val recs = lines.map(v => LatLonRecord(dateparser.parse(v._1), v._2, v._3,v._4, v._5))
But I feel like its a sub-optimal solution. I welcome better answers from people who have been using Scala for more than, say, 1 week, like me.

Scala immutable container class extended with mixins

I'd like a container class that I can extend with some number of traits to contain groups of default vals that can later be changed in an immutable way. The traits will hold certain simple pieces of data that go together so that creating the class with a couple of traits will create an object with several collections of default values.
Then I'd like to be able to modify any of the vals immutably by copying the object while changing one new value at a time.
The class might have something like the following:
class Defaults(val string: String = "string", val int: Int = "int")
Then other traits like this
trait MoreDefaults{
val long: Long = 1l
}
Then I'd like to mix them when instantiated to build my the particular needed set of defaults
var d = new Defaults with MoreDefaults
and later to something like:
if (someFlag) d = d.copy( long = 1412341234l )
You can do something like this with a single case class but I run out of params at 22. But I'll have a bunch of groupings of defaults I'd like to mixin depending on the need, then allow changes to any of them (class defined or trait defined) in an immutable way.
I can stick a copy method in the Defaults class like this:
def copy(
string: String = string,
int: Int = int): Defaults = {
new Defaults(string, int)
}
then do something like
var d = new Defaults
if (someFlag) d = d.copy(int = 234234)
Question ====> This works for values in the base class but I can't figure how to extend this to the mixin traits. Ideally the d.copy would work on all vals defined by all of the class + traits. Overloading is trouble too since the vals are mainly Strings but all of the val names will be unique in any mix of class and traits or it is an error.
Using only classes I can get some of this functionality by having a base Defaults class then extending it with another class that has it's own non-overloaded copyMoreDefault function. This is really ugly and I hope a Scala expert will see it and have a good laugh before setting me straight--it does work though.
class Defaults(
val string: String = "one",
val boolean: Boolean = true,
val int: Int = 1,
val double: Double = 1.0d,
val long: Long = 1l) {
def copy(
string: String = string,
boolean: Boolean = boolean,
int: Int = int,
double: Double = double,
long: Long = long): Defaults = {
new Defaults(string, boolean, int, double, long)
}
}
class MoreDefaults(
string: String = "one",
boolean: Boolean = true,
int: Int = 1,
double: Double = 1.0d,
long: Long = 1l,
val string2: String = "string2") extends Defaults (
string,
boolean,
int,
double,
long) {
def copyMoreDefaults(
string: String = string,
boolean: Boolean = boolean,
int: Int = int,
double: Double = double,
long: Long = long,
string2: String = string2): MoreDefaults = {
new MoreDefaults(string, boolean, int, double, long, string2)
}
}
Then the following works:
var d = new MoreDefualts
if (someFlag) d = d.copyMoreDefaults(string2 = "new string2")
This method will be a mess if Defaults get's changed parameters! All the derived classes will have to be updated--ugh. There must be a better way.
I don't think I'm strictly speaking answering your question, rather suggesting an alternative solution. So your having problems with large case classes, e.g.
case class Fred(a: Int = 1, b: Int = 2, ... too many params ... )
What I would do is organize the params into more case classes:
case class Bar(a: Int = 1, b: Int = 2)
case class Foo(c: Int = 99, d: Int = 200)
// etc
case class Fred(bar: Bar = Bar(), foo: Foo = Foo(), ... etc)
Then when you want to do a copy and change, say one of the values of Foo you do:
val myFred: Fred = Fred()
val fredCopy: Fred = myFred.copy(foo = myFred.foo.copy(d = 300))
and you need not even define the copy functions, you get them for free.