Add multiple columns with map logic without using UDF - scala

I want to parse the address column from the given table structure using addressParser function to get number, street, city and country.
Sample Input:
addressId
address
ADD001
"384, East Avenue Street, New York, USA
ADD002
"123, Maccolm Street, Copenhagen, Denmark"
The sample code is attached for reference:
object ParseAddress extends App {
val spark = SparkSession.builder().master("local[*]").appName("ParseAddress ").getOrCreate()
import spark.implicits._
case class AddressRawData(addressId: String, address: String)
case class AddressData(
addressId: String,
address: String,
number: Option[Int],
road: Option[String],
city: Option[String],
country: Option[String]
)
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData] = {
unparsedAddress.map(address => {
val split = address.address.split(", ")
address.copy(
number = Some(split(0).toInt),
road = Some(split(1)),
city = Some(split(2)),
country = Some(split(3))
)
}
)
}
val addressDS: Dataset[AddressRawData] = addressDF.as[AddressRawData]
}
Expected output:
addressId
address
number
road
city
country
ADD001
"384, East Avenue Street, New York, USA
384
East Avenue Street
New York
USA
ADD002
"123, Maccolm Street, Copenhagen, Denmark"
123
Maccolm Street
Copenhagen
Denmark
I am not sure how should I convert addressDS as an input to function to parse the column data. Some form of help to solve this problem is very much appreciated.

I think one important thing is that it's better to design your function to take a single input and return a single output (in your scenario), and if you have a collection or a dataset of rows, you can map each row to this function, that's why all these polymorphic functions (map, flatMap, fold, ...) are made, right? So you can implement a method, which receives an AddressRawData and returns an AddressData:
def singleAddressParser(unparsedAddress: AddressRawData): AddressData = {
val split = unparsedAddress.address.split(", ")
AddressData(
addressId = unparsedAddress.addressId,
address = unparsedAddress.address,
number = Some(split(0).toInt),
road = Some(split(1)),
city = Some(split(2)),
country = Some(split(3))
)
}
And then map each raw data to this function:
import org.apache.spark.sql.Dataset
val addressDS: Dataset[AddressData] =
addressDF.as[AddressRawData].map(singleAddressParser)
And this is the output, as expected:
scala> addressDS.show(false)
+---------+-----------------------------------------+------+------------------+----------+-------+
|addressId|address |number|road |city |country|
+---------+-----------------------------------------+------+------------------+----------+-------+
|ADD001 |384, East Avenue Street, New York, USA |384 |East Avenue Street|New York |USA |
|ADD002 |123, Malccolm Street, Copenhagen, Denmark|123 |Malccolm Street |Copenhagen|Denmark|
+---------+-----------------------------------------+------+------------------+----------+-------+

Related

Convert spark scala dataset of one type to another

I have a dataset with following case class type:
case class AddressRawData(
addressId: String,
customerId: String,
address: String
)
I want to convert it to:
case class AddressData(
addressId: String,
customerId: String,
address: String,
number: Option[Int], //i.e. it is optional
road: Option[String],
city: Option[String],
country: Option[String]
)
Using a parser function:
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData] = {
unparsedAddress.map(address => {
val split = address.address.split(", ")
address.copy(
number = Some(split(0).toInt),
road = Some(split(1)),
city = Some(split(2)),
country = Some(split(3))
)
}
)
}
I am new to scala and spark. Could anyone please let me know how can this be done?
You were on the right path! There are multiple ways of doing this of course. But as you're already on the way by making some case classes, and you've started making a parsing function an elegant solution is by using the Dataset's map function. From the docs, this map function signature is the following:
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
Where T is the starting type (AddressRawData in your case) and U is the type you want to get to (AddressData in your case). So the input of this map function is a function that transforms a AddressRawData to a AddressData. That could perfectly be the addressParser you've started making!
Now, your current addressParser has the following signature:
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData]
In order to be able to feed it to that map function, we need to make this signature:
def newAddressParser(unparsedAddress: AddressRawData): AddressData
Knowing all of this, we can work further! An example would be the following:
import spark.implicits._
import scala.util.Try
// Your case classes
case class AddressRawData(addressId: String, customerId: String, address: String)
case class AddressData(
addressId: String,
customerId: String,
address: String,
number: Option[Int],
road: Option[String],
city: Option[String],
country: Option[String]
)
// Your addressParser function, adapted to be able to feed into the Dataset.map
// function
def addressParser(rawAddress: AddressRawData): AddressData = {
val addressArray = rawAddress.address.split(", ")
AddressData(
rawAddress.addressId,
rawAddress.customerId,
rawAddress.address,
Try(addressArray(0).toInt).toOption,
Try(addressArray(1)).toOption,
Try(addressArray(2)).toOption,
Try(addressArray(3)).toOption
)
}
// Creating a sample dataset
val rawDS = Seq(
AddressRawData("1", "1", "20, my super road, beautifulCity, someCountry"),
AddressRawData("1", "1", "badFormat, some road, cityButNoCountry")
).toDS
val parsedDS = rawDS.map(addressParser)
parsedDS.show
+---------+----------+--------------------+------+-------------+----------------+-----------+
|addressId|customerId| address|number| road| city| country|
+---------+----------+--------------------+------+-------------+----------------+-----------+
| 1| 1|20, my super road...| 20|my super road| beautifulCity|someCountry|
| 1| 1|badFormat, some r...| null| some road|cityButNoCountry| null|
+---------+----------+--------------------+------+-------------+----------------+-----------+
As you see, thanks to the fact that you had already foreseen that parsing can go wrong, it was easily possible to use scala.util.Try to try and get the pieces of that raw address and add some robustness in there (the second line contains some null values where it could not parse the address string.
Hope this helps!

Parse a log file with scala

I am trying to parse a text file. My input file looks like this:
ID: 12343-7888
Name: Mary, Bob, Jason, Jeff, Suzy
Harry, Steve
Larry, George
City: New York, Portland, Dallas, Kansas City
Tampa, Bend
Expected output would:
“12343-7888”
“Mary, Bob, Jason, Jeff, Suzy, Harry, Steve, Larry, George”
“New York, Portland, Dallas, Kansas City, Tampa, Bend"
Note the “Name” and "City" fields have new lines or returns in them. I have this code below, but it is not working. The second line of code places each character in a line. Plus, I am having troubles only grabbing the data from the field, like only returning the actual names, where the “Name: “ is not part of the results. Also, looking to put quotes around each return field.
Can you help fix up my problems?
val lines = Source.fromFile("/filesdata/logfile.text").getLines().toList
val record = lines.dropWhile(line => !line.startsWith("Name: ")).takeWhile(line => !line.startsWith("Address: ")).flatMap(_.split(",")).map(_.trim()).filter(_.nonEmpty).mkString(", ")
val final results record.map(s => "\"" + s + "\"").mkString(",\n")
How can I get my results that I am looking for?
SHORT ANSWER
A two-liner that produces a string that looks exactly as you specified:
println(lines.map{line => if(line.trim.matches("[a-zA-Z]+:.*"))
("\"\n\"" + line.split(":")(1).trim) else (", " + line.trim)}.mkString.drop(2) + "\"")
LONG ANSWER
Why try to solve something in one line, if you can achieve the same thing in 94?
(That's the exact opposite of the usual slogan when working with Scala collections, but the input was sufficiently messy that I found it worthwhile to actually write out some of the intermediate steps. Maybe that's just because I've bought a nice new keyboard recently...)
val input = """ID: 12343-7888
Name: Mary, Bob, Jason, Jeff, Suzy
Harry, Steve
Larry, George
City: New York, Portland, Dallas, Kansas City
Tampa, Bend
ID: 567865-676
Name: Alex, Bob
Chris, Dave
Evan, Frank
Gary
City: Los Angeles, St. Petersburg
Washington D.C., Phoenix
"""
case class Entry(id: String, names: List[String], cities: List[String])
def parseMessyInput(input: String): List[Entry] = {
// just a first rought approximation of the structure of the input
sealed trait MessyInputLine { def content: String }
case class IdLine(content: String) extends MessyInputLine
case class NameLine(content: String) extends MessyInputLine
case class UnlabeledLine(content: String) extends MessyInputLine
case class CityLine(content: String) extends MessyInputLine
val lines = input.split("\n").toList
// a helper function for checking whether a line starts with a label
def tryParseLabeledLine
(label: String, line: String)
(cons: String => MessyInputLine)
: Option[MessyInputLine] = {
if (line.startsWith(label + ":")) {
Some(cons(line.drop(label.size + 1)))
} else {
None
}
}
val messyLines: List[MessyInputLine] = for (line <- lines) yield {
(
tryParseLabeledLine("Name", line){NameLine(_)} orElse
tryParseLabeledLine("City", line){CityLine(_)} orElse
tryParseLabeledLine("ID", line){IdLine(_)}
).getOrElse(UnlabeledLine(line))
}
/** Combines the content of the first line with the content
* of all unlabeled lines, until the next labeled line or
* the end of the list is hit. Returns the content of
* the first few lines and the list of the remaining lines.
*/
def readUntilNextLabel(messyLines: List[MessyInputLine])
: (List[String], List[MessyInputLine]) = {
messyLines match {
case Nil => (Nil, Nil)
case h :: t => {
val (unlabeled, rest) = t.span {
case UnlabeledLine(_) => true
case _ => false
}
(h.content :: unlabeled.map(_.content), rest)
}
}
}
/** Glues multiple lines to entries */
def combineToEntries(messyLines: List[MessyInputLine]): List[Entry] = {
if (messyLines.isEmpty) Nil
else {
val (idContent, namesCitiesRest) = readUntilNextLabel(messyLines)
val (namesContent, citiesRest) = readUntilNextLabel(namesCitiesRest)
val (citiesContent, rest) = readUntilNextLabel(citiesRest)
val id = idContent.head.trim
val names = namesContent.map(_.split(",").map(_.trim).toList).flatten
val cities = citiesContent.map(_.split(",").map(_.trim).toList).flatten
Entry(id, names, cities) :: combineToEntries(rest)
}
}
// invoke recursive function on the entire input
combineToEntries(messyLines)
}
// how to use
val entries = parseMessyInput(input)
// output
for (Entry(id, names, cities) <- entries) {
println(id)
println(names.mkString(", "))
println(cities.mkString(", "))
}
Output:
12343-7888
Mary, Bob, Jason, Jeff, Suzy, Harry, Steve, Larry, George
New York, Portland, Dallas, Kansas City, Tampa, Bend
567865-676
Alex, Bob, Chris, Dave, Evan, Frank, Gary
Los Angeles, St. Petersburg, Washington D.C., Phoenix
You probably could write it down in one line, sooner or later. But if you write dumb code consisting of many simple intermediate steps, you don't have to think that hard, and there are no obstacles large enough to get stuck.

Scala specs2 matchers with "aka"

I would like to check if a string contains another while providing a label using "aka". For instance:
"31 west 23rd street, NY" aka "address" must contain("11065")
This fails with
address '31 west 23rd street, NY' doesn't contain '11065'.
However, I would like to specify that 11066 is a zip code. I.e.:
"31 west 23rd street, NY" aka "address" must contain("11065") aka "zip code"
Which doesn't work.
Any idea how to achieve that?
The required result I expect is:
address '31 west 23rd street, NY' doesn't contain zip code '11065'.
Below is a possible solution, but I dislike it since it's not spec2 native and only supports strings:
def contain(needle: String, aka: String) = new Matcher[String] {
def apply[S <: String](b: Expectable[S]) = {
result(needle != null && b.value != null && b.value.contains(needle),
s"${b.description} contains $aka '$needle'",
s"${b.description} doesn't contain $aka '$needle'", b)
}
}
I don't think there is a solution applicable to all matchers. In this case you can reuse the aka machinery
def contain(expected: Expectable[String]): Matcher[String] = new Matcher[String] {
def apply[S <: String](e: Expectable[S]): MatchResult[S] =
result(e.value.contains(expected.value),
s" ${e.value} contains ${expected.description} ${expected.value}",
s" ${e.value} does not contain ${expected.description}",
e)
}
"31 west 23rd street, NY" aka "address" must contain("11065" aka "the zip code")
This displays
31 west 23rd street, NY does not contain the zip code '11065'

Scala populate a case class for each list item

I have a csv file of countries and a CountryData case class
Example data from file:
Denmark, Europe, 1.23, 7.89
Australia, Australia, 8.88, 9.99
Brazil, South America, 7.77,3.33
case class CountryData(country: String, region: String, population: Double, economy: Double)
I can read in the file and split, etc to get
(List(Denmark, Europe, 1.23, 7.89)
(List(Australia, Australia, 8.88, 9.99)
(List(Brazil, South America, 7.77,3.33)
How can I now populate a CountryData case class for each list item?
I've tried:
for (line <- Source.getLines.drop(1)) {
val splitInput = line.split(",", -1).map(_.trim).toList
val country = splitInput(0)
val region = splitInput(1)
val population = splitInput(2)
val economy = splitInput(3)
val dataList: List[CountryData]=List(CountryData(country,region,population,economy))
But that doesn't work because it's not reading the val, it sees it as a string 'country' or 'region'.
It is not clear where exactly is your issue. Is it about Double vs String or about List being inside the loop. Still something like this will probably work
case class CountryData(country: String, region: String, population: Double, economy: Double)
object CountryDataMain extends App {
val src = "\nDenmark, Europe, 1.23, 7.89\nAustralia, Australia, 8.88, 9.99\nBrazil, South America, 7.77,3.33"
val list = Source.fromString(src).getLines.drop(1).map(line => {
val splitInput = line.split(",", -1).map(_.trim).toList
val country = splitInput(0)
val region = splitInput(1)
val population = splitInput(2)
val economy = splitInput(3)
CountryData(country, region, population.toDouble, economy.toDouble)
}).toList
println(list)
}
I would use scala case matching: i.e.
def doubleOrNone(str: Double): Option[Double] = {
Try {
Some(str.toDouble) //Not sure of exact name of method, should be quick to find
} catch {
case t: Throwable => None
}
}
def parseCountryLine(line: String): Vector[CountryData] = {
lines.split(",").toVector match {
case Vector(countryName, region, population, economy) => CountryData(countryName, region, doubleOrNone(population), doubleOrNone(economy))//I would suggest having them as options because you may not have data for all countries
case s => println(s"Error parsing line:\n$s");
}
}

For a Scala map of key to a tuple, how to select (project?) a specific item from the tuple?

Say I have a map with multi-attribute values, of which I want to select a specific attribute.
For instance, a map representing a table of people with name, gender, age, description.
In SQL, I would write "select age from people where name='whomever'"
How would I get that effect in Scala?
val people = Map(
"Walter White" -> ("male",52,"bad boy"),
"Skyler White" -> ("female",42,"morally challenged mom")
)
// equivalent of select * from people. This works.
for ((name,(gender,age,desc)) <- people) println(s"$name is a $age year old $gender and is a $desc")
// what should be the syntax to get "the age of Walter White is 52"?
// in SQL, it would be "'The age of Walter White is ' || (select age from people where name='Walter White')"
// what would it be in Scala?
println("The age of Walter White is " + people("Walter White")(1)) // not this!
You could create a Person case class and create a new map with Persons instead of tuples.
case class Person(name: String, gender: String, age: Int, description: String)
val persons = people map { case (name, (gender, age, descr)) =>
name -> Person(name, gender, age, descr)
}
This way you could write :
persons("Walter White").age // Int = 52
persons.get("Skyler White").map(_.age) // Option[Int] = Some(42)
You could also access the age using your original code :
people("Walter White")._2 // Int = 52