I would like to check if a string contains another while providing a label using "aka". For instance:
"31 west 23rd street, NY" aka "address" must contain("11065")
This fails with
address '31 west 23rd street, NY' doesn't contain '11065'.
However, I would like to specify that 11066 is a zip code. I.e.:
"31 west 23rd street, NY" aka "address" must contain("11065") aka "zip code"
Which doesn't work.
Any idea how to achieve that?
The required result I expect is:
address '31 west 23rd street, NY' doesn't contain zip code '11065'.
Below is a possible solution, but I dislike it since it's not spec2 native and only supports strings:
def contain(needle: String, aka: String) = new Matcher[String] {
def apply[S <: String](b: Expectable[S]) = {
result(needle != null && b.value != null && b.value.contains(needle),
s"${b.description} contains $aka '$needle'",
s"${b.description} doesn't contain $aka '$needle'", b)
}
}
I don't think there is a solution applicable to all matchers. In this case you can reuse the aka machinery
def contain(expected: Expectable[String]): Matcher[String] = new Matcher[String] {
def apply[S <: String](e: Expectable[S]): MatchResult[S] =
result(e.value.contains(expected.value),
s" ${e.value} contains ${expected.description} ${expected.value}",
s" ${e.value} does not contain ${expected.description}",
e)
}
"31 west 23rd street, NY" aka "address" must contain("11065" aka "the zip code")
This displays
31 west 23rd street, NY does not contain the zip code '11065'
Related
I want to parse the address column from the given table structure using addressParser function to get number, street, city and country.
Sample Input:
addressId
address
ADD001
"384, East Avenue Street, New York, USA
ADD002
"123, Maccolm Street, Copenhagen, Denmark"
The sample code is attached for reference:
object ParseAddress extends App {
val spark = SparkSession.builder().master("local[*]").appName("ParseAddress ").getOrCreate()
import spark.implicits._
case class AddressRawData(addressId: String, address: String)
case class AddressData(
addressId: String,
address: String,
number: Option[Int],
road: Option[String],
city: Option[String],
country: Option[String]
)
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData] = {
unparsedAddress.map(address => {
val split = address.address.split(", ")
address.copy(
number = Some(split(0).toInt),
road = Some(split(1)),
city = Some(split(2)),
country = Some(split(3))
)
}
)
}
val addressDS: Dataset[AddressRawData] = addressDF.as[AddressRawData]
}
Expected output:
addressId
address
number
road
city
country
ADD001
"384, East Avenue Street, New York, USA
384
East Avenue Street
New York
USA
ADD002
"123, Maccolm Street, Copenhagen, Denmark"
123
Maccolm Street
Copenhagen
Denmark
I am not sure how should I convert addressDS as an input to function to parse the column data. Some form of help to solve this problem is very much appreciated.
I think one important thing is that it's better to design your function to take a single input and return a single output (in your scenario), and if you have a collection or a dataset of rows, you can map each row to this function, that's why all these polymorphic functions (map, flatMap, fold, ...) are made, right? So you can implement a method, which receives an AddressRawData and returns an AddressData:
def singleAddressParser(unparsedAddress: AddressRawData): AddressData = {
val split = unparsedAddress.address.split(", ")
AddressData(
addressId = unparsedAddress.addressId,
address = unparsedAddress.address,
number = Some(split(0).toInt),
road = Some(split(1)),
city = Some(split(2)),
country = Some(split(3))
)
}
And then map each raw data to this function:
import org.apache.spark.sql.Dataset
val addressDS: Dataset[AddressData] =
addressDF.as[AddressRawData].map(singleAddressParser)
And this is the output, as expected:
scala> addressDS.show(false)
+---------+-----------------------------------------+------+------------------+----------+-------+
|addressId|address |number|road |city |country|
+---------+-----------------------------------------+------+------------------+----------+-------+
|ADD001 |384, East Avenue Street, New York, USA |384 |East Avenue Street|New York |USA |
|ADD002 |123, Malccolm Street, Copenhagen, Denmark|123 |Malccolm Street |Copenhagen|Denmark|
+---------+-----------------------------------------+------+------------------+----------+-------+
I am trying to parse a text file. My input file looks like this:
ID: 12343-7888
Name: Mary, Bob, Jason, Jeff, Suzy
Harry, Steve
Larry, George
City: New York, Portland, Dallas, Kansas City
Tampa, Bend
Expected output would:
“12343-7888”
“Mary, Bob, Jason, Jeff, Suzy, Harry, Steve, Larry, George”
“New York, Portland, Dallas, Kansas City, Tampa, Bend"
Note the “Name” and "City" fields have new lines or returns in them. I have this code below, but it is not working. The second line of code places each character in a line. Plus, I am having troubles only grabbing the data from the field, like only returning the actual names, where the “Name: “ is not part of the results. Also, looking to put quotes around each return field.
Can you help fix up my problems?
val lines = Source.fromFile("/filesdata/logfile.text").getLines().toList
val record = lines.dropWhile(line => !line.startsWith("Name: ")).takeWhile(line => !line.startsWith("Address: ")).flatMap(_.split(",")).map(_.trim()).filter(_.nonEmpty).mkString(", ")
val final results record.map(s => "\"" + s + "\"").mkString(",\n")
How can I get my results that I am looking for?
SHORT ANSWER
A two-liner that produces a string that looks exactly as you specified:
println(lines.map{line => if(line.trim.matches("[a-zA-Z]+:.*"))
("\"\n\"" + line.split(":")(1).trim) else (", " + line.trim)}.mkString.drop(2) + "\"")
LONG ANSWER
Why try to solve something in one line, if you can achieve the same thing in 94?
(That's the exact opposite of the usual slogan when working with Scala collections, but the input was sufficiently messy that I found it worthwhile to actually write out some of the intermediate steps. Maybe that's just because I've bought a nice new keyboard recently...)
val input = """ID: 12343-7888
Name: Mary, Bob, Jason, Jeff, Suzy
Harry, Steve
Larry, George
City: New York, Portland, Dallas, Kansas City
Tampa, Bend
ID: 567865-676
Name: Alex, Bob
Chris, Dave
Evan, Frank
Gary
City: Los Angeles, St. Petersburg
Washington D.C., Phoenix
"""
case class Entry(id: String, names: List[String], cities: List[String])
def parseMessyInput(input: String): List[Entry] = {
// just a first rought approximation of the structure of the input
sealed trait MessyInputLine { def content: String }
case class IdLine(content: String) extends MessyInputLine
case class NameLine(content: String) extends MessyInputLine
case class UnlabeledLine(content: String) extends MessyInputLine
case class CityLine(content: String) extends MessyInputLine
val lines = input.split("\n").toList
// a helper function for checking whether a line starts with a label
def tryParseLabeledLine
(label: String, line: String)
(cons: String => MessyInputLine)
: Option[MessyInputLine] = {
if (line.startsWith(label + ":")) {
Some(cons(line.drop(label.size + 1)))
} else {
None
}
}
val messyLines: List[MessyInputLine] = for (line <- lines) yield {
(
tryParseLabeledLine("Name", line){NameLine(_)} orElse
tryParseLabeledLine("City", line){CityLine(_)} orElse
tryParseLabeledLine("ID", line){IdLine(_)}
).getOrElse(UnlabeledLine(line))
}
/** Combines the content of the first line with the content
* of all unlabeled lines, until the next labeled line or
* the end of the list is hit. Returns the content of
* the first few lines and the list of the remaining lines.
*/
def readUntilNextLabel(messyLines: List[MessyInputLine])
: (List[String], List[MessyInputLine]) = {
messyLines match {
case Nil => (Nil, Nil)
case h :: t => {
val (unlabeled, rest) = t.span {
case UnlabeledLine(_) => true
case _ => false
}
(h.content :: unlabeled.map(_.content), rest)
}
}
}
/** Glues multiple lines to entries */
def combineToEntries(messyLines: List[MessyInputLine]): List[Entry] = {
if (messyLines.isEmpty) Nil
else {
val (idContent, namesCitiesRest) = readUntilNextLabel(messyLines)
val (namesContent, citiesRest) = readUntilNextLabel(namesCitiesRest)
val (citiesContent, rest) = readUntilNextLabel(citiesRest)
val id = idContent.head.trim
val names = namesContent.map(_.split(",").map(_.trim).toList).flatten
val cities = citiesContent.map(_.split(",").map(_.trim).toList).flatten
Entry(id, names, cities) :: combineToEntries(rest)
}
}
// invoke recursive function on the entire input
combineToEntries(messyLines)
}
// how to use
val entries = parseMessyInput(input)
// output
for (Entry(id, names, cities) <- entries) {
println(id)
println(names.mkString(", "))
println(cities.mkString(", "))
}
Output:
12343-7888
Mary, Bob, Jason, Jeff, Suzy, Harry, Steve, Larry, George
New York, Portland, Dallas, Kansas City, Tampa, Bend
567865-676
Alex, Bob, Chris, Dave, Evan, Frank, Gary
Los Angeles, St. Petersburg, Washington D.C., Phoenix
You probably could write it down in one line, sooner or later. But if you write dumb code consisting of many simple intermediate steps, you don't have to think that hard, and there are no obstacles large enough to get stuck.
I have a csv file of countries and a CountryData case class
Example data from file:
Denmark, Europe, 1.23, 7.89
Australia, Australia, 8.88, 9.99
Brazil, South America, 7.77,3.33
case class CountryData(country: String, region: String, population: Double, economy: Double)
I can read in the file and split, etc to get
(List(Denmark, Europe, 1.23, 7.89)
(List(Australia, Australia, 8.88, 9.99)
(List(Brazil, South America, 7.77,3.33)
How can I now populate a CountryData case class for each list item?
I've tried:
for (line <- Source.getLines.drop(1)) {
val splitInput = line.split(",", -1).map(_.trim).toList
val country = splitInput(0)
val region = splitInput(1)
val population = splitInput(2)
val economy = splitInput(3)
val dataList: List[CountryData]=List(CountryData(country,region,population,economy))
But that doesn't work because it's not reading the val, it sees it as a string 'country' or 'region'.
It is not clear where exactly is your issue. Is it about Double vs String or about List being inside the loop. Still something like this will probably work
case class CountryData(country: String, region: String, population: Double, economy: Double)
object CountryDataMain extends App {
val src = "\nDenmark, Europe, 1.23, 7.89\nAustralia, Australia, 8.88, 9.99\nBrazil, South America, 7.77,3.33"
val list = Source.fromString(src).getLines.drop(1).map(line => {
val splitInput = line.split(",", -1).map(_.trim).toList
val country = splitInput(0)
val region = splitInput(1)
val population = splitInput(2)
val economy = splitInput(3)
CountryData(country, region, population.toDouble, economy.toDouble)
}).toList
println(list)
}
I would use scala case matching: i.e.
def doubleOrNone(str: Double): Option[Double] = {
Try {
Some(str.toDouble) //Not sure of exact name of method, should be quick to find
} catch {
case t: Throwable => None
}
}
def parseCountryLine(line: String): Vector[CountryData] = {
lines.split(",").toVector match {
case Vector(countryName, region, population, economy) => CountryData(countryName, region, doubleOrNone(population), doubleOrNone(economy))//I would suggest having them as options because you may not have data for all countries
case s => println(s"Error parsing line:\n$s");
}
}
I have a code like this
val pop = sc.textFile("population.csv")
.filter(line => !line.startsWith("Country"))
.map(line => line.split(","))
.map { case Array(CountryName, CountryCode, Year, Value) => (CountryName, CountryCode, Year, Value) }
The file looks like this.
Country Name,Country Code,Year,Value
Arab World,ARB,1960,93485943
Arab World,ARB,1961,96058179
Arab World,ARB,1962,98728995
Arab World,ARB,1963,101496308
Arab World,ARB,1964,104359772
Arab World,ARB,1965,107318159
Arab World,ARB,1966,110379639
Arab World,ARB,1967,113543760
Arab World,ARB,1968,116787194
up until the .map {case}, I can print out by pop.take(10),
And I get Array[Array[String]].
But once the case is added, I'm getting
error: not found: value (all columns)
all columns meaning 4 different errors with CountryName, CountryCode, Year, Value, etc...
Not sure where I'm doing wrong.
The data is clean.
You need to use lowercase variable names in pattern matching. I.e:
.map { case Array(countryName, countryCode, year, value) => (countryName, countryCode, year, value) }
In Scala's pattern matching variables that are Capitalized as well as variables enclosed in backticks (`) are taken from outer scope and used as constants. Here is an example to illustrate what I'm saying:
Array("a") match {
case Array(a) => a
}
Will match array with any string, while:
val A = "a"
Array("a") match {
case Array(A) =>
}
will match only literal "a". Or, equivalent:
val a = "a"
Array("a") match {
case Array(`a`) =>
}
will also match only literal "a".
I am experimenting with path dependent types. In my simple example I am using a Currency object to ensure that Money calculations can only be performed on Money of the same currency.
// Simple currency class
case class Currency(code: String, name: String, symbol: String) {
// Money amounts in this currency
// Operations are only supported on money of the same currency
case class Money(amount: BigDecimal) {
override def toString: String = s"$code $amount"
val currency: Currency.this.type = Currency.this
def +(rhs: Money): Money = Money(amount + rhs.amount)
def -(rhs: Money): Money = Money(amount - rhs.amount)
}
}
Using the above class simple calulations in the repl are straigh forward.
val e1 = Euro.Money(5)
val e2 = Euro.Money(9)
e1 + e2 // Compiles fine
val d1 = Dollar.Money(6)
d1 + e2 // Doesn't compile as expected
These are simple because the compiler can easily prove that e1 and e2 share a common currency. However proving that money instances share a common currency is much harder when I read a list of money amounts from a file or database. For instance I cannot see how to implement the collate function below.
trait CurrencyAndMonies {
val currency: Currency
val monies: List[currency.Money]
}
// Take a list of money in different currencies and group them by currency
// so their shared Currency type is available to static type checking
// in further calculations
def collate(Seq[Currency#Money]): List[CurrencyAndMonies] = ???
Is it possible to sort monies based on currency and reestablish the link between them? And if so how? I don't mind changing the signature or the way Money amounts are read from the database.
You can downcast:
new CurrencyAndMonies {
val currency = foo
val monies = bars.map(_.asInstanceOf[currency.Money])
}
Group by Money#currency.
The downcast is not runtime-checked, so you'll have to make sure the monetary value has the right currency (which you already do by grouping by currency), but it will compile.
In your example type signature
def collate(Seq[Currency#Money]): List[CurrencyAndMonies]
doesn't require all money to be from the same currency, it can be random Money from any Currency.
val Euro = Currency("EUR", "Euro", "EUR")
val USD = Currency("USD", "Dollar", "$")
def collateOld(s: List[Currency#Money]): CurrencyAndMonies = ???
// Compiles successfully -> ERROR
collateOld(List(USD.Money(10), Euro.Money(20)))
Typically you'll have to pass and instance of currency as well as list of Money. For example you can do it this way:
abstract class CurrencyAndMonies(val currency: Currency) {
type Money
def monies: List[Money]
}
def collate(c: Currency)
(m: List[c.Money]): CurrencyAndMonies { type Money = c.Money } =
new CurrencyAndMonies(c) {
type Money = c.Money
val monies = m
}
collate(Euro)(List(Euro.Money(10), Euro.Money(20)))
It's weird that it's required to re-define type Money inside CurrencyAndMonies, but I can't understand why just currency.Money doesn't work. If you'll make constructor private and collate only one way to create instance of a class you should be good to use in with guaranteed type safety