Pattern matching and RDDs - scala

I have a very simple (n00b) question but I'm somehow stuck. I'm trying to read a set of files in Spark with wholeTextFiles and want to return an RDD[LogEntry], where LogEntry is just a case class. I want to end up with an RDD of valid entries and I need to use a regular expression to extract the parameters for my case class. When an entry is not valid I do not want the extractor logic to fail but simply write an entry in a log. For that I use LazyLogging.
object LogProcessors extends LazyLogging {
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[Option[CleaningLogEntry]] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.map(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
That gives me an RDD[Array[Option[LogEntry]]]. Is there a neat way to end up with an RDD of the LogEntrys? I'm somehow missing it.
I was thinking about using Try instead, but I'm not sure if that's any better.
Thoughts greatly appreciated.

To get rid of the Array - simply replace the map command with flatMap - flatMap will treat a result of type Traversable[T] for each record as separate records of type T.
To get rid of the Option - collect only the successful ones: entries.collect { case Some(entry) => entry }.
Note that this collect(p: PartialFunction) overload (which performs something equivelant to a map and a filter combined) is very different from collect() (which sends all data to the driver).
Altogether, this would be something like:
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[CleaningLogEntry] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.flatMap(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
entries.collect { case Some(entry) => entry }
}

Related

Parse CSV and add only matching rows to List functionally in Scala

I am reading csv scala.
Person is a case class
Case class Person(name, address)
def getData(path:String,existingName) : List[Person] = {
Source.fromFile(“my_file_path”).getLines.drop(1).map(l => {
val data = l.split("|", -1).map(_.trim).toList
val personName = data(0)
if(personName.equalsIgnoreCase(existingName)) {
val address=data(1)
Person(personName,address)
//here I want to add to list
}
else
Nil
///here return empty list of zero length
}).toList()
}
I want to achieve this functionally in scala.
Here's the basic approach to what I think you're trying to do.
case class Person(name:String, address:String)
def getData(path:String, existingName:String) :List[Person] = {
val recordPattern = raw"\s*(?i)($existingName)\s*\|\s*(.*)".r.unanchored
io.Source.fromFile(path).getLines.drop(1).collect {
case recordPattern(name,addr) => Person(name, addr.trim)
}.toList
}
This doesn't close the file reader or report the error if the file can't be opened, which you really should do, but we'll leave that for a different day.
update: added file closing and error handling via Using (Scala 2.13)
import scala.util.{Using, Try}
case class Person(name:String, address:String)
def getData(path:String, existingName:String) :Try[List[Person]] =
Using(io.Source.fromFile(path)){ file =>
val recordPattern = raw"\s*(?i)($existingName)\s*\|\s*([^|]*)".r
file.getLines.drop(1).collect {
case recordPattern(name,addr) => Person(name, addr.trim)
}.toList
}
updated update
OK. Here's a version that:
reports the error if the file can't be opened
closes the file after it's been opened and read
ignores unwanted spaces and quote marks
is pre-2.13 compiler friendly
import scala.util.Try
case class Person(name:String, address:String)
def getData(path:String, existingName:String) :List[Person] = {
val recordPattern =
raw"""[\s"]*(?i)($existingName)["\s]*\|[\s"]*([^"|]*)*.""".r
val file = Try(io.Source.fromFile(path))
val res = file.fold(
err => {println(err); List.empty[Person]},
_.getLines.drop(1).collect {
case recordPattern(name,addr) => Person(name, addr.trim)
}.toList)
file.map(_.close)
res
}
And here's how the regex works:
[\s"]* there might be spaces or quote marks
(?i) matching is case-insensitive
($existingName) match and capture this string (1st capture group)
["\s]* there might be spaces or quote marks
\| there will be a bar character
[\s"]* there might be spaces or quote marks
([^"|]*) match and capture everything that isn't quote or bar
.* ignore anything that might come thereafter
you were not very clear on what was the problem on your approach, but this should do the trick (very close to what you have)
def getData(path:String, existingName: String) : List[Person] = {
val source = Source.fromFile("my_file_path")
val lst = source.getLines.drop(1).flatMap(l => {
val data = l.split("|", -1).map(_.trim).toList
val personName = data.head
if (personName.equalsIgnoreCase(existingName)) {
val address = data(1)
Option(Person(personName, address))
}
else
Option.empty
}).toList
source.close()
lst
}
we read the file line per line, for each line we extract the personName from the first csv field, and if it's the one we are looking for we return an (Option) Person, otherwise none (Option.empty). By doing flatmap we discard the empty options (just to avoid using nils)

Handling and throwing Exceptions in Scala

I have the following implementation:
val dateFormats = Seq("dd/MM/yyyy", "dd.MM.yyyy")
implicit def dateTimeCSVConverter: CsvFieldReader[DateTime] = (s: String) => Try {
val elem = dateFormats.map {
format =>
try {
Some(DateTimeFormat.forPattern(format).parseDateTime(s))
} catch {
case _: IllegalArgumentException =>
None
}
}.collectFirst {
case e if e.isDefined => e.get
}
if (elem.isDefined)
elem.get
else
throw new IllegalArgumentException(s"Unable to parse DateTime $s")
}
So basically what I'm doing is that, I'm running over my Seq and trying to parse the DateTime with different formats. I then collect the first one that succeeds and if not I throw the Exception back.
I'm not completely satisfied with the code. Is there a better way to make it simpler? I need the exception message passed on to the caller.
The one problem with your code is it tries all patterns no matter if date was already parsed. You could use lazy collection, like Stream to solve this problem:
def dateTimeCSVConverter(s: String) = Stream("dd/MM/yyyy", "dd.MM.yyyy")
.map(f => Try(DateTimeFormat.forPattern(format).parseDateTime(s))
.dropWhile(_.isFailure)
.headOption
Even better is the solution proposed by jwvh with find (you don't have to call headOption):
def dateTimeCSVConverter(s: String) = Stream("dd/MM/yyyy", "dd.MM.yyyy")
.map(f => Try(DateTimeFormat.forPattern(format).parseDateTime(s))
.find(_.isSuccess)
It returns None if none of patterns matched. If you want to throw exception on that case, you can uwrap option with getOrElse:
...
.dropWhile(_.isFailure)
.headOption
.getOrElse(throw new IllegalArgumentException(s"Unable to parse DateTime $s"))
The important thing is, that when any validation succeedes, it won't go further but will return parsed date right away.
This is a possible solution that iterates through all the options
val dateFormats = Seq("dd/MM/yyyy", "dd.MM.yyyy")
val dates = Vector("01/01/2019", "01.01.2019", "01-01-2019")
dates.foreach(s => {
val d: Option[Try[DateTime]] = dateFormats
.map(format => Try(DateTimeFormat.forPattern(format).parseDateTime(s)))
.filter(_.isSuccess)
.headOption
d match {
case Some(d) => println(d.toString)
case _ => throw new IllegalArgumentException("foo")
}
})
This is an alternative solution that returns the first successful conversion, if any
val dateFormats = Seq("dd/MM/yyyy", "dd.MM.yyyy")
val dates = Vector("01/01/2019", "01.01.2019", "01-01-2019")
dates.foreach(s => {
dateFormats.find(format => Try(DateTimeFormat.forPattern(format).parseDateTime(s)).isSuccess) match {
case Some(format) => println(DateTimeFormat.forPattern(format).parseDateTime(s))
case _ => throw new IllegalArgumentException("foo")
}
})
I made it sweet like this now! I like this a lot better! Use this if you want to collect all the successes and all the failures. Note that, this might be a bit in-efficient when you need to break out of the loop as soon as you find one success!
implicit def dateTimeCSVConverter: CsvFieldReader[DateTime] = (s: String) => Try {
val (successes, failures) = dateFormats.map {
case format => Try(DateTimeFormat.forPattern(format).parseDateTime(s))
}.partition(_.isSuccess)
if (successes.nonEmpty)
successes.head.get
else
failures.head.get
}

How to skip entries that don't match a pattern when constructing an RDD

I'm doing some pattern matching on an RDD and would like to select only thoese rows/records matching a pattern. Here is what I currently have,
val idPattern = """Id="([^"]*)""".r.unanchored
val typePattern = """PostTypeId="([^"]*)""".r.unanchored
val datePattern = """CreationDate="([^"]*)""".r.unanchored
val tagPattern = """Tags="([^"]*)""".r.unanchored
val projectedPostsAnswers = postsAnswers.map {line => {
val id = line match {case idPattern(x) => x}
val typeId = line match {case typePattern(x) => x}
val date = line match {case datePattern(x) => x}
val tags = line match {case tagPattern(x) => x}
Post(Some(id),Some(typeId),Some(date),Some(tags))
}
}
case class Post(Id: Option[String], Type: Option[String], CreationDate: Option[String], Tags: Option[String])
I'm only interested in row/record which match to all patterns (that is, records that have all those four fields). How can I skip those rows/records not satisfy my requirement?
You can use RDD.collect(scala.PartialFunction f) to do the filtering and mapping in one step.
For example, if you know the order of these fields in your input, you can merge the regular expressions into one and use a single case:
val pattern = """Id="([^"]*).*PostTypeId="([^"]*).*CreationDate="([^"]*).*Tags="([^"]*)""".r.unanchored
val projectedPostsAnswers = postsAnswers.collect {
case pattern(id, typeId, date, tags) => Post(Some(id), Some(typeId), Some(date), Some(tags))
}
The returned RDD[Post] will only contain records for which this case matched. Notice that this collect(PartialFunction) has nothing to do with collect() - it does not collect entire data set into driver memory.
Use the filter API.
val projectedPostAnswers = postsAnswers.filter(line => f(line)).map{....
Create a function 'f' that does your data cleansing for you. Make sure the function returns true or false, as thats what filter uses to decide whether or not to pass the record.

Scala: parsing an API parameter

My API currently take an optional parameter named gamedate. It is passed in as a string at which time I later parse it to a Date object using some utility code. The code looks like this:
val gdate:Option[String] = params.get("gamedate")
val res = gdate match {
case Some(s) => {
val date:Option[DateTime] = gdate map { MyDateTime.parseDate _ }
val dateOrDefault:DateTime = date.getOrElse((new DateTime).withTime(0, 0, 0, 0))
NBAScoreboard.findByDate(dateOrDefault)
}
case None => NBAScoreboard.getToday
}
This works just fine. Now what I'm trying to solve is I'm allowing multiple gamedates get passed in via a comma delimited list. Originally you can pass a parameter like this:
gamedate=20131211
now I want to allow that OR:
gamedate=20131211,20131212
That requires modifying the code above to try to split the comma delimited string and parse each value into a Date and change the interface to findByDate to accept a Seq[DateTime] vs just DateTime. I tried running something like this, but apparently it's not the way to go about it:
val res = gdates match {
case Some(s) => {
val dates:Option[Seq[DateTime]] = gdates map { _.split(",").distinct.map(MyDateTime.parseDate _ )}
val datesOrDefault:Seq[DateTime] = dates map { _.getOrElse((new DateTime).withTime(0, 0, 0, 0))}
NBAScoreboard.findByDates(datesOrDefault)
}
case None => NBAScoreboard.getToday
}
What's the best way to convert my first set of code to handle this use case? I'm probably fairly close in the second code example I provided, but I'm just not hitting it right.
You mixed up the containers. The map you call on dates unpackes the Option so the getOrElse is applied to a list.
val res = gdates match {
case Some(s) =>
val dates = gdates.map(_.split(",").distinct.map(MyDateTime.parseDate _ ))
val datesOrDefault = dates.getOrElse(Array((new DateTime).withTime(0, 0, 0, 0)))
NBAScoreboard.findByDates(datesOrDefault)
case _ =>
NBAScoreboard.getToday
}
This should work.

Scala: how to traverse stream/iterator collecting results into several different collections

I'm going through log file that is too big to fit into memory and collecting 2 type of expressions, what is better functional alternative to my iterative snippet below?
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String, String)]={
val lines : Iterator[String] = io.Source.fromFile(file).getLines()
val logins: mutable.Map[String, String] = new mutable.HashMap[String, String]()
val errors: mutable.ListBuffer[(String, String)] = mutable.ListBuffer.empty
for (line <- lines){
line match {
case errorPat(date,ip)=> errors.append((ip,date))
case loginPat(date,user,ip,id) =>logins.put(ip, id)
case _ => ""
}
}
errors.toList.map(line => (logins.getOrElse(line._1,"none") + " " + line._1,line._2))
}
Here is a possible solution:
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String,String)] = {
val lines = Source.fromFile(file).getLines
val (err, log) = lines.collect {
case errorPat(inf, ip) => (Some((ip, inf)), None)
case loginPat(_, _, ip, id) => (None, Some((ip, id)))
}.toList.unzip
val ip2id = log.flatten.toMap
err.collect{ case Some((ip,inf)) => (ip2id.getOrElse(ip,"none") + "" + ip, inf) }
}
Corrections:
1) removed unnecessary types declarations
2) tuple deconstruction instead of ulgy ._1
3) left fold instead of mutable accumulators
4) used more convenient operator-like methods :+ and +
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String, String)] = {
val lines = io.Source.fromFile(file).getLines()
val (logins, errors) =
((Map.empty[String, String], Seq.empty[(String, String)]) /: lines) {
case ((loginsAcc, errorsAcc), next) =>
next match {
case errorPat(date, ip) => (loginsAcc, errorsAcc :+ (ip -> date))
case loginPat(date, user, ip, id) => (loginsAcc + (ip -> id) , errorsAcc)
case _ => (loginsAcc, errorsAcc)
}
}
// more concise equivalent for
// errors.toList.map { case (ip, date) => (logins.getOrElse(ip, "none") + " " + ip) -> date }
for ((ip, date) <- errors.toList)
yield (logins.getOrElse(ip, "none") + " " + ip) -> date
}
I have a few suggestions:
Instead of a pair/tuple, it's often better to use your own class. It gives meaningful names to both the type and its fields, which makes the code much more readable.
Split the code into small parts. In particular, try to decouple pieces of code that don't need to be tied together. This makes your code easier to understand, more robust, less prone to errors and easier to test. In your case it'd be good to separate producing your input (lines of a log file) and consuming it to produce a result. For example, you'd be able to make automatic tests for your function without having to store sample data in a file.
As an example and exercise, I tried to make a solution based on Scalaz iteratees. It's a bit longer (includes some auxiliary code for IteratorEnumerator) and perhaps it's a bit overkill for the task, but perhaps someone will find it helpful.
import java.io._;
import scala.util.matching.Regex
import scalaz._
import scalaz.IterV._
object MyApp extends App {
// A type for the result. Having names keeps things
// clearer and shorter.
type LogResult = List[(String,String)]
// Represents a state of our computation. Not only it
// gives a name to the data, we can also put here
// functions that modify the state. This nicely
// separates what we're computing and how.
sealed case class State(
logins: Map[String,String],
errors: Seq[(String,String)]
) {
def this() = {
this(Map.empty[String,String], Seq.empty[(String,String)])
}
def addError(date: String, ip: String): State =
State(logins, errors :+ (ip -> date));
def addLogin(ip: String, id: String): State =
State(logins + (ip -> id), errors);
// Produce the final result from accumulated data.
def result: LogResult =
for ((ip, date) <- errors.toList)
yield (logins.getOrElse(ip, "none") + " " + ip) -> date
}
// An iteratee that consumes lines of our input. Based
// on the given regular expressions, it produces an
// iteratee that parses the input and uses State to
// compute the result.
def logIteratee(errorPat: Regex, loginPat: Regex):
IterV[String,List[(String,String)]] = {
// Consumes a signle line.
def consume(line: String, state: State): State =
line match {
case errorPat(date, ip) => state.addError(date, ip);
case loginPat(date, user, ip, id) => state.addLogin(ip, id);
case _ => state
}
// The core of the iteratee. Every time we consume a
// line, we update our state. When done, compute the
// final result.
def step(state: State)(s: Input[String]): IterV[String, LogResult] =
s(el = line => Cont(step(consume(line, state))),
empty = Cont(step(state)),
eof = Done(state.result, EOF[String]))
// Return the iterate waiting for its first input.
Cont(step(new State()));
}
// Converts an iterator into an enumerator. This
// should be more likely moved to Scalaz.
// Adapted from scalaz.ExampleIteratee
implicit val IteratorEnumerator = new Enumerator[Iterator] {
#annotation.tailrec def apply[E, A](e: Iterator[E], i: IterV[E, A]): IterV[E, A] = {
val next: Option[(Iterator[E], IterV[E, A])] =
if (e.hasNext) {
val x = e.next();
i.fold(done = (_, _) => None, cont = k => Some((e, k(El(x)))))
} else
None;
next match {
case None => i
case Some((es, is)) => apply(es, is)
}
}
}
// main ---------------------------------------------------
{
// Read a file as an iterator of lines:
// val lines: Iterator[String] =
// io.Source.fromFile("test.log").getLines();
// Create our testing iterator:
val lines: Iterator[String] = Seq(
"Error: 2012/03 1.2.3.4",
"Login: 2012/03 user 1.2.3.4 Joe",
"Error: 2012/03 1.2.3.5",
"Error: 2012/04 1.2.3.4"
).iterator;
// Create an iteratee.
val iter = logIteratee("Error: (\\S+) (\\S+)".r,
"Login: (\\S+) (\\S+) (\\S+) (\\S+)".r);
// Run the the iteratee against the input
// (the enumerator is implicit)
println(iter(lines).run);
}
}