Filtering inside `for` with pattern matching - scala

I am reading a TSV file and using using something like this:
case class Entry(entryType: Int, value: Int)
def filterEntries(): Iterator[Entry] = {
for {
line <- scala.io.Source.fromFile("filename").getLines()
} yield new Entry(line.split("\t").map(x => x.toInt))
}
Now I am both interested in filtering out entries whose entryType are set to 0 and ignoring lines with column count greater or lesser than 2 (that does not match the constructor). I was wondering if there's an idiomatic way to achieve this may be using pattern matching and unapply method in a companion object. The only thing I can think of is using .filter on the resulting iterator.
I will also accept solution not involving for loop but that returns Iterator[Entry]. They solutions must be tolerant to malformed inputs.

This is more state-of-arty:
package object liner {
implicit class R(val sc: StringContext) {
object r {
def unapplySeq(s: String): Option[Seq[String]] = sc.parts.mkString.r unapplySeq s
}
}
}
package liner {
case class Entry(entryType: Int, value: Int)
object I {
def unapply(s: String): Option[Int] = util.Try(s.toInt).toOption
}
object Test extends App {
def lines = List("1 2", "3", "", " 4 5 ", "junk", "0, 100000", "6 7 8")
def entries = lines flatMap {
case r"""\s*${I(i)}(\d+)\s+${I(j)}(\d+)\s*""" if i != 0 => Some(Entry(i, j))
case __________________________________________________ => None
}
Console println entries
}
}
Hopefully, the regex interpolator will make it into the standard distro soon, but this shows how easy it is to rig up. Also hopefully, a scanf-style interpolator will allow easy extraction with case f"$i%d".
I just started using the "elongated wildcard" in patterns to align the arrows.
There is a pupal or maybe larval regex macro:
https://github.com/som-snytt/regextractor

You can create variables in the head of the for-comprehension and then use a guard:
edit: ensure length of array
for {
line <- scala.io.Source.fromFile("filename").getLines()
arr = line.split("\t").map(x => x.toInt)
if arr.size == 2 && arr(0) != 0
} yield new Entry(arr(0), arr(1))

I have solved it using the following code:
import scala.util.{Try, Success}
val lines = List(
"1\t2",
"1\t",
"2",
"hello",
"1\t3"
)
case class Entry(val entryType: Int, val value: Int)
object Entry {
def unapply(line: String) = {
line.split("\t").map(x => Try(x.toInt)) match {
case Array(Success(entryType: Int), Success(value: Int)) => Some(Entry(entryType, value))
case _ =>
println("Malformed line: " + line)
None
}
}
}
for {
line <- lines
entryOption = Entry.unapply(line)
if entryOption.isDefined
} yield entryOption.get

The left hand side of a <- or = in a for-loop may be a fully-fledged pattern. So you may write this:
def filterEntries(): Iterator[Int] = for {
line <- scala.io.Source.fromFile("filename").getLines()
arr = line.split("\t").map(x => x.toInt)
if arr.size == 2
// now you may use pattern matching to extract the array
Array(entryType, value) = arr
if entryType == 0
} yield Entry(entryType, value)
Note that this solution will throw a NumberFormatException if a field is not convertible to an Int. If you do not want that, you'll have to encapsulate x.toInt with a Try and pattern match again.

Related

Scala conditional accumulation

I'm trying to implement a function that extracts from a given string "placeholders" delimited by $ character.
Processing the string:
val stringToParse = "ignore/me/$aaa$/once-again/ignore/me/$bbb$/still-to-be/ignored
the result should be:
Seq("aaa", "bbb")
What would be a Scala idiomatic alternative of following implementation using var for toggling accumulation?
import fiddle.Fiddle, Fiddle.println
import scalajs.js
import scala.collection.mutable.ListBuffer
#js.annotation.JSExportTopLevel("ScalaFiddle")
object ScalaFiddle {
// $FiddleStart
val stringToParse = "ignore/me/$aaa$/once-again/ignore/me/$bbb$/still-to-be/ignored"
class StringAccumulator {
val accumulator: ListBuffer[String] = new ListBuffer[String]
val sb: StringBuilder = new StringBuilder("")
var open:Boolean = false
def next():Unit = {
if (open) {
accumulator.append(sb.toString)
sb.clear
open = false
} else {
open = true
}
}
def accumulateIfOpen(charToAccumulate: Char):Unit = {
if (open) sb.append(charToAccumulate)
}
def get(): Seq[String] = accumulator.toList
}
def getPlaceHolders(str: String): Seq[String] = {
val sac = new StringAccumulator
str.foreach(chr => {
if (chr == '$') {
sac.next()
} else {
sac.accumulateIfOpen(chr)
}
})
sac.get
}
println(getPlaceHolders(stringToParse))
// $FiddleEnd
}
I'll present two solutions to you. The first is the most direct translation of what you've done. In Scala, if you hear the word accumulate it usually translates to a variant of fold or reduce.
def extractValues(s: String) =
{
// We can combine the functionality of your boolean and StringBuilder by using an Option
s.foldLeft[(ListBuffer[String],Option[StringBuilder])]((new ListBuffer[String], Option.empty))
{
//As we fold through, we have the accumulated list, possibly a partially built String and the current letter
case ((accumulator,sbOption),char) =>
{
char match
{
//This logic pretty much matches what you had, adjusted to work with the Option
case '$' =>
{
sbOption match
{
case Some(sb) =>
{
accumulator.append(sb.mkString)
(accumulator,None)
}
case None =>
{
(accumulator,Some(new StringBuilder))
}
}
}
case _ =>
{
sbOption.foreach(_.append(char))
(accumulator,sbOption)
}
}
}
}._1.map(_.mkString).toList
}
However, that seems pretty complicated, for what sounds like it should be a simple task. We can use regexes, but those are scary so let's avoid them. In fact, with a little bit of thought this problem actually becomes quite simple.
def extractValuesSimple(s: String) =
{
s.split('$'). //Split the string on the $ character
dropRight(1). //Drops the rightmost item, to handle the case with an odd number of $
zipWithIndex.filter{case (str, index) => index % 2 == 1}. //Filter out all of the even indexed items, which will always be outside of the matching $
map{case (str, index) => str}.toList //Remove the indexes from the output
}
Is this solution enough?
scala> val stringToParse = "ignore/me/$aaa$/once-again/ignore/me/$bbb$/still-to-be/ignored"
stringToParse: String = ignore/me/$aaa$/once-again/ignore/me/$bbb$/still-to-be/ignored
scala> val P = """\$([^\$]+)\$""".r
P: scala.util.matching.Regex = \$([^\$]+)\$
scala> P.findAllIn(stringToParse).map{case P(s) => s}.toSeq
res1: Seq[String] = List(aaa, bbb)

Nested Scala case classes to/from CSV

There are many nice libraries for writing/reading Scala case classes to/from CSV files. I'm looking for something that goes beyond that, which can handle nested cases classes. For example, here a Match has two Players:
case class Player(name: String, ranking: Int)
case class Match(place: String, winner: Player, loser: Player)
val matches = List(
Match("London", Player("Jane",7), Player("Fred",23)),
Match("Rome", Player("Marco",19), Player("Giulia",3)),
Match("Paris", Player("Isabelle",2), Player("Julien",5))
)
I'd like to effortlessly (no boilerplate!) write/read matches to/from this CSV:
place,winner.name,winner.ranking,loser.name,loser.ranking
London,Jane,7,Fred,23
Rome,Marco,19,Giulia,3
Paris,Isabelle,2,Julien,5
Note the automated header line using the dot "." to form the column name for a nested field, e.g. winner.ranking. I'd be delighted if someone could demonstrate a simple way to do this (say, using reflection or Shapeless).
[Motivation. During data analysis it's convenient to have a flat CSV to play around with, for sorting, filtering, etc., even when case classes are nested. And it would be nice if you could load nested case classes back from such files.]
Since a case-class is a Product, getting the values of the various fields is relatively easy. Getting the names of the fields/columns does require using Java reflection.
The following function takes a list of case-class instances and returns a list of rows, each is a list of strings. It is using a recursion to get the values and headers of child case-class instances.
def toCsv(p: List[Product]): List[List[String]] = {
def header(c: Class[_], prefix: String = ""): List[String] = {
c.getDeclaredFields.toList.flatMap { field =>
val name = prefix + field.getName
if (classOf[Product].isAssignableFrom(field.getType)) header(field.getType, name + ".")
else List(name)
}
}
def flatten(p: Product): List[String] =
p.productIterator.flatMap {
case p: Product => flatten(p)
case v: Any => List(v.toString)
}.toList
header(classOf[Match]) :: p.map(flatten)
}
However, constructing case-classes from CSV is far more involved, requiring to use reflection for getting the types of the various fields, for creating the values from the CSV strings and for constructing the case-class instances.
For simplicity (not saying the code is simple, just so it won't be further complicated), I assume that the order of columns in the CSV is the same as if the file was produced by the toCsv(...) function above.
The following function starts by creating a list of "instructions how to process a single CSV row" (the instructions are also used to verify that the column headers in the CSV matches the the case-class properties). The instructions are then used to recursively produce one CSV row at a time.
def fromCsv[T <: Product](csv: List[List[String]])(implicit tag: ClassTag[T]): List[T] = {
trait Instruction {
val name: String
val header = true
}
case class BeginCaseClassField(name: String, clazz: Class[_]) extends Instruction {
override val header = false
}
case class EndCaseClassField(name: String) extends Instruction {
override val header = false
}
case class IntField(name: String) extends Instruction
case class StringField(name: String) extends Instruction
case class DoubleField(name: String) extends Instruction
def scan(c: Class[_], prefix: String = ""): List[Instruction] = {
c.getDeclaredFields.toList.flatMap { field =>
val name = prefix + field.getName
val fType = field.getType
if (fType == classOf[Int]) List(IntField(name))
else if (fType == classOf[Double]) List(DoubleField(name))
else if (fType == classOf[String]) List(StringField(name))
else if (classOf[Product].isAssignableFrom(fType)) BeginCaseClassField(name, fType) :: scan(fType, name + ".")
else throw new IllegalArgumentException(s"Unsupported field type: $fType")
} :+ EndCaseClassField(prefix)
}
def produce(instructions: List[Instruction], row: List[String], argAccumulator: List[Any]): (List[Instruction], List[String], List[Any]) = instructions match {
case IntField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString.toInt)
case StringField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString)
case DoubleField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString.toDouble)
case BeginCaseClassField(_, clazz) :: tail =>
val (instructionRemaining, rowRemaining, constructorArgs) = produce(tail, row, List.empty)
val newCaseClass = clazz.getConstructors.head.newInstance(constructorArgs.map(_.asInstanceOf[AnyRef]): _*)
produce(instructionRemaining, rowRemaining, argAccumulator :+ newCaseClass)
case EndCaseClassField(_) :: tail => (tail, row, argAccumulator)
case Nil if row.isEmpty => (Nil, Nil, argAccumulator)
case Nil => throw new IllegalArgumentException("Not all values from CSV row were used")
}
val instructions = BeginCaseClassField(".", tag.runtimeClass) :: scan(tag.runtimeClass)
assert(csv.head == instructions.filter(_.header).map(_.name), "CSV header doesn't match target case-class fields")
csv.drop(1).map(row => produce(instructions, row, List.empty)._3.head.asInstanceOf[T])
}
I've tested this using:
case class Player(name: String, ranking: Int, price: Double)
case class Match(place: String, winner: Player, loser: Player)
val matches = List(
Match("London", Player("Jane", 7, 12.5), Player("Fred", 23, 11.1)),
Match("Rome", Player("Marco", 19, 13.54), Player("Giulia", 3, 41.8)),
Match("Paris", Player("Isabelle", 2, 31.7), Player("Julien", 5, 16.8))
)
val csv = toCsv(matches)
val matchesFromCsv = fromCsv[Match](csv)
assert(matches == matchesFromCsv)
Obviously this should be optimized and hardened if you ever want to use this for production...

Scala - how to print case classes like (pretty printed) tree

I'm making a parser with Scala Combinators. It is awesome. What I end up with is a long list of entagled case classes, like: ClassDecl(Complex,List(VarDecl(Real,float), VarDecl(Imag,float))), just 100x longer. I was wondering if there is a good way to print case classes like these in a tree-like fashion so that it's easier to read..? (or some other form of Pretty Print)
ClassDecl
name = Complex
fields =
- VarDecl
name = Real
type = float
- VarDecl
name = Imag
type = float
^ I want to end up with something like this
edit: Bonus question
Is there also a way to show the name of the parameter..? Like: ClassDecl(name=Complex, fields=List( ... ) ?
Check out a small extensions library named sext. It exports these two functions exactly for purposes like that.
Here's how it can be used for your example:
object Demo extends App {
import sext._
case class ClassDecl( kind : Kind, list : List[ VarDecl ] )
sealed trait Kind
case object Complex extends Kind
case class VarDecl( a : Int, b : String )
val data = ClassDecl(Complex,List(VarDecl(1, "abcd"), VarDecl(2, "efgh")))
println("treeString output:\n")
println(data.treeString)
println()
println("valueTreeString output:\n")
println(data.valueTreeString)
}
Following is the output of this program:
treeString output:
ClassDecl:
- Complex
- List:
| - VarDecl:
| | - 1
| | - abcd
| - VarDecl:
| | - 2
| | - efgh
valueTreeString output:
- kind:
- list:
| - - a:
| | | 1
| | - b:
| | | abcd
| - - a:
| | | 2
| | - b:
| | | efgh
Starting Scala 2.13, case classes (which are an implementation of Product) are now provided with a productElementNames method which returns an iterator over their field's names.
Combined with Product::productIterator which provides the values of a case class, we have a simple way to pretty print case classes without requiring reflection:
def pprint(obj: Any, depth: Int = 0, paramName: Option[String] = None): Unit = {
val indent = " " * depth
val prettyName = paramName.fold("")(x => s"$x: ")
val ptype = obj match { case _: Iterable[Any] => "" case obj: Product => obj.productPrefix case _ => obj.toString }
println(s"$indent$prettyName$ptype")
obj match {
case seq: Iterable[Any] =>
seq.foreach(pprint(_, depth + 1))
case obj: Product =>
(obj.productIterator zip obj.productElementNames)
.foreach { case (subObj, paramName) => pprint(subObj, depth + 1, Some(paramName)) }
case _ =>
}
}
which for your specific scenario:
// sealed trait Kind
// case object Complex extends Kind
// case class VarDecl(a: Int, b: String)
// case class ClassDecl(kind: Kind, decls: List[VarDecl])
val data = ClassDecl(Complex, List(VarDecl(1, "abcd"), VarDecl(2, "efgh")))
pprint(data)
produces:
ClassDecl
kind: Complex
decls:
VarDecl
a: 1
b: abcd
VarDecl
a: 2
b: efgh
Use the com.lihaoyi.pprint library.
libraryDependencies += "com.lihaoyi" %% "pprint" % "0.4.1"
val data = ...
val str = pprint.tokenize(data).mkString
println(str)
you can also configure width, height, indent and colors:
pprint.tokenize(data, width = 80).mkString
Docs: https://github.com/com-lihaoyi/PPrint
Here's my solution which greatly improves how http://www.lihaoyi.com/PPrint/ handles the case-classes (see https://github.com/lihaoyi/PPrint/issues/4 ).
e.g. it prints this:
for such a usage:
pprint2 = pprint.copy(additionalHandlers = pprintAdditionalHandlers)
case class Author(firstName: String, lastName: String)
case class Book(isbn: String, author: Author)
val b = Book("978-0486282114", Author("first", "last"))
pprint2.pprintln(b)
code:
import pprint.{PPrinter, Tree, Util}
object PPrintUtils {
// in scala 2.13 this would be even simpler/cleaner due to added product.productElementNames
protected def caseClassToMap(cc: Product): Map[String, Any] = {
val fieldValues = cc.productIterator.toSet
val fields = cc.getClass.getDeclaredFields.toSeq
.filterNot(f => f.isSynthetic || java.lang.reflect.Modifier.isStatic(f.getModifiers))
fields.map { f =>
f.setAccessible(true)
f.getName -> f.get(cc)
}.filter { case (k, v) => fieldValues.contains(v) }
.toMap
}
var pprint2: PPrinter = _
protected def pprintAdditionalHandlers: PartialFunction[Any, Tree] = {
case x: Product =>
val className = x.getClass.getName
// see source code for pprint.treeify()
val shouldNotPrettifyCaseClass = x.productArity == 0 || (x.productArity == 2 && Util.isOperator(x.productPrefix)) || className.startsWith(pprint.tuplePrefix) || className == "scala.Some"
if (shouldNotPrettifyCaseClass)
pprint.treeify(x)
else {
val fieldMap = caseClassToMap(x)
pprint.Tree.Apply(
x.productPrefix,
fieldMap.iterator.flatMap { case (k, v) =>
val prettyValue: Tree = pprintAdditionalHandlers.lift(v).getOrElse(pprint2.treeify(v))
Seq(pprint.Tree.Infix(Tree.Literal(k), "=", prettyValue))
}
)
}
}
pprint2 = pprint.copy(additionalHandlers = pprintAdditionalHandlers)
}
// usage
pprint2.println(SomeFancyObjectWithNestedCaseClasses(...))
import java.lang.reflect.Field
...
/**
* Pretty prints case classes with field names.
* Handles sequences and arrays of such values.
* Ideally, one could take the output and paste it into source code and have it compile.
*/
def prettyPrint(a: Any): String = {
// Recursively get all the fields; this will grab vals declared in parents of case classes.
def getFields(cls: Class[_]): List[Field] =
Option(cls.getSuperclass).map(getFields).getOrElse(Nil) ++
cls.getDeclaredFields.toList.filterNot(f =>
f.isSynthetic || java.lang.reflect.Modifier.isStatic(f.getModifiers))
a match {
// Make Strings look similar to their literal form.
case s: String =>
'"' + Seq("\n" -> "\\n", "\r" -> "\\r", "\t" -> "\\t", "\"" -> "\\\"", "\\" -> "\\\\").foldLeft(s) {
case (acc, (c, r)) => acc.replace(c, r) } + '"'
case xs: Seq[_] =>
xs.map(prettyPrint).toString
case xs: Array[_] =>
s"Array(${xs.map(prettyPrint) mkString ", "})"
// This covers case classes.
case p: Product =>
s"${p.productPrefix}(${
(getFields(p.getClass) map { f =>
f setAccessible true
s"${f.getName} = ${prettyPrint(f.get(p))}"
}) mkString ", "
})"
// General objects and primitives end up here.
case q =>
Option(q).map(_.toString).getOrElse("¡null!")
}
}
Just like parser combinators, Scala already contains pretty printer combinators in the standard library. (note: this library is deprecated as of Scala 2.11. A similar pretty printing library is a part of kiama open source project).
You are not saying it plainly in your question if you need the solution that does "reflection" or you'd like to build the printer explicitly. (though your "bonus question" hints you probably want "reflective" solution)
Anyway, in the case you'd like to develop simple pretty printer using plain Scala library, here it is. The following code is REPLable.
case class VarDecl(name: String, `type`: String)
case class ClassDecl(name: String, fields: List[VarDecl])
import scala.text._
import Document._
def varDoc(x: VarDecl) =
nest(4, text("- VarDecl") :/:
group("name = " :: text(x.name)) :/:
group("type = " :: text(x.`type`))
)
def classDoc(x: ClassDecl) = {
val docs = ((empty:Document) /: x.fields) { (d, f) => varDoc(f) :/: d }
nest(2, text("ClassDecl") :/:
group("name = " :: text(x.name)) :/:
group("fields =" :/: docs))
}
def prettyPrint(d: Document) = {
val writer = new java.io.StringWriter
d.format(1, writer)
writer.toString
}
prettyPrint(classDoc(
ClassDecl("Complex", VarDecl("Real","float") :: VarDecl("Imag","float") :: Nil)
))
Bonus question: wrap the printers into type classes for even greater composability.
The nicest, most concise "out-of-the" box experience I've found is with the Kiama pretty printing library. It doesn't print member names without using additional combinators, but with only import org.kiama.output.PrettyPrinter._; pretty(any(data)) you have a great start:
case class ClassDecl( kind : Kind, list : List[ VarDecl ] )
sealed trait Kind
case object Complex extends Kind
case class VarDecl( a : Int, b : String )
val data = ClassDecl(Complex,List(VarDecl(1, "abcd"), VarDecl(2, "efgh")))
import org.kiama.output.PrettyPrinter._
// `w` is the wrapping width. `1` forces wrapping all components.
pretty(any(data), w=1)
Produces:
ClassDecl (
Complex (),
List (
VarDecl (
1,
"abcd"),
VarDecl (
2,
"efgh")))
Note that this is just the most basic example. Kiama PrettyPrinter is an extremely powerful library with a rich set of combinators specifically designed for intelligent spacing, line wrapping, nesting, and grouping. It's very easy to tweak to suit your needs. As of this posting, it's available in SBT with:
libraryDependencies += "com.googlecode.kiama" %% "kiama" % "1.8.0"
Using reflection
import scala.reflect.ClassTag
import scala.reflect.runtime.universe._
object CaseClassBeautifier {
def getCaseAccessors[T: TypeTag] = typeOf[T].members.collect {
case m: MethodSymbol if m.isCaseAccessor => m
}.toList
def nice[T:TypeTag](x: T)(implicit classTag: ClassTag[T]) : String = {
val instance = x.asInstanceOf[T]
val mirror = runtimeMirror(instance.getClass.getClassLoader)
val accessors = getCaseAccessors[T]
var res = List.empty[String]
accessors.foreach { z ⇒
val instanceMirror = mirror.reflect(instance)
val fieldMirror = instanceMirror.reflectField(z.asTerm)
val s = s"${z.name} = ${fieldMirror.get}"
res = s :: res
}
val beautified = x.getClass.getSimpleName + "(" + res.mkString(", ") + ")"
beautified
}
}
This is a shamless copy paste of #F. P Freely, but
I've added an indentation feature
slight modifications so that the output will be of correct Scala style (and will compile for all primative types)
Fixed string literal bug
Added support for java.sql.Timestamp (as I use this with Spark a lot)
Tada!
// Recursively get all the fields; this will grab vals declared in parents of case classes.
def getFields(cls: Class[_]): List[Field] =
Option(cls.getSuperclass).map(getFields).getOrElse(Nil) ++
cls.getDeclaredFields.toList.filterNot(f =>
f.isSynthetic || java.lang.reflect.Modifier.isStatic(f.getModifiers))
// FIXME fix bug where indent seems to increase too much
def prettyfy(a: Any, indentSize: Int = 0): String = {
val indent = List.fill(indentSize)(" ").mkString
val newIndentSize = indentSize + 2
(a match {
// Make Strings look similar to their literal form.
case string: String =>
val conversionMap = Map('\n' -> "\\n", '\r' -> "\\r", '\t' -> "\\t", '\"' -> "\\\"", '\\' -> "\\\\")
string.map(c => conversionMap.getOrElse(c, c)).mkString("\"", "", "\"")
case xs: Seq[_] =>
xs.map(prettyfy(_, newIndentSize)).toString
case xs: Array[_] =>
s"Array(${xs.map(prettyfy(_, newIndentSize)).mkString(", ")})"
case map: Map[_, _] =>
s"Map(\n" + map.map {
case (key, value) => " " + prettyfy(key, newIndentSize) + " -> " + prettyfy(value, newIndentSize)
}.mkString(",\n") + "\n)"
case None => "None"
case Some(x) => "Some(" + prettyfy(x, newIndentSize) + ")"
case timestamp: Timestamp => "new Timestamp(" + timestamp.getTime + "L)"
case p: Product =>
s"${p.productPrefix}(\n${
getFields(p.getClass)
.map { f =>
f.setAccessible(true)
s" ${f.getName} = ${prettyfy(f.get(p), newIndentSize)}"
}
.mkString(",\n")
}\n)"
// General objects and primitives end up here.
case q =>
Option(q).map(_.toString).getOrElse("null")
})
.split("\n", -1).mkString("\n" + indent)
}
E.g.
case class Foo(bar: String, bob: Int)
case class Alice(foo: Foo, opt: Option[String], opt2: Option[String])
scala> prettyPrint(Alice(Foo("hello world", 10), Some("asdf"), None))
res6: String =
Alice(
foo = Foo(
bar = "hello world",
bob = 10
),
opt = Some("asdf"),
opt2 = None
)
If you use Apache Spark, you can use the following method to print your case classes :
def prettyPrint[T <: Product : scala.reflect.runtime.universe.TypeTag](c:T) = {
import play.api.libs.json.Json
println(Json.prettyPrint(Json.parse(Seq(c).toDS().toJSON.head)))
}
This gives a nicely formatted JSON representation of your case class instance. Make sure sparkSession.implicits._ is imported
example:
case class Adress(country:String,city:String,zip:Int,street:String)
case class Person(name:String,age:Int,adress:Adress)
val person = Person("Peter",36,Adress("Switzerland","Zürich",9876,"Bahnhofstrasse 69"))
prettyPrint(person)
gives :
{
"name" : "Peter",
"age" : 36,
"adress" : {
"country" : "Switzerland",
"city" : "Zürich",
"zip" : 9876,
"street" : "Bahnhofstrasse 69"
}
}
I would suggest using the same print that is used in the AssertEquals of your test framework of choice. I was using Scalameta and munit.Assertions.munitPrint(clue: => Any): String does the trick. I can pass nested classes to it and see the whole tree with the proper indentation.

Scala: how to traverse stream/iterator collecting results into several different collections

I'm going through log file that is too big to fit into memory and collecting 2 type of expressions, what is better functional alternative to my iterative snippet below?
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String, String)]={
val lines : Iterator[String] = io.Source.fromFile(file).getLines()
val logins: mutable.Map[String, String] = new mutable.HashMap[String, String]()
val errors: mutable.ListBuffer[(String, String)] = mutable.ListBuffer.empty
for (line <- lines){
line match {
case errorPat(date,ip)=> errors.append((ip,date))
case loginPat(date,user,ip,id) =>logins.put(ip, id)
case _ => ""
}
}
errors.toList.map(line => (logins.getOrElse(line._1,"none") + " " + line._1,line._2))
}
Here is a possible solution:
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String,String)] = {
val lines = Source.fromFile(file).getLines
val (err, log) = lines.collect {
case errorPat(inf, ip) => (Some((ip, inf)), None)
case loginPat(_, _, ip, id) => (None, Some((ip, id)))
}.toList.unzip
val ip2id = log.flatten.toMap
err.collect{ case Some((ip,inf)) => (ip2id.getOrElse(ip,"none") + "" + ip, inf) }
}
Corrections:
1) removed unnecessary types declarations
2) tuple deconstruction instead of ulgy ._1
3) left fold instead of mutable accumulators
4) used more convenient operator-like methods :+ and +
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String, String)] = {
val lines = io.Source.fromFile(file).getLines()
val (logins, errors) =
((Map.empty[String, String], Seq.empty[(String, String)]) /: lines) {
case ((loginsAcc, errorsAcc), next) =>
next match {
case errorPat(date, ip) => (loginsAcc, errorsAcc :+ (ip -> date))
case loginPat(date, user, ip, id) => (loginsAcc + (ip -> id) , errorsAcc)
case _ => (loginsAcc, errorsAcc)
}
}
// more concise equivalent for
// errors.toList.map { case (ip, date) => (logins.getOrElse(ip, "none") + " " + ip) -> date }
for ((ip, date) <- errors.toList)
yield (logins.getOrElse(ip, "none") + " " + ip) -> date
}
I have a few suggestions:
Instead of a pair/tuple, it's often better to use your own class. It gives meaningful names to both the type and its fields, which makes the code much more readable.
Split the code into small parts. In particular, try to decouple pieces of code that don't need to be tied together. This makes your code easier to understand, more robust, less prone to errors and easier to test. In your case it'd be good to separate producing your input (lines of a log file) and consuming it to produce a result. For example, you'd be able to make automatic tests for your function without having to store sample data in a file.
As an example and exercise, I tried to make a solution based on Scalaz iteratees. It's a bit longer (includes some auxiliary code for IteratorEnumerator) and perhaps it's a bit overkill for the task, but perhaps someone will find it helpful.
import java.io._;
import scala.util.matching.Regex
import scalaz._
import scalaz.IterV._
object MyApp extends App {
// A type for the result. Having names keeps things
// clearer and shorter.
type LogResult = List[(String,String)]
// Represents a state of our computation. Not only it
// gives a name to the data, we can also put here
// functions that modify the state. This nicely
// separates what we're computing and how.
sealed case class State(
logins: Map[String,String],
errors: Seq[(String,String)]
) {
def this() = {
this(Map.empty[String,String], Seq.empty[(String,String)])
}
def addError(date: String, ip: String): State =
State(logins, errors :+ (ip -> date));
def addLogin(ip: String, id: String): State =
State(logins + (ip -> id), errors);
// Produce the final result from accumulated data.
def result: LogResult =
for ((ip, date) <- errors.toList)
yield (logins.getOrElse(ip, "none") + " " + ip) -> date
}
// An iteratee that consumes lines of our input. Based
// on the given regular expressions, it produces an
// iteratee that parses the input and uses State to
// compute the result.
def logIteratee(errorPat: Regex, loginPat: Regex):
IterV[String,List[(String,String)]] = {
// Consumes a signle line.
def consume(line: String, state: State): State =
line match {
case errorPat(date, ip) => state.addError(date, ip);
case loginPat(date, user, ip, id) => state.addLogin(ip, id);
case _ => state
}
// The core of the iteratee. Every time we consume a
// line, we update our state. When done, compute the
// final result.
def step(state: State)(s: Input[String]): IterV[String, LogResult] =
s(el = line => Cont(step(consume(line, state))),
empty = Cont(step(state)),
eof = Done(state.result, EOF[String]))
// Return the iterate waiting for its first input.
Cont(step(new State()));
}
// Converts an iterator into an enumerator. This
// should be more likely moved to Scalaz.
// Adapted from scalaz.ExampleIteratee
implicit val IteratorEnumerator = new Enumerator[Iterator] {
#annotation.tailrec def apply[E, A](e: Iterator[E], i: IterV[E, A]): IterV[E, A] = {
val next: Option[(Iterator[E], IterV[E, A])] =
if (e.hasNext) {
val x = e.next();
i.fold(done = (_, _) => None, cont = k => Some((e, k(El(x)))))
} else
None;
next match {
case None => i
case Some((es, is)) => apply(es, is)
}
}
}
// main ---------------------------------------------------
{
// Read a file as an iterator of lines:
// val lines: Iterator[String] =
// io.Source.fromFile("test.log").getLines();
// Create our testing iterator:
val lines: Iterator[String] = Seq(
"Error: 2012/03 1.2.3.4",
"Login: 2012/03 user 1.2.3.4 Joe",
"Error: 2012/03 1.2.3.5",
"Error: 2012/04 1.2.3.4"
).iterator;
// Create an iteratee.
val iter = logIteratee("Error: (\\S+) (\\S+)".r,
"Login: (\\S+) (\\S+) (\\S+) (\\S+)".r);
// Run the the iteratee against the input
// (the enumerator is implicit)
println(iter(lines).run);
}
}

Why is this Option transformed to a String? [Scala]

I'm still a Scala noob, and this confuses me:
import java.util.regex._
object NumberMatcher {
def apply(x:String):Boolean = {
val pat = Pattern.compile("\\d+")
val matcher = pat.matcher(x)
return matcher.find
}
def unapply(x:String):Option[String] = {
val pat = Pattern.compile("\\d+")
val matcher = pat.matcher(x)
if(matcher.find) {
return Some(matcher.group())
}
None
}
}
object x {
def main(args : Array[String]) : Unit = {
val strings = List("geo12","neo493","leo")
for(val string <- strings) {
string match {
case NumberMatcher(group) => println(group)
case _ => println ("no")
}
}
}
}
I wanted to add pattern matching for strings containing digits ( so I can learn more about pattern matching ), and in unapply I decided to return a Option[String]. However, in the println in the NumberMatcher case, group is seen as a String and not as an Option. Can you shed some light? The output produced when this is ran is:
12,493,no
Take a look at this example.
The unapply method returns Some value if it succeeded in extracting one, otherwise None. So internally the
case NumberMatcher(group) => println(group)
invokes unapply and looks whether it returns some value. If it does, we already have to true result and therefore no Option type remains. The pattern matching extracts the returned value from the option.