How to exclude a string from parsed text file using Scala? - scala

A sample text file looks like this:
Date: Nov 12, 2004
Support_Addresses: Support#microsoft.com, suport#yahoo.com,
google#gmail.com,
support#comcast.net
Notes: Need to renew support contracts for software and services.
Expected output is:
Nov 12, 2004
Support#microsoft.com, suport#yahoo.com, google#gmail.com, support#comcast.net
Need to renew support contracts for software and services.
Basically, I need to exclude the field titles from the lines, so things like “Date:” , “Support_Addresses: “ and “Notes: “ are removed from the lines before they are saved to a CSV file. I have this code from other projects:
val support_agreements = lines
.dropWhile(line => !line.startsWith("Support_Addresses: "))
.takeWhile(line => !line.startsWith(“Notes: "))
.flatMap(_.split(","))
.map(_.trim())
.filter(_.nonEmpty)
.mkString(", ")
But it does not remove the field titles/names. I am using startsWith, but it includes the field name. How can I exclude the field name from the line?

This should do it:
text.lines.map{ line =>
line.indexOf(':') match {
case x if x > 0 =>
line.substring(x + 1).trim
case _ => line.trim
}
}.mkString("\n")
it iterates through the lines and if it finds a colon it calls the substring function

Here's what I came up with. It builds a map of data m which could be manipulated usefully. This is then printed in the form you wanted.
def processValue(s: String): List[String] =
s.split(",").toList.map(_.trim).filterNot(_.isEmpty)
val retros = lines.foldLeft(List.empty[(String, List[String])]) {
case (acc, l) =>
l.indexOf(':') match {
case -1 =>
acc match {
case Nil => acc // ???
case h :: t => (h._1, h._2 ++ processValue(l)) :: t
}
case n =>
val key = l.substring(0, n).trim
val value = processValue(l.substring(n+1))
(key, value) :: acc
}
}
val m = retros.reverse.toMap
m.values.map(_.mkString(", ")).foreach(println)

Related

how to extract part of string that did not match pattern

I want to extract part of string that did not match pattern
My pattern matching condition is sting should be of length 5 and should contain only N or Y.
Ex:
NYYYY => valid
NY => Invalid , length is invalid
NYYSY => Invalid. character at position 3 is invalid
If string is invalid then I want to find out which particular character did not match. Ex : In NYYSY 4th character did not match.
I tried with pattern matching in scala
val Pattern = "([NY]{5})".r
paramList match {
case Pattern(c) => true
case _ => false
}
Returns a String indicating validation status.
def validate(str :String, len :Int, cs :Seq[Char]) :String = {
val checkC = cs.toSet
val errs = str.zipAll(Range(0,len), 1.toChar, -1).flatMap{ case (c,x) =>
if (x < 0) Some("too long")
else if (checkC(c)) None
else if (c == 1) Some("too short")
else Some(s"'$c' at index $x")
}
str + ": " + (if (errs.isEmpty) "valid" else errs.distinct.mkString(", "))
}
testing:
validate("NTYYYNN", 4, "NY") //res0: String = NTYYYNN: 'T' at index 1, too long
validate("NYC", 7, "NY") //res1: String = NYC: 'C' at index 2, too short
validate("YNYNY", 5, "NY") //res2: String = YNYNY: valid
Here's one approach that returns a list of (Char, Int) tuples of invalid characters and their corresponding positions in a given string:
def checkString(validChars: List[Char], validLength: Int, s: String) = {
val Pattern = s"([${validChars.mkString}]{$validLength})".r
s match {
case Pattern(_) => Vector.empty[(Char, Int)]
case s =>
val invalidList = s.zipWithIndex.filter{case (c, _) => !validChars.contains(c)}
if (invalidList.nonEmpty) invalidList else Vector(('\u0000', -1))
}
}
List("NYYYY", "NY", "NNSYYTN").map(checkString(List('N', 'Y'), 5, _))
// res1: List(Vector(), Vector((?,-1)), Vector((S,2), (T,5)))
As shown above, an empty list represents a valid string and a list of (null-char, -1) means the string has valid characters but invalid length.
Here is one suggestion which might suit your needs:
"NYYSY".split("(?<=[^NY])|(?=[^NY])").foreach(println)
NYY
S
Y
This solution splits the input string at any point when either the preceding or following character is not a Y or a N. This places each island of valid and invalid characters as separate rows in the output.
You can use additional regular expressions to detect the specific issue:
val Pattern = "([NY]{5})".r
val TooLong = "([NY]{5})(.+)".r
val WrongChar = "([NY]*)([^NY].*)".r
paramList match {
case Pattern(c) => // Good
case TooLong(head, rest) => // Extra character(s) in sequence
case WrongChar(head, rest) => // Wrong character in sequence
case _ => // Too short
}
You can work out the index of the error using head.length and the failing character is rest.head.
You can achieve this with pattern matching each characters of the string without using any sort of regex or complex string manipulation.
def check(value: String): Unit = {
if(value.length!=5) println(s"$value length is invalid.")
else value.foldLeft((0, Seq[String]())){
case (r, char) =>
char match {
case 'Y' | 'N' => r._1+1 -> r._2
case c # _ => r._1+1 -> {r._2 ++ List(s"Invalid character `$c` in position ${r._1}")}
}
}._2 match {
case Nil => println(s"$value is valid.")
case errors: List[String] => println(s"$value is invalid - [${errors.mkString(", ")}]")
}
}
check("NYCNBNY")
NYNYNCC length is invalid.
check("NYCNB")
NYCNB is invalid - [Invalid character `C` in position 2, Invalid character `B` in position 4]
check("NYNNY")
NYNNY is valid.

How can I emit periodic results over an iteration?

I might have something like this:
val found = source.toCharArray.foreach{ c =>
// Process char c
// Sometimes (e.g. on newline) I want to emit a result to be
// captured in 'found'. There may be 0 or more captured results.
}
This shows my intent. I want to iterate over some collection of things. Whenever the need arrises I want to "emit" a result to be captured in found. It's not a direct 1-for-1 like map. collect() is a "pull", applying a partial function over the collection. I want a "push" behavior, where I visit everything but push out something when needed.
Is there a pattern or collection method I'm missing that does this?
Apparently, you have a Collection[Thing], and you want to obtain a new Collection[Event] by emitting a Collection[Event] for each Thing. That is, you want a function
(Collection[Thing], Thing => Collection[Event]) => Collection[Event]
That's exactly what flatMap does.
You can write it down with nested fors where the second generator defines what "events" have to be "emitted" for each input from the source. For example:
val input = "a2ba4b"
val result = (for {
c <- input
emitted <- {
if (c == 'a') List('A')
else if (c.isDigit) List.fill(c.toString.toInt)('|')
else Nil
}
} yield emitted).mkString
println(result)
prints
A||A||||
because each 'a' emits an 'A', each digit emits the right amount of tally marks, and all other symbols are ignored.
There are several other ways to express the same thing, for example, the above expression could also be rewritten with an explicit flatMap and with a pattern match instead of if-else:
println(input.flatMap{
case 'a' => "A"
case d if d.isDigit => "|" * (d.toString.toInt)
case _ => ""
})
I think you are looking for a way to build a Stream for your condition. Streams are lazy and are computed only when required.
val sourceString = "sdfdsdsfssd\ndfgdfgd\nsdfsfsggdfg\ndsgsfgdfgdfg\nsdfsffdg\nersdff\n"
val sourceStream = sourceString.toCharArray.toStream
def foundStreamCreator( source: Stream[Char], emmitBoundaryFunction: Char => Boolean): Stream[String] = {
def loop(sourceStream: Stream[Char], collector: List[Char]): Stream[String] =
sourceStream.isEmpty match {
case true => collector.mkString.reverse #:: Stream.empty[String]
case false => {
val char = sourceStream.head
emmitBoundaryFunction(char) match {
case true =>
collector.mkString.reverse #:: loop(sourceStream.tail, List.empty[Char])
case false =>
loop(sourceStream.tail, char :: collector)
}
}
}
loop(source, List.empty[Char])
}
val foundStream = foundStreamCreator(sourceStream, c => c == '\n')
val foundIterator = foundStream.toIterator
foundIterator.next()
// res0: String = sdfdsdsfssd
foundIterator.next()
// res1: String = dfgdfgd
foundIterator.next()
// res2: String = sdfsfsggdfg
It looks like foldLeft to me:
val found = ((List.empty[String], "") /: source.toCharArray) {case ((agg, tmp), char) =>
if (char == '\n') (tmp :: agg, "") // <- emit
else (agg, tmp + char)
}._1
Where you keep collecting items in a temporary location and then emit it when you run into a character signifying something. Since I used List you'll have to reverse at the end if you want it in order.

Scala - Automatic Iterator inside pattern match

I have Array Data like this : [("Bob",5),("Andy",10),("Jim",7),...(x,y)].
How to do pattern matching in Scala? so they will match automatically based on Array Data that i have provided (instead of define "Case" one by one)
i mean dont like this, pseudocode :
val x = y.match {
case "Bob" => get and print Bob's Score
case "Andy" => get and print Andy's Score
..
}
but
val x = y.match {
case automatically defined by given Array => print each'score
}
Any Idea ? thanks in advance
If printing and storing results in an array is your main concern than the following will work well:
val ls = Array(("Bob",5),("Andy",10),("Jim",7))
ls.map({case (x,y) => println(y); y}) // print and store the score in an array
A bit confused about the question however if you just wish to print all the data in the array i would go about it doing this:
val list = Array(("Foo",3),("Tom",3))
list.foreach{
case (name,score) =>
println(s"$name scored $score")
}
//output:
//Foo scored 3
//Tom scored 3
Consider
val xs = Array( ("Bob",5),("Andy",10),("Jim",7) )
for ( (name,n) <- xs ) println(s"$name scores $n")
and also
xs.foreach { t => println(s"{t._1} scores ${t._2}") }
xs.foreach { t => println(t._1 + " scores " + t._2) }
xs.foreach(println)
A simple way to print the contents of xs,
println( xs.mkString(",") )
where mkString creates a string out of xs and separates each item by a comma.
Miscellany notes
To illustrate pattern matching on Scala Array, consider
val x = xs match {
case Array( t # ("Bob", _), _*) => println("xs starts with " + t._1)
case Array() => println("xs is empty")
case _ => println("xs does not start with Bob")
}
In the first case we extract the first tuple, and neglect the rest. In the first tuple we match against string "Bob" and neglect the second item. Moreover, we bind the first tuple to tag t, which is used in the printing where we refer to its first item.
The second case means every other case not covered.

Split a list into a target element, and the rest of the list?

Let's say I have something like the following:
case class Thing(num: Int)
val xs = List(Thing(1), Thing(2), Thing(3))
What I'd like to do is separate the list into one particular value, and the rest of the list. The target value can be at any position in the list, or may not be present at all. The single value needs to be handled separately, after the other values are handled, so I can't simply use pattern matching.
What I have so far is this:
val (targetList, rest) = xs.partition(_.num == 2)
val targetEl = targetList match {
case x :: Nil => x
case _ => null
}
Is it possible to combine the two steps? Like
val (targetEl, rest) = xs.<some_method>
A note on handling order:
The reason that the target element must be handled last is that this is for use in a HTML template (Play framework). The other elements are looped through, and a HTML element is rendered for each. After that group of elements, another HTML element is created for the target element.
You can do it with pattern-matching in map, you just need multiple cases:
xs map {
case t # Thing(1) => // do something with thing 1
case t => // do something with the other things
}
To handle the OP's extra requirements:
xs map {
case t # Thing(num) if(num != 1) => // do something with things that are not "1"
case t => // do something with thing 1
}
Following produces two lists as tuples for some condition.
case class Thing(num: Int)
val xs = List(Thing(1), Thing(2), Thing(3))
val partioned = xs.foldLeft((List.empty[Thing], List.empty[Thing]))((x, y) => y match {
case t # Thing(1) => (x._1, t :: x._2)
case t => (t :: x._1, x._2)
})
//(List(Thing(3), Thing(2)),List(Thing(1)))
Try this:
val (targetEl, rest) = (xs.head, xs.tail)
It works for non-empty list. Nil case must be handled separately.
After some experimentation, I've come up with the following, which is almost what I'm looking for:
var (maybeTargetEl, rest) = xs
.foldLeft((Option.empty[Thing], List[Thing]())) { case ((opt, ls), x) =>
if (x.num == 1)
(Some(x), ls)
else
(opt, x :: ls)
}
The target value is still wrapped in a container, but at least it guarantees a single value.
After that I can do
rest map <some_method>
maybeTargetEl map <some_other_method>
If the order of the original list is important:
var (maybeTargetEl, rest) = xs.
foldLeft((Option.empty[Thing], ListBuffer[Thing]())){ case ((opt, lb), x) =>
if (x.num == 1)
(Some(x), ls)
else
(opt, lb += x)
} match {
case (opt, lb) => (opt, lb.toList)
}
#evanjdooner Your solution with fold works if target element is present only once. If you want to extract only one occurrence of target element:
def find(xs: List[T], target: T, prefix: List[T]) = xs match {
case target :: tail => (target, prefix ::: tail)
case other :: tail => find(tail, target, other :: prefix)
case Nil => throw new Exception("Not found")
}
val (el, rest) = find(xs, target, Nil)
Sorry, I can't add it as a comment.

Rewrite string modifications more functional

I'm reading lines from a file
for (line <- Source.fromFile("test.txt").getLines) {
....
}
I basically want to get a list of paragraphs in the end. If a line is empty, that starts as a new paragraph, and I might want to parse some keyword - value pairs in the future.
The text file contains a list of entries like this (or something similar, like an Ini file)
User=Hans
Project=Blow up the moon
The slugs are going to eat the mustard. // multiline possible!
They are sneaky bastards, those slugs.
User=....
And I basically want to have a List[Project] where Project looks something like
class Project (val User: String, val Name:String, val Desc: String) {}
And the Description is that big chunk of text that doesn't start with a <keyword>=, but can stretch over any number of lines.
I know how to do this in an iterative style. Just do a list of checks for the keywords, and populate an instance of a class, and add it to a list to return later.
But I think it should be possible to do this in proper functional style, possibly with match case, yield and recursion, resulting in a list of objects that have the fields User, Project and so on. The class used is known, as are all the keywords, and the file format is not set in stone either. I'm mostly trying to learn better functional style.
You're obviously parsing something, so it might be the time to use... a parser!
Since your language seems to treat line breaks as significant, you will need to refer to this question to tell the parser so.
Apart from that, a rather simple implementation would be
import scala.util.parsing.combinator.RegexParsers
case class Project(user: String, name: String, description: String)
object ProjectParser extends RegexParsers {
override val whiteSpace = """[ \t]+""".r
def eol : Parser[String] = """\r?\n""".r
def user: Parser[String] = "User=" ~> """[^\n]*""".r <~ eol
def name: Parser[String] = "Project=" ~> """[^\n]*""".r <~ eol
def description: Parser[String] = repsep("""[^\n]+""".r, eol) ^^ { case l => l.mkString("\n") }
def project: Parser[Project] = user ~ name ~ description ^^ { case a ~ b ~ c => Project(a, b, c) }
def projects: Parser[List[Project]] = repsep(project,eol ~ eol)
}
And how to use it:
val sample = """User=foo1
Project=bar1
desc1
desc2
desc3
User=foo
Project=bar
desc4 desc5 desc6
desc7 desc8 desc9"""
import scala.util.parsing.input._
val reader = new CharSequenceReader(sample)
val res = ProjectParser.parseAll(ProjectParser.projects, reader)
if(res.successful) {
print("Found projects: " + res.get)
} else {
print(res)
}
Another possible implementation (since this parser is rather simple), using recursion:
import scala.io.Source
case class Project(user: String, name: String, desc: String)
#scala.annotation.tailrec
def parse(source: Iterator[String], list: List[Project] = Nil): List[Project] = {
val emptyProject = Project("", "", "")
#scala.annotation.tailrec
def parseProject(project: Option[Project] = None): Option[Project] = {
if(source.hasNext) {
val line = source.next
if(!line.isEmpty) {
val splitted = line.span(_ != '=')
parseProject(splitted match {
case (h, t) if h == "User" => project.orElse(Some(emptyProject)).map(_.copy(user = t.drop(1)))
case (h, t) if h == "Project" => project.orElse(Some(emptyProject)).map(_.copy(name = t.drop(1)))
case _ => project.orElse(Some(emptyProject)).map(project => project.copy(desc = (if(project.desc.isEmpty) "" else project.desc ++ "\n") ++ line))
})
} else project
} else project
}
if(source.hasNext) {
parse(source, parseProject().map(_ :: list).getOrElse(list))
} else list.reverse
}
And the test:
object Test {
def source = Source.fromString("""User=Hans
Project=Blow up the moon
The slugs are going to eat the mustard. // multiline possible!
They are sneaky bastards, those slugs.
User=Plop
Project=SO
Some desc""")
def test = println(parse(source.getLines))
}
Which gives:
List(Project(Hans,Blow up the moon,The slugs are going to eat the mustard. // multiline possible!
They are sneaky bastards, those slugs.), Project(Plop,SO,Some desc))
To answer your question without also tackling keyword parsing, fold over the lines and aggregate lines unless it's an empty one, in which case you start a new empty paragraph.
lines.foldLeft(List("")) { (l, x) =>
if (x.isEmpty) "" :: l else (l.head + "\n" + x) :: l.tail
} reverse
You'll notice this has some wrinkles in how it handles zero lines, and multiple and trailing empty lines. Adapt to your needs. Also if you are anal about string concatenations you can collect them in a nested list and flatten in the end (using .map(_.mkString)), this is just to showcase the basic technique of folding a sequence not to a scalar but to a new sequence.
This builds a list in reverse order because list prepend (::) is more efficient than appending to l in each step.
You're obviously building something, so you might want to try... a builder!
Like Jürgen, my first thought was to fold, where you're accumulating a result.
A mutable.Builder does the accumulation mutably, with a collection.generic.CanBuildFrom to indicate the builder to use to make a target collection from a source collection. You keep the mutable thing around just long enough to get a result. So that's my plug for localized mutability. Lest one assume that the path from List[String] to List[Project] is immutable.
To the other fine answers (the ones with non-negative appreciation ratings), I would add that functional style means functional decomposition, and usually small functions.
If you're not using regex parsers, don't neglect regexes in your pattern matches.
And try to spare the dots. In fact, I believe that tomorrow is a Spare the Dots Day, and people with sensitivity to dots are advised to remain indoors.
case class Project(user: String, name: String, description: String)
trait Sample {
val sample = """
|User=Hans
|Project=Blow up the moon
|The slugs are going to eat the mustard. // multiline possible!
|They are sneaky bastards, those slugs.
|
|User=Bob
|I haven't thought up a project name yet.
|
|User=Greta
|Project=Burn the witch
|It's necessary to escape from the witch before
|we blow up the moon. I hope Hans sees it my way.
|Once we burn the bitch, I mean witch, we can
|wreak whatever havoc pleases us.
|""".stripMargin
}
object Test extends App with Sample {
val kv = "(.*?)=(.*)".r
def nonnully(s: String) = if (s == null) "" else s + " "
val empty = Project(null, null, null)
val (res, dummy) = ((List.empty[Project], empty) /: sample.lines) { (acc, line) =>
val (sofar, cur) = acc
line match {
case kv("User", u) => (sofar, cur copy (user = u))
case kv("Project", n) => (sofar, cur copy (name = n))
case kv(k, _) => sys error s"Bad keyword $k"
case x if x.nonEmpty => (sofar, cur copy (description = s"${nonnully(cur.description)}$x"))
case _ if cur != empty => (cur :: sofar, empty)
case _ => (sofar, empty)
}
}
val ps = if (dummy == empty) res.reverse else (dummy :: res).reverse
Console println ps
}
The match can be mashed this way, too:
val (res, dummy) = ((List.empty[Project], empty) /: sample.lines) {
case ((sofar, cur), kv("User", u)) => (sofar, cur copy (user = u))
case ((sofar, cur), kv("Project", n)) => (sofar, cur copy (name = n))
case ((sofar, cur), kv(k, _)) => sys error s"Bad keyword $k"
case ((sofar, cur), x) if x.nonEmpty => (sofar, cur copy (description = s"${nonnully(cur.description)}$x"))
case ((sofar, cur), _) if cur != empty => (cur :: sofar, empty)
case ((sofar, cur), _) => (sofar, empty)
}
Before the fold, it seemed simpler to do paragraphs first. Is that imperative thinking?
object Test0 extends App with Sample {
def grafs(ss: Iterator[String]): List[List[String]] = {
val (g, rest) = ss dropWhile (_.isEmpty) span (_.nonEmpty)
val others = if (rest.nonEmpty) grafs(rest) else Nil
g.toList :: others
}
def toProject(ss: List[String]): Project = {
var p = Project("", "", "")
for (line <- ss; parts = line split '=') parts match {
case Array("User", u) => p = p.copy(user = u)
case Array("Project", n) => p = p.copy(name = n)
case Array(k, _) => sys error s"Bad keyword $k"
case Array(text) => p = p.copy(description = s"${p.description} $text")
}
p
}
val ps = grafs(sample.lines) map toProject
Console println ps
}
class Project (val User: String, val Name:String, val Desc: String) {}
object Project {
def apply(str: String): Project = {
val user = somehowFetchUserName(str)
val name = somehowFetchProjectName(str)
val desc = somehowFetchDescription(str)
new Project(user, name, desc)
}
}
val contents: Array[String] = Source.fromFile("test.txt").mkString.split("\\n\\n")
val list = contents map(Project(_))
will end up with the list of projects.