Scala: Transforming List of Strings containing long descriptions to list of strings containing only last sentences - scala

I have a List[String], for example:
val test=List("this is, an extremely long sentence. Check; But. I want this sentence.",
"Another. extremely. long. (for eg. description). But I want this sentence.",
..)
I want the result to be like:
List("I want this sentence", "But I want this sentence"..)
I tried few approaches but didn't work
test.map(x=>x.split(".").reverse.head)
test.map(x=>x.split(".").last)

Try using this
test.reverse.head.split("\\.").last
To handle any Exception
Try(List[String]().reverse.head.split("\\.").last).getOrElse("YOUR_DEFAULT_STRING")

You can map over you List, split each String and then take the last element. Try the below code.
val list = List("this is, an extremely long sentence. Check; But. I want this sentence.",
"Another. extremely. long. (for eg. description). But I want this sentence.")
list.map(_.split("\\.").last.trim)
It will give you
List(I want this sentence, But I want this sentence)

test.map (_.split("\\.").last)
Split takes a regular expression, and in such, the dot stands for every character, so you have to mask it.
Maybe you want to include question marks and bangs:
test.map (_.split("[!?.]").last)
and trim surrounding whitespace:
test.map (_.split("[!?.]").last.trim).
The reverse.head would have been a good idea, if there wasn't the last:
scala> test.map (_.split("[!?.]").reverse.head.trim)
res138: List[String] = List(I want this sentence, But I want this sentence)

You can do this a number of ways:
For each string in your original list: split by ., reverse the list, take the first value
test.map(_.split('.').reverse.headOption)
// List(Some( I want this sentence), Some( But I want this sentence))
.headOption results in Some("string") or None, and you can do something like a .getOrElse("no valid string found") on it. You can trim the unwanted whitespace if you want.
Regex match
test.map { sentence =>
val regex = ".*\\.\\s*([^.]*)\\.$".r
val regex(value) = sentence
value
}
This will fetch any string at the end of a longer string which is preceded by a full stop and a space and followed by a full stop. You can modify the regex to change the exact rules of the regex, and I recommend playing around with regex101.com if you fancy learning more regex. It's very good.
This solution is better for more complicated examples and requirements, but it's worth keeping in mind. If you are worried that the regex might not match, you can do something like checking if the regex matches before extracting it:
test.map { sentence =>
val regexString = ".*\\.\\s*([^.]*)\\.$"
val regex = regexString.r
if(sentence.matches(regexString)) {
val regex(value) = sentence
value
} else ""
}
Take the last after splitting the string by .
test.map(_.split('.').map(_.trim).lastOption)

Related

Match capitalise and lower case

I’m trying to match the Scala string sequence with .contains(“pear”). I’m able to match pear, but is there any other way to match no matter capital or lower case of the “Pear” other than toLowerCase first or using regex? This is what I did so far.
val fruits = Seq("apple", "PEAR")
fruits.map(_.toLowerCase).contains("pear")
Boolean = true
As sinanspd said
fruits.exists(_.equalsIgnoreCase("pear"))
This is better since rather than converting every element of fruits to lowercase, it only converts as many characters as needed to reject or match an element.

How to determine if a string contains any element of a set

I have a sentence, and I want to determine if it contains any elements of a set.
val sentence = "Hello, today is a fine day to learn scala"
val mySet = Set("day", "scala")
What about:
mySet.exists(word => sentence.contains(word))
It will return true if at least one word from the set is present in the string.
Here's a solution that...
is case-insensitive ("scala" does match "Scala")
ignores sub-strings ("rat" does not match "rats")
ignores punctuation (!?,-) unless specifically specified in mySet
mySet.mkString("(?i)\\b(", "|", ")\\b")
.r.unanchored
.matches(sentence)

Scala : How to split words using multiple delimeters

Suppose I have the text file like this:
Apple#mango&banana#grapes
The data needs to be split on multiple delimiters before performing the word count.
How to do that?
Use split method:
scala> "Apple#mango&banana#grapes".split("[#&#]")
res0: Array[String] = Array(Apple, mango, banana, grapes)
If you just want to count words, you don't need to split. Something like this will do:
val numWords = """\b\w""".r.findAllIn(string).length
This is a regex that matches start of a word (\b is a (zero-length) word boundary, \w is any "word" character (letter, number or underscore), so you get all the matches in your string, and then just check how many there are.
If you are looking to count each word separately, and do it across multiple lines, then, split is, probably, a better option:
source
.getLines
.flatMap(_.split("\\W+"))
.filterNot(_.isEmpty)
.groupBy(identity)
.mapValues(_.size)

How to strip everything except digits from a string in Scala (quick one liners)

This is driving me nuts... there must be a way to strip out all non-digit characters (or perform other simple filtering) in a String.
Example: I want to turn a phone number ("+72 (93) 2342-7772" or "+1 310-777-2341") into a simple numeric String (not an Int), such as "729323427772" or "13107772341".
I tried "[\\d]+".r.findAllIn(phoneNumber) which returns an Iteratee and then I would have to recombine them into a String somehow... seems horribly wasteful.
I also came up with: phoneNumber.filter("0123456789".contains(_)) but that becomes tedious for other situations. For instance, removing all punctuation... I'm really after something that works with a regular expression so it has wider application than just filtering out digits.
Anyone have a fancy Scala one-liner for this that is more direct?
You can use filter, treating the string as a character sequence and testing the character with isDigit:
"+72 (93) 2342-7772".filter(_.isDigit) // res0: String = 729323427772
You can use replaceAll and Regex.
"+72 (93) 2342-7772".replaceAll("[^0-9]", "") // res1: String = 729323427772
Another approach, define the collection of valid characters, in this case
val d = '0' to '9'
and so for val a = "+72 (93) 2342-7772", filter on collection inclusion for instance with either of these,
for (c <- a if d.contains(c)) yield c
a.filter(d.contains)
a.collect{ case c if d.contains(c) => c }

Getting IndexOutOfBounds Exception while search for a subtring

I have a string like
var word = "banana"
and a sentence like var sent = "the monkey is holding a banana which is yellow"
sent1 = "banana!!"
I want to search banana in sent and then write to a file in the following way:
the monkey is holding a
banana
which is yellow
I'm doing it in the following way:
var before = sent.substring(0, sent.indexOf(word))
var after = sent.substring(sent.indexOf(word) + word.length)
println(before)
println(after)
This works fine but when I do the same for sent1, then it gives me IndexOutOfBoundsException. I think it is because there is nothing before banana in sent1. How to deal with this?
You can split based on the word and you will get an array with everything before and after the word.
val search = sent.split(word)
search: Array[String] = Array("the monkey is holding a ", " which is yellow")
This works in the "banana!!!" case:
"banana!!".split(word)
res5: Array[String] = Array("", !!)
Now you can write the three lines to a file in your favorite way:
println(search(0))
println(word)
println(search(1))
What if you had more than one occurrence of the word? .split understands regular expressions, so you could improve the previous solution with something like this:
string
.replaceAll("\\s+(?=banana)|(?<=banana)\\s+")
.foreach(println)
\\s means a whitespace character
(?=<word>) means "followed by <word>"
(?<=<word>) means "preceded by <word>"
So, this would split your string into pieces, using any spaces either preceded or followed by the "banana", and not the word itself. The actual word ends up in the list, just like the other parts of the string, so you don't need to print it out explicitly
This regex trick is called "positive look-around" ( ?= is look-ahead, ?<= is look-behind) in case you are wondering.