I want to apply a list of regex to a string. My current approach is not very functional
My current code:
val stopWords = List[String](
"the",
"restaurant",
"bar",
"[^a-zA-Z -]"
)
def CanonicalName(name: String): String = {
var nameM = name
for (reg <- stopWords) {
nameM = nameM.replaceAll(reg, "")
}
nameM = nameM.replaceAll(" +", " ").trim
return nameM
}
I think this does what you're looking for.
def CanonicalName(name: String): String = {
val stopWords = List("the", "restaurant", "bar", "[^a-zA-Z -]")
stopWords.foldLeft(name)(_.replaceAll(_, "")).replaceAll(" +"," ").trim
}
'replaceAll' has the possiblity to replace part of a word, for example: "the thermal & barbecue restaurant" is replaced to "rmal becue". If what you want is "thermal barbecue", you may split the name first and then apply your stopwords rules word by word:
def isStopWord(word: String): Boolean = stopWords.exists(word.matches)
def CanonicalName(name: String): String =
name.replaceAll(" +", " ").trim.split(" ").flatMap(n => if (isStopWord(n)) List() else List(n)).mkString(" ")
Related
I have a function which should take in a long string and separate it into a list of strings where each list element is a sentence of the article. I am going to achieve this by splitting on space and then grouping the elements from that split according to the tokens which end with a dot:
def getSentences(article: String): List[String] = {
val separatedBySpace = article
.map((c: Char) => if (c == '\n') ' ' else c)
.split(" ")
val splitAt: List[Int] = Range(0, separatedBySpace.size)
.filter(i => endsWithDot(separatedBySpace(0))).toList
// TODO
}
I have separated the string on space, and I've found each index that I want to group the list on. But how do I now turn separatedBySpace into a list of sentences based on splitAt?
Example of how it should work:
article = "I like donuts. I like cats."
result = List("I like donuts.", "I like cats.")
PS: Yes, I now that my algorithm for splitting the article into sentences has flaws, I just want to make a quick naive method to get the job done.
I ended up solving this by using recursion:
def getSentenceTokens(article: String): List[List[String]] = {
val separatedBySpace: List[String] = article
.replace('\n', ' ')
.replaceAll(" +", " ") // regex
.split(" ")
.toList
val splitAt: List[Int] = separatedBySpace.indices
.filter(i => ( i > 0 && endsWithDot(separatedBySpace(i - 1)) ) || i == 0)
.toList
groupBySentenceTokens(separatedBySpace, splitAt, List())
}
def groupBySentenceTokens(tokens: List[String], splitAt: List[Int], sentences: List[List[String]]): List[List[String]] = {
if (splitAt.size <= 1) {
if (splitAt.size == 1) {
sentences :+ tokens.slice(splitAt.head, tokens.size)
} else {
sentences
}
}
else groupBySentenceTokens(tokens, splitAt.tail, sentences :+ tokens.slice(splitAt.head, splitAt.tail.head))
}
val s: String = """I like donuts. I like cats
This is amazing"""
s.split("\\.|\n").map(_.trim).toList
//result: List[String] = List("I like donuts", "I like cats", "This is amazing")
To include the dots in the sentences:
val (a, b, _) = s.replace("\n", " ").split(" ")
.foldLeft((List.empty[String], List.empty[String], "")){
case ((temp, result, finalStr), word) =>
if (word.endsWith(".")) {
(List.empty[String], result ++ List(s"$finalStr${(temp ++ List(word)).mkString(" ")}"), "")
} else {
(temp ++ List(word), result, finalStr)
}
}
val result = b ++ List(a.mkString(" ").trim)
//result = List("I like donuts.", "I like cats.", "This is amazing")
I want to remove substring after 'Y' character. Final output should be EBAY in all three cases below. But output coming as EBAYSK.
object AndreaTest extends SparkSessionWrapper {
def main(args: Array[String]): Unit = {
var string1= "EBAY-SK"
var string2= "EBAY SK"
var string3= "EBAY- SK"
val finalString1 = string1.replaceAll("[-]/' '", "")
val finalString2 = string2.replaceAll("[-]/' '", "")
val finalString3 = string3.replaceAll("[-]/' '", "")
println(finalString1)
println(finalString2)
println(finalString3)
}
try this regex
val strings = List("EBAY-SK", "EBAY SK", "EBAY- SK", "EBAY", "EBAYEBAY")
val pattern = """([^.]*?Y).*""".r
strings.foreach(a => pattern.findAllIn(a).matchData foreach {
m => println(a + " -> " + m.group(1))
})
output:
EBAY-SK -> EBAY
EBAY SK -> EBAY
EBAY- SK -> EBAY
EBAY -> EBAY
EBAYEBAY -> EBAY
You either use a working pattern, i.e. replace - or \s (space) plus everything following it:
val finalString1 = string1.replaceAll("[-\\s].*", "")
val finalString2 = string2.replaceAll("[-\\s].*", "")
val finalString3 = string3.replaceAll("[-\\s].*", "")
Or just use substring instead of replaceAll:
val finalString1 = string1.substring(0, 4)
val finalString2 = string2.substring(0, 4)
val finalString3 = string3.substring(0, 4)
I have a requirement to concatenate two potentially empty address lines into one (with a space in between the two lines), but I need it to return a None if both address lines are None (this field is going into an Option[String] variable). The following command gets me what I want in terms of the concatenation:
Seq(myobj.address1, myobj.address2).flatten.mkString(" ")
But that gives me an empty string instead of a None in case address1 and address2 are both None.
This converts a single string to Option, converting it to None if it's either null or an empty-trimmed string:
(kudos to #Miroslav Machura for this simpler version)
Option(x).filter(_.trim.nonEmpty)
Alternative version, using collect:
Option(x).collect { case x if x.trim.nonEmpty => x }
Assuming:
val list1 = List(Some("aaaa"), Some("bbbb"))
val list2 = List(None, None)
Using plain Scala:
scala> Option(list1).map(_.flatten).filter(_.nonEmpty).map(_.mkString(" "))
res38: Option[String] = Some(aaaa bbbb)
scala> Option(list2).map(_.flatten).filter(_.nonEmpty).map(_.mkString(" "))
res39: Option[String] = None
Or using scalaz:
import scalaz._; import Scalaz._
scala> list1.flatten.toNel.map(_.toList.mkString(" "))
res35: Option[String] = Some(aaaa bbbb)
scala> list2.flatten.toNel.map(_.toList.mkString(" "))
res36: Option[String] = None
Well, In Scala there is Option[ T ] type which is intended to eliminate various run-time problems due to nulls.
So... Here is how you use Options, So basically a Option[ T ] can have one of the two types of values - Some[ T ] or None
// A nice string
var niceStr = "I am a nice String"
// A nice String option
var noceStrOption: Option[ String ] = Some( niceStr )
// A None option
var noneStrOption: Option[ String ] = None
Now coming to your part of problem:
// lets say both of your myobj.address1 and myobj.address2 were normal Strings... then you would not have needed to flatten them... this would have worked..
var yourString = Seq(myobj.address1, myobj.address2).mkString(" ")
// But since both of them were Option[ String ] you had to flatten the Sequence[ Option[ String ] ] to become a Sequence[ String ]
var yourString = Seq(myobj.address1, myobj.address2).flatten.mkString(" ")
//So... what really happens when you flatten a Sequence[ Option[ String ] ] ?
// Lets say we have Sequence[ Option [ String ] ], like this
var seqOfStringOptions = Seq( Some( "dsf" ), None, Some( "sdf" ) )
print( seqOfStringOptions )
// List( Some(dsf), None, Some(sdf))
//Now... lets flatten it out...
var flatSeqOfStrings = seqOfStringOptions.flatten
print( flatSeqOfStrings )
// List( dsf, sdf )
// So... basically all those Option[ String ] which were None are ignored and only Some[ String ] are converted to Strings.
// So... that means if both address1 and address2 were None... your flattened list would be empty.
// Now what happens when we create a String out of an empty list of Strings...
var emptyStringList: List[ String ] = List()
var stringFromEmptyList = emptyStringList.mkString( " " )
print( stringFromEmptyList )
// ""
// So... you get an empty String
// Which means we are sure that yourString will always be a String... though it can be empty (ie - "").
// Now that we are sure that yourString will alwyas be a String, we can use pattern matching to get out Option[ String ] .
// Getting an appropriate Option for yourString
var yourRequiredOption: Option[ String ] = yourString match {
// In case yourString is "" give None.
case "" => None
// If case your string is not "" give Some[ yourString ]
case someStringVal => Some( someStringVal )
}
You might also use the reduce method here:
val mySequenceOfOptions = Seq(myAddress1, myAddress2, ...)
mySequenceOfOptions.reduce[Option[String]] {
case(Some(soFar), Some(next)) => Some(soFar + " " + next)
case(None, next) => next
case(soFar, None) => soFar
}
Here's a function that should solve the original problem.
def mergeAddresses(addr1: Option[String],
addr2: Option[String]): Option[String] = {
val str = s"${addr1.getOrElse("")} ${addr2.getOrElse("")}"
if (str.trim.isEmpty) None else Some(str)
}
the answer from #dk14 is actually incorrect/incomplete because if list2 has a Some("") it will not yield a None because the filter() evaluates to an empty list instead of a None ( ScalaFiddle link)
val list2 = List(None, None, Some(""))
// this yields Some()
println(Option(list2).map(_.flatten).filter(_.nonEmpty).map(_.mkString(" ")))
but it's close. you just need to ensure that empty string is converted to a None so we combine it with #juanmirocks answer (ScalaFiddle link):
val list1 = List(Some("aaaa"), Some("bbbb"))
val list2 = List(None, None, Some(""))
// yields Some(aaaa bbbbb)
println(Option(list1.map(_.collect { case x if x.trim.nonEmpty => x }))
.map(_.flatten).filter(_.nonEmpty).map(_.mkString(" ")))
// yields None
println(Option(list2.map(_.collect { case x if x.trim.nonEmpty => x }))
.map(_.flatten).filter(_.nonEmpty).map(_.mkString(" ")))
I was searching a kind of helper function like below in the standard library but did not find yet, so I defined in the meantime:
def string_to_Option(x: String): Option[String] = {
if (x.nonEmpty)
Some(x)
else
None
}
with the help of the above you can then:
import scala.util.chaining.scalaUtilChainingOps
object TEST123 {
def main(args: Array[String]): Unit = {
val address1 = ""
val address2 = ""
val result =
Seq(
address1 pipe string_to_Option,
address2 pipe string_to_Option
).flatten.mkString(" ") pipe string_to_Option
println(s"The result is «${result}»")
// prints out: The result is «None»
}
}
With Scala 2.13:
Option.unless(address.isEmpty)(address)
For example:
val address = "foo"
Option.unless(address.isEmpty)(address) // Some("foo")
val address = ""
Option.unless(address.isEmpty)(address) // None
implicit class EmptyToNone(s: String):
def toOption: Option[String] = if (s.isEmpty) None else Some(s)
Example:
scala> "".toOption
val res0: Option[String] = None
scala> "foo".toOption
val res1: Option[String] = Some(foo)
(tested with Scala 3.2.2)
Below code attempts to count the number of times "Apple" appears in an HTML file.
object Question extends App {
def validWords(fileSentancesPart: List[String], wordList: List[String]): List[Option[String]] =
fileSentancesPart.map(sentancePart => {
if (isWordContained(wordList, sentancePart)) {
Some(sentancePart)
} else {
None
}
})
def isWordContained(wordList: List[String], sentancePart: String): Boolean = {
for (word <- wordList) {
if (sentancePart.contains(word)) {
return true;
}
}
false
}
lazy val lines = scala.io.Source.fromFile("c:\\data\\myfile.txt" , "latin1").getLines.toList.map(m => m.toUpperCase.split(" ")).flatten
val vw = validWords(lines, List("APPLE")) .flatten.size
println("size is "+vw)
}
The count is 79 as per the Scala code. But when I open the file with a text editor it finds 81 words with "Apple" contained. The search is case insensitive. Can spot where the bug is ? (I'm assuming the bug is with my code and not the text editor!)
I've wrote a couple of tests but the code seems to behave as expected in these simple use cases :
import scala.collection.mutable.Stack;
import org.scalatest.FlatSpec;
import org.scalatest._;
class ConvertTes extends FlatSpec {
"Valid words" should "be returned" in {
val fileWords = List("this" , "is" , "apple" , "applehere")
val validWords = List("apple")
lazy val lines = scala.io.Source.fromFile("c:\\data\\myfile.txt" , "latin1").getLines.toList.map(m => m.toUpperCase.split(" ")).flatten
val l : List[String] = validWords(fileWords, validWords).flatten
l.foreach(println)
}
"Entire line " should "be returned for matched word" in {
val fileWords = List("this" , "is" , "this apple is an" , "applehere")
val validWords = List("apple")
val l : List[String] = validWords(fileWords, validWords).flatten
l.foreach(println)
}
}
The HTML file being parsed (referred to as "c:\data\myfile.txt") in code above :
https://drive.google.com/file/d/0B1TIppVWd0LSVG9Edl9OYzh4Q1U/view?usp=sharing
Any suggestions on alternatives to code above welcome.
Think my issue is as per #Jack Leow comment. For code :
val fileWords = List("this", "is", "this appleisapple an", "applehere")
val validWords = List("apple")
val l: List[String] = validWords(fileWords, validWords).flatten
println("size : " + l.size)
size printed is 2, when it should be 3
I think you should do the following:
def validWords(
fileSentancesPart: List[String],
wordList: List[String]): List[Option[String]] =
fileSentancesPart /* add flatMap */ .flatMap(_.tails)
.map(sentancePart => {
if (isWordContained(wordList, sentancePart)) {
Some(sentancePart)
} else {
None
}
})
def isWordContained(
wordList: List[String],
sentancePart: String): Boolean = {
for (word <- wordList) {
//if (sentancePart.contains(word)) {
if (sentancePart.startsWith(word)) { // use startsWith
return true;
}
}
false
}
You could use regular expressions with a Source iterator:
val regex = "([Aa]pple)".r
val count = Source.fromFile("/test.txt").getLines.map(regex.findAllIn(_).length).sum
How can i unit test console input in scala with scalaTest.
Code under Test:
object ConsoleAction {
def readInput(in: InputStream): List[String] = {
val bs = new BufferedSource(in)(Codec.default)
val l = bs.getLines()
l.takeWhile(_!="").toList
}
def main(args: Array[String]) {
val l = ConsoleAction.readInput(System.in)
println("--> "+l)
}
}
I'd like to test the readInput method.
A one line input can be tested like that:
"Result list" should "has 1 element" in {
val input = "Hello\\n"
val is = new ByteArrayInputStream(input.getBytes(StandardCharsets.UTF_8))
assert(ConsoleAction.readInput(is).size===1)
}
... but what is the way for multiline input?
input line 1
input line 2
thx
Your problem lies with how you're escaping your newline. You're doing "\\n" rather than "\n". This test should pass.
"Result list" should "has 2 elements" in {
val input = "Hello\nWorld\n"
val is = new ByteArrayInputStream(input.getBytes(StandardCharsets.UTF_8))
assert(ConsoleAction.readInput(is).size===2)
}