scala regex multiple integers - scala

I have the following string that I would like to match on: 1-10 employees.
Here is my regex statement val regex = ("\\d+").r
The problem I have is Im trying to find a way to extract the matched data and determine which value returned is bigger.
Here is what IM doing to process it
def setMinAndMaxValue(currentCompany: CurrentCompany, matchIterator: Iterator[Regex.Match]): CurrentCompany = {
var max = 0
println(s"matchIterator - $matchIterator")
matchIterator.collect {
case regex(s: String) => println("found string")
case regex(IntConv(x)) =>
println("regex case")
if (x > max) max = x
}
val (minVal, maxVal) = rangesForMaxValue(max)
val newDetails = currentCompany.details.copy(minSize = Some(minVal), maxSize = Some(maxVal))
currentCompany.copy(details = newDetails)
}
object IntConv {
def unapply(s : String) : Option[Int] = Try {
Some(s.toInt)
}.toOption.flatten
}

I thought I was confused by your original question, then you clarified it with code and now I have no idea what you're trying to do.
To extract numbers from a string, try this.
val re = """(\d+)""".r
val nums = re.findAllIn(string_with_numbers).map(_.toInt).toList
Then you can just nums.min, and nums.max, and whatever number processing you need.

Related

Cleaner way to find all indices of same value in Scala

I have a textfile like so
NameOne,2,3,3
NameTwo,1,0,2
I want to find the indices of the max value in each line in Scala. So the output of this would be
NameOne,1,2
NameTwo,2
I'm currently using the function below to do this but I can't seem to find a simple way to do this without a for loop and I'm wondering if there is a better method out there.
def findIndices(movieRatings: String): (String) = {
val tokens = movieRatings.split(",", -1)
val movie = tokens(0)
val ratings = tokens.slice(1, tokens.size)
val max = ratings.max
var indices = ArrayBuffer[Int]()
for (i<-0 until ratings.length) {
if (ratings(i) == max) {
indices += (i+1)
}
}
return movie + "," + indices.mkString(",")
}
This function is called as so:
val output = textFile.map(findIndices).saveAsTextFile(args(1))
Just starting to learn Scala so any advice would help!
You can zipWithIndex and use filter:
ratings.zipWithIndex
.filter { case(_, value) => value == max }
.map { case(index, _) => index }
I noticed that your code doesn't actually produce the expected result from your example input. I'm going to assume that the example given is the correct result.
def findIndices(movieRatings :String) :String = {
val Array(movie, ratings #_*) = movieRatings.split(",", -1)
val mx = ratings.maxOption //Scala 2.13.x
ratings.indices
.filter(x => mx.contains(ratings(x)))
.mkString(s"$movie,",",","")
}
Note that this doesn't address some of the shortcomings of your algorithm:
No comma allowed in movie name.
Only works for ratings from 0 to 9. No spaces allowed.
testing:
List("AA"
,"BB,"
,"CC,5"
,"DD,2,5"
,"EE,2,5, 9,11,5"
,"a,b,2,7").map(findIndices)
//res0: List[String] = List(AA, <-no ratings
// , BB,0 <-comma, no ratings
// , CC,0 <-one rating
// , DD,1 <-two ratings
// , EE,1,4 <-" 9" and "11" under valued
// , a,0 <-comma in name error
// )

Scala conditional accumulation

I'm trying to implement a function that extracts from a given string "placeholders" delimited by $ character.
Processing the string:
val stringToParse = "ignore/me/$aaa$/once-again/ignore/me/$bbb$/still-to-be/ignored
the result should be:
Seq("aaa", "bbb")
What would be a Scala idiomatic alternative of following implementation using var for toggling accumulation?
import fiddle.Fiddle, Fiddle.println
import scalajs.js
import scala.collection.mutable.ListBuffer
#js.annotation.JSExportTopLevel("ScalaFiddle")
object ScalaFiddle {
// $FiddleStart
val stringToParse = "ignore/me/$aaa$/once-again/ignore/me/$bbb$/still-to-be/ignored"
class StringAccumulator {
val accumulator: ListBuffer[String] = new ListBuffer[String]
val sb: StringBuilder = new StringBuilder("")
var open:Boolean = false
def next():Unit = {
if (open) {
accumulator.append(sb.toString)
sb.clear
open = false
} else {
open = true
}
}
def accumulateIfOpen(charToAccumulate: Char):Unit = {
if (open) sb.append(charToAccumulate)
}
def get(): Seq[String] = accumulator.toList
}
def getPlaceHolders(str: String): Seq[String] = {
val sac = new StringAccumulator
str.foreach(chr => {
if (chr == '$') {
sac.next()
} else {
sac.accumulateIfOpen(chr)
}
})
sac.get
}
println(getPlaceHolders(stringToParse))
// $FiddleEnd
}
I'll present two solutions to you. The first is the most direct translation of what you've done. In Scala, if you hear the word accumulate it usually translates to a variant of fold or reduce.
def extractValues(s: String) =
{
// We can combine the functionality of your boolean and StringBuilder by using an Option
s.foldLeft[(ListBuffer[String],Option[StringBuilder])]((new ListBuffer[String], Option.empty))
{
//As we fold through, we have the accumulated list, possibly a partially built String and the current letter
case ((accumulator,sbOption),char) =>
{
char match
{
//This logic pretty much matches what you had, adjusted to work with the Option
case '$' =>
{
sbOption match
{
case Some(sb) =>
{
accumulator.append(sb.mkString)
(accumulator,None)
}
case None =>
{
(accumulator,Some(new StringBuilder))
}
}
}
case _ =>
{
sbOption.foreach(_.append(char))
(accumulator,sbOption)
}
}
}
}._1.map(_.mkString).toList
}
However, that seems pretty complicated, for what sounds like it should be a simple task. We can use regexes, but those are scary so let's avoid them. In fact, with a little bit of thought this problem actually becomes quite simple.
def extractValuesSimple(s: String) =
{
s.split('$'). //Split the string on the $ character
dropRight(1). //Drops the rightmost item, to handle the case with an odd number of $
zipWithIndex.filter{case (str, index) => index % 2 == 1}. //Filter out all of the even indexed items, which will always be outside of the matching $
map{case (str, index) => str}.toList //Remove the indexes from the output
}
Is this solution enough?
scala> val stringToParse = "ignore/me/$aaa$/once-again/ignore/me/$bbb$/still-to-be/ignored"
stringToParse: String = ignore/me/$aaa$/once-again/ignore/me/$bbb$/still-to-be/ignored
scala> val P = """\$([^\$]+)\$""".r
P: scala.util.matching.Regex = \$([^\$]+)\$
scala> P.findAllIn(stringToParse).map{case P(s) => s}.toSeq
res1: Seq[String] = List(aaa, bbb)

Iterate and trim string based on condition in spark Scala

I have dataframe 'regexDf' like below
id,regex
1,(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)
2,(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)
If the length of the regex exceeds some max length for example 50, then i want to remove the last text token in splitted regex string separated by '|' for the exceeded id. In the above data frame, id 1 length is more than 50 so that last tokens 'text4(.)' and 'text6(.)' from each splitted regex string should be removed. Even after removing that also length of the regex string in id 1 still more than 50, so that again last tokens 'text3(.)' and 'text5(.)' should be removed.so the final dataframe will be
id,regex
1,(.*)text1(.*)text2(.*)|(.*)text2(.*)
2,(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)
I am able to trim the last tokens using the following code
val reducedStr = regex.split("|").foldLeft(List[String]()) {
(regexStr,eachRegex) => {
regexStr :+ eachRegex.replaceAll("\\(\\.\\*\\)\\w+\\(\\.\\*\\)$", "\\(\\.\\*\\)")
}
}.mkString("|")
I tried using while loop to check the length and trim the text tokens in iteration which is not working. Also i want to avoid using var and while loop. Is it possible to achieve without while loop.
val optimizeRegexString = udf((regex: String) => {
if(regex.length >= 50) {
var len = regex.length;
var resultStr: String = ""
while(len >= maxLength) {
val reducedStr = regex.split("|").foldLeft(List[String]()) {
(regexStr,eachRegex) => {
regexStr :+ eachRegex
.replaceAll("\\(\\.\\*\\)\\w+\\(\\.\\*\\)$", "\\(\\.\\*\\)")
}
}.mkString("|")
len = reducedStr.length
resultStr = reducedStr
}
resultStr
} else {
regex
}
})
regexDf.withColumn("optimizedRegex", optimizeRegexString(col("regex")))
As per SathiyanS and Pasha suggestion, I changed the recursive method as function.
def optimizeRegex(regexDf: DataFrame): DataFrame = {
val shrinkString= (s: String) => {
if (s.length > 50) {
val extractedString: String = shrinkString(s.split("\\|")
.map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"))
extractedString
}
else s
}
def shrinkUdf = udf((regex: String) => shrinkString(regex))
regexDf.withColumn("regexString", shrinkUdf(col("regex")))
}
Now i am getting exception as "recursive value shrinkString needs type"
Error:(145, 39) recursive value shrinkString needs type
val extractedString: String = shrinkString(s.split("\\|")
.map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"));
Recursion:
def shrink(s: String): String = {
if (s.length > 50)
shrink(s.split("\\|").map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"))
else s
}
Looks like issues with function calling, some additional info.
Can be called as static function:
object ShrinkContainer {
def shrink(s: String): String = {
if (s.length > 50)
shrink(s.split("\\|").map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"))
else s
}
}
Link with dataframe:
def shrinkUdf = udf((regex: String) => ShrinkContainer.shrink(regex))
df.withColumn("regex", shrinkUdf(col("regex"))).show(truncate = false)
Drawbacks: Just basic example (approach) provided. Some edge cases (if regexp does not contains "text", if too many parts separated by "|", for ex. 100; etc.) have to be resolved by author of question, for avoid infinite recursion loop.
This is how I would do it.
First, a function for removing the last token from a regex:
def deleteLastToken(s: String): String =
s.replaceFirst("""[^)]+\(\.\*\)$""", "")
Then, a function that shortens the entire regex string by deleting the last token from all the |-separated fields:
def shorten(r: String) = {
val items = r.split("[|]").toSeq
val shortenedItems = items.map(deleteLastToken)
shortenedItems.mkString("|")
}
Then, for a given input regex string, create the stream of all the shortened strings you get by applying the shorten function repeatedly. This is an infinite stream, but it's lazily evaluated, so only as few elements as required will be actually computed:
val regex = "(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)"
val allShortened = Stream.iterate(regex)(shorten)
Finally, you can treat allShortened as any other sequence. For solving our problem, you can drop all elements while they don't satisfy the length requirement, and then keep only the first one of the remaining ones:
val result = allShortened.dropWhile(_.length > 50).head
You can see all the intermediate values by printing some elements of allShortened:
allShortened.take(10).foreach(println)
// Prints:
// (.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)
// (.*)text1(.*)text2(.*)text3(.*)|(.*)text2(.*)text5(.*)
// (.*)text1(.*)text2(.*)|(.*)text2(.*)
// (.*)text1(.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
Just to add to #pasha701 answer. Here is the solution that works in spark.
val df = sc.parallelize(Seq((1,"(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)"),(2,"(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)"))).toDF("ID", "regex")
df.show()
//prints
+---+------------------------------------------------------------------------+
|ID |regex |
+---+------------------------------------------------------------------------+
|1 |(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)|
|2 |(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*) |
+---+------------------------------------------------------------------------+
Now you can use the #pasha701 shrink function using udf
val shrink: String => String = (s: String) => if (s.length > 50) shrink(s.split("\\|").map(s => s.substring(0,s.lastIndexOf("text"))).mkString("|")) else s
def shrinkUdf = udf((regex: String) => shrink(regex))
df.withColumn("regex", shrinkUdf(col("regex"))).show(truncate = false)
//prints
+---+---------------------------------------------+
|ID |regex |
+---+---------------------------------------------+
|1 |(.*)text1(.*)text2(.*)|(.*)text2(.*) |
|2 |(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)|
+---+---------------------------------------------+

Modifying generic maps in Scala

I'm new to the Scala landscape after spending the last 10 years in Java and the last ~year in Groovy. Hi Scala!
For the life of me I can't seem to get the following code snippet to compile, and its just complicated enough to the point where the Google Gods aren't helping me.
I have a map that will contain Strings for keys and Lists of Tuples for values. The tuples will be a String-Long pair. In Groovy this would look like:
Map<String,List<Tuple2<String,Long>>> data = [:]
I need to be able to add and modify keys and values for this map. Specifically, I need to:
Add to the List of Tuples for existing keys
If a key doesn't exist, instantiate a new List of Tuples, and then add the key and list as a map entry
In Groovy this would look like:
Map<String,List<String,Long>> data = [:]
def addData(String key, String message) {
Long currTime = System.currentTimestampInMillis()
Tuple2<String,Long> tuple = new Tuple2<String,Long>(message, tuple)
if(data.contains(key)) {
data.key << tuple
} else {
data[key] = new List<Tuple2<String,Long>>()
data.key << tuple
}
}
I'm trying to do this in Scala, albeit unsuccessfully.
My best attempt thus far:
object MapUtils {
// var data : Map[String,ListBuffer[(String,Long)]] = Map()
val data = collection.mutable.Map[String, ListBuffer[(String, Long)]]()
def addData(key : String, message : String) : Unit = {
val newTuple = (message, System.currentTimeMillis())
val optionalOldValue = data.get(key)
optionalOldValue match {
case Some(olderBufferList) => olderBufferList += newTuple
case None => data
.put(key, ListBuffer[(String, Long)](newTuple))
}
}
}
Complains with this compiler error on the case Some(olderBufferList) => olderBufferList += newTuple line:
value += is not a member of Any
Any ideas what I can do to get this compiling & working?
You are missing an import for ListBuffer. The following code works perfectly fine in 2.9.1 (tested on TryScala), 2.11.7 (tested on IDEOne) and 2.11.8. Note the only addition is the first line adding the import:
import collection.mutable.ListBuffer
object MapUtils {
// var data : Map[String,ListBuffer[(String,Long)]] = Map()
val data = collection.mutable.Map[String, ListBuffer[(String, Long)]]()
def addData(key : String, message : String) : Unit = {
val newTuple = (message, System.currentTimeMillis())
val optionalOldValue = data.get(key)
optionalOldValue match {
case Some(olderBufferList) => olderBufferList += newTuple
case None => data
.put(key, ListBuffer[(String, Long)](newTuple))
}
}
}
MapUtils.addData("123", "message 1")
MapUtils.addData("456", "message 2")
MapUtils.data
//=> Map(456 -> ListBuffer((message 2,1472925061065)), 123 -> ListBuffer((message 1,1472925060926)))
The short version for your needs will be:
val map = mutable.Map[String, ListBuffer[(String, Long)]]()
map.put(key, map.getOrElse(key, ListBuffer[(String, Long)]()) += ((message, System.currentTimeMillis())))
You have some syntax issues with your code, If I'll try to change addData it would look like this:
def addData(key : String, message : String) : Unit = {
val newTuple = (message, System.currentTimeMillis())
val optionalOldValue = map.get(key)
optionalOldValue match {
case Some(olderBufferList) => olderBufferList += newTuple
case None => map.put(key, ListBuffer[(String, Long)](newTuple))
}
}

How to deal with decimal number in scala

I have a file like this:
1 4.146846
2 3.201141
3 3.016736
4 2.729412
I want to use toDouble but, it's not working as expected :
val rows = textFile.map { line =>
val fields = line.split("[^\\d]+")
((fields(0),fields(1).toDouble))
}
val Num = rows.sortBy(- _._2).map{case (user , num) => num}.collect.mkString("::")
println(Num)
The result print out is 4.0::3.0::3.0::2.0.
What I expect is 4.146846::3.201141::3.016736::2.729412
How do I do this?
Your regular expression is stopping at the decimal point in 4.146846.
Try line.split("[^\\d.]+")
What about splitting the lines by variant number of whitespaces? The regular expression would be like '[\s]+' . This resumes in two parts per line, one digit and one double string.
My whole program looks like:
object Application {
def parseDouble(s: String) =
try {
Some(s.toDouble)
} catch {
case _ => None
}
def main(args: Array[String]): Unit = {
val linesIt = "1 3.201141\n2 4.146846\n3 3.016736\n4 2.729412".lines
var doubles: List[Double] = List.empty
for (singleLine <- linesIt) {
val oneDouble = parseDouble(singleLine.split("[\\s]+")(1))
doubles = if (oneDouble != None)
oneDouble.get::doubles
else
doubles
}
val doublesArr = doubles.toArray
println("before sorting: " + doublesArr.mkString("::"))
scala.util.Sorting.quickSort(doublesArr)
println("after sorting: " + doublesArr.mkString("::"))
}
}