Cleaner way to find all indices of same value in Scala - scala

I have a textfile like so
NameOne,2,3,3
NameTwo,1,0,2
I want to find the indices of the max value in each line in Scala. So the output of this would be
NameOne,1,2
NameTwo,2
I'm currently using the function below to do this but I can't seem to find a simple way to do this without a for loop and I'm wondering if there is a better method out there.
def findIndices(movieRatings: String): (String) = {
val tokens = movieRatings.split(",", -1)
val movie = tokens(0)
val ratings = tokens.slice(1, tokens.size)
val max = ratings.max
var indices = ArrayBuffer[Int]()
for (i<-0 until ratings.length) {
if (ratings(i) == max) {
indices += (i+1)
}
}
return movie + "," + indices.mkString(",")
}
This function is called as so:
val output = textFile.map(findIndices).saveAsTextFile(args(1))
Just starting to learn Scala so any advice would help!

You can zipWithIndex and use filter:
ratings.zipWithIndex
.filter { case(_, value) => value == max }
.map { case(index, _) => index }

I noticed that your code doesn't actually produce the expected result from your example input. I'm going to assume that the example given is the correct result.
def findIndices(movieRatings :String) :String = {
val Array(movie, ratings #_*) = movieRatings.split(",", -1)
val mx = ratings.maxOption //Scala 2.13.x
ratings.indices
.filter(x => mx.contains(ratings(x)))
.mkString(s"$movie,",",","")
}
Note that this doesn't address some of the shortcomings of your algorithm:
No comma allowed in movie name.
Only works for ratings from 0 to 9. No spaces allowed.
testing:
List("AA"
,"BB,"
,"CC,5"
,"DD,2,5"
,"EE,2,5, 9,11,5"
,"a,b,2,7").map(findIndices)
//res0: List[String] = List(AA, <-no ratings
// , BB,0 <-comma, no ratings
// , CC,0 <-one rating
// , DD,1 <-two ratings
// , EE,1,4 <-" 9" and "11" under valued
// , a,0 <-comma in name error
// )

Related

Iterate and trim string based on condition in spark Scala

I have dataframe 'regexDf' like below
id,regex
1,(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)
2,(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)
If the length of the regex exceeds some max length for example 50, then i want to remove the last text token in splitted regex string separated by '|' for the exceeded id. In the above data frame, id 1 length is more than 50 so that last tokens 'text4(.)' and 'text6(.)' from each splitted regex string should be removed. Even after removing that also length of the regex string in id 1 still more than 50, so that again last tokens 'text3(.)' and 'text5(.)' should be removed.so the final dataframe will be
id,regex
1,(.*)text1(.*)text2(.*)|(.*)text2(.*)
2,(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)
I am able to trim the last tokens using the following code
val reducedStr = regex.split("|").foldLeft(List[String]()) {
(regexStr,eachRegex) => {
regexStr :+ eachRegex.replaceAll("\\(\\.\\*\\)\\w+\\(\\.\\*\\)$", "\\(\\.\\*\\)")
}
}.mkString("|")
I tried using while loop to check the length and trim the text tokens in iteration which is not working. Also i want to avoid using var and while loop. Is it possible to achieve without while loop.
val optimizeRegexString = udf((regex: String) => {
if(regex.length >= 50) {
var len = regex.length;
var resultStr: String = ""
while(len >= maxLength) {
val reducedStr = regex.split("|").foldLeft(List[String]()) {
(regexStr,eachRegex) => {
regexStr :+ eachRegex
.replaceAll("\\(\\.\\*\\)\\w+\\(\\.\\*\\)$", "\\(\\.\\*\\)")
}
}.mkString("|")
len = reducedStr.length
resultStr = reducedStr
}
resultStr
} else {
regex
}
})
regexDf.withColumn("optimizedRegex", optimizeRegexString(col("regex")))
As per SathiyanS and Pasha suggestion, I changed the recursive method as function.
def optimizeRegex(regexDf: DataFrame): DataFrame = {
val shrinkString= (s: String) => {
if (s.length > 50) {
val extractedString: String = shrinkString(s.split("\\|")
.map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"))
extractedString
}
else s
}
def shrinkUdf = udf((regex: String) => shrinkString(regex))
regexDf.withColumn("regexString", shrinkUdf(col("regex")))
}
Now i am getting exception as "recursive value shrinkString needs type"
Error:(145, 39) recursive value shrinkString needs type
val extractedString: String = shrinkString(s.split("\\|")
.map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"));
Recursion:
def shrink(s: String): String = {
if (s.length > 50)
shrink(s.split("\\|").map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"))
else s
}
Looks like issues with function calling, some additional info.
Can be called as static function:
object ShrinkContainer {
def shrink(s: String): String = {
if (s.length > 50)
shrink(s.split("\\|").map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"))
else s
}
}
Link with dataframe:
def shrinkUdf = udf((regex: String) => ShrinkContainer.shrink(regex))
df.withColumn("regex", shrinkUdf(col("regex"))).show(truncate = false)
Drawbacks: Just basic example (approach) provided. Some edge cases (if regexp does not contains "text", if too many parts separated by "|", for ex. 100; etc.) have to be resolved by author of question, for avoid infinite recursion loop.
This is how I would do it.
First, a function for removing the last token from a regex:
def deleteLastToken(s: String): String =
s.replaceFirst("""[^)]+\(\.\*\)$""", "")
Then, a function that shortens the entire regex string by deleting the last token from all the |-separated fields:
def shorten(r: String) = {
val items = r.split("[|]").toSeq
val shortenedItems = items.map(deleteLastToken)
shortenedItems.mkString("|")
}
Then, for a given input regex string, create the stream of all the shortened strings you get by applying the shorten function repeatedly. This is an infinite stream, but it's lazily evaluated, so only as few elements as required will be actually computed:
val regex = "(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)"
val allShortened = Stream.iterate(regex)(shorten)
Finally, you can treat allShortened as any other sequence. For solving our problem, you can drop all elements while they don't satisfy the length requirement, and then keep only the first one of the remaining ones:
val result = allShortened.dropWhile(_.length > 50).head
You can see all the intermediate values by printing some elements of allShortened:
allShortened.take(10).foreach(println)
// Prints:
// (.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)
// (.*)text1(.*)text2(.*)text3(.*)|(.*)text2(.*)text5(.*)
// (.*)text1(.*)text2(.*)|(.*)text2(.*)
// (.*)text1(.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
Just to add to #pasha701 answer. Here is the solution that works in spark.
val df = sc.parallelize(Seq((1,"(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)"),(2,"(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)"))).toDF("ID", "regex")
df.show()
//prints
+---+------------------------------------------------------------------------+
|ID |regex |
+---+------------------------------------------------------------------------+
|1 |(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)|
|2 |(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*) |
+---+------------------------------------------------------------------------+
Now you can use the #pasha701 shrink function using udf
val shrink: String => String = (s: String) => if (s.length > 50) shrink(s.split("\\|").map(s => s.substring(0,s.lastIndexOf("text"))).mkString("|")) else s
def shrinkUdf = udf((regex: String) => shrink(regex))
df.withColumn("regex", shrinkUdf(col("regex"))).show(truncate = false)
//prints
+---+---------------------------------------------+
|ID |regex |
+---+---------------------------------------------+
|1 |(.*)text1(.*)text2(.*)|(.*)text2(.*) |
|2 |(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)|
+---+---------------------------------------------+

New to Scala, better/idiomatic way than reduceLeft to find max (key,val) in collection?

New-ish to Scala ... I am trying to find the best match from a collection of (key,value) pairs, where best match is defined as highest frequency. The method reduceLeft would be ideal, but the collection size may be smaller than 2 (1 or 0), so well-defined behavior for small collections is good.
Is there a more idiomatic scala approach to finding the max?
Other sources explained reduceLeft, which makes sense and reads well, but other approaches suggest different methods.
Is there a better way to extract the lone item from a collection of size=1?
Assume I have a map with some unknown number of values,
m:Map[String,Int]
val vm = m.filterNot{ case (k,v) => k.equals("ignore") }
val size = vm.size
val best = if(size>1) {
val list = vm.map{ case (k,v) => KeyCount(k,v) }
list.reduceLeft( maxKey )
} else if(size == 1) {
vm.toList(0)
//another source has suggested vm.head as an alternative
} else {
KeyCount("default",0)
}
Where KeyCount and maxKey are declared as,
case class KeyCount( key:String, count:Long ) {
def max( a:KeyCount, z:KeyCount ) = { if( a.count>z.count) a else z; }
def min( a:KeyCount, d2:KeyCount ) = { if( a.count<z.count) a else z; }
}
val maxKey = (x:KeyCount, y:KeyCount) => if( x.count > y.count ) x else y;
reduceLeft works fine with lists of size 1. If count is always greater than 0 you can use foldLeft with the default case:
val list = vm.map{ case (k,v) => KeyCount(k,v) }
val best = list.foldLeft(KeyCount("default",0))(maxKey)
Otherwise simply use a condition with maxBy or reduceLeft:
val best = if(size>0) {
val list = vm.map{ case (k,v) => KeyCount(k,v) }
list.maxBy(_.count)
} else {
KeyCount("default",0)
}
Note that you can use maxBy on the original Map[String, Int], there is no need to convert the elements to KeyCount.

scala regex multiple integers

I have the following string that I would like to match on: 1-10 employees.
Here is my regex statement val regex = ("\\d+").r
The problem I have is Im trying to find a way to extract the matched data and determine which value returned is bigger.
Here is what IM doing to process it
def setMinAndMaxValue(currentCompany: CurrentCompany, matchIterator: Iterator[Regex.Match]): CurrentCompany = {
var max = 0
println(s"matchIterator - $matchIterator")
matchIterator.collect {
case regex(s: String) => println("found string")
case regex(IntConv(x)) =>
println("regex case")
if (x > max) max = x
}
val (minVal, maxVal) = rangesForMaxValue(max)
val newDetails = currentCompany.details.copy(minSize = Some(minVal), maxSize = Some(maxVal))
currentCompany.copy(details = newDetails)
}
object IntConv {
def unapply(s : String) : Option[Int] = Try {
Some(s.toInt)
}.toOption.flatten
}
I thought I was confused by your original question, then you clarified it with code and now I have no idea what you're trying to do.
To extract numbers from a string, try this.
val re = """(\d+)""".r
val nums = re.findAllIn(string_with_numbers).map(_.toInt).toList
Then you can just nums.min, and nums.max, and whatever number processing you need.

Apache Spark: dealing with Option/Some/None in RDDs

I'm mapping over an HBase table, generating one RDD element per HBase row. However, sometimes the row has bad data (throwing a NullPointerException in the parsing code), in which case I just want to skip it.
I have my initial mapper return an Option to indicate that it returns 0 or 1 elements, then filter for Some, then get the contained value:
// myRDD is RDD[(ImmutableBytesWritable, Result)]
val output = myRDD.
map( tuple => getData(tuple._2) ).
filter( {case Some(y) => true; case None => false} ).
map( _.get ).
// ... more RDD operations with the good data
def getData(r: Result) = {
val key = r.getRow
var id = "(unk)"
var x = -1L
try {
id = Bytes.toString(key, 0, 11)
x = Long.MaxValue - Bytes.toLong(key, 11)
// ... more code that might throw exceptions
Some( ( id, ( List(x),
// more stuff ...
) ) )
} catch {
case e: NullPointerException => {
logWarning("Skipping id=" + id + ", x=" + x + "; \n" + e)
None
}
}
}
Is there a more idiomatic way to do this that's shorter? I feel like this looks pretty messy, both in getData() and in the map.filter.map dance I'm doing.
Perhaps a flatMap could work (generate 0 or 1 items in a Seq), but I don't want it to flatten the tuples I'm creating in the map function, just eliminate empties.
An alternative, and often overlooked way, would be using collect(PartialFunction pf), which is meant to 'select' or 'collect' specific elements in the RDD that are defined at the partial function.
The code would look like this:
val output = myRDD.collect{case Success(tuple) => tuple }
def getData(r: Result):Try[(String, List[X])] = Try {
val id = Bytes.toString(key, 0, 11)
val x = Long.MaxValue - Bytes.toLong(key, 11)
(id, List(x))
}
If you change your getData to return a scala.util.Try then you can simplify your transformations considerably. Something like this could work:
def getData(r: Result) = {
val key = r.getRow
var id = "(unk)"
var x = -1L
val tr = util.Try{
id = Bytes.toString(key, 0, 11)
x = Long.MaxValue - Bytes.toLong(key, 11)
// ... more code that might throw exceptions
( id, ( List(x)
// more stuff ...
) )
}
tr.failed.foreach(e => logWarning("Skipping id=" + id + ", x=" + x + "; \n" + e))
tr
}
Then your transform could start like so:
myRDD.
flatMap(tuple => getData(tuple._2).toOption)
If your Try is a Failure it will be turned into a None via toOption and then removed as part of the flatMap logic. At that point, your next step in the transform will only be working with the successful cases being whatever the underlying type is that is returned from getData without the wrapping (i.e. No Option)
If you are ok with dropping the data then you can just use mapPartitions. Here is a sample:
import scala.util._
val mixedData = sc.parallelize(List(1,2,3,4,0))
mixedData.mapPartitions(x=>{
val foo = for(y <- x)
yield {
Try(1/y)
}
for{goodVals <- foo.partition(_.isSuccess)._1}
yield goodVals.get
})
If you want to see the bad values, then you can use an accumulator or just log as you have been.
Your code would look something like this:
val output = myRDD.
mapPartitions( tupleIter => getCleanData(tupleIter) )
// ... more RDD operations with the good data
def getCleanData(iter: Iter[???]) = {
val triedData = getDataInTry(iter)
for{goodVals <- triedData.partition(_.isSuccess)._1}
yield goodVals.get
}
def getDataInTry(iter: Iter[???]) = {
for(r <- iter) yield {
Try{
val key = r._2.getRow
var id = "(unk)"
var x = -1L
id = Bytes.toString(key, 0, 11)
x = Long.MaxValue - Bytes.toLong(key, 11)
// ... more code that might throw exceptions
}
}
}

Tune Nested Loop in Scala

I was wondering if I can tune the following Scala code :
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] = {
var listNoDuplicates: List[(Class1, Class2)] = Nil
for (outerIndex <- 0 until listOfTuple.size) {
if (outerIndex != listOfTuple.size - 1)
for (innerIndex <- outerIndex + 1 until listOfTuple.size) {
if (listOfTuple(i)._1.flag.equals(listOfTuple(j)._1.flag))
listNoDuplicates = listOfTuple(i) :: listNoDuplicates
}
}
listNoDuplicates
}
Usually if you have someting looking like:
var accumulator: A = new A
for( b <- collection ) {
accumulator = update(accumulator, b)
}
val result = accumulator
can be converted in something like:
val result = collection.foldLeft( new A ){ (acc,b) => update( acc, b ) }
So here we can first use a map to force the unicity of flags. Supposing the flag has a type F:
val result = listOfTuples.foldLeft( Map[F,(ClassA,ClassB)] ){
( map, tuple ) => map + ( tuple._1.flag -> tuple )
}
Then the remaining tuples can be extracted from the map and converted to a list:
val uniqList = map.values.toList
It will keep the last tuple encoutered, if you want to keep the first one, replace foldLeft by foldRight, and invert the argument of the lambda.
Example:
case class ClassA( flag: Int )
case class ClassB( value: Int )
val listOfTuples =
List( (ClassA(1),ClassB(2)), (ClassA(3),ClassB(4)), (ClassA(1),ClassB(-1)) )
val result = listOfTuples.foldRight( Map[Int,(ClassA,ClassB)]() ) {
( tuple, map ) => map + ( tuple._1.flag -> tuple )
}
val uniqList = result.values.toList
//uniqList: List((ClassA(1),ClassB(2)), (ClassA(3),ClassB(4)))
Edit: If you need to retain the order of the initial list, use instead:
val uniqList = listOfTuples.filter( result.values.toSet )
This compiles, but as I can't test it it's hard to say if it does "The Right Thing" (tm):
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] =
(for {outerIndex <- 0 until listOfTuple.size
if outerIndex != listOfTuple.size - 1
innerIndex <- outerIndex + 1 until listOfTuple.size
if listOfTuple(i)._1.flag == listOfTuple(j)._1.flag
} yield listOfTuple(i)).reverse.toList
Note that you can use == instead of equals (use eq if you need reference equality).
BTW: https://codereview.stackexchange.com/ is better suited for this type of question.
Do not use index with lists (like listOfTuple(i)). Index on lists have very lousy performance. So, some ways...
The easiest:
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] =
SortedSet(listOfTuple: _*)(Ordering by (_._1.flag)).toList
This will preserve the last element of the list. If you want it to preserve the first element, pass listOfTuple.reverse instead. Because of the sorting, performance is, at best, O(nlogn). So, here's a faster way, using a mutable HashSet:
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] = {
// Produce a hash map to find the duplicates
import scala.collection.mutable.HashSet
val seen = HashSet[Flag]()
// now fold
listOfTuple.foldLeft(Nil: List[(Class1,Class2)]) {
case (acc, el) =>
val result = if (seen(el._1.flag)) acc else el :: acc
seen += el._1.flag
result
}.reverse
}
One can avoid using a mutable HashSet in two ways:
Make seen a var, so that it can be updated.
Pass the set along with the list being created in the fold. The case then becomes:
case ((seen, acc), el) =>