iterate on scala list and treating last iteration differently - scala

I have a list of objects that I am calling toString on, and would like to treat the last object differently as follows:
o1 = Array(...)
o2 = Array(...) // same length as o1
sb = new StringBuilder()
for (i <- 0 to o1.size() - 1)
sb.append(o1.get(i).toString() + " & " o2.get(i).toString())
// if not last iteration then append ", "
is there a simple way to write this in scala rather than checking value of i etc?

#jwvh's anwser is good.
just give another pattern-matching version.
o1.zip(o2).map({case (item1, item2) => s"$item1 & $item2"}).mkString(", ")

Give this a try.
o1.zip(o2).map(t => s"${t._1} & ${t._2}").mkString(", ")
Zip the arrays together, turn each pair into the desired string, let mkString() insert the commas.

Related

Scala: Convert a string to string array with and without split given that all special characters except "(" an ")" are allowed

I have an array
val a = "((x1,x2),(y1,y2),(z1,z2))"
I want to parse this into a scala array
val arr = Array(("x1","x2"),("y1","y2"),("z1","z2"))
Is there a way of directly doing this with an expr() equivalent ?
If not how would one do this using split
Note : x1 x2 x3 etc are strings and can contain special characters so key would be to use () delimiters to parse data -
Code I munged from Dici and Bogdan Vakulenko
val x2 = a.getString(1).trim.split("[\()]").grouped(2).map(x=>x(0).trim).toArray
val x3 = x2.drop(1) // first grouping is always null dont know why
var jmap = new java.util.HashMap[String, String]()
for (i<-x3)
{
val index = i.lastIndexOf(",")
val fv = i.slice(0,index)
val lv = i.substring(index+1).trim
jmap.put(fv,lv)
}
This is still suceptible to "," in the second string -
Actually, I think regex are the most convenient way to solve this.
val a = "((x1,x2),(y1,y2),(z1,z2))"
val regex = "(\\((\\w+),(\\w+)\\))".r
println(
regex.findAllMatchIn(a)
.map(matcher => (matcher.group(2), matcher.group(3)))
.toList
)
Note that I made some assumptions about the format:
no whitespaces in the string (the regex could easily be updated to fix this if needed)
always tuples of two elements, never more
empty string not valid as a tuple element
only alphanumeric characters allowed (this also would be easy to fix)
val a = "((x1,x2),(y1,y2),(z1,z2))"
a.replaceAll("[\\(\\) ]","")
.split(",")
.sliding(2)
.map(x=>(x(0),x(1)))
.toArray

Efficientley counting occurrences of each character in a file - scala

I am new to Scala, I want the fastest way to get a map of count of occurrences for each character in a text file, how can I do that?(I used groupBy but I believe it is too slow)
I think that groupBy() is probably pretty efficient, but it simply collects the elements, which means that counting them requires a 2nd traversal.
To count all Chars in a single traversal you'd probably need something like this.
val tally = Array.ofDim[Long](127)
io.Source.fromFile("someFile.txt").foreach(tally(_) += 1)
Array was used for its fast indexing. The index is the character that was counted.
tally('e') //res0: Long = 74
tally('x') //res1: Long = 1
You can do the following:
Read the file first:
val lines = Source.fromFile("/Users/Al/.bash_profile").getLines.toSeq
You can then write a method that takes the List of lines read and counts the occurence for a given character:
def getCharCount(c: Char, lines: Seq[String]) = {
lines.foldLeft(0){(acc, elem) =>
elem.toSeq.count(_ == c) + acc
}
}

scala Convert String to tuple and insert into list

code:
var tup = ""
var l1 = new ListBuffer[String]()
tup=""
for (element1 <- tds) {
tup += element1.text + "|"
}
l1 += tup
l1
Output:
ListBuffer(STANDINGS|CONFERENCE|OVERALL|, ACC|W-L|GB|PCT|W-L|PCT|STRK|, North Carolina|14-2|--|.875|29-5|.853|L1|, Duke|13-3|1|.813|27-6|.818|L1|)
Now this is a list of string. I want it to be a list of tuple.
You can't. The thing you're looking for (assuming you want to split on |) is not well-typed. You would get
ListBuffer(("Standings", "Conference", "Overall"), ("ACC", "W-L", "GB", ...), ...)
The first element would be Tuple3[String, String, String]. The second would be Tuple7[String, ... String], and ListBuffer, like all collections, can't have heterogeneous types. You can get a ListBuffer of arrays, though.
l1.map(_.split("|"))
I used List[List[String]]. And now am able to refer to each element.
Am adding Lists to a List like this
(1::2::Nil)::(5::7::Nil)::Nil
Now my output is like this
List(List(STANDINGS, CONFERENCE, OVERALL), List(ACC, W-L, GB, PCT, W-L, PCT, STRK))

How to create a bigram from a text file with frequency count in Spark/Scala?

I want to take a text file and create a bigram of all words not separated by a dot ".", removing any special characters. I'm trying to do this using Spark and Scala.
This text:
Hello my Friend. How are
you today? bye my friend.
Should produce the following:
hello my, 1
my friend, 2
how are, 1
you today, 1
today bye, 1
bye my, 1
For each of the lines in the RDD, start by splitting based on '.'. Then tokenize each of the resulting substrings by splitting on ' '. Once tokenized, remove special characters with replaceAll and convert to lowercase. Each of these sublists can be converted with sliding to an iterator of string arrays containing bigrams.
Then, after flattening and converting the bigram arrays to strings with mkString as requested, get a count for each one with groupBy and mapValues.
Finally flatten, reduce, and collect the (bigram, count) tuples from the RDD.
val rdd = sc.parallelize(Array("Hello my Friend. How are",
"you today? bye my friend."))
rdd.map{
// Split each line into substrings by periods
_.split('.').map{ substrings =>
// Trim substrings and then tokenize on spaces
substrings.trim.split(' ').
// Remove non-alphanumeric characters, using Shyamendra's
// clean replacement technique, and convert to lowercase
map{_.replaceAll("""\W""", "").toLowerCase()}.
// Find bigrams
sliding(2)
}.
// Flatten, and map the bigrams to concatenated strings
flatMap{identity}.map{_.mkString(" ")}.
// Group the bigrams and count their frequency
groupBy{identity}.mapValues{_.size}
}.
// Reduce to get a global count, then collect
flatMap{identity}.reduceByKey(_+_).collect.
// Format and print
foreach{x=> println(x._1 + ", " + x._2)}
you today, 1
hello my, 1
my friend, 2
how are, 1
bye my, 1
today bye, 1
In order to separate entire words from any punctuation marks consider for instance
val words = text.split("\\W+")
which delivers in this case
Array[String] = Array(Hello, my, Friend, How, are, you, today, bye, my, friend)
Pairing words into tuples proves more inlined with the concept of a bigram, thus consider for instance
for( Array(a,b,_*) <- words.sliding(2).toArray )
yield (a.toLowerCase(), b.toLowerCase())
which yields
Array((hello,my), (my,friend), (friend,How), (how,are),
(are,you), (you,today), (today,bye), (bye,my), (my,friend))
The answer by ohruunuruus conveys otherwise a concise approach.
This should work in Spark:
def bigramsInString(s: String): Array[((String, String), Int)] = {
s.split("""\.""") // split on .
.map(_.split(" ") // split on space
.filter(_.nonEmpty) // remove empty string
.map(_.replaceAll("""\W""", "") // remove special chars
.toLowerCase)
.filter(_.nonEmpty)
.sliding(2) // take continuous pairs
.filter(_.size == 2) // sliding can return partial
.map{ case Array(a, b) => ((a, b), 1) })
.flatMap(x => x)
}
val rdd = sc.parallelize(Array("Hello my Friend. How are",
"you today? bye my friend."))
rdd.map(bigramsInString)
.flatMap(x => x)
.countByKey // get result in driver memory as Map
.foreach{ case ((x, y), z) => println(s"${x} ${y}, ${z}") }
// my friend, 2
// how are, 1
// today bye, 1
// bye my, 1
// you today, 1
// hello my, 1

Functional Way of handling corner case in Folds

I've a list of nodes (String) that I want to convert into something the following.
create X ({name:"A"}),({name:"B"}),({name:"B"}),({name:"C"}),({name:"D"}),({name:"F"})
Using a fold I get everything with an extra "," at the end. I can remove that using a substring on the final String. I was wondering if there is a better/more functional way of doing this in Scala ?
val nodes = List("A", "B", "B", "C", "D", "F")
val str = nodes.map( x => "({name:\"" + x + "\"}),").foldLeft("create X ")( (acc, curr) => acc + curr )
println(str)
//create X ({name:"A"}),({name:"B"}),({name:"B"}),({name:"C"}),({name:"D"}),({name:"F"}),
Solution 1
You could use the mkString function, which won't append the seperator at the end.
In this case you first map each element to the corresponding String and then use mkString for putting the ',' inbetween.
Since the "create X" is static in the beginning you could just prepend it to the result.
val str = "create X " + nodes.map("({name:\"" + _ + "\"})").mkString(",")
Solution 2
Another way to see this: Since you append exactly one ',' too much, you could just remove it.
val str = nodes.foldLeft("create X ")((acc, x) => acc + "({name:\"" + x + "\"}),").init
init just takes all elements from a collection, except the last.
(A string is seen as a collection of chars here)
So in a case where there are elements in your nodes, you would remove a ','. When there is none you only get "create X " and therefore remove the white-space, which might not be needed anyways.
Solution 1 and 2 are not equivalent when nodes is empty. Solution 1 would keep the white-space.
Joining a bunch of things, splicing something "in between" each of the things, isn't a map-shaped problem. So adding the comma in the map call doesn't really "fit".
I generally do this sort of thing by inserting the comma before each item during the fold; the fold can test whether the accumulator is "empty" and not insert a comma.
For this particular case (string joining) it's so common that there's already a library function for it: mkString.
Move "," from map(which applies to all) to fold/reduce
val str = "create X " + nodes.map( x => "({name:\"" + x + "\"})").reduceLeftOption( _ +","+ _ ).getOrElse("")