I have to analyze an email corpus to see how many of individual sentences are dominated by leet speak (i.e. lol, brb etc.)
For each sentence I am doing the following:
val words = sentence.split(" ")
for (word <- words) {
if (validWords.contains(word)) {
score += 1
} else if (leetWords.contains(word)) {
score -= 1
}
}
Is there a better way to calculate the scores using Fold?
Not a great deal different, but another option.
val words = List("one", "two", "three")
val valid = List("one", "two")
val leet = List("three")
def check(valid: List[String], invalid: List[String])(words:List[String]): Int = words.foldLeft(0){
case (x, word) if valid.contains(word) => x + 1
case (x, word) if invalid.contains(word) => x - 1
case (x, _ ) => x
}
val checkValidOrLeet = check(valid, leet)(_)
val count = checkValidOrLeet(words)
If not limited to fold, using sum would be more concise.
sentence.split(" ")
.iterator
.map(word =>
if (validWords.contains(word)) 1
else if (leetWords.contains(word)) -1
else 0
).sum
Here's a way to do it with fold and partial application. Could still be more elegant, I'll continue to think on it.
val sentence = // ...your data....
val validWords = // ... your valid words...
val leetWords = // ... your leet words...
def checkWord(goodList: List[String], badList: List[String])(c: Int, w: String): Int = {
if (goodList.contains(w)) c + 1
else if (badList.contains(w)) c - 1
else c
}
val count = sentence.split(" ").foldLeft(0)(checkWord(validWords, leetWords))
print(count)
Related
I'm new in Scala programming language so in this Bubble sort I need to generate 10 random integers instead of right it down like the code below
any suggestions?
object BubbleSort {
def bubbleSort(array: Array[Int]) = {
def bubbleSortRecursive(array: Array[Int], current: Int, to: Int): Array[Int] = {
println(array.mkString(",") + " current -> " + current + ", to -> " + to)
to match {
case 0 => array
case _ if(to == current) => bubbleSortRecursive(array, 0, to - 1)
case _ =>
if (array(current) > array(current + 1)) {
var temp = array(current + 1)
array(current + 1) = array(current)
array(current) = temp
}
bubbleSortRecursive(array, current + 1, to)
}
}
bubbleSortRecursive(array, 0, array.size - 1)
}
def main(args: Array[String]) {
val sortedArray = bubbleSort(Array(10,9,11,5,2))
println("Sorted Array -> " + sortedArray.mkString(","))
}
}
Try this:
import scala.util.Random
val sortedArray = (1 to 10).map(_ => Random.nextInt).toArray
You can use scala.util.Random for generation. nextInt method takes maxValue argument, so in the code sample, you'll generate list of 10 int values from 0 to 100.
val r = scala.util.Random
for (i <- 1 to 10) yield r.nextInt(100)
You can find more info here or here
You can use it this way.
val solv1 = Random.shuffle( (1 to 100).toList).take(10)
val solv2 = Array.fill(10)(Random.nextInt)
here is example about countWords. (Scala)
[origin]
def countWords(text: String): mutable.Map[String, Int] = {
val counts = mutable.Map.empty[String, Int]
for (rawWord <- text.split("[ ,!.]+")) {
val word = rawWord.toLowerCase
val oldCount =
if (counts.contains(word)) counts(word)
else 0
counts += (word -> (oldCount + 1))
}
return counts
}
[my code]
here is my code.
def countWords2(text: String):mutable.Map[String, Int] = {
val counts = mutable.Map.empty[String, Int]s
text.split("[ ,!.]").foreach(word =>
val lowWord = word.toLowerCase()
val oldCount = if (counts.contains(lowWord)) counts(lowWord) else 0
counts += (lowWord -> (oldCount + 1))
)
return counts
}
I tried transfer "for()" sentence to "foreach" but I got "cannot resolved symbol" error message.
how to use foreach in this case?
I have a function which should take in a long string and separate it into a list of strings where each list element is a sentence of the article. I am going to achieve this by splitting on space and then grouping the elements from that split according to the tokens which end with a dot:
def getSentences(article: String): List[String] = {
val separatedBySpace = article
.map((c: Char) => if (c == '\n') ' ' else c)
.split(" ")
val splitAt: List[Int] = Range(0, separatedBySpace.size)
.filter(i => endsWithDot(separatedBySpace(0))).toList
// TODO
}
I have separated the string on space, and I've found each index that I want to group the list on. But how do I now turn separatedBySpace into a list of sentences based on splitAt?
Example of how it should work:
article = "I like donuts. I like cats."
result = List("I like donuts.", "I like cats.")
PS: Yes, I now that my algorithm for splitting the article into sentences has flaws, I just want to make a quick naive method to get the job done.
I ended up solving this by using recursion:
def getSentenceTokens(article: String): List[List[String]] = {
val separatedBySpace: List[String] = article
.replace('\n', ' ')
.replaceAll(" +", " ") // regex
.split(" ")
.toList
val splitAt: List[Int] = separatedBySpace.indices
.filter(i => ( i > 0 && endsWithDot(separatedBySpace(i - 1)) ) || i == 0)
.toList
groupBySentenceTokens(separatedBySpace, splitAt, List())
}
def groupBySentenceTokens(tokens: List[String], splitAt: List[Int], sentences: List[List[String]]): List[List[String]] = {
if (splitAt.size <= 1) {
if (splitAt.size == 1) {
sentences :+ tokens.slice(splitAt.head, tokens.size)
} else {
sentences
}
}
else groupBySentenceTokens(tokens, splitAt.tail, sentences :+ tokens.slice(splitAt.head, splitAt.tail.head))
}
val s: String = """I like donuts. I like cats
This is amazing"""
s.split("\\.|\n").map(_.trim).toList
//result: List[String] = List("I like donuts", "I like cats", "This is amazing")
To include the dots in the sentences:
val (a, b, _) = s.replace("\n", " ").split(" ")
.foldLeft((List.empty[String], List.empty[String], "")){
case ((temp, result, finalStr), word) =>
if (word.endsWith(".")) {
(List.empty[String], result ++ List(s"$finalStr${(temp ++ List(word)).mkString(" ")}"), "")
} else {
(temp ++ List(word), result, finalStr)
}
}
val result = b ++ List(a.mkString(" ").trim)
//result = List("I like donuts.", "I like cats.", "This is amazing")
I have the following code:
var res: GenMap[Point, GenSeq[Point]] = points.par.groupBy(point => findClosest(point, means))
means.par.foreach(mean => if(!res.contains(mean)) {
println("Map doesn't contain mean: " + mean)
res += mean -> GenSeq.empty[Point]
println("Map contains?: " + res.contains(mean))
})
That uses this case class:
case class Point(val x: Double, val y: Double, val z: Double)
Basically, the code groups the Point elements in points around the Point elements in means. The algorithm itself is not very important though.
My problem is that I am getting the following output:
Map doesn't contain mean: (0.44, 0.59, 0.73)
Map doesn't contain mean: (0.44, 0.59, 0.73)
Map doesn't contain mean: (0.1, 0.11, 0.11)
Map doesn't contain mean: (0.1, 0.11, 0.11)
Map contains?: true
Map contains?: true
Map contains?: false
Map contains?: true
Why would I ever get this?
Map contains?: false
I am checking if a key is in the res map. If it is not, then I'm adding it.
So how can it not be present in the map?
Is there an issue with parallelism?
Your code has a race condition in line
res += mean -> GenSeq.empty[Point]
more than one thread is reasigning res concurrently so some entries can be missed.
This code solves the problem:
val closest = points.par.groupBy(point => findClosest(point, means))
val res = means.foldLeft(closest) {
case (map, mean) =>
if(map.contains(mean))
map
else
map + (mean -> GenSeq.empty[Point])
}
Processing a point changes means and the result is sensitive to processing order, so the algorithm doesn't lend itself to parallel execution. If parallel execution is important enough to allow a change of algorithm, then it might be possible to find an algorithm that can be applied in parallel.
Using a known set of grouping points, such as grid square centres, means that the points can be allocated to their grouping points in parallel and grouped by their grouping points in parallel:
import scala.annotation.tailrec
import scala.collection.parallel.ParMap
import scala.collection.{GenMap, GenSeq, Map}
import scala.math._
import scala.util.Random
class ParallelPoint {
val rng = new Random(0)
val groups: Map[Point, Point] = (for {
i <- 0 to 100
j <- 0 to 100
k <- 0 to 100
}
yield {
val p = Point(10.0 * i, 10.0 * j, 10.0 * k)
p -> p
}
).toMap
val points: Array[Point] = (1 to 10000000).map(aaa => Point(rng.nextDouble() * 1000.0, rng.nextDouble() * 1000.0, rng.nextDouble() * 1000.0)).toArray
def findClosest(point: Point, groups: GenMap[Point, Point]): (Point, Point) = {
val x: Double = rint(point.x / 10.0) * 10.0
val y: Double = rint(point.y / 10.0) * 10.0
val z: Double = rint(point.z / 10.0) * 10.0
val mean: Point = groups(Point(x, y, z)) //.getOrElse(throw new Exception(s"$point out of range of mean ($x, $y, $z).") )
(mean, point)
}
#tailrec
private def total(points: GenSeq[Point]): Option[Point] = {
points.size match {
case 0 => None
case 1 => Some(points(0))
case _ => total((points(0) + points(1)) +: points.drop(2))
}
}
def mean(points: GenSeq[Point]): Option[Point] = {
total(points) match {
case None => None
case Some(p) => Some(p / points.size)
}
}
val startTime = System.currentTimeMillis()
println("starting test ...")
val res: ParMap[Point, GenSeq[Point]] = points.par.map(p => findClosest(p, groups)).groupBy(pp => pp._1).map(kv => kv._1 -> kv._2.map(v => v._2))
val groupTime = System.currentTimeMillis()
println(s"... grouped result after ${groupTime - startTime}ms ...")
points.par.foreach(p => if (! res(findClosest(p, groups)._1).exists(_ == p)) println(s"point $p not found"))
val checkTime = System.currentTimeMillis()
println(s"... checked grouped result after ${checkTime - startTime}ms ...")
val means: ParMap[Point, GenSeq[Point]] = res.map{ kv => mean(kv._2).get -> kv._2 }
val meansTime = System.currentTimeMillis()
println(s"... means calculated after ${meansTime - startTime}ms.")
}
object ParallelPoint {
def main(args: Array[String]): Unit = new ParallelPoint()
}
case class Point(x: Double, y: Double, z: Double) {
def +(that: Point): Point = {
Point(this.x + that.x, this.y + that.y, this.z + that.z)
}
def /(scale: Double): Point = Point(x/ scale, y / scale, z / scale)
}
The last step replaces the grouping point with the calculated mean of the grouped points as the map key. This processes 10 million points in about 30 seconds on my 2011 MBP.
We are trying to generate column wise statistics of our dataset in spark. In addition to using the summary function from statistics library. We are using the following procedure:
We determine the columns with string values
Generate key value pair for the whole dataset, using the column number as key and value of column as value
generate a new map of format
(K,V) ->((K,V),1)
Then we use reduceByKey to find the sum of all unique value in all the columns. We cache this output to reduce further computation time.
In the next step we cycle through the columns using a for loop to find the statistics for all the columns.
We are trying to reduce the for loop by again utilizing the map reduce way but we are unable to find some way to achieve it. Doing so will allow us to generate column statistics for all columns in one execution. The for loop method is running sequentially making it very slow.
Code:
//drops the header
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
def retAtrTuple(x: String) = {
val newX = x.split(",")
for (h <- 0 until newX.length)
yield (h,newX(h))
}
val line = sc.textFile("hdfs://.../myfile.csv")
val withoutHeader: RDD[String] = dropHeader(line)
val kvPairs = withoutHeader.flatMap(retAtrTuple) //generates a key-value pair where key is the column number and value is column's value
var bool_numeric_col = kvPairs.map{case (x,y) => (x,isNumeric(y))}.reduceByKey(_&&_).sortByKey() //this contains column indexes as key and boolean as value (true for numeric and false for string type)
var str_cols = bool_numeric_col.filter{case (x,y) => y == false}.map{case (x,y) => x}
var num_cols = bool_numeric_col.filter{case (x,y) => y == true}.map{case (x,y) => x}
var str_col = str_cols.toArray //array consisting the string col
var num_col = num_cols.toArray //array consisting numeric col
val colCount = kvPairs.map((_,1)).reduceByKey(_+_)
val e1 = colCount.map{case ((x,y),z) => (x,(y,z))}
var numPairs = e1.filter{case (x,(y,z)) => str_col.contains(x) }
//running for loops which needs to be parallelized/optimized as it sequentially operates on each column. Idea is to find the top10, bottom10 and number of distinct elements column wise
for(i <- str_col){
var total = numPairs.filter{case (x,(y,z)) => x==i}.sortBy(_._2._2)
var leastOnes = total.take(10)
println("leastOnes for Col" + i)
leastOnes.foreach(println)
var maxOnes = total.sortBy(-_._2._2).take(10)
println("maxOnes for Col" + i)
maxOnes.foreach(println)
println("distinct for Col" + i + " is " + total.count)
}
Let me simplify your question a bit. (A lot actually.) We have an RDD[(Int, String)] and we want to find the top 10 most common Strings for each Int (which are all in the 0–100 range).
Instead of sorting, as in your example, it is more efficient to use the Spark built-in RDD.top(n) method. Its run-time is linear in the size of the data, and requires moving much less data around than a sort.
Consider the implementation of top in RDD.scala. You want to do the same, but with one priority queue (heap) per Int key. The code becomes fairly complex:
import org.apache.spark.util.BoundedPriorityQueue // Pretend it's not private.
def top(n: Int, rdd: RDD[(Int, String)]): Map[Int, Iterable[String]] = {
// A heap that only keeps the top N values, so it has bounded size.
type Heap = BoundedPriorityQueue[(Long, String)]
// Get the word counts.
val counts: RDD[[(Int, String), Long)] =
rdd.map(_ -> 1L).reduceByKey(_ + _)
// In each partition create a column -> heap map.
val perPartition: RDD[Map[Int, Heap]] =
counts.mapPartitions { items =>
val heaps =
collection.mutable.Map[Int, Heap].withDefault(i => new Heap(n))
for (((k, v), count) <- items) {
heaps(k) += count -> v
}
Iterator.single(heaps)
}
// Merge the per-partition heap maps into one.
val merged: Map[Int, Heap] =
perPartition.reduce { (heaps1, heaps2) =>
val heaps =
collection.mutable.Map[Int, Heap].withDefault(i => new Heap(n))
for ((k, heap) <- heaps1.toSeq ++ heaps2.toSeq) {
for (cv <- heap) {
heaps(k) += cv
}
}
heaps
}
// Discard counts, return just the top strings.
merged.mapValues(_.map { case(count, value) => value })
}
This is efficient, but made painful because we need to work with multiple columns at the same time. It would be way easier to have one RDD per column and just call rdd.top(10) on each.
Unfortunately the naive way to split up the RDD into N smaller RDDs does N passes:
def split(together: RDD[(Int, String)], columns: Int): Seq[RDD[String]] = {
together.cache // We will make N passes over this RDD.
(0 until columns).map {
i => together.filter { case (key, value) => key == i }.values
}
}
A more efficient solution could be to write out the data into separate files by key, then load it back into separate RDDs. This is discussed in Write to multiple outputs by key Spark - one Spark job.
Thanks for #Daniel Darabos's answer. But there are some mistakes.
mixed use of Map and collection.mutable.Map
withDefault((i: Int) => new Heap(n)) do not create a new Heap when you set heaps(k) += count -> v
mix uasage of parentheses
Here is the modified code:
//import org.apache.spark.util.BoundedPriorityQueue // Pretend it's not private. copy to your own folder and import it
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object BoundedPriorityQueueTest {
// https://stackoverflow.com/questions/28166190/spark-column-wise-word-count
def top(n: Int, rdd: RDD[(Int, String)]): Map[Int, Iterable[String]] = {
// A heap that only keeps the top N values, so it has bounded size.
type Heap = BoundedPriorityQueue[(Long, String)]
// Get the word counts.
val counts: RDD[((Int, String), Long)] =
rdd.map(_ -> 1L).reduceByKey(_ + _)
// In each partition create a column -> heap map.
val perPartition: RDD[collection.mutable.Map[Int, Heap]] =
counts.mapPartitions { items =>
val heaps =
collection.mutable.Map[Int, Heap]() // .withDefault((i: Int) => new Heap(n))
for (((k, v), count) <- items) {
println("\n---")
println("before add " + ((k, v), count) + ", the map is: ")
println(heaps)
if (!heaps.contains(k)) {
println("not contains key " + k)
heaps(k) = new Heap(n)
println(heaps)
}
heaps(k) += count -> v
println("after add " + ((k, v), count) + ", the map is: ")
println(heaps)
}
println(heaps)
Iterator.single(heaps)
}
// Merge the per-partition heap maps into one.
val merged: collection.mutable.Map[Int, Heap] =
perPartition.reduce { (heaps1, heaps2) =>
val heaps =
collection.mutable.Map[Int, Heap]() //.withDefault((i: Int) => new Heap(n))
println(heaps)
for ((k, heap) <- heaps1.toSeq ++ heaps2.toSeq) {
for (cv <- heap) {
heaps(k) += cv
}
}
heaps
}
// Discard counts, return just the top strings.
merged.mapValues(_.map { case (count, value) => value }).toMap
}
def main(args: Array[String]): Unit = {
Logger.getRootLogger().setLevel(Level.FATAL) //http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console
val conf = new SparkConf().setAppName("word count").setMaster("local[1]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN") //http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console
val words = sc.parallelize(List((1, "s11"), (1, "s11"), (1, "s12"), (1, "s13"), (2, "s21"), (2, "s22"), (2, "s22"), (2, "s23")))
println("# words:" + words.count())
val result = top(1, words)
println("\n--result:")
println(result)
sc.stop()
print("DONE")
}
}