How to calculate the count of words per line in pyspark

How to calculate the count of words per line in pyspark - pyspark

I tried this :
rdd1= sc.parallelize(["Let's have some fun.",
"To have fun you don't need any plans."])
output = rdd1.map(lambda t: t.split(" ")).map(lambda lists: (lists, len(lists)))
output.foreach(print)
output:
(["Let's", 'have', 'some', 'fun.'], 4)
(['To', 'have', 'fun', 'you', "don't", 'need', 'any', 'plans.'], 8)
and i got the count of total number of words per line. but I wanted the count of each word per line.

You can try this:
from collections import Counter
output = rdd1.map(lambda t: t.split(" ")).map(lambda lists: dict(Counter(lists)))
I'll give a small python example:
from collections import Counter
example_1 = "Let's have some fun."
Counter(example_1.split(" "))
# [{"Let's": 1, 'have': 1, 'some': 1, 'fun.': 1}
example_2 = "To have fun you don't need any plans."
Counter(example_2.split(" "))
# {'To': 1, 'have': 1, 'fun': 1, 'you': 1, "don't": 1, 'need': 1, 'any': 1, 'plans.': 1}]

Based on your input and from what I understand please find below code. Just minor changes to your code:
output = rdd1.flatMap(lambda t: t.split(" ")).map(lambda lists: (lists, 1)).reduceByKey(lambda x,y : x+y)
You used map for splitting data. Instead use flatMap. It will break your string into words. PFB output:
output.collect()
[('have', 2), ("Let's", 1), ('To', 1), ('you', 1), ('need', 1), ('fun', 1), ("don't", 1), ('any', 1), ('some', 1), ('fun.', 1), ('plans.', 1)]

Related

Printing specific output in Scala

I have the following array of arrays that represents a cycle in a graph that I want to print in the below format.
scala> result.collect
Array[Array[Long]] = Array(Array(0, 1, 4, 0), Array(1, 5, 2, 1), Array(1, 4, 0, 1), Array(2, 3, 5, 2), Array(2, 1, 5, 2), Array(3, 5, 2, 3), Array(4, 0, 1, 4), Array(5, 2, 3, 5), Array(5, 2, 1, 5))
0:0->1->4;
1:1->5->2;1->4->0;
2:2->3->5;2->1->5;
3:3->5->2;
4:4->0->1;
5:5->2->3;5->2->1;
How can I do this? I have tried to do a for loop with if statements like other coding languages but scala's ifs in for loops are for filtering and cannot make use if/else to account for two different criteria.
example python code
for (array,i) in enumerate(range(0,result.length)):
if array[i] == array[i+1]:
//print thing needed
else:
// print other thing
I also tried to do result.groupBy to make it easier to print but doing that ruins the arrays.
Array[(Long, Iterable[Array[Long]])] = Array((4,CompactBuffer([J#3677a08a)), (0,CompactBuffer([J#695fd7e)), (1,CompactBuffer([J#50b0f441, [J#142efc4d)), (3,CompactBuffer([J#1fd66db2)), (5,CompactBuffer([J#36811d3b, [J#61c4f556)), (2,CompactBuffer([J#2eba1b7, [J#2efcf7a5)))
Is there a way to nicely print the output needed in Scala?

This should do it:
result
.groupBy(_.head)
.toArray
.sortBy(_._1)
.map {
case (node, cycles) =>
val paths = cycles.map { cycle =>
cycle
.init // drop last node
.mkString("->")
}
s"$node:${paths.mkString(";")}"
}
.mkString(";\n")
This is the output for the sample input you provided:
0:0->1->4;
1:1->5->2;1->4->0;
2:2->3->5;2->1->5;
3:3->5->2;
4:4->0->1;
5:5->2->3;5->2->1

Reverse a word-frequency map in Scala

I have a word-frequency array like this:
[("hello", 1), ("world", 5), ("globle", 1)]
I have to reverse it such that I get frequency-to-wordCount map like this:
[(1, 2), (5, 1)]
Notice that since two words ("hello" and "globe") have the frequency 1, the value of the reversed mapping is 2. However, since there is only one word with a frequency 5, so, the value of that entry is 1. How can I do this in scala?
Update:
I happened to figure this out as well:
arr.groupBy(_._2).map(x => (x._1,x._2.toList.length))

You can first group by the count, and then just get the size of each group
val frequencies = List(("hello", 1), ("world", 5), ("globle", 1))
val reversed = frequencies.groupBy(_._2).mapValues(_.size).toList
res0: List[(Int, Int)] = List((5,1), (1,2))

Operate on neighbor elements in RDD in Spark

As I have a collection:
List(1, 3,-1, 0, 2, -4, 6)
It's easy to make it sorted as:
List(-4, -1, 0, 1, 2, 3, 6)
Then I can construct a new collection by compute 6 - 3, 3 - 2, 2 - 1, 1 - 0, and so on like this:
for(i <- 0 to list.length -2) yield {
list(i + 1) - list(i)
}
and get a vector:
Vector(3, 1, 1, 1, 1, 3)
That is, I want to make the next element minus the current element.
But how to implement this in RDD on Spark?
I know for the collection:
List(-4, -1, 0, 1, 2, 3, 6)
There will be some partitions of the collection, each partition is ordered, can I do the similar operation on each partition and collect results on each partition together?

The most efficient solution is to use sliding method:
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = sc.parallelize(Seq(1, 3,-1, 0, 2, -4, 6))
.sortBy(identity)
.sliding(2)
.map{case Array(x, y) => y - x}

Suppose you have something like
val seq = sc.parallelize(List(1, 3, -1, 0, 2, -4, 6)).sortBy(identity)
Let's create first collection with index as key like Ton Torres suggested
val original = seq.zipWithIndex.map(_.swap)
Now we can build collection shifted by one element.
val shifted = original.map { case (idx, v) => (idx - 1, v) }.filter(_._1 >= 0)
Next we can calculate needed differences ordered by index descending
val diffs = original.join(shifted)
.sortBy(_._1, ascending = false)
.map { case (idx, (v1, v2)) => v2 - v1 }
So
println(diffs.collect.toSeq)
shows
WrappedArray(3, 1, 1, 1, 1, 3)
Note that you can skip the sortBy step if reversing is not critical.
Also note that for local collection this could be computed much more simple like:
val elems = List(1, 3, -1, 0, 2, -4, 6).sorted
(elems.tail, elems).zipped.map(_ - _).reverse
But in case of RDD the zip method requires each collection should contain equal element count for each partition. So if you would implement tail like
val tail = seq.zipWithIndex().filter(_._2 > 0).map(_._1)
tail.zip(seq) would not work since both collection needs equal count of elements for each partition and we have one element for each partition that should travel to previous partition.

Example of usage of a monoid for distributed computation with spark

I have user hobby data(RDD[Map[String, Int]]) like:
("food" -> 3, "music" -> 1),
("food" -> 2),
("game" -> 5, "twitch" -> 3, "food" -> 3)
I want to calculate stats of them, and represent the stats as Map[String, Array[Int]] while the array size is 5, like:
("food" -> Array(0, 1, 2, 0, 0),
"music" -> Array(1, 0, 0, 0, 0),
"game" -> Array(0, 0, 0, 0, 1),
"twitch" -> Array(0, 0, 1, 0 ,0))
foldLeft seems to be the right solution, but RDD cannot use it, and the data is too big to convert to List/Array to use foldLeft, how could I do this job?

The trick is to replace the Array in your example by a class that contains the statistic you want for some part of the data, and that can be combined with another instance of the same statistic (covering other part of the data) to provide the statistic on the whole data.
For instance, if you have a statistic that covers the data 3, 3, 2 and 5, I gather it would look something like (0, 1, 2, 0, 1) and if you have another instance covering the data 3,4,4 it would look like (0, 0, 1, 2,0). Now all you have to do is define a + operation that let you combine (0, 1, 2, 0, 1) + (0, 0, 1, 2, 0) = (0,1,3,2,1), covering the data 3,3,2,5 and 3,4,4.
Let's just do that, and call the class StatMonoid:
case class StatMonoid(flags: Seq[Int] = Seq(0,0,0,0,0)) {
def + (other: StatMonoid) =
new StatMonoid( (0 to 4).map{idx => flags(idx) + other.flags(idx)})
}
This class contains the sequence of 5 counters, and define a + operation that let it be combined with other counters.
We also need a convenience method to build it, this could be a constructor in StatMonoid, in the companion object, or just a plain method, as you prefer:
def stat(value: Int): StatMonoid = value match {
case 1 => new StatMonoid(Seq(1,0,0,0,0))
case 2 => new StatMonoid(Seq(0,1,0,0,0))
case 3 => new StatMonoid(Seq(0,0,1,0,0))
case 4 => new StatMonoid(Seq(0,0,0,1,0))
case 5 => new StatMonoid(Seq(0,0,0,0,1))
case _ => throw new RuntimeException("illegal init value: $value")
}
This allows us to easily compute instance of the statistic covering one single piece of data, for example:
scala> stat(4)
res25: StatMonoid = StatMonoid(List(0, 0, 0, 1, 0))
And it also allows us to combine them together simply by adding them:
scala> stat(1) + stat(2) + stat(2) + stat(5) + stat(5) + stat(5)
res18: StatMonoid = StatMonoid(Vector(1, 2, 0, 0, 3))
Now to apply this to your example, let's assume we have the data you mention as an RDD of Map:
val rdd = sc.parallelize(List(Map("food" -> 3, "music" -> 1), Map("food" -> 2), Map("game" -> 5, "twitch" -> 3, "food" -> 3)))
All we need to do to find the stat for each kind of food, is to flatten the data to get ("foodId" -> id) tuples, transform each id into an instance of StatMonoid above, and finally combine them all together for each kind of food:
import org.apache.spark.rdd.PairRDDFunctions
rdd.flatMap(_.toList).mapValue(stat).reduceByKey(_ + _).collect
Which yields:
res24: Array[(String, StatMonoid)] = Array((game,StatMonoid(List(0, 0, 0, 0, 1))), (twitch,StatMonoid(List(0, 0, 1, 0, 0))), (music,StatMonoid(List(1, 0, 0, 0, 0))), (food,StatMonoid(Vector(0, 1, 2, 0, 0))))
Now, for the side story, if you wonder why I call the class StateMonoid it's simply because... it is a monoid :D, and a very common and handy one, called product . In short, monoids are just thingies that can be combined with each other in associative fashion, they are super common when developing in Spark since they naturally define operations that can be executed in parallel on the distributed slaves, and gathered together into a final result.

Transform a Scala Stream to a new Stream which is the sum of the current element and the previous element

How to transform a Scala Stream of integers so that we have a new Stream where the elements are the sum of this element and the previous element.
By example if the input stream is 1, 2, 3, 4 ... then the output stream is 1, 3, 5, 7.
Also a second question, how would you make the sum use the previous one in the output stream so that the output would be 1, (2+(1)), (3+(2+1)), (4+(3+(2+1))).

Just zip your stream with a shifted version of itself and sum the two elements.
val s1 = Stream.from(0) // 0, 1, 2, 3, ...
val s2 = Stream.from(1) // 1, 2, 3, 4, ...
val sumOfTwo = s1.zip(s2).map{ case (a,b) => a+b } // 1, 3, 5, 7, ...
To compute the total sum, just use the scan function that acts like a fold but returning elements at each step.
val totalSum = s1.scan(0)((ctr, el) => ctr + el) // 0, 1, 3, 6, 10, ...

This answer computes the cumulative sum by using a variable for the accumulated result instead of scan(). Example program:
import scala.collection.immutable.Stream
object Main extends App {
// 1, 2, 3, ...
val naturals = Stream.from(1)
// cumulative sum (see https://stackoverflow.com/a/8567134/1071311)
def sumUp(s : Stream[Int], acc : Int = 0) : Stream[Int] =
Stream.cons(s.head + acc, sumUp(s.tail, s.head + acc))
val firstFive = sumUp(naturals, 0).take(5)
firstFive.foreach(println _)
}
Output:
1
3
6
10
15

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to calculate the count of words per line in pyspark - pyspark

Related

Printing specific output in Scala

Reverse a word-frequency map in Scala

Operate on neighbor elements in RDD in Spark

Example of usage of a monoid for distributed computation with spark

Transform a Scala Stream to a new Stream which is the sum of the current element and the previous element

Categories

Resources