Per-Document Word Count in Spark - scala

I am learning Spark (in Scala) and have been trying to figure out how to count all the the words on each line of a file.
I am working with a dataset where each line contains a tab-separated document_id and the full text of the document
doc_1 <full-text>
doc_2 <full-text>
etc..
Here is a toy example I have in a file called doc.txt
doc_1 new york city new york state
doc_2 rain rain go away
I think what I need to do is transform into tuples containig
((doc_id, word), 1)
and then call reduceByKey() to sum the 1's. I wrote the following:
val file = sc.textFile("docs.txt")
val tuples = file.map(_.split("\t"))
.map( x => (x(1).split("\\s+")
.map(y => ((x(0), y), 1 )) ) )
Which does give me the intermediate representation I think I need:
tuples.collect
res0: Array[Array[((String, String), Int)]] = Array(Array(((doc_1,new),1), ((doc_1,york),1), ((doc_1,city),1), ((doc_1,new),1), ((doc_1,york),1), ((doc_1,state),1)), Array(((doc_2,rain),1), ((doc_2,rain),1), ((doc_2,go),1), ((doc_2,away),1)))
But if call reduceByKey on tuples it produces an error
tuples.reduceByKey(_ + )
<console>:21: error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[Array[((String, String), Int)]]
tuples.reduceByKey(_ + )
I can't seem to get my head around how to do this. I think I need to do a reduce on an array inside an array. I have tried many different things but like the above keep getting errors and making no progress.
Any guidance / advice on this would be much appreciated.
Note: I know that there is a word count example on https://spark.apache.org/examples.html showing how to get counts for all of the words in a file. But that is for an entire input file. I am talking about getting counts per-document where each document is on a different line.

reduceByKey expects type RDD[(K,V)] whereas the instant you perform the split in the first map, you end up with an RDD[Array[...]], which is not the type signature that is needed. You can rework your current solution as below...but it probably will not be as performant (read after the code for a rework using flatMap):
//Dummy data load
val file = sc.parallelize(List("doc_1\tnew york city","doc_2\train rain go away"))
//Split the data on tabs to get an array of (key, line) tuples
val firstPass = file.map(_.split("\t"))
//Split the line inside each tuple so you now have an array of (key, Array(...))
//Where the inner array is full of (word, 1) tuples
val secondPass = firstPass.map(x=>(x(0), x(1).split("\\s+").map(y=>(y,1))))
//Now group the words and re-map so that the inner tuple is the wordcount
val finalPass = secondPass.map(x=>(x._1, x._2.groupBy(_._1).map(y=>(y._1,y._2.size))))
Probably the better solution vvvv :
If you want to keep your current structure, then you need to change to using a Tuple2 from the start and then using a flatMap after:
//Load your data
val file = sc.parallelize(List("doc_1\tnew york city","doc_2\train rain go away"))
//Turn the data into a key-value RDD (I suggest caching the split, kept 1 line for SO)
val firstPass = file.map(x=>(x.split("\t")(0), x.split("\t")(1)))
//Change your key to be a Tuple2[String,String] and the value is the count
val tuples = firstPass.flatMap(x=>x._2.split("\\s+").map(y=>((x._1, y), 1)))

Here is a quick demo on a VERY small dataset.
scala> val file = sc.textFile("../README.md")
15/02/02 00:32:38 INFO MemoryStore: ensureFreeSpace(32792) called with curMem=45512, maxMem=278302556
15/02/02 00:32:38 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 32.0 KB, free 265.3 MB)
file: org.apache.spark.rdd.RDD[String] = ../README.md MappedRDD[7] at textFile at <console>:12
scala> val splitLines = file.map{ line => line.split(" ") }
splitLines: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[9] at map at <console>:14
scala> splitLines.map{ arr => arr.toList.groupBy(identity).map{ x => (x._1, x._2.size) } }
res19: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Int]] = MappedRDD[10] at map at <console>:17
scala> val result = splitLines.map{ arr => arr.toList.groupBy(identity).map{ x => (x._1, x._2.size) } }
result: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Int]] = MappedRDD[11] at map at <console>:16
scala> result.take(10).foreach(println)
Map(# -> 1, Spark -> 1, Apache -> 1)
Map( -> 1)
Map(for -> 1, is -> 1, Data. -> 1, system -> 1, a -> 1, provides -> 1, computing -> 1, cluster -> 1, general -> 1, Spark -> 1, It -> 1, fast -> 1, Big -> 1, and -> 1)
Map(in -> 1, Scala, -> 1, optimized -> 1, APIs -> 1, that -> 1, Java, -> 1, high-level -> 1, an -> 1, Python, -> 1, and -> 2, engine -> 1)
Map(for -> 1, data -> 1, a -> 1, also -> 1, general -> 1, supports -> 2, It -> 1, graphs -> 1, analysis. -> 1, computation -> 1)
Map(for -> 1, set -> 1, tools -> 1, rich -> 1, Spark -> 1, structured -> 1, including -> 1, of -> 1, and -> 1, higher-level -> 1, SQL -> 2)
Map(GraphX -> 1, for -> 2, processing, -> 2, data -> 1, MLlib -> 1, learning, -> 1, machine -> 1, graph -> 1)
Map(for -> 1, Streaming -> 1, processing. -> 1, stream -> 1, Spark -> 1, and -> 1)
Map( -> 1)
Map(<http://spark.apache.org/> -> 1)

Input file content:
a b c d
a b e f
h i j l
m h i l
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("test").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> textFile = javaSparkContext.textFile("C:\\Users\\arun7.gupta\\Desktop\\Spark\\word.txt");
/**Splitting the word with space*/
textFile = textFile.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
/**Pair the word with count*/
JavaPairRDD<String, Integer> mapToPair = textFile.mapToPair(w -> new Tuple2<>(w, 1));
/**Reduce the pair with key and add count*/
JavaPairRDD<String, Integer> reduceByKey = mapToPair.reduceByKey((wc1, wc2) -> wc1 + wc2);
System.out.println(reduceByKey.collectAsMap());
javaSparkContext.close();
}
Output:
{e=1, h=2, b=2, j=1, m=1, d=1, a=2, i=2, c=1, l=2, f=1}

Related

How to create RDD[Map(Int,Int)] using Spark and Scala?

I have the following simple code in Java. This code creates and fills the Map with 0 values.
Map<Integer,Integer> myMap = new HashMap<Integer,Integer>();
for (int i=0; i<=20; i++) { myMap.put(i, 0); }
I want to create a similar RDD using Spark and Scala. I tried this approach, but it returns me RDD[(Any) => (Any,Int)] instead of RDD[Map(Int,Int)]. What am I doing wrong?
val data = (0 to 20).map(_ => (_,0))
val myMapRDD = sparkContext.parallelize(data)
What you are creating are tuples. Instead you need to create Map and parallelize as below
val data = (0 to 20).map(x => Map(x -> 0)) //data: scala.collection.immutable.IndexedSeq[scala.collection.immutable.Map[Int,Int]] = Vector(Map(0 -> 0), Map(1 -> 0), Map(2 -> 0), Map(3 -> 0), Map(4 -> 0), Map(5 -> 0), Map(6 -> 0), Map(7 -> 0), Map(8 -> 0), Map(9 -> 0), Map(10 -> 0), Map(11 -> 0), Map(12 -> 0), Map(13 -> 0), Map(14 -> 0), Map(15 -> 0), Map(16 -> 0), Map(17 -> 0), Map(18 -> 0), Map(19 -> 0), Map(20 -> 0))
val myMapRDD = sparkContext.parallelize(data) //myMapRDD: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]] = ParallelCollectionRDD[0] at parallelize at test.sc:19
In Scala, (0 to 20).map(_ => (_, 0)) would not compile, as it has invalid placeholder syntax. I believe you might be looking for something like below instead:
val data = (0 to 20).map( _->0 )
which would generate a list of key-value pairs, and is really just a placeholder shorthand for:
val data = (0 to 20).map( n => n->0 )
// data: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector(
// (0,0), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0), (9,0), (10,0),
// (11,0), (12,0), (13,0), (14,0), (15,0), (16,0), (17,0), (18,0), (19,0), (20,0)
// )
A RDD is an immtable collection (e.g. Seq, Array) of data. To create a RDD of Map[Int,Int], you would expand data inside a Map which in turn gets placed inside a Seq collection:
val rdd = sc.parallelize(Seq(Map(data: _*)))
rdd.collect
// res1: Array[scala.collection.immutable.Map[Int,Int]] = Array(
// Map(0 -> 0, 5 -> 0, 10 -> 0, 14 -> 0, 20 -> 0, 1 -> 0, 6 -> 0, 9 -> 0, 13 -> 0, 2 -> 0, 17 -> 0,
// 12 -> 0, 7 -> 0, 3 -> 0, 18 -> 0, 16 -> 0, 11 -> 0, 8 -> 0, 19 -> 0, 4 -> 0, 15 -> 0)
// )
Note that, as is, this RDD consists of only a single Map, and certainly you can assemble as many Maps as you wish in a RDD.
val rdd2 = sc.parallelize(Seq(
Map((0 to 4).map( _->0 ): _*),
Map((5 to 9).map( _->0 ): _*),
Map((10 to 14).map( _->0 ): _*),
Map((15 to 19).map( _->0 ): _*)
))
You can't parallelize a Map, as parallelize takes a Seq. What you can achieve is creating an RDD[(Int, Int)], which however does not enforce the uniqueness of keys. To perform operation by key, you can leverage PairRDDFunctions, that despite this limitation can end up being useful for your use case.
Let's try at least to get an RDD[(Int, Int)].
You used a slightly "wrong" syntax when applying the map to your range.
The _ placeholder can have different meanings depending on the context. The two meanings that got mixed up in your snippet of code are:
a placeholder for an anonymous function parameter that is not going to be used (as in (_ => 42), a function which ignores its input and always returns 42)
a positional placeholder for arguments in anonymous functions (as in (_, 42) a function that takes one argument and returns a tuple where the first element is the input and the second is the number 42)
The above examples are simplified and do not account for type inference as they only wish to point out two of the meanings of the _ placeholder that got mixed up in your snippet of code.
The first step is to use one of the two following functions to create the pairs that are going to be part of the map, either
a => (a, 0)
or
(_, 0)
and after parallelizing it you can get the RDD[(Int, Int)], as follows:
val pairRdd = sc.parallelize((0 to 20).map((_, 0)))
I believe it's worth noting here, that mapping on the local collection is going to be executed eagerly and bound to your driver's resources, while you can obtain the same final result by parallelizing the collection first and then mapping the pair-creating function on the RDD.
Now, as mentioned you don't have a distributed map, but rather a collection of key-value pairs where the key uniqueness is not enforced. But you can work pretty seamlessly with those values using PairRDDFunctions, which you obtain automatically by importing org.apache.spark.rdd.RDD.rddToPairRDDFunctions (or without having to do anything in the spark-shell as the import has already been done for your), which will decorate your RDD leveraging Scala's implicit conversions.
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
pairRdd.mapValues(_ + 1).foreach(println)
will print the following
(10,1)
(11,1)
(12,1)
(13,1)
(14,1)
(15,1)
(16,1)
(17,1)
(18,1)
(19,1)
(20,1)
(0,1)
(1,1)
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
(7,1)
(8,1)
(9,1)
You can learn more about working with key-value pairs with the RDD API on the official documentation.

Scala - reducing a map based on changed key

Lets say have the following Map data
val testMap: Map[String, Int] = Map("AAA_abc" -> 1,
"AAA_anghesh" -> 2,
"BBB_wfejw" -> 3,
"BBB_qgqwe" -> 4,
"C_fkee" -> 5)
Now I want to reduce the map by key.split("_").head and add all the values for the keys that became equal. So for this example the Map should result into:
Map(AAA -> 3, BBB -> 7, C -> 5)
What would be the correct way to do so in Scala?
I tried constructions with groupBy and reduceLeft but could not find a solution.
Here's a way to do it:
testMap.groupBy(_._1.split("_").head).mapValues(_.values.sum)
A variation in one pass:
testMap.foldLeft(Map[String,Int]())( (map, kv) => {
val key = kv._1.split("_").head
val previous = map.getOrElse(key,0)
map.updated(key, previous + kv._2) })

Compare a key in a map with every string in a list and if they match then store it inside a new map with its corresponding value from the old map

For example:
The map is given below:
output = Map(execution -> 1, () -> 2, for -> 1, in -> 1, brutalized -> 1, everyone -> 1, felt -> 1, christ -> 1, all -> 1, saw -> 1, crying -> 1, it -> 1, two -> 1, a -> 2, man -> 1, i -> 3, to -> 1, cried -> 1, you -> 1, tuned -> 1, around -> 1, was -> 1, there -> 1, hours -> 1, how -> 1, televised -> 1, me -> 1, not -> 1, could -> 1, were -> 1, passion -> 1, we -> 1, sat -> 1, when -> 1, like -> 1, of -> 2, and -> 1, watched -> 1, proceedings -> 1, the -> 3) of type [Any, Int]
The List(pos) is created from a file which contains huge number of words. It can be found here in the positive words section. In order to create that list I have used the below code:
val words = """([A-Za-z])+""".r
val pos = scala.io.Source.fromFile("Positive_words.txt").getLines.drop(35).flatMap(words.findAllIn).toList
I notice that when the list is matched with the map using either:
val result = output.filterKeys(pos.contains)
or
output.foreach { x => pos.foreach { aa => if(x._1.toString() == aa) <create a new hashmap>}}
results in something unexpected. There are some words which even though do not exist in pos are shown in the output.
Output snippet below:
val result = output.filterKeys(pos.contains)
result: scala.collection.immutable.Map[Any,Int] = Map(in -> 9, all -> 1, a -> 2, to -> 1, around -> 1, passion -> 1, like -> 1, of -> 2, the -> 3)
This is printing some irrelevant items which are not even a part of list = pos.
words like in, all, to are not expected.
PS: If a simple list(pos) is created WITHOUT using the above mentioned code, the output is just fine.
You can use map's filterKeys to keep only the keys that exist in the pos list, using pos.contains as the predicate:
val result = output.filterKeys(pos.contains)
result.foreach(println)
I don't know if this might help you. But normally only primitives(boolean, int, every type with lower case beginning) should be compared with "==". All the other objects should be compared with .equals() and a String is such an object.
Maybe you just keep this in mind.
The use of flatMap while creating the list was creating the problem.
It was resolved by using the below code instead of what I had written earlier:
val pos = scala.io.Source.fromFile("Positive_words.txt").getLines.drop(35).toList
Thanks.

Scala Map Word Count?

This problem appeared under "Maps and Tuples" Chapter in Scala for the Impatient
Write a program that reads words from a file. Use a mutable map to count how often each word appears.
My attempt is
// source file: https://www.gutenberg.org/cache/epub/35709/pg35709.txt
scala> val words = scala.io.Source.fromFile("pg35709.txt").mkString.split("\\s+")
words: Array[String] = Array(The, Project, Gutenberg, EBook, of, Making, Your, Camera, Pay,, by, Frederick, C., Davis,
This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with, almost, no, restrictions, whatsoever.,
You, may, copy, it,, give, it, away, or, re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included,
with, this, eBook, or, online, at, www.gutenberg.net, Title:, Making, Your, Camera, Pay, Author:, Frederick, C., Davis,
Release, Date:, March, 29,, 2011, [EBook, #35709], Language:, English, ***, START, OF, THIS, PROJECT, GUTENBERG, EBOOK,
MAKING, YOUR, CAMERA, PAY, ***, Produced, by, The, Online, Distributed, Proofreading, Team, at, http://www.pgdp.net,
(This, file, was, produced, from, images, generously, made, available, by, The, In...
scala> val wordCount = scala.collection.mutable.HashMap[String, Int]()
wordCount: scala.collection.mutable.HashMap[String,Int] = Map()
scala> for (word <- words) {
| val count = wordCount.getOrElse(word, 0)
| wordCount(word) = count + 1
| }
scala> word
wordCount words
scala> wordCount
res1: scala.collection.mutable.HashMap[String,Int] = Map(arts -> 1, follow -> 3, request, -> 1, Lines. -> 1,
demand -> 7, 1.E.4. -> 1, PRODUCT -> 2, 470 -> 1, Chicago, -> 3, scenic -> 1, J2 -> 1, untrimmed -> 1,
photographs--not -> 1, basis. -> 1, "prints -> 1, instances. -> 1, Onion-Planter -> 1, trick -> 1,
illustrating -> 3, prefer. -> 1, detected -> 1, non-exclusive. -> 1, famous -> 1, Competition -> 2,
expense -> 1, created -> 2, renamed. -> 1, maggot -> 1, calendar-photographs, -> 1, widely-read -> 1,
Publisher, -> 1, producers -> 1, Shapes -> 1, ARTICLES -> 2, yearly -> 2, retoucher -> 1, satisfy -> 2,
agrees: -> 1, Gentleman_, -> 1, intellectual -> 2, hard -> 2, Porch. -> 1, sold.) -> 1, START -> 1, House -> 2,
welcome -> 1, Dealers' -> 1, ... -> 2, pasted -> 1, _Cosmopolitan_ -...
While I know that this works, I wanted to know if there is Scalaesque way of achieving the same
You can do this:
val wordCount = words.groupBy(w => w).mapValues(_.size)
The groupBy method returns a map from the result of the given function, to a collection of values that return the same value from the function. In this case, a Map[String, Array[String]]. Then mapValues maps the Array[String] to their lengths.
If by Scalaesque way of achieving the same, you mean with using a mutable Map, here is a version:
scala> val data = Array("The", "Project", "Gutenberg", "EBook", "of", "Making", "Your", "The")
data: Array[String] = Array(The, Project, Gutenberg, EBook, of, Making, Your, The)
scala> val wordCount = scala.collection.mutable.HashMap[String, Int]().withDefaultValue(0)
wordCount: scala.collection.mutable.Map[String,Int] = Map()
scala> data.foreach(word => wordCount(word) += 1 )
scala> wordCount
res6: scala.collection.mutable.Map[String,Int] = Map(Making -> 1, of -> 1, Your -> 1, Project -> 1, Gutenberg -> 1, EBook -> 1, The -> 2)
author wanted to get it done in the mutable way in this chapter, so here is my mutable solution (in scala way it should be less verbouse)
var wordsMap = new scala.collection.mutable.HashMap[String, Int]
val in = new Scanner(new java.io.File("/home/artemvlasenko/Desktop/myfile.txt"))
while (in.hasNext) {
val word = in.next()
val count = wordsMap.getOrElse(word, 0)
wordsMap(word) = count + 1
in.next()
}
println(wordsMap.mkString(", "))

How to convert an Iterator[int] to a map with frequency per bin in scala

I just learned how to convert a list of integers to a map with frequency per bin in scala.
How to convert list of integers to a map with frequency per bin in scala
However I am working with a 22 GB file therefore I am streaming trough the file.
Source.fromFile("test.txt").getLines.filter(x => x.charAt(0) != '#').map(x => x.split("\t")(1)).map(x => x.toInt)
The groupby function only works on a list, not on an iterator. I guess because it needs all the values in memory. I can't convert the iterator to an list because of the file size.
So an example would be
List(1,2,3,101,330,302).iterator
And how can I go from there to
res1: scala.collection.immutable.Map[Int,Int] = Map(100 -> 1, 300 -> 2, 0 -> 3)
You may use fold:
val iter = List(1,2,3,101,330,302).iterator
iter.foldLeft(Map[Int, Int]()) {(accum, a) =>
val key = a/100 * 100;
accum + (key -> (accum.getOrElse(key, 0) + 1))}
// scala.collection.immutable.Map[Int,Int] = Map(0 -> 3, 100 -> 1, 300 - 2)