Related
For example:
The map is given below:
output = Map(execution -> 1, () -> 2, for -> 1, in -> 1, brutalized -> 1, everyone -> 1, felt -> 1, christ -> 1, all -> 1, saw -> 1, crying -> 1, it -> 1, two -> 1, a -> 2, man -> 1, i -> 3, to -> 1, cried -> 1, you -> 1, tuned -> 1, around -> 1, was -> 1, there -> 1, hours -> 1, how -> 1, televised -> 1, me -> 1, not -> 1, could -> 1, were -> 1, passion -> 1, we -> 1, sat -> 1, when -> 1, like -> 1, of -> 2, and -> 1, watched -> 1, proceedings -> 1, the -> 3) of type [Any, Int]
The List(pos) is created from a file which contains huge number of words. It can be found here in the positive words section. In order to create that list I have used the below code:
val words = """([A-Za-z])+""".r
val pos = scala.io.Source.fromFile("Positive_words.txt").getLines.drop(35).flatMap(words.findAllIn).toList
I notice that when the list is matched with the map using either:
val result = output.filterKeys(pos.contains)
or
output.foreach { x => pos.foreach { aa => if(x._1.toString() == aa) <create a new hashmap>}}
results in something unexpected. There are some words which even though do not exist in pos are shown in the output.
Output snippet below:
val result = output.filterKeys(pos.contains)
result: scala.collection.immutable.Map[Any,Int] = Map(in -> 9, all -> 1, a -> 2, to -> 1, around -> 1, passion -> 1, like -> 1, of -> 2, the -> 3)
This is printing some irrelevant items which are not even a part of list = pos.
words like in, all, to are not expected.
PS: If a simple list(pos) is created WITHOUT using the above mentioned code, the output is just fine.
You can use map's filterKeys to keep only the keys that exist in the pos list, using pos.contains as the predicate:
val result = output.filterKeys(pos.contains)
result.foreach(println)
I don't know if this might help you. But normally only primitives(boolean, int, every type with lower case beginning) should be compared with "==". All the other objects should be compared with .equals() and a String is such an object.
Maybe you just keep this in mind.
The use of flatMap while creating the list was creating the problem.
It was resolved by using the below code instead of what I had written earlier:
val pos = scala.io.Source.fromFile("Positive_words.txt").getLines.drop(35).toList
Thanks.
This problem appeared under "Maps and Tuples" Chapter in Scala for the Impatient
Write a program that reads words from a file. Use a mutable map to count how often each word appears.
My attempt is
// source file: https://www.gutenberg.org/cache/epub/35709/pg35709.txt
scala> val words = scala.io.Source.fromFile("pg35709.txt").mkString.split("\\s+")
words: Array[String] = Array(The, Project, Gutenberg, EBook, of, Making, Your, Camera, Pay,, by, Frederick, C., Davis,
This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with, almost, no, restrictions, whatsoever.,
You, may, copy, it,, give, it, away, or, re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included,
with, this, eBook, or, online, at, www.gutenberg.net, Title:, Making, Your, Camera, Pay, Author:, Frederick, C., Davis,
Release, Date:, March, 29,, 2011, [EBook, #35709], Language:, English, ***, START, OF, THIS, PROJECT, GUTENBERG, EBOOK,
MAKING, YOUR, CAMERA, PAY, ***, Produced, by, The, Online, Distributed, Proofreading, Team, at, http://www.pgdp.net,
(This, file, was, produced, from, images, generously, made, available, by, The, In...
scala> val wordCount = scala.collection.mutable.HashMap[String, Int]()
wordCount: scala.collection.mutable.HashMap[String,Int] = Map()
scala> for (word <- words) {
| val count = wordCount.getOrElse(word, 0)
| wordCount(word) = count + 1
| }
scala> word
wordCount words
scala> wordCount
res1: scala.collection.mutable.HashMap[String,Int] = Map(arts -> 1, follow -> 3, request, -> 1, Lines. -> 1,
demand -> 7, 1.E.4. -> 1, PRODUCT -> 2, 470 -> 1, Chicago, -> 3, scenic -> 1, J2 -> 1, untrimmed -> 1,
photographs--not -> 1, basis. -> 1, "prints -> 1, instances. -> 1, Onion-Planter -> 1, trick -> 1,
illustrating -> 3, prefer. -> 1, detected -> 1, non-exclusive. -> 1, famous -> 1, Competition -> 2,
expense -> 1, created -> 2, renamed. -> 1, maggot -> 1, calendar-photographs, -> 1, widely-read -> 1,
Publisher, -> 1, producers -> 1, Shapes -> 1, ARTICLES -> 2, yearly -> 2, retoucher -> 1, satisfy -> 2,
agrees: -> 1, Gentleman_, -> 1, intellectual -> 2, hard -> 2, Porch. -> 1, sold.) -> 1, START -> 1, House -> 2,
welcome -> 1, Dealers' -> 1, ... -> 2, pasted -> 1, _Cosmopolitan_ -...
While I know that this works, I wanted to know if there is Scalaesque way of achieving the same
You can do this:
val wordCount = words.groupBy(w => w).mapValues(_.size)
The groupBy method returns a map from the result of the given function, to a collection of values that return the same value from the function. In this case, a Map[String, Array[String]]. Then mapValues maps the Array[String] to their lengths.
If by Scalaesque way of achieving the same, you mean with using a mutable Map, here is a version:
scala> val data = Array("The", "Project", "Gutenberg", "EBook", "of", "Making", "Your", "The")
data: Array[String] = Array(The, Project, Gutenberg, EBook, of, Making, Your, The)
scala> val wordCount = scala.collection.mutable.HashMap[String, Int]().withDefaultValue(0)
wordCount: scala.collection.mutable.Map[String,Int] = Map()
scala> data.foreach(word => wordCount(word) += 1 )
scala> wordCount
res6: scala.collection.mutable.Map[String,Int] = Map(Making -> 1, of -> 1, Your -> 1, Project -> 1, Gutenberg -> 1, EBook -> 1, The -> 2)
author wanted to get it done in the mutable way in this chapter, so here is my mutable solution (in scala way it should be less verbouse)
var wordsMap = new scala.collection.mutable.HashMap[String, Int]
val in = new Scanner(new java.io.File("/home/artemvlasenko/Desktop/myfile.txt"))
while (in.hasNext) {
val word = in.next()
val count = wordsMap.getOrElse(word, 0)
wordsMap(word) = count + 1
in.next()
}
println(wordsMap.mkString(", "))
I am having an issue with appending pairs to an existing Map. Once I reach the fifth pair of the Map, the Map reorders itself. The order is correct with 4 pairs, but as soon as the 5th is added it shifts itself. See example below (assuming I built the 4 pair Map one pair at a time.):
scala> val a = Map("a1" -> 1, "a2" -> 1, "a3" -> 1, "a4" -> 1)
a: scala.collection.immutable.Map[String,Int] = Map(a1 -> 1, a2 -> 1, a3 -> 1, a4 -> 1)
scala> a += ("a5" -> 1)
scala> a
res26: scala.collection.immutable.Map[String,Int] = Map(a5 -> 1, a4 -> 1, a3 -> 1, a1 -> 1, a2 -> 1)
The added fifth element jumped to the front of the Map and shifts the others around. Is there a way to keep the elements in order (1, 2, 3, 4, 5) ?
Thanks
By default Scala's immutable.Map uses HashMap.
From http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html:
This class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time
So a map is really not a table that contains "a1" -> 1, but a table that contains hash("a1") -> 1. The map reorders its keys based on the hash of the key rather than the key you put in it.
As was recommended in the comments, use LinkedHashMap or ListMap:
Scala Map implementation keeping entries in insertion order?
PS: You might be interested in reading this article: http://howtodoinjava.com/2012/10/09/how-hashmap-works-in-java/
I am learning Spark (in Scala) and have been trying to figure out how to count all the the words on each line of a file.
I am working with a dataset where each line contains a tab-separated document_id and the full text of the document
doc_1 <full-text>
doc_2 <full-text>
etc..
Here is a toy example I have in a file called doc.txt
doc_1 new york city new york state
doc_2 rain rain go away
I think what I need to do is transform into tuples containig
((doc_id, word), 1)
and then call reduceByKey() to sum the 1's. I wrote the following:
val file = sc.textFile("docs.txt")
val tuples = file.map(_.split("\t"))
.map( x => (x(1).split("\\s+")
.map(y => ((x(0), y), 1 )) ) )
Which does give me the intermediate representation I think I need:
tuples.collect
res0: Array[Array[((String, String), Int)]] = Array(Array(((doc_1,new),1), ((doc_1,york),1), ((doc_1,city),1), ((doc_1,new),1), ((doc_1,york),1), ((doc_1,state),1)), Array(((doc_2,rain),1), ((doc_2,rain),1), ((doc_2,go),1), ((doc_2,away),1)))
But if call reduceByKey on tuples it produces an error
tuples.reduceByKey(_ + )
<console>:21: error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[Array[((String, String), Int)]]
tuples.reduceByKey(_ + )
I can't seem to get my head around how to do this. I think I need to do a reduce on an array inside an array. I have tried many different things but like the above keep getting errors and making no progress.
Any guidance / advice on this would be much appreciated.
Note: I know that there is a word count example on https://spark.apache.org/examples.html showing how to get counts for all of the words in a file. But that is for an entire input file. I am talking about getting counts per-document where each document is on a different line.
reduceByKey expects type RDD[(K,V)] whereas the instant you perform the split in the first map, you end up with an RDD[Array[...]], which is not the type signature that is needed. You can rework your current solution as below...but it probably will not be as performant (read after the code for a rework using flatMap):
//Dummy data load
val file = sc.parallelize(List("doc_1\tnew york city","doc_2\train rain go away"))
//Split the data on tabs to get an array of (key, line) tuples
val firstPass = file.map(_.split("\t"))
//Split the line inside each tuple so you now have an array of (key, Array(...))
//Where the inner array is full of (word, 1) tuples
val secondPass = firstPass.map(x=>(x(0), x(1).split("\\s+").map(y=>(y,1))))
//Now group the words and re-map so that the inner tuple is the wordcount
val finalPass = secondPass.map(x=>(x._1, x._2.groupBy(_._1).map(y=>(y._1,y._2.size))))
Probably the better solution vvvv :
If you want to keep your current structure, then you need to change to using a Tuple2 from the start and then using a flatMap after:
//Load your data
val file = sc.parallelize(List("doc_1\tnew york city","doc_2\train rain go away"))
//Turn the data into a key-value RDD (I suggest caching the split, kept 1 line for SO)
val firstPass = file.map(x=>(x.split("\t")(0), x.split("\t")(1)))
//Change your key to be a Tuple2[String,String] and the value is the count
val tuples = firstPass.flatMap(x=>x._2.split("\\s+").map(y=>((x._1, y), 1)))
Here is a quick demo on a VERY small dataset.
scala> val file = sc.textFile("../README.md")
15/02/02 00:32:38 INFO MemoryStore: ensureFreeSpace(32792) called with curMem=45512, maxMem=278302556
15/02/02 00:32:38 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 32.0 KB, free 265.3 MB)
file: org.apache.spark.rdd.RDD[String] = ../README.md MappedRDD[7] at textFile at <console>:12
scala> val splitLines = file.map{ line => line.split(" ") }
splitLines: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[9] at map at <console>:14
scala> splitLines.map{ arr => arr.toList.groupBy(identity).map{ x => (x._1, x._2.size) } }
res19: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Int]] = MappedRDD[10] at map at <console>:17
scala> val result = splitLines.map{ arr => arr.toList.groupBy(identity).map{ x => (x._1, x._2.size) } }
result: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Int]] = MappedRDD[11] at map at <console>:16
scala> result.take(10).foreach(println)
Map(# -> 1, Spark -> 1, Apache -> 1)
Map( -> 1)
Map(for -> 1, is -> 1, Data. -> 1, system -> 1, a -> 1, provides -> 1, computing -> 1, cluster -> 1, general -> 1, Spark -> 1, It -> 1, fast -> 1, Big -> 1, and -> 1)
Map(in -> 1, Scala, -> 1, optimized -> 1, APIs -> 1, that -> 1, Java, -> 1, high-level -> 1, an -> 1, Python, -> 1, and -> 2, engine -> 1)
Map(for -> 1, data -> 1, a -> 1, also -> 1, general -> 1, supports -> 2, It -> 1, graphs -> 1, analysis. -> 1, computation -> 1)
Map(for -> 1, set -> 1, tools -> 1, rich -> 1, Spark -> 1, structured -> 1, including -> 1, of -> 1, and -> 1, higher-level -> 1, SQL -> 2)
Map(GraphX -> 1, for -> 2, processing, -> 2, data -> 1, MLlib -> 1, learning, -> 1, machine -> 1, graph -> 1)
Map(for -> 1, Streaming -> 1, processing. -> 1, stream -> 1, Spark -> 1, and -> 1)
Map( -> 1)
Map(<http://spark.apache.org/> -> 1)
Input file content:
a b c d
a b e f
h i j l
m h i l
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("test").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> textFile = javaSparkContext.textFile("C:\\Users\\arun7.gupta\\Desktop\\Spark\\word.txt");
/**Splitting the word with space*/
textFile = textFile.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
/**Pair the word with count*/
JavaPairRDD<String, Integer> mapToPair = textFile.mapToPair(w -> new Tuple2<>(w, 1));
/**Reduce the pair with key and add count*/
JavaPairRDD<String, Integer> reduceByKey = mapToPair.reduceByKey((wc1, wc2) -> wc1 + wc2);
System.out.println(reduceByKey.collectAsMap());
javaSparkContext.close();
}
Output:
{e=1, h=2, b=2, j=1, m=1, d=1, a=2, i=2, c=1, l=2, f=1}
scala> var test2 : Map[String , String] = Map("a"->"b","c"->"d")
test2: Map[String,String] = Map(a -> b, c -> d)
test2 = test2 + ("e"->"f" , "g"->"h")
test2: Map[String,String] = Map(a -> b, c -> d, e -> f, g -> h)
And so on. I want to know that Map is not supposed to preserve order of insertion [For that purpose we have LinkedHashMap]then why are the results showing preservation of order? is this a mere coincidence or there is more than meets the eye ?
Thanks in advance!
It's coincidence that holds only for the first 4 items.
val m = Map('a' -> 1, 'b' -> 2, 'c' -> 3, 'd' -> 4)
// m: immutable.Map[Char,Int] = Map(a -> 1, b -> 2, c -> 3, d -> 4)
m + ('e' -> 5)
// immutable.Map[Char,Int] = Map(e -> 5, a -> 1, b -> 2, c -> 3, d -> 4)
The reason is that there is a special optimized implementations for small maps which indeed preserve insertion order (e.g. append to Map of one pair), but once you cross this border it doesn't work anymore.
Coincidental. The ordering may be preserved, but preservation of ordering is not guaranteed and should not be relied upon. In fact, one should not consider maps to be ordered at all.