This problem appeared under "Maps and Tuples" Chapter in Scala for the Impatient
Write a program that reads words from a file. Use a mutable map to count how often each word appears.
My attempt is
// source file: https://www.gutenberg.org/cache/epub/35709/pg35709.txt
scala> val words = scala.io.Source.fromFile("pg35709.txt").mkString.split("\\s+")
words: Array[String] = Array(The, Project, Gutenberg, EBook, of, Making, Your, Camera, Pay,, by, Frederick, C., Davis,
This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with, almost, no, restrictions, whatsoever.,
You, may, copy, it,, give, it, away, or, re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included,
with, this, eBook, or, online, at, www.gutenberg.net, Title:, Making, Your, Camera, Pay, Author:, Frederick, C., Davis,
Release, Date:, March, 29,, 2011, [EBook, #35709], Language:, English, ***, START, OF, THIS, PROJECT, GUTENBERG, EBOOK,
MAKING, YOUR, CAMERA, PAY, ***, Produced, by, The, Online, Distributed, Proofreading, Team, at, http://www.pgdp.net,
(This, file, was, produced, from, images, generously, made, available, by, The, In...
scala> val wordCount = scala.collection.mutable.HashMap[String, Int]()
wordCount: scala.collection.mutable.HashMap[String,Int] = Map()
scala> for (word <- words) {
| val count = wordCount.getOrElse(word, 0)
| wordCount(word) = count + 1
| }
scala> word
wordCount words
scala> wordCount
res1: scala.collection.mutable.HashMap[String,Int] = Map(arts -> 1, follow -> 3, request, -> 1, Lines. -> 1,
demand -> 7, 1.E.4. -> 1, PRODUCT -> 2, 470 -> 1, Chicago, -> 3, scenic -> 1, J2 -> 1, untrimmed -> 1,
photographs--not -> 1, basis. -> 1, "prints -> 1, instances. -> 1, Onion-Planter -> 1, trick -> 1,
illustrating -> 3, prefer. -> 1, detected -> 1, non-exclusive. -> 1, famous -> 1, Competition -> 2,
expense -> 1, created -> 2, renamed. -> 1, maggot -> 1, calendar-photographs, -> 1, widely-read -> 1,
Publisher, -> 1, producers -> 1, Shapes -> 1, ARTICLES -> 2, yearly -> 2, retoucher -> 1, satisfy -> 2,
agrees: -> 1, Gentleman_, -> 1, intellectual -> 2, hard -> 2, Porch. -> 1, sold.) -> 1, START -> 1, House -> 2,
welcome -> 1, Dealers' -> 1, ... -> 2, pasted -> 1, _Cosmopolitan_ -...
While I know that this works, I wanted to know if there is Scalaesque way of achieving the same
You can do this:
val wordCount = words.groupBy(w => w).mapValues(_.size)
The groupBy method returns a map from the result of the given function, to a collection of values that return the same value from the function. In this case, a Map[String, Array[String]]. Then mapValues maps the Array[String] to their lengths.
If by Scalaesque way of achieving the same, you mean with using a mutable Map, here is a version:
scala> val data = Array("The", "Project", "Gutenberg", "EBook", "of", "Making", "Your", "The")
data: Array[String] = Array(The, Project, Gutenberg, EBook, of, Making, Your, The)
scala> val wordCount = scala.collection.mutable.HashMap[String, Int]().withDefaultValue(0)
wordCount: scala.collection.mutable.Map[String,Int] = Map()
scala> data.foreach(word => wordCount(word) += 1 )
scala> wordCount
res6: scala.collection.mutable.Map[String,Int] = Map(Making -> 1, of -> 1, Your -> 1, Project -> 1, Gutenberg -> 1, EBook -> 1, The -> 2)
author wanted to get it done in the mutable way in this chapter, so here is my mutable solution (in scala way it should be less verbouse)
var wordsMap = new scala.collection.mutable.HashMap[String, Int]
val in = new Scanner(new java.io.File("/home/artemvlasenko/Desktop/myfile.txt"))
while (in.hasNext) {
val word = in.next()
val count = wordsMap.getOrElse(word, 0)
wordsMap(word) = count + 1
in.next()
}
println(wordsMap.mkString(", "))
Related
Lets say have the following Map data
val testMap: Map[String, Int] = Map("AAA_abc" -> 1,
"AAA_anghesh" -> 2,
"BBB_wfejw" -> 3,
"BBB_qgqwe" -> 4,
"C_fkee" -> 5)
Now I want to reduce the map by key.split("_").head and add all the values for the keys that became equal. So for this example the Map should result into:
Map(AAA -> 3, BBB -> 7, C -> 5)
What would be the correct way to do so in Scala?
I tried constructions with groupBy and reduceLeft but could not find a solution.
Here's a way to do it:
testMap.groupBy(_._1.split("_").head).mapValues(_.values.sum)
A variation in one pass:
testMap.foldLeft(Map[String,Int]())( (map, kv) => {
val key = kv._1.split("_").head
val previous = map.getOrElse(key,0)
map.updated(key, previous + kv._2) })
For example:
The map is given below:
output = Map(execution -> 1, () -> 2, for -> 1, in -> 1, brutalized -> 1, everyone -> 1, felt -> 1, christ -> 1, all -> 1, saw -> 1, crying -> 1, it -> 1, two -> 1, a -> 2, man -> 1, i -> 3, to -> 1, cried -> 1, you -> 1, tuned -> 1, around -> 1, was -> 1, there -> 1, hours -> 1, how -> 1, televised -> 1, me -> 1, not -> 1, could -> 1, were -> 1, passion -> 1, we -> 1, sat -> 1, when -> 1, like -> 1, of -> 2, and -> 1, watched -> 1, proceedings -> 1, the -> 3) of type [Any, Int]
The List(pos) is created from a file which contains huge number of words. It can be found here in the positive words section. In order to create that list I have used the below code:
val words = """([A-Za-z])+""".r
val pos = scala.io.Source.fromFile("Positive_words.txt").getLines.drop(35).flatMap(words.findAllIn).toList
I notice that when the list is matched with the map using either:
val result = output.filterKeys(pos.contains)
or
output.foreach { x => pos.foreach { aa => if(x._1.toString() == aa) <create a new hashmap>}}
results in something unexpected. There are some words which even though do not exist in pos are shown in the output.
Output snippet below:
val result = output.filterKeys(pos.contains)
result: scala.collection.immutable.Map[Any,Int] = Map(in -> 9, all -> 1, a -> 2, to -> 1, around -> 1, passion -> 1, like -> 1, of -> 2, the -> 3)
This is printing some irrelevant items which are not even a part of list = pos.
words like in, all, to are not expected.
PS: If a simple list(pos) is created WITHOUT using the above mentioned code, the output is just fine.
You can use map's filterKeys to keep only the keys that exist in the pos list, using pos.contains as the predicate:
val result = output.filterKeys(pos.contains)
result.foreach(println)
I don't know if this might help you. But normally only primitives(boolean, int, every type with lower case beginning) should be compared with "==". All the other objects should be compared with .equals() and a String is such an object.
Maybe you just keep this in mind.
The use of flatMap while creating the list was creating the problem.
It was resolved by using the below code instead of what I had written earlier:
val pos = scala.io.Source.fromFile("Positive_words.txt").getLines.drop(35).toList
Thanks.
I'm trying to find a cleaner way to update nested immutable structures in Scala. I think I'm looking for something similar to assoc-in in Clojure. I'm not sure how much types factor into this.
For example, in Clojure, to update the "city" attribute of a nested map I'd do:
> (def person {:name "john", :dob "1990-01-01", :home-address {:city "norfolk", :state "VA"}})
#'user/person
> (assoc-in person [:home-address :city] "richmond")
{:name "john", :dob "1990-01-01", :home-address {:state "VA", :city "richmond"}}
What are my options in Scala?
val person = Map("name" -> "john", "dob" -> "1990-01-01",
"home-address" -> Map("city" -> "norfolk", "state" -> "VA"))
As indicated in the other answer, you can leverage case classes to get cleaner, typed data objects. But in case what you need is simply to update a map:
val m = Map("A" -> 1, "B" -> 2)
val m2 = m + ("A" -> 3)
The result (in a worksheet):
m: scala.collection.immutable.Map[String,Int] = Map(A -> 1, B -> 2)
m2: scala.collection.immutable.Map[String,Int] = Map(A -> 3, B -> 2)
The + operator on a Map will add the new key-value pair, overwriting if it already exists. Notably, though, because the original value is a val, you have to assign the result to a new val, because you cannot change the original.
Because, in your example, you're rewriting a nested value, doing this manually becomes somewhat more onerous:
val m = Map("A" -> 1, "B" -> Map("X" -> 2, "Y" -> 4))
val m2 = m + ("B" -> Map("X" -> 3))
This yields some loss-of-data (the nested Y value disappears):
m: scala.collection.immutable.Map[String,Any] = Map(A -> 1, B -> Map(X -> 2, Y -> 4))
m2: scala.collection.immutable.Map[String,Any] = Map(A -> 1, B -> Map(X -> 3)) // Note that 'Y' has gone away.
Thus, forcing you to copy the original value and then re-assign it back:
val m = Map("A" -> 1, "B" -> Map("X" -> 2, "Y" -> 4))
val b = m.get("B") match {
case Some(b: Map[String, Any]) => b + ("X" -> 3) // Will update `X` while keeping other key-value pairs
case None => Map("X" -> 3)
}
val m2 = m + ("B" -> b)
This yields the 'expected' result, but is obviously a lot of code:
m: scala.collection.immutable.Map[String,Any] = Map(A -> 1, B -> Map(X -> 2, Y -> 4))
b: scala.collection.immutable.Map[String,Any] = Map(X -> 3, Y -> 4)
m2: scala.collection.immutable.Map[String,Any] = Map(A -> 1, B -> Map(X -> 3, Y -> 4))
In short, with any immutable data structure when you 'update' it you're really copying all the pieces you want and then including updated values where appropriate. If the structure is complicated this can get onerous. Hence the recommendation that #0___ gave with, say, Monocle.
Scala is a statically typed language, so you may first want to increase the safety of your code by moving away from any-string-to-any-string.
case class Address(city: String, state: String)
case class Person(name: String, dob: java.util.Date, homeAddress: Address)
(Yes, there are better alternatives for java.util.Date).
Then you create an update like this:
val person = Person(name = "john", dob = new java.util.Date(90, 0, 1),
homeAddress = Address(city = "norfolk", state = "VA"))
person.copy(homeAddress = person.homeAddress.copy(city = "richmond"))
To avoid this nested copy, you would use a lens library, like Monocle or Quicklens (there are many others).
import com.softwaremill.quicklens._
person.modify(_.homeAddress.city).setTo("richmond")
The other two answers nicely sum up the importance of correctly modelling your problem so we don't end up having to deal with Map[String, Object] type of collection.
Just adding my two cents here for a brute force solution utilizing the quiet powerful function pipelining and higher order function features in Scala. The ugly asInstanceOf casting is needed because the Map values are of different types and hence Scala treats the Map signature as Map[String,Any].
val person: Map[String,Any] = Map("name" -> "john", "dob" -> "1990-01-01", "home-address" -> Map("city" -> "norfolk", "state" -> "VA"))
val newperson = person.map({case(k,v) => if(k == "home-address") v.asInstanceOf[Map[String,String]].updated("city","Virginia") else k -> v})
I'm new to learning Scala and would appreciate any thoughts on an idiomatic way to do the following. I want to count occurrences of sequential letter pairs in a word.
For example, for the word "home", the output might be Map("ho"->1,"om"->1,"me"->1). And for 'lulu', the result would be Map("lu"->2, "ul"->1 ).
So performing a simple single-letter count might be done as
"abracadabra".map(s => s).groupBy(identity).mapValues(_.length)
But I'm stumped as to how to add-in the two-letter component of this problem. Thanks for your thoughts.
You can use .sliding :
scala> "abracadabra".sliding(2).toList.groupBy(identity).mapValues(_.length)
res3: scala.collection.immutable.Map[String,Int] =
Map(br -> 2, ca -> 1, ab -> 2, ra -> 2, ac -> 1, da -> 1, ad -> 1)
scala> "lulu".sliding(2).toList.groupBy(identity).mapValues(_.length)
res4: scala.collection.immutable.Map[String,Int] = Map(ul -> 1, lu -> 2)
From the docs :
Sliding : Groups elements in fixed size blocks by passing a "sliding window" over them
You should use sliding(2):
"abracadabra".sliding(2).toVector.map(s => s).groupBy(identity).mapValues(_.length)
// Map(br -> 2, ca -> 1, ab -> 2, ra -> 2, ac -> 1, da -> 1, ad -> 1)
I am learning Spark (in Scala) and have been trying to figure out how to count all the the words on each line of a file.
I am working with a dataset where each line contains a tab-separated document_id and the full text of the document
doc_1 <full-text>
doc_2 <full-text>
etc..
Here is a toy example I have in a file called doc.txt
doc_1 new york city new york state
doc_2 rain rain go away
I think what I need to do is transform into tuples containig
((doc_id, word), 1)
and then call reduceByKey() to sum the 1's. I wrote the following:
val file = sc.textFile("docs.txt")
val tuples = file.map(_.split("\t"))
.map( x => (x(1).split("\\s+")
.map(y => ((x(0), y), 1 )) ) )
Which does give me the intermediate representation I think I need:
tuples.collect
res0: Array[Array[((String, String), Int)]] = Array(Array(((doc_1,new),1), ((doc_1,york),1), ((doc_1,city),1), ((doc_1,new),1), ((doc_1,york),1), ((doc_1,state),1)), Array(((doc_2,rain),1), ((doc_2,rain),1), ((doc_2,go),1), ((doc_2,away),1)))
But if call reduceByKey on tuples it produces an error
tuples.reduceByKey(_ + )
<console>:21: error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[Array[((String, String), Int)]]
tuples.reduceByKey(_ + )
I can't seem to get my head around how to do this. I think I need to do a reduce on an array inside an array. I have tried many different things but like the above keep getting errors and making no progress.
Any guidance / advice on this would be much appreciated.
Note: I know that there is a word count example on https://spark.apache.org/examples.html showing how to get counts for all of the words in a file. But that is for an entire input file. I am talking about getting counts per-document where each document is on a different line.
reduceByKey expects type RDD[(K,V)] whereas the instant you perform the split in the first map, you end up with an RDD[Array[...]], which is not the type signature that is needed. You can rework your current solution as below...but it probably will not be as performant (read after the code for a rework using flatMap):
//Dummy data load
val file = sc.parallelize(List("doc_1\tnew york city","doc_2\train rain go away"))
//Split the data on tabs to get an array of (key, line) tuples
val firstPass = file.map(_.split("\t"))
//Split the line inside each tuple so you now have an array of (key, Array(...))
//Where the inner array is full of (word, 1) tuples
val secondPass = firstPass.map(x=>(x(0), x(1).split("\\s+").map(y=>(y,1))))
//Now group the words and re-map so that the inner tuple is the wordcount
val finalPass = secondPass.map(x=>(x._1, x._2.groupBy(_._1).map(y=>(y._1,y._2.size))))
Probably the better solution vvvv :
If you want to keep your current structure, then you need to change to using a Tuple2 from the start and then using a flatMap after:
//Load your data
val file = sc.parallelize(List("doc_1\tnew york city","doc_2\train rain go away"))
//Turn the data into a key-value RDD (I suggest caching the split, kept 1 line for SO)
val firstPass = file.map(x=>(x.split("\t")(0), x.split("\t")(1)))
//Change your key to be a Tuple2[String,String] and the value is the count
val tuples = firstPass.flatMap(x=>x._2.split("\\s+").map(y=>((x._1, y), 1)))
Here is a quick demo on a VERY small dataset.
scala> val file = sc.textFile("../README.md")
15/02/02 00:32:38 INFO MemoryStore: ensureFreeSpace(32792) called with curMem=45512, maxMem=278302556
15/02/02 00:32:38 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 32.0 KB, free 265.3 MB)
file: org.apache.spark.rdd.RDD[String] = ../README.md MappedRDD[7] at textFile at <console>:12
scala> val splitLines = file.map{ line => line.split(" ") }
splitLines: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[9] at map at <console>:14
scala> splitLines.map{ arr => arr.toList.groupBy(identity).map{ x => (x._1, x._2.size) } }
res19: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Int]] = MappedRDD[10] at map at <console>:17
scala> val result = splitLines.map{ arr => arr.toList.groupBy(identity).map{ x => (x._1, x._2.size) } }
result: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Int]] = MappedRDD[11] at map at <console>:16
scala> result.take(10).foreach(println)
Map(# -> 1, Spark -> 1, Apache -> 1)
Map( -> 1)
Map(for -> 1, is -> 1, Data. -> 1, system -> 1, a -> 1, provides -> 1, computing -> 1, cluster -> 1, general -> 1, Spark -> 1, It -> 1, fast -> 1, Big -> 1, and -> 1)
Map(in -> 1, Scala, -> 1, optimized -> 1, APIs -> 1, that -> 1, Java, -> 1, high-level -> 1, an -> 1, Python, -> 1, and -> 2, engine -> 1)
Map(for -> 1, data -> 1, a -> 1, also -> 1, general -> 1, supports -> 2, It -> 1, graphs -> 1, analysis. -> 1, computation -> 1)
Map(for -> 1, set -> 1, tools -> 1, rich -> 1, Spark -> 1, structured -> 1, including -> 1, of -> 1, and -> 1, higher-level -> 1, SQL -> 2)
Map(GraphX -> 1, for -> 2, processing, -> 2, data -> 1, MLlib -> 1, learning, -> 1, machine -> 1, graph -> 1)
Map(for -> 1, Streaming -> 1, processing. -> 1, stream -> 1, Spark -> 1, and -> 1)
Map( -> 1)
Map(<http://spark.apache.org/> -> 1)
Input file content:
a b c d
a b e f
h i j l
m h i l
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("test").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> textFile = javaSparkContext.textFile("C:\\Users\\arun7.gupta\\Desktop\\Spark\\word.txt");
/**Splitting the word with space*/
textFile = textFile.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
/**Pair the word with count*/
JavaPairRDD<String, Integer> mapToPair = textFile.mapToPair(w -> new Tuple2<>(w, 1));
/**Reduce the pair with key and add count*/
JavaPairRDD<String, Integer> reduceByKey = mapToPair.reduceByKey((wc1, wc2) -> wc1 + wc2);
System.out.println(reduceByKey.collectAsMap());
javaSparkContext.close();
}
Output:
{e=1, h=2, b=2, j=1, m=1, d=1, a=2, i=2, c=1, l=2, f=1}