Storing the contents of a file in an immutable Map in scala - scala

I am trying to implement a simple wordcount in scala using an immutable map(this is intentional) and the way I am trying to accomplish it is as follows:
Create an empty immutable map
Create a scanner that reads through the file.
While the scanner.hasNext() is true:
Check if the Map contains the word, if it doesn't contain the word, initialize the count to zero
Create a new entry with the key=word and the value=count+1
Update the map
At the end of the iteration, the map is populated with all the values.
My code is as follows:
val wordMap = Map.empty[String,Int]
val input = new java.util.scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
val token = input.next()
val currentCount = wordMap.getOrElse(token,0) + 1
val wordMap = wordMap + (token,currentCount)
}
The ides is that wordMap will have all the wordCounts at the end of the iteration...
Whenever I try to run this snippet, I get the following exception
recursive value wordMap needs type.
Can somebody point out why I am getting this exception and what can I do to remedy it?
Thanks

val wordMap = wordMap + (token,currentCount)
This line is redefining an already-defined variable. If you want to do this, you need to define wordMap with var and then just use
wordMap = wordMap + (token,currentCount)
Though how about this instead?:
io.Source.fromFile("textfile.txt") // read from the file
.getLines.flatMap{ line => // for each line
line.split("\\s+") // split the line into tokens
.groupBy(identity).mapValues(_.size) // count each token in the line
} // this produces an iterator of token counts
.toStream // make a Stream so we can groupBy
.groupBy(_._1).mapValues(_.map(_._2).sum) // combine all the per-line counts
.toList
Note that the per-line pre-aggregation is used to try and reduce the memory required. Counting across the entire file at once might be too big.
If your file is really massive, I would suggest using doing this in parallel (since word counting is trivial to parallelize) using either Scala's parallel collections or Hadoop (using one of the cool Scala Hadoop wrappers like Scrunch or Scoobi).
EDIT: Detailed explanation:
Ok, first look at the inner part of the flatMap. We take a string, and split it apart on whitespace:
val line = "a b c b"
val tokens = line.split("\\s+") // Array(a, b, c, a, b)
Now identity is a function that just returns its argument, so if wegroupBy(identity)`, we map each distinct word type, to each word token:
val grouped = tokens.groupBy(identity) // Map(c -> Array(c), a -> Array(a), b -> Array(b, b))
And finally, we want to count up the number of tokens for each type:
val counts = grouped.mapValues(_.size) // Map(c -> 1, a -> 1, b -> 2)
Since we map this over all the lines in the file, we end up with token counts for each line.
So what does flatMap do? Well, it runs the token-counting function over each line, and then combines all the results into one big collection.
Assume the file is:
a b c b
b c d d d
e f c
Then we get:
val countsByLine =
io.Source.fromFile("textfile.txt") // read from the file
.getLines.flatMap{ line => // for each line
line.split("\\s+") // split the line into tokens
.groupBy(identity).mapValues(_.size) // count each token in the line
} // this produces an iterator of token counts
println(countsByLine.toList) // List((c,1), (a,1), (b,2), (c,1), (d,3), (b,1), (c,1), (e,1), (f,1))
So now we need to combine the counts of each line into one big set of counts. The countsByLine variable is an Iterator, so it doesn't have a groupBy method. Instead we can convert it to a Stream, which is basically a lazy list. We want laziness because we don't want to have to read the entire file into memory before we start. Then the groupBy groups all counts of the same word type together.
val groupedCounts = countsByLine.toStream.groupBy(_._1)
println(groupedCounts.mapValues(_.toList)) // Map(e -> List((e,1)), f -> List((f,1)), a -> List((a,1)), b -> List((b,2), (b,1)), c -> List((c,1), (c,1), (c,1)), d -> List((d,3)))
And finally we can sum up the counts from each line for each word type by grabbing the second item from each tuple (the count), and summing:
val totalCounts = groupedCounts.mapValues(_.map(_._2).sum)
println(totalCounts.toList)
List((e,1), (f,1), (a,1), (b,3), (c,3), (d,3))
And there you have it.

You have a few mistakes: you've defined wordMap twice (val is to declare a value). Also, Map is immutable, so you either have to declare it as a var or use a mutable map (I suggest the former).
Try this:
var wordMap = Map.empty[String,Int] withDefaultValue 0
val input = new java.util.Scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
val token = input.next()
wordMap += token -> (wordMap(token) + 1)
}

Related

Reading one file as Map(K,V) and pass V as keys while reading the second file as Map

I have two files. One is a text file and another one is CSV. I want to read the text file as Map(keys, values) and pass these values from the first file as key in Map when I read the second file (CSV file).
I am able to read the first file and get Map(key, value). From this Map, I have extracted the values and passed these values as keys in the second file but didn't get the desired result.
first file - text file
sdp:field(0)
meterNumber:field(1)
date:field(2)
time:field(3)
value:field(4),field(5),field(6),field(7),field(8),field(9),
field(10),field(11),field(12),field(13),field(14),
field(15),field(16),field(17)
second file - csv file
SDP,METERNO,READINGDATE,TIME,Reset Count.,Kilowatt-Hour Last Reset .,Kilowatt-Hour Rate A Last Reset.,Kilowatt-Hour Rate B Last Reset.,Kilowatt-Hour Rate C Last Reset.,Max Kilowatt Rate A Last Reset.,Max Kilowatt Rate B Last Reset.,Max Kilowatt Rate C Last Reset.,Accumulate Kilowatt Rate A Current.,Accumulate Kilowatt Rate B Current.,Accumulate Kilowatt Rate C Current.,Total Kilovar-Hour Last Reset.,Max Kilovar Last Reset.,Accumulate Kilovar Last Reset.
9000000001,500001,02-09-2018,00:00:00,2,48.958,8.319333333,24.31933333,16.31933333,6,24,15,10,9,6,48.958,41,40
this is what I have done to read the first file.
val lines = scala.io.Source.fromFile("D:\\JSON_READER\\dailymapping.txt", "UTF8")
.getLines
.map(line=>line.split(":"))
.map(fields => (fields(0),fields(1))).toMap;
val sdp = lines.get("sdp").get;
val meterNumber = lines.get("meterNumber").get;
val date = lines.get("date").get;
val time = lines.get("time").get;
val values = lines.get("value").get;
now I can see sdp has field(0), meterNumber has field(1), date has field(2), time has field(3) and values has field(4) .. to field(17).
Second file which I m reading using below code
val keyValuePairs = scala.io.Source.fromFile("D:\\JSON_READER\\Daily.csv")
.getLines.drop(1).map(_.stripLineEnd.split(",", -1))
.map{field => ((field(0),field(1),field(2),field(3)) -> (field(4),field(5)))}.toList
val map = Map(keyValuePairs : _*)
System.out.println(map);
above code giving me the following output which is desired output.
Map((9000000001,500001,02-09-2018,00:00:00) -> (2,48.958))
But I want to replace field(0), field(1), field(2), field(3) with sdp, meterNumber, date, time in the above code. So, I don't have to mention keys when I read the second file, keys will come from the first file.
I tried to replace but I got below output which is not desired output.
Map((field(0),field(1),field(2),field(3)) -> (,))
Can somebody please guide me on how can I achieve the desired output.
This might get you close to what you're after. The first Map is used to lookup the correct index into the CSV data.
val fieldRE = raw"field\((\d+)\)".r
val idx = io.Source
.fromFile(<txt_file>, "UTF8")
.getLines
.map(_.split(":"))
.flatMap(fields => fieldRE.replaceAllIn(fields(1), _.group(1))
.split(",")
.map(fields(0) -> _.toInt))
.toMap
val resMap = io.Source
.fromFile(<csv_file>)
.getLines
.drop(1)
.map(_.stripLineEnd.split(",", -1))
.map{ fld =>
(fld(idx("sdp")),fld(idx("meterNumber")),fld(idx("date")),fld(idx("time"))) ->
(fld(4),fld(5)) }
.toMap
//resMap: Map((9000000001,500001,02-09-2018,00:00:00) -> (2,48.958))
UPDATE
Changing the Map of (String identifiers -> Int index values) into a Map of (String identifiers -> collection of Int index values) can be done. I'm not sure what that buys you, but it's doable.
val fieldRE = raw"field\((\d+)\)".r
val idx = io.Source
.fromFile(<txt_file>, "UTF8")
.getLines
.map(_.split(":"))
.flatMap(fields => fieldRE.replaceAllIn(fields(1), _.group(1))
.split(",")
.map(fields(0) -> _.toInt))
.foldLeft(Map[String,Seq[Int]]()){ case (m,(k,v)) =>
m + (k -> (m.getOrElse(k,Seq()) :+ v))
}
val resMap = io.Source
.fromFile(<csv_file>)
.getLines
.drop(1)
.map(_.stripLineEnd.split(",", -1))
.map{fld => (fld(idx("sdp").head)
,fld(idx("meterNumber").head)
,fld(idx("date").head)
,fld(idx("time").head)) -> (fld(4),fld(5))}
.toMap

Scala - conditional product/join of two arrays with default values using for comprehensions

I have two Sequences, say:
val first = Array("B", "L", "T")
val second = Array("T70", "B25", "B80", "A50", "M100", "B50")
How do I get a product such that elements of the first array are joined with each element of the second array which startsWith the former and also yield a default empty result when no element in the second array meets the condition.
Effectively to get an Output:
expectedProductArray = Array("B-B25", "B-B80", "B-B50", "L-Default", "T-T70")
I tried doing,
val myProductArray: Array[String] = for {
f <- first
s <- second if s.startsWith(f)
} yield s"""$f-$s"""
and i get:
myProductArray = Array("B-B25", "B-B80", "B-B50", "T-T70")
Is there an Idiomatic way of adding a default value for values in first sequence not having a corresponding value in the second sequence with the given criteria? Appreciate your thoughts.
Here's one approach by making array second a Map and looking up the Map for elements in array first with getOrElse:
val first = Array("B", "L", "T")
val second = Array("T70", "B25", "B80", "A50", "M100", "B50")
val m = second.groupBy(_(0).toString)
// m: scala.collection.immutable.Map[String,Array[String]] =
// Map(M -> Array(M100), A -> Array(A50), B -> Array(B25, B80, B50), T -> Array(T70))
first.flatMap(x => m.getOrElse(x, Array("Default")).map(x + "-" + _))
// res1: Array[String] = Array(B-B25, B-B80, B-B50, L-Default, T-T70)
In case you prefer using for-comprehension:
for {
x <- first
y <- m.getOrElse(x, Array("Default"))
} yield s"$x-$y"

How to create a map from a RDD[String] using scala?

My file is,
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
Here there are 7 rows & 5 columns(0,1,2,3,4)
I want the output as,
Map(0 -> Set("sunny","overcast","rainy"))
Map(1 -> Set("hot","mild","cool"))
Map(2 -> Set("high","normal"))
Map(3 -> Set("false","true"))
Map(4 -> Set("yes","no"))
The output must be the type of [Map[Int,Set[String]]]
EDIT: Rewritten to present the map-reduce version first, as it's more suited to Spark
Since this is Spark, we're probably interested in parallelism/distribution. So we need to take care to enable that.
Splitting each string into words can be done in partitions. Getting the set of values used in each column is a bit more tricky - the naive approach of initialising a set then adding every value from every row is inherently serial/local, since there's only one set (per column) we're adding the value from each row to.
However, if we have the set for some part of the rows and the set for the rest, the answer is just the union of these sets. This suggests a reduce operation where we merge sets for some subset of the rows, then merge those and so on until we have a single set.
So, the algorithm:
Split each row into an array of strings, then change this into an
array of sets of the single string value for each column - this can
all be done with one map, and distributed.
Now reduce this using an
operation that merges the set for each column in turn. This also can
be distributed
turn the single row that results into a Map
It's no coincidence that we do a map, then a reduce, which should remind you of something :)
Here's a one-liner that produces the single row:
val data = List(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val row = data.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
Converting it to a Map as the question asks:
val theMap = row.zipWithIndex.map(_.swap).toMap
Zip the list with the index, since that's what we need as the key of
the map.
The elements of each tuple are unfortunately in the wrong
order for .toMap, so swap them.
Then we have a list of (key, value)
pairs which .toMap will turn into the desired result.
These don't need to change AT ALL to work with Spark. We just need to use a RDD, instead of the List. Let's convert data into an RDD just to demo this:
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
val rdd = sc.makeRDD(data)
val row = rdd.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
(This can be converted into a Map as before)
An earlier oneliner works neatly (transpose is exactly what's needed here) but is very difficult to distribute (transpose inherently needs to visit every row)
data.map(_.split("\\W+")).transpose.map(_.toSet)
(Omitting the conversion to Map for clarity)
Split each string into words.
Transpose the result, so we have a list that has a list of the first words, then a list of the second words, etc.
Convert each of those to a set.
Maybe this do the trick:
val a = Array(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val b = new Array[Map[String, Set[String]]](5)
for (i <- 0 to 4)
b(i) = Map(i.toString -> (Set() ++ (for (s <- a) yield s.split(",")(i))) )
println(b.mkString("\n"))

Function to return List of Map while iterating over String, kmer count

I am working on creating a k-mer frequency counter (similar to word count in Hadoop) written in Scala. I'm fairly new to Scala, but I have some programming experience.
The input is a text file containing a gene sequence and my task is to get the frequency of each k-mer where k is some specified length of the sequence.
Therefore, the sequence AGCTTTC has three 5-mers (AGCTT, GCTTT, CTTTC)
I've parsed through the input and created a huge string which is the entire sequence, the new lines throw off the k-mer counting as the end of one line's sequence should still form a k-mer with the beginning of the next line's sequence.
Now I am trying to write a function that will generate a list of maps List[Map[String, Int]] with which it should be easy to use scala's groupBy function to get the count of the common k-mers
import scala.io.Source
object Main {
def main(args: Array[String]) {
// Get all of the lines from the input file
val input = Source.fromFile("input.txt").getLines.toArray
// Create one huge string which contains all the lines but the first
val lines = input.tail.mkString.replace("\n","")
val mappedKmers: List[Map[String,Int]] = getMappedKmers(5, lines)
}
def getMappedKmers(k: Int, seq: String): List[Map[String, Int]] = {
for (i <- 0 until seq.length - k) {
Map(seq.substring(i, i+k), 1) // Map the k-mer to a count of 1
}
}
}
Couple of questions:
How to create/generate List[Map[String,Int]]?
How would you do it?
Any help and/or advice is definitely appreciated!
You're pretty close—there are three fairly minor problems with your code.
The first is that for (i <- whatever) foo(i) is syntactic sugar for whatever.foreach(i => foo(i)), which means you're not actually doing anything with the contents of whatever. What you want is for (i <- whatever) yield foo(i), which is sugar for whatever.map(i => foo(i)) and returns the transformed collection.
The second issue is that 0 until seq.length - k is a Range, not a List, so even once you've added the yield, the result still won't line up with the declared return type.
The third issue is that Map(k, v) tries to create a map with two key-value pairs, k and v. You want Map(k -> v) or Map((k, v)), either of which is explicit about the fact that you have a single argument pair.
So the following should work:
def getMappedKmers(k: Int, seq: String): IndexedSeq[Map[String, Int]] = {
for (i <- 0 until seq.length - k) yield {
Map(seq.substring(i, i + k) -> 1) // Map the k-mer to a count of 1
}
}
You could also convert either the range or the entire result to a list with .toList if you'd prefer a list at the end.
It's worth noting, by the way, that the sliding method on Seq does exactly what you want:
scala> "AGCTTTC".sliding(5).foreach(println)
AGCTT
GCTTT
CTTTC
I'd definitely suggest something like "AGCTTTC".sliding(5).toList.groupBy(identity) for real code.

Merge the intersection of two CSV files with Scala

From input 1:
fruit, apple, cider
animal, beef, burger
and input 2:
animal, beef, 5kg
fruit, apple, 2liter
fish, tuna, 1kg
I need to produce:
fruit, apple, cider, 2liter
animal, beef, burger, 5kg
The closest example I could get is:
object FileMerger {
def main(args : Array[String]) {
import scala.io._
val f1 = (Source fromFile "file1.csv" getLines) map (_.split(", *")(1))
val f2 = Source fromFile "file2.csv" getLines
val out = new java.io.FileWriter("output.csv")
f1 zip f2 foreach { x => out.write(x._1 + ", " + x._2 + "\n") }
out.close
}
}
The problem is that the example assumes that the two CSV files contain the same number of elements and in the same order. My merged result must only contain elements that are in the first and the second file. I am new to Scala, and any help will be greatly appreciated.
You need an intersection of the two files: the lines from file1 and file2 which share some criteria. Consider this through a set theory perspective: you have two sets with some elements in common, and you need a new set with those elements. Well, there's more to it than that, because the lines aren't really equal...
So, let's say you read file1, and that's of type List[Input1]. We could code it like this, without getting into any details of what Input1 is:
case class Input1(line: String)
val f1: List[Input1] = (Source fromFile "file1.csv" getLines () map Input1).toList
We can do the same thing for file2 and List[Input2]:
case class Input2(line: String)
val f2: List[Input2] = (Source fromFile "file2.csv" getLines () map Input2).toList
You might be wondering why I created two different classes if they have the exact same definition. Well, if you were reading structured data, you would have two different types, so let's see how to handle that more complex case.
Ok, so how do we match them, since Input1 and Input2 are different types? Well, the lines are matched by keys, which, according to your code, are the first column in each. So let's create a class Key, and conversions Input1 => Key and Input2 => Key:
case class Key(key: String)
def Input1IsKey(input: Input1): Key = Key(input.line split "," head) // using regex would be better
def Input2IsKey(input: Input2): Key = Key(input.line split "," head)
Ok, now that we can produce a common Key from Input1 and Input2, let's get the intersection of them:
val intersection = (f1 map Input1IsKey).toSet intersect (f2 map Input2IsKey).toSet
So we can build the intersection of lines we want, but we don't have the lines! The problem is that, for each key, we need to know from which line it came. Consider that we have a set of keys, and for each key we want to keep track of a value -- that's exactly what a Map is! So we can build this:
val m1 = (f1 map (input => Input1IsKey(input) -> input)).toMap
val m2 = (f2 map (input => Input2IsKey(input) -> input)).toMap
So the output can be produced like this:
val output = intersection map (key => m1(key).line + ", " + m2(key).line)
All you have to do now is output that.
Let's consider some improvements on this code. First, note that the output produced above repeats the key -- that's exactly what your code does, but not what you want in the example. Let's change, then, Input1 and Input2 to split the key from the rest of the args:
case class Input1(key: String, rest: String)
case class Input2(key: String, rest: String)
It's now a bit harder to initialize f1 and f2. Instead of using split, which will break all the line unnecessarily (and at great cost to performance), we'll divide the line right the at the first comma: everything before is key, everything after is rest. The method span does that:
def breakLine(line: String): (String, String) = line span (',' !=)
Play a bit with the span method on REPL to get a better understanding of it. As for (',' !=), that's just an abbreviated form of saying (x => ',' != x).
Next, we need a way to create Input1 and Input2 from a tuple (the result of breakLine):
def TupleIsInput1(tuple: (String, String)) = Input1(tuple._1, tuple._2)
def TupleIsInput2(tuple: (String, String)) = Input2(tuple._1, tuple._2)
We can now read the files:
val f1: List[Input1] = (Source fromFile "file1.csv" getLines () map breakLine map TupleIsInput1).toList
val f2: List[Input2] = (Source fromFile "file2.csv" getLines () map breakLine map TupleIsInput2).toList
Another thing we can simplify is intersection. When we create a Map, its keys are sets, so we can create the maps first, and then use their keys to compute the intersection:
case class Key(key: String)
def Input1IsKey(input: Input1): Key = Key(input.key)
def Input2IsKey(input: Input2): Key = Key(input.key)
// We now only keep the "rest" as the map value
val m1 = (f1 map (input => Input1IsKey(input) -> input.rest)).toMap
val m2 = (f2 map (input => Input2IsKey(input) -> input.rest)).toMap
val intersection = m1.keySet intersect m2.keySet
And the output is computed like this:
val output = intersection map (key => key + m1(key) + m2(key))
Note that I don't append comma anymore -- the rest of both f1 and f2 start with a comma already.
It's tough to infer a requirement from one example. May be something like this is would serve your needs:
Create a map from key to line for the second file f2 (so from "animal, beef" -> "5kg")
For each lines in the first file f1, get the key to look up in the map
Look up value, if found write output
That translates to
val f1 = Source fromFile "file1.csv" getLines
val f2 = Source fromFile "file2.csv" getLines
val map = f2.map(_.split(", *")).map(arr => arr.init.mkString(", ") -> arr.last}.toMap
for {
line <- f1
key = line.split(", *").init.mkString(", ")
value <- map.get(key)
} {
out.write(line + ", " + value + "\n")
}