Scala File lines to Map - scala

I'm Currently opening my files and utilizing .getLines to retrieve each lines from the file with a word and its phonetic pronunciation separated by two white spaces, i'm confused as to how would i go about Mapping the word and its pronunciation in Scala as i'm fairly new to the language.
i've previously though to utilize split and separate the words and their sounds into different lines,but, i'm lost
Currently i Started with
def words(filename: String, word: String): Unit = {
val file = Source.fromFile(filename).getLines().drop(56)
for(x <- file){
}
}
EX:
ARTI AA1 R T IY2
AASE AA1 S
ABAIR AH0 B EH1 R
AB AE1 B
Result:
Map("AARTI -> "AA1 R T IY2","AASE" -> "AA1 S", "ABAIR" -> " AH0 B EH1 R")

iterate each line
split by 2 white spaces " "
create a tuple of (a -> b)
convert Array[Tuple[A, B]] => Map[A, B]
example,
val data =
"""
ARTI AA1 R T IY2
AASE AA1 S
ABAIR AH0 B EH1 R
AB AE1 B
""".stripMargin
val lines: Array[String] = data.split("\n").filter(_.trim.nonEmpty)
// if you are reading from file
// val lines = Source.fromFile("src/test/resources/my_filename.txt").getLines()
val res: Array[Tuple2[String, String]] = lines.map { line =>
line.split(" ") match { case Array(a, b) => a -> b }
}
println(res.toMap)
output:
Map(ARTI -> AA1 R T IY2, AASE -> AA1 S, ABAIR -> AH0 B EH1 R, AB -> AE1 B)
Running example - https://scastie.scala-lang.org/prayagupd/jBCnEhUPQJCMPKP9TXlgWA
How to read entire file in Scala?

If your lines are in file then this will create a Map from the first word to the rest of the string:
val res: Map[String, String] = file.map(_.span(_.isLetter))(collection.breakOut)
The values in the Map will contain leading space characters so you may want to call trim on them before using them.
The map call processes each line in turn.
The span method splits the line into a tuple where the first value is your word and the second is the rest of the line.
Using collection.breakOut tells map to put the results directly into a Map rather than going through an intermediate array or list.

Related

How to Read and Change position of lines in a file using scala

I have a file, lets say file1:
A
B
C
D
E
I have to read this file and want to move 1st and 2nd line from the file to 3rd and 4th line in a file, like :
C
D
A
B
E
Getlines function can get the lines and probably print it. But how to change position of lines in a file using Scala?
Let's say you can't, or just don't want to, read the entire file into memory. Or, on the other hand, what if the file has fewer than 4 lines? Can the swapping still be done safely?
import java.nio.file.{Files, Paths}
util.Using.Manager { use => //Scala 2.13
val input = use(io.Source.fromFile("inFile.txt"))
val output = use(Files.newBufferedWriter(Paths.get("outFile.txt")))
val itr = input.getLines()
val linesAB = Seq.fill(2)(util.Try(itr.next()))
val linesCD = Seq.fill(2)(util.Try(itr.next()))
linesCD.foreach(_.foreach(s => output.write(s + "\n")))
linesAB.foreach(_.foreach(s => output.write(s + "\n")))
while (itr.hasNext) output.write(itr.next() + "\n")
}.fold(println,identity) //report failure
result:
~> head *File.txt # a 3-line file
==> inFile.txt <==
A
B
C
==> outFile.txt <==
C
A
B
Here is a pragmatic solution:
val newList =Source.fromFile("file.txt").getLines().toList match {
case a::b::c::d::rest => c::d::a::b::rest // reorganize your list
case other => other // don't do anything if the List has not at least 4 elements
}
// persist newList
Use the Pattern Matching of the List type.

Read input from input file given in specified format in scala

I am given a input file in following format
(a,[b,c,d])
(b,[d,a])
How can format this input to get values in form
key => List()
Following code is used to split lines on basis of space.
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
How to store this kind of formatted input ?
To tackle this I started with multiple data elements with and without whitespace separation.
%> cat junk.txt
(a,[b,c,d,e]) (w,[x,y,z])
(q,[wert,cv])(xx,[aa])
Then I opened the file and split the input on every leading ( paren without consuming the character.
val input = io.Source.fromFile("junk.txt")
.getLines()
.flatMap(_.split("(?=\\()"))
I also need a way to recognize the pattern I'm looking for.
val dataRE = "\\(([^,]+),\\[([^\\]]+)]".r.unanchored
Now to transform the data from Strings to Maps.
input.collect{case dataRE(k,v) => k -> v.split(",").toList}.toMap
Result: Map[String,List[String]]
Map(a -> List(b, c, d, e), w -> List(x, y, z), q -> List(wert, cv), xx -> List(aa))

Scala - List of Strings to Square Cypher String

I am following an exercise in Scala to build a square cypher. Here's an overview of the problem:
List("hello", "world", "fille", "r") is written taking the first letter from each String in the List and concatenating to the final string. Essentially, if you write them in square cypher form, you get:
hwfr
eoi
lrl
lll
ode
Which if you read from top to bottom, left to right, is the message. My expected output needs to be a List[String] that becomes List("hwfr", "eoi", ...). I don't know what methods or where to start in order to manipulate the original List in order to adhere to the form that I need. I can't map zip since zip only takes two arguments and I have an indeterminate amount of Strings. I'm not exactly sure how I might iterate over this List to get the result I need and would appreciate any suggestions or tips.
scala> val list = List("hello", "world", "fille", "rtext")
list: List[String] = List(hello, world, fille, rtext)
scala> list.transpose
res6: List[List[Char]] = List(List(h, w, f, r), List(e, o, i, t), List(l, r, l, e), List(l, l, l, x), List(o, d, e, t))
does the trick, api
Here is a version which does not care about equal word length. There should be more efficient versions, but I wanted to keep it relatively short.
Basic idea: Find out how long the longest word is (max). Since you know that, you start with index i = 0 and take the character at that position i from each string and form a string from it until you are at i = max - 1 (which is the position of the last character of the longest word. When the words are not at equal length, you have to make sure that you don't access a character which is not there.
Example: i = 1, then you get e from hello, o from world, i from fille, but accessing character 1 on r would result in an exception. That is why we check for size of the string beforehand and in that case append the empty string. if(i < elem.size) elem(i) else ""
val list = List("hello", "world", "fille", "r")
val max = list.maxBy(_.size).size //gives you the size of the longest word
val result: List[String] = (0 until max).map(i => list.foldLeft("")
((s, elem) => s + (if(i < elem.size) elem(i) else "")))(collection.breakOut)
println(result) //List(hwfr, eoi, lrl, lll, ode)
Edit:
If you still want it to be readable from left-right/top-bottom (if they are not ordered by length and you don't want to order them), you can introduce spaces. Change if(i < elem.size) elem(i) else "" to if(i < elem.size) elem(i) else " ".
List("hello", "world", "fille", "r") would become List(hwfr, eoi , lrl , lll , ode ) and List("hello", "world", "r", "fille") would become List(hwrf, eo i, lr l, ll l, od e)

Storing the contents of a file in an immutable Map in scala

I am trying to implement a simple wordcount in scala using an immutable map(this is intentional) and the way I am trying to accomplish it is as follows:
Create an empty immutable map
Create a scanner that reads through the file.
While the scanner.hasNext() is true:
Check if the Map contains the word, if it doesn't contain the word, initialize the count to zero
Create a new entry with the key=word and the value=count+1
Update the map
At the end of the iteration, the map is populated with all the values.
My code is as follows:
val wordMap = Map.empty[String,Int]
val input = new java.util.scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
val token = input.next()
val currentCount = wordMap.getOrElse(token,0) + 1
val wordMap = wordMap + (token,currentCount)
}
The ides is that wordMap will have all the wordCounts at the end of the iteration...
Whenever I try to run this snippet, I get the following exception
recursive value wordMap needs type.
Can somebody point out why I am getting this exception and what can I do to remedy it?
Thanks
val wordMap = wordMap + (token,currentCount)
This line is redefining an already-defined variable. If you want to do this, you need to define wordMap with var and then just use
wordMap = wordMap + (token,currentCount)
Though how about this instead?:
io.Source.fromFile("textfile.txt") // read from the file
.getLines.flatMap{ line => // for each line
line.split("\\s+") // split the line into tokens
.groupBy(identity).mapValues(_.size) // count each token in the line
} // this produces an iterator of token counts
.toStream // make a Stream so we can groupBy
.groupBy(_._1).mapValues(_.map(_._2).sum) // combine all the per-line counts
.toList
Note that the per-line pre-aggregation is used to try and reduce the memory required. Counting across the entire file at once might be too big.
If your file is really massive, I would suggest using doing this in parallel (since word counting is trivial to parallelize) using either Scala's parallel collections or Hadoop (using one of the cool Scala Hadoop wrappers like Scrunch or Scoobi).
EDIT: Detailed explanation:
Ok, first look at the inner part of the flatMap. We take a string, and split it apart on whitespace:
val line = "a b c b"
val tokens = line.split("\\s+") // Array(a, b, c, a, b)
Now identity is a function that just returns its argument, so if wegroupBy(identity)`, we map each distinct word type, to each word token:
val grouped = tokens.groupBy(identity) // Map(c -> Array(c), a -> Array(a), b -> Array(b, b))
And finally, we want to count up the number of tokens for each type:
val counts = grouped.mapValues(_.size) // Map(c -> 1, a -> 1, b -> 2)
Since we map this over all the lines in the file, we end up with token counts for each line.
So what does flatMap do? Well, it runs the token-counting function over each line, and then combines all the results into one big collection.
Assume the file is:
a b c b
b c d d d
e f c
Then we get:
val countsByLine =
io.Source.fromFile("textfile.txt") // read from the file
.getLines.flatMap{ line => // for each line
line.split("\\s+") // split the line into tokens
.groupBy(identity).mapValues(_.size) // count each token in the line
} // this produces an iterator of token counts
println(countsByLine.toList) // List((c,1), (a,1), (b,2), (c,1), (d,3), (b,1), (c,1), (e,1), (f,1))
So now we need to combine the counts of each line into one big set of counts. The countsByLine variable is an Iterator, so it doesn't have a groupBy method. Instead we can convert it to a Stream, which is basically a lazy list. We want laziness because we don't want to have to read the entire file into memory before we start. Then the groupBy groups all counts of the same word type together.
val groupedCounts = countsByLine.toStream.groupBy(_._1)
println(groupedCounts.mapValues(_.toList)) // Map(e -> List((e,1)), f -> List((f,1)), a -> List((a,1)), b -> List((b,2), (b,1)), c -> List((c,1), (c,1), (c,1)), d -> List((d,3)))
And finally we can sum up the counts from each line for each word type by grabbing the second item from each tuple (the count), and summing:
val totalCounts = groupedCounts.mapValues(_.map(_._2).sum)
println(totalCounts.toList)
List((e,1), (f,1), (a,1), (b,3), (c,3), (d,3))
And there you have it.
You have a few mistakes: you've defined wordMap twice (val is to declare a value). Also, Map is immutable, so you either have to declare it as a var or use a mutable map (I suggest the former).
Try this:
var wordMap = Map.empty[String,Int] withDefaultValue 0
val input = new java.util.Scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
val token = input.next()
wordMap += token -> (wordMap(token) + 1)
}

Merge the intersection of two CSV files with Scala

From input 1:
fruit, apple, cider
animal, beef, burger
and input 2:
animal, beef, 5kg
fruit, apple, 2liter
fish, tuna, 1kg
I need to produce:
fruit, apple, cider, 2liter
animal, beef, burger, 5kg
The closest example I could get is:
object FileMerger {
def main(args : Array[String]) {
import scala.io._
val f1 = (Source fromFile "file1.csv" getLines) map (_.split(", *")(1))
val f2 = Source fromFile "file2.csv" getLines
val out = new java.io.FileWriter("output.csv")
f1 zip f2 foreach { x => out.write(x._1 + ", " + x._2 + "\n") }
out.close
}
}
The problem is that the example assumes that the two CSV files contain the same number of elements and in the same order. My merged result must only contain elements that are in the first and the second file. I am new to Scala, and any help will be greatly appreciated.
You need an intersection of the two files: the lines from file1 and file2 which share some criteria. Consider this through a set theory perspective: you have two sets with some elements in common, and you need a new set with those elements. Well, there's more to it than that, because the lines aren't really equal...
So, let's say you read file1, and that's of type List[Input1]. We could code it like this, without getting into any details of what Input1 is:
case class Input1(line: String)
val f1: List[Input1] = (Source fromFile "file1.csv" getLines () map Input1).toList
We can do the same thing for file2 and List[Input2]:
case class Input2(line: String)
val f2: List[Input2] = (Source fromFile "file2.csv" getLines () map Input2).toList
You might be wondering why I created two different classes if they have the exact same definition. Well, if you were reading structured data, you would have two different types, so let's see how to handle that more complex case.
Ok, so how do we match them, since Input1 and Input2 are different types? Well, the lines are matched by keys, which, according to your code, are the first column in each. So let's create a class Key, and conversions Input1 => Key and Input2 => Key:
case class Key(key: String)
def Input1IsKey(input: Input1): Key = Key(input.line split "," head) // using regex would be better
def Input2IsKey(input: Input2): Key = Key(input.line split "," head)
Ok, now that we can produce a common Key from Input1 and Input2, let's get the intersection of them:
val intersection = (f1 map Input1IsKey).toSet intersect (f2 map Input2IsKey).toSet
So we can build the intersection of lines we want, but we don't have the lines! The problem is that, for each key, we need to know from which line it came. Consider that we have a set of keys, and for each key we want to keep track of a value -- that's exactly what a Map is! So we can build this:
val m1 = (f1 map (input => Input1IsKey(input) -> input)).toMap
val m2 = (f2 map (input => Input2IsKey(input) -> input)).toMap
So the output can be produced like this:
val output = intersection map (key => m1(key).line + ", " + m2(key).line)
All you have to do now is output that.
Let's consider some improvements on this code. First, note that the output produced above repeats the key -- that's exactly what your code does, but not what you want in the example. Let's change, then, Input1 and Input2 to split the key from the rest of the args:
case class Input1(key: String, rest: String)
case class Input2(key: String, rest: String)
It's now a bit harder to initialize f1 and f2. Instead of using split, which will break all the line unnecessarily (and at great cost to performance), we'll divide the line right the at the first comma: everything before is key, everything after is rest. The method span does that:
def breakLine(line: String): (String, String) = line span (',' !=)
Play a bit with the span method on REPL to get a better understanding of it. As for (',' !=), that's just an abbreviated form of saying (x => ',' != x).
Next, we need a way to create Input1 and Input2 from a tuple (the result of breakLine):
def TupleIsInput1(tuple: (String, String)) = Input1(tuple._1, tuple._2)
def TupleIsInput2(tuple: (String, String)) = Input2(tuple._1, tuple._2)
We can now read the files:
val f1: List[Input1] = (Source fromFile "file1.csv" getLines () map breakLine map TupleIsInput1).toList
val f2: List[Input2] = (Source fromFile "file2.csv" getLines () map breakLine map TupleIsInput2).toList
Another thing we can simplify is intersection. When we create a Map, its keys are sets, so we can create the maps first, and then use their keys to compute the intersection:
case class Key(key: String)
def Input1IsKey(input: Input1): Key = Key(input.key)
def Input2IsKey(input: Input2): Key = Key(input.key)
// We now only keep the "rest" as the map value
val m1 = (f1 map (input => Input1IsKey(input) -> input.rest)).toMap
val m2 = (f2 map (input => Input2IsKey(input) -> input.rest)).toMap
val intersection = m1.keySet intersect m2.keySet
And the output is computed like this:
val output = intersection map (key => key + m1(key) + m2(key))
Note that I don't append comma anymore -- the rest of both f1 and f2 start with a comma already.
It's tough to infer a requirement from one example. May be something like this is would serve your needs:
Create a map from key to line for the second file f2 (so from "animal, beef" -> "5kg")
For each lines in the first file f1, get the key to look up in the map
Look up value, if found write output
That translates to
val f1 = Source fromFile "file1.csv" getLines
val f2 = Source fromFile "file2.csv" getLines
val map = f2.map(_.split(", *")).map(arr => arr.init.mkString(", ") -> arr.last}.toMap
for {
line <- f1
key = line.split(", *").init.mkString(", ")
value <- map.get(key)
} {
out.write(line + ", " + value + "\n")
}