Merge the intersection of two CSV files with Scala - scala

From input 1:
fruit, apple, cider
animal, beef, burger
and input 2:
animal, beef, 5kg
fruit, apple, 2liter
fish, tuna, 1kg
I need to produce:
fruit, apple, cider, 2liter
animal, beef, burger, 5kg
The closest example I could get is:
object FileMerger {
def main(args : Array[String]) {
import scala.io._
val f1 = (Source fromFile "file1.csv" getLines) map (_.split(", *")(1))
val f2 = Source fromFile "file2.csv" getLines
val out = new java.io.FileWriter("output.csv")
f1 zip f2 foreach { x => out.write(x._1 + ", " + x._2 + "\n") }
out.close
}
}
The problem is that the example assumes that the two CSV files contain the same number of elements and in the same order. My merged result must only contain elements that are in the first and the second file. I am new to Scala, and any help will be greatly appreciated.

You need an intersection of the two files: the lines from file1 and file2 which share some criteria. Consider this through a set theory perspective: you have two sets with some elements in common, and you need a new set with those elements. Well, there's more to it than that, because the lines aren't really equal...
So, let's say you read file1, and that's of type List[Input1]. We could code it like this, without getting into any details of what Input1 is:
case class Input1(line: String)
val f1: List[Input1] = (Source fromFile "file1.csv" getLines () map Input1).toList
We can do the same thing for file2 and List[Input2]:
case class Input2(line: String)
val f2: List[Input2] = (Source fromFile "file2.csv" getLines () map Input2).toList
You might be wondering why I created two different classes if they have the exact same definition. Well, if you were reading structured data, you would have two different types, so let's see how to handle that more complex case.
Ok, so how do we match them, since Input1 and Input2 are different types? Well, the lines are matched by keys, which, according to your code, are the first column in each. So let's create a class Key, and conversions Input1 => Key and Input2 => Key:
case class Key(key: String)
def Input1IsKey(input: Input1): Key = Key(input.line split "," head) // using regex would be better
def Input2IsKey(input: Input2): Key = Key(input.line split "," head)
Ok, now that we can produce a common Key from Input1 and Input2, let's get the intersection of them:
val intersection = (f1 map Input1IsKey).toSet intersect (f2 map Input2IsKey).toSet
So we can build the intersection of lines we want, but we don't have the lines! The problem is that, for each key, we need to know from which line it came. Consider that we have a set of keys, and for each key we want to keep track of a value -- that's exactly what a Map is! So we can build this:
val m1 = (f1 map (input => Input1IsKey(input) -> input)).toMap
val m2 = (f2 map (input => Input2IsKey(input) -> input)).toMap
So the output can be produced like this:
val output = intersection map (key => m1(key).line + ", " + m2(key).line)
All you have to do now is output that.
Let's consider some improvements on this code. First, note that the output produced above repeats the key -- that's exactly what your code does, but not what you want in the example. Let's change, then, Input1 and Input2 to split the key from the rest of the args:
case class Input1(key: String, rest: String)
case class Input2(key: String, rest: String)
It's now a bit harder to initialize f1 and f2. Instead of using split, which will break all the line unnecessarily (and at great cost to performance), we'll divide the line right the at the first comma: everything before is key, everything after is rest. The method span does that:
def breakLine(line: String): (String, String) = line span (',' !=)
Play a bit with the span method on REPL to get a better understanding of it. As for (',' !=), that's just an abbreviated form of saying (x => ',' != x).
Next, we need a way to create Input1 and Input2 from a tuple (the result of breakLine):
def TupleIsInput1(tuple: (String, String)) = Input1(tuple._1, tuple._2)
def TupleIsInput2(tuple: (String, String)) = Input2(tuple._1, tuple._2)
We can now read the files:
val f1: List[Input1] = (Source fromFile "file1.csv" getLines () map breakLine map TupleIsInput1).toList
val f2: List[Input2] = (Source fromFile "file2.csv" getLines () map breakLine map TupleIsInput2).toList
Another thing we can simplify is intersection. When we create a Map, its keys are sets, so we can create the maps first, and then use their keys to compute the intersection:
case class Key(key: String)
def Input1IsKey(input: Input1): Key = Key(input.key)
def Input2IsKey(input: Input2): Key = Key(input.key)
// We now only keep the "rest" as the map value
val m1 = (f1 map (input => Input1IsKey(input) -> input.rest)).toMap
val m2 = (f2 map (input => Input2IsKey(input) -> input.rest)).toMap
val intersection = m1.keySet intersect m2.keySet
And the output is computed like this:
val output = intersection map (key => key + m1(key) + m2(key))
Note that I don't append comma anymore -- the rest of both f1 and f2 start with a comma already.

It's tough to infer a requirement from one example. May be something like this is would serve your needs:
Create a map from key to line for the second file f2 (so from "animal, beef" -> "5kg")
For each lines in the first file f1, get the key to look up in the map
Look up value, if found write output
That translates to
val f1 = Source fromFile "file1.csv" getLines
val f2 = Source fromFile "file2.csv" getLines
val map = f2.map(_.split(", *")).map(arr => arr.init.mkString(", ") -> arr.last}.toMap
for {
line <- f1
key = line.split(", *").init.mkString(", ")
value <- map.get(key)
} {
out.write(line + ", " + value + "\n")
}

Related

Scala File lines to Map

I'm Currently opening my files and utilizing .getLines to retrieve each lines from the file with a word and its phonetic pronunciation separated by two white spaces, i'm confused as to how would i go about Mapping the word and its pronunciation in Scala as i'm fairly new to the language.
i've previously though to utilize split and separate the words and their sounds into different lines,but, i'm lost
Currently i Started with
def words(filename: String, word: String): Unit = {
val file = Source.fromFile(filename).getLines().drop(56)
for(x <- file){
}
}
EX:
ARTI AA1 R T IY2
AASE AA1 S
ABAIR AH0 B EH1 R
AB AE1 B
Result:
Map("AARTI -> "AA1 R T IY2","AASE" -> "AA1 S", "ABAIR" -> " AH0 B EH1 R")
iterate each line
split by 2 white spaces " "
create a tuple of (a -> b)
convert Array[Tuple[A, B]] => Map[A, B]
example,
val data =
"""
ARTI AA1 R T IY2
AASE AA1 S
ABAIR AH0 B EH1 R
AB AE1 B
""".stripMargin
val lines: Array[String] = data.split("\n").filter(_.trim.nonEmpty)
// if you are reading from file
// val lines = Source.fromFile("src/test/resources/my_filename.txt").getLines()
val res: Array[Tuple2[String, String]] = lines.map { line =>
line.split(" ") match { case Array(a, b) => a -> b }
}
println(res.toMap)
output:
Map(ARTI -> AA1 R T IY2, AASE -> AA1 S, ABAIR -> AH0 B EH1 R, AB -> AE1 B)
Running example - https://scastie.scala-lang.org/prayagupd/jBCnEhUPQJCMPKP9TXlgWA
How to read entire file in Scala?
If your lines are in file then this will create a Map from the first word to the rest of the string:
val res: Map[String, String] = file.map(_.span(_.isLetter))(collection.breakOut)
The values in the Map will contain leading space characters so you may want to call trim on them before using them.
The map call processes each line in turn.
The span method splits the line into a tuple where the first value is your word and the second is the rest of the line.
Using collection.breakOut tells map to put the results directly into a Map rather than going through an intermediate array or list.

Function to return List of Map while iterating over String, kmer count

I am working on creating a k-mer frequency counter (similar to word count in Hadoop) written in Scala. I'm fairly new to Scala, but I have some programming experience.
The input is a text file containing a gene sequence and my task is to get the frequency of each k-mer where k is some specified length of the sequence.
Therefore, the sequence AGCTTTC has three 5-mers (AGCTT, GCTTT, CTTTC)
I've parsed through the input and created a huge string which is the entire sequence, the new lines throw off the k-mer counting as the end of one line's sequence should still form a k-mer with the beginning of the next line's sequence.
Now I am trying to write a function that will generate a list of maps List[Map[String, Int]] with which it should be easy to use scala's groupBy function to get the count of the common k-mers
import scala.io.Source
object Main {
def main(args: Array[String]) {
// Get all of the lines from the input file
val input = Source.fromFile("input.txt").getLines.toArray
// Create one huge string which contains all the lines but the first
val lines = input.tail.mkString.replace("\n","")
val mappedKmers: List[Map[String,Int]] = getMappedKmers(5, lines)
}
def getMappedKmers(k: Int, seq: String): List[Map[String, Int]] = {
for (i <- 0 until seq.length - k) {
Map(seq.substring(i, i+k), 1) // Map the k-mer to a count of 1
}
}
}
Couple of questions:
How to create/generate List[Map[String,Int]]?
How would you do it?
Any help and/or advice is definitely appreciated!
You're pretty close—there are three fairly minor problems with your code.
The first is that for (i <- whatever) foo(i) is syntactic sugar for whatever.foreach(i => foo(i)), which means you're not actually doing anything with the contents of whatever. What you want is for (i <- whatever) yield foo(i), which is sugar for whatever.map(i => foo(i)) and returns the transformed collection.
The second issue is that 0 until seq.length - k is a Range, not a List, so even once you've added the yield, the result still won't line up with the declared return type.
The third issue is that Map(k, v) tries to create a map with two key-value pairs, k and v. You want Map(k -> v) or Map((k, v)), either of which is explicit about the fact that you have a single argument pair.
So the following should work:
def getMappedKmers(k: Int, seq: String): IndexedSeq[Map[String, Int]] = {
for (i <- 0 until seq.length - k) yield {
Map(seq.substring(i, i + k) -> 1) // Map the k-mer to a count of 1
}
}
You could also convert either the range or the entire result to a list with .toList if you'd prefer a list at the end.
It's worth noting, by the way, that the sliding method on Seq does exactly what you want:
scala> "AGCTTTC".sliding(5).foreach(println)
AGCTT
GCTTT
CTTTC
I'd definitely suggest something like "AGCTTTC".sliding(5).toList.groupBy(identity) for real code.

Scala: how to make a Hash(Trie)Map from a Map (via Anorm in Play)

Having read this quote on HashTrieMaps on docs.scala-lang.org:
For instance, to find a given key in a map, one first takes the hash code of the key. Then, the lowest 5 bits of the hash code are used to select the first subtree, followed by the next 5 bits and so on. The selection stops once all elements stored in a node have hash codes that differ from each other in the bits that are selected up to this level.
I figured that be a great (read: fast!) collection to store my Map[String, Long] in.
In my Play Framework (using Scala) I have this piece of code using Anorm that loads in around 18k of elements. It takes a few seconds to load (no big deal, but any tips?). I'd like to have it 'in memory' for fast look ups for string to long translation.
val data = DB.withConnection { implicit c ⇒
SQL( "SELECT stringType, longType FROM table ORDER BY stringType;" )
.as( get[String]( "stringType" ) ~ get[Long]( "longType " )
map { case ( s ~ l ) ⇒ s -> l }* ).toMap.withDefaultValue( -1L )
}
This code makes data of type class scala.collection.immutable.Map$WithDefault. I'd like this to be of type HashTrieMap (or HashMap, as I understand the linked quote all Scala HashMaps are of HashTrieMap?). Weirdly enough I found no way on how to convert it to a HashTrieMap. (I'm new to Scala, Play and Anorm.)
// Test for the repl (2.9.1.final). Map[String, Long]:
val data = Map( "Hello" -> 1L, "world" -> 2L ).withDefaultValue ( -1L )
data: scala.collection.immutable.Map[java.lang.String,Long] =
Map(Hello -> 1, world -> 2)
// Google showed me this, but it is still a Map[String, Long].
val hm = scala.collection.immutable.HashMap( data.toArray: _* ).withDefaultValue( -1L )
// This generates an error.
val htm = scala.collection.immutable.HashTrieMap( data.toArray: _* ).withDefaultValue( -1L )
So my question is how to convert the MapWithDefault to HashTrieMap (or HashMap if that shares the implementation of HashTrieMap)?
Any feedback welcome.
As the documentation that you pointed to explains, immutable maps already are implemented under the hood as HashTrieMaps. You can easily verify this in the REPL:
scala> println( Map(1->"one", 2->"two", 3->"three", 4->"four", 5->"five").getClass )
class scala.collection.immutable.HashMap$HashTrieMap
So you have nothing special to do, your code already is using HashMap.HashTrieMap without you even realizing.
More precisely, the default implementation of immutable.Map is immutable.HashMap, which is further refined (extended) by immutable.HashMap.HashTrieMap.
Note though that small immutable maps are not instances of immutable.HashMap.HashTrieMap, but are implemented as special cases (this is an optimization). There is a certain size threshold where they start being impelmented as immutable.HashMap.HashTrieMap.
As an example, entering the following in the REPL:
val m0 = HashMap[Int,String]()
val m1 = m0 + (1 -> "one")
val m2 = m1 + (2 -> "two")
val m3 = m2 + (3 -> "three")
println(s"m0: ${m0.getClass.getSimpleName}, m1: ${m1.getClass.getSimpleName}, m2: ${m2.getClass.getSimpleName}, m3: ${m3.getClass.getSimpleName}")
will print this:
m0: EmptyHashMap$, m1: HashMap1, m2: HashTrieMap, m3: HashTrieMap
So here the empty map is an instance of EmptyHashMap$. Adding an element to that gives a HashMap1, and adding yet another element finally gives a HashTrieMap.
Finally, the use of withDefaultValue does not change anything, as withDefaultValue will just return an instance Map.WithDefault wich wraps the initial map (which will still be a HashMap.HashTrieMap).

Scala: groupBy (identity) of List Elements

I develop an application that builds pairs of words in (tokenised) text and produces the number of times each pair occurs (even when same-word pairs occur multiple times, it's OK as it'll be evened out later in the algorithm).
When I use
elements groupBy()
I want to group by the elements' content itself, so I wrote the following:
def self(x: (String, String)) = x
/**
* Maps a collection of words to a map where key is a pair of words and the
* value is number of
* times this pair
* occurs in the passed array
*/
def producePairs(words: Array[String]): Map[(String,String), Double] = {
var table = List[(String, String)]()
words.foreach(w1 =>
words.foreach(w2 =>
table = table ::: List((w1, w2))))
val grouppedPairs = table.groupBy(self)
val size = int2double(grouppedPairs.size)
return grouppedPairs.mapValues(_.length / size)
}
Now, I fully realise that this self() trick is a dirty hack. So I thought a little a came out with a:
grouppedPairs = table groupBy (x => x)
This way it produced what I want. However, I still feel that I clearly miss something and there should be easier way of doing it. Any ideas at all, dear all?
Also, if you'd help me to improve the pairs extraction part, it'll also help a lot – it looks very imperative, C++ - ish right now. Many thanks in advance!
I'd suggest this:
def producePairs(words: Array[String]): Map[(String,String), Double] = {
val table = for(w1 <- words; w2 <- words) yield (w1,w2)
val grouppedPairs = table.groupBy(identity)
val size = grouppedPairs.size.toDouble
grouppedPairs.mapValues(_.length / size)
}
The for comprehension is much easier to read, and there is already a predifined function identity, with is a generalized version of your self.
you are creating a list of pairs of all words against all words by iterating over words twice, where i guess you just want the neighbouring pairs. the easiest is to use a sliding view instead.
def producePairs(words: Array[String]): Map[(String, String), Int] = {
val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1)).toList
val grouped = pairs.groupBy(t => t)
grouped.mapValues(_.size)
}
another approach would be to fold the list of pairs by summing them up. not sure though that this is more efficient:
def producePairs(words: Array[String]): Map[(String, String), Int] = {
val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1))
pairs.foldLeft(Map.empty[(String, String), Int]) { (m, p) =>
m + (p -> (m.getOrElse(p, 0) + 1))
}
}
i see you are return a relative number (Double). for simplicity i have just counted the occurances, so you need to do the final division. i think you want to divide by the number of total pairs (words.size - 1) and not by the number of unique pairs (grouped.size)..., so the relative frequencies sum up to 1.0
Alternative approach which is not of order O(num_words * num_words) but of order O(num_unique_words * num_unique_words) (or something like that):
def producePairs[T <% Traversable[String]](words: T): Map[(String,String), Double] = {
val counts = words.groupBy(identity).map{case (w, ws) => (w -> ws.size)}
val size = (counts.size * counts.size).toDouble
for(w1 <- counts; w2 <- counts) yield {
((w1._1, w2._1) -> ((w1._2 * w2._2) / size))
}
}

Scala: How do I use fold* with Map?

I have a Map[String, String] and want to concatenate the values to a single string.
I can see how to do this using a List...
scala> val l = List("te", "st", "ing", "123")
l: List[java.lang.String] = List(te, st, ing, 123)
scala> l.reduceLeft[String](_+_)
res8: String = testing123
fold* or reduce* seem to be the right approach I just can't get the syntax right for a Map.
Folds on a map work the same way they would on a list of pairs. You can't use reduce because then the result type would have to be the same as the element type (i.e. a pair), but you want a string. So you use foldLeft with the empty string as the neutral element. You also can't just use _+_ because then you'd try to add a pair to a string. You have to instead use a function that adds the accumulated string, the first value of the pair and the second value of the pair. So you get this:
scala> val m = Map("la" -> "la", "foo" -> "bar")
m: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(la -> la, foo -> bar)
scala> m.foldLeft("")( (acc, kv) => acc + kv._1 + kv._2)
res14: java.lang.String = lalafoobar
Explanation of the first argument to fold:
As you know the function (acc, kv) => acc + kv._1 + kv._2 gets two arguments: the second is the key-value pair currently being processed. The first is the result accumulated so far. However what is the value of acc when the first pair is processed (and no result has been accumulated yet)? When you use reduce the first value of acc will be the first pair in the list (and the first value of kv will be the second pair in the list). However this does not work if you want the type of the result to be different than the element types. So instead of reduce we use fold where we pass the first value of acc as the first argument to foldLeft.
In short: the first argument to foldLeft says what the starting value of acc should be.
As Tom pointed out, you should keep in mind that maps don't necessarily maintain insertion order (Map2 and co. do, but hashmaps do not), so the string may list the elements in a different order than the one in which you inserted them.
The question has been answered already, but I'd like to point out that there are easier ways to produce those strings, if that's all you want. Like this:
scala> val l = List("te", "st", "ing", "123")
l: List[java.lang.String] = List(te, st, ing, 123)
scala> l.mkString
res0: String = testing123
scala> val m = Map(1 -> "abc", 2 -> "def", 3 -> "ghi")
m: scala.collection.immutable.Map[Int,java.lang.String] = Map((1,abc), (2,def), (3,ghi))
scala> m.values.mkString
res1: String = abcdefghi