Function to return List of Map while iterating over String, kmer count - scala

I am working on creating a k-mer frequency counter (similar to word count in Hadoop) written in Scala. I'm fairly new to Scala, but I have some programming experience.
The input is a text file containing a gene sequence and my task is to get the frequency of each k-mer where k is some specified length of the sequence.
Therefore, the sequence AGCTTTC has three 5-mers (AGCTT, GCTTT, CTTTC)
I've parsed through the input and created a huge string which is the entire sequence, the new lines throw off the k-mer counting as the end of one line's sequence should still form a k-mer with the beginning of the next line's sequence.
Now I am trying to write a function that will generate a list of maps List[Map[String, Int]] with which it should be easy to use scala's groupBy function to get the count of the common k-mers
import scala.io.Source
object Main {
def main(args: Array[String]) {
// Get all of the lines from the input file
val input = Source.fromFile("input.txt").getLines.toArray
// Create one huge string which contains all the lines but the first
val lines = input.tail.mkString.replace("\n","")
val mappedKmers: List[Map[String,Int]] = getMappedKmers(5, lines)
}
def getMappedKmers(k: Int, seq: String): List[Map[String, Int]] = {
for (i <- 0 until seq.length - k) {
Map(seq.substring(i, i+k), 1) // Map the k-mer to a count of 1
}
}
}
Couple of questions:
How to create/generate List[Map[String,Int]]?
How would you do it?
Any help and/or advice is definitely appreciated!

You're pretty close—there are three fairly minor problems with your code.
The first is that for (i <- whatever) foo(i) is syntactic sugar for whatever.foreach(i => foo(i)), which means you're not actually doing anything with the contents of whatever. What you want is for (i <- whatever) yield foo(i), which is sugar for whatever.map(i => foo(i)) and returns the transformed collection.
The second issue is that 0 until seq.length - k is a Range, not a List, so even once you've added the yield, the result still won't line up with the declared return type.
The third issue is that Map(k, v) tries to create a map with two key-value pairs, k and v. You want Map(k -> v) or Map((k, v)), either of which is explicit about the fact that you have a single argument pair.
So the following should work:
def getMappedKmers(k: Int, seq: String): IndexedSeq[Map[String, Int]] = {
for (i <- 0 until seq.length - k) yield {
Map(seq.substring(i, i + k) -> 1) // Map the k-mer to a count of 1
}
}
You could also convert either the range or the entire result to a list with .toList if you'd prefer a list at the end.
It's worth noting, by the way, that the sliding method on Seq does exactly what you want:
scala> "AGCTTTC".sliding(5).foreach(println)
AGCTT
GCTTT
CTTTC
I'd definitely suggest something like "AGCTTTC".sliding(5).toList.groupBy(identity) for real code.

Related

Passing parameters to foldLeft scala

I am trying to use Scala's foldLeft function to compare a Stream with a List of Lists.
This is a snippet of the code I have
def freqParse(pairs: (Pair,Int,String), record: String): (Pair,Int,String) ={
val m: Pair = ("","")
val t: FreqPairs = Map((m,0))
(pairs._1,pairs._2,pairs._3)
}
val freqItems = items.map(v => (v._1)).toList
val cross = freqItems.flatMap(x => freqItems.map(y => (x, y))) // cross product to get pair of frequent items
val freq = lines.foldLeft(pairs,0,delim)(freqParse)
cross is basically a lists where each element is a pair of strings List(List(a,b), List(a,c)...).
lines is an input stream with a record per line (25902 lines in total).
I want to count how many times each pair (or each element in cross) occurs in the entirety of the stream. Essentially comparing all elements in cross to all elements in lines.
I decided foldLeft because that way I can take each line in the stream and split it at a delimiter and then check if both the elements in cross appear or not.
I am able to split each record in lines but I don't know how to pass the cross variable to the function to begin the comparison.

How to count the number of iterations in a for comprehension in Scala?

I am using a for comprehension on a stream and I would like to know how many iterations took to get o the final results.
In code:
var count = 0
for {
xs <- xs_generator
x <- xs
count = count + 1 //doesn't work!!
if (x prop)
yield x
}
Is there a way to achieve this?
Edit: If you don't want to return only the first item, but the entire stream of solutions, take a look at the second part.
Edit-2: Shorter version with zipWithIndex appended.
It's not entirely clear what you are attempting to do. To me it seems as if you are trying to find something in a stream of lists, and additionaly save the number of checked elements.
If this is what you want, consider doing something like this:
/** Returns `x` that satisfies predicate `prop`
* as well the the total number of tested `x`s
*/
def findTheX(): (Int, Int) = {
val xs_generator = Stream.from(1).map(a => (1 to a).toList).take(1000)
var count = 0
def prop(x: Int): Boolean = x % 317 == 0
for (xs <- xs_generator; x <- xs) {
count += 1
if (prop(x)) {
return (x, count)
}
}
throw new Exception("No solution exists")
}
println(findTheX())
// prints:
// (317,50403)
Several important points:
Scala's for-comprehension have nothing to do with Python's "yield". Just in case you thought they did: re-read the documentation on for-comprehensions.
There is no built-in syntax for breaking out of for-comprehensions. It's better to wrap it into a function, and then call return. There is also breakable though, but it works with Exceptions.
The function returns the found item and the total count of checked items, therefore the return type is (Int, Int).
The error in the end after the for-comprehension is to ensure that the return type is Nothing <: (Int, Int) instead of Unit, which is not a subtype of (Int, Int).
Think twice when you want to use Stream for such purposes in this way: after generating the first few elements, the Stream holds them in memory. This might lead to "GC-overhead limit exceeded"-errors if the Stream isn't used properly.
Just to emphasize it again: the yield in Scala for-comprehensions is unrelated to Python's yield. Scala has no built-in support for coroutines and generators. You don't need them as often as you might think, but it requires some readjustment.
EDIT
I've re-read your question again. In case that you want an entire stream of solutions together with a counter of how many different xs have been checked, you might use something like that instead:
val xs_generator = Stream.from(1).map(a => (1 to a).toList)
var count = 0
def prop(x: Int): Boolean = x % 317 == 0
val xsWithCounter = for {
xs <- xs_generator;
x <- xs
_ = { count = count + 1 }
if (prop(x))
} yield (x, count)
println(xsWithCounter.take(10).toList)
// prints:
// List(
// (317,50403), (317,50721), (317,51040), (317,51360), (317,51681),
// (317,52003), (317,52326), (317,52650), (317,52975), (317,53301)
// )
Note the _ = { ... } part. There is a limited number of things that can occur in a for-comprehension:
generators (the x <- things)
filters/guards (if-s)
value definitions
Here, we sort-of abuse the value-definition syntax to update the counter. We use the block { counter += 1 } as the right hand side of the assignment. It returns Unit. Since we don't need the result of the block, we use _ as the left hand side of the assignment. In this way, this block is executed once for every x.
EDIT-2
If mutating the counter is not your main goal, you can of course use the zipWithIndex directly:
val xsWithCounter =
xs_generator.flatten.zipWithIndex.filter{x => prop(x._1)}
It gives almost the same result as the previous version, but the indices are shifted by -1 (it's the indices, not the number of tried x-s).

Getting an error trying to map through a list in Scala

I'm trying to print out all the factors of every number in a list.
Here is my code:
def main(args: Array[String])
{
val list_of_numbers = List(1,4,6)
def get_factors(list_of_numbers:List[Int]) : Int =
{
return list_of_numbers.foreach{(1 to _).filter {divisor => _ % divisor == 0}}
}
println(get_factors(list_of_numbers));
}
I want the end result to contain a single list that will hold all the numbers which are factors of any of the numbers in the list. So the final result should be (1,2,3,4,6). Right now, I get the following error:
error: missing parameter type for expanded function ((x$1) => 1.to(x$1))
return list_of_numbers.foreach{(1 to _).filter {divisor => _ % divisor == 0}}
How can I fix this?
You can only use _ shorthand once in a function (except for some special cases), and even then not always.
Try spelling it out instead:
list_of_numbers.foreach { n =>
(1 to n).filter { divisor => n % divisor == 0 }
}
This will compile.
There are other problems with your code though.
foreach returns a Unit, but you are requiring an Int for example.
Perhaps, you wanted a .map rather than .foreach, but that would still be a List, not an Int.
A few things are wrong here.
First, foreach takes a function A => Unit as an argument, meaning that it's really just for causing side effects.
Second your use of _, you can use _ when the function uses each argument once.
Lastly your expected output seems to be getting rid of duplicates (1 is a factor for all 3 inputs, but it only appears once).
list_of_numbers flatMap { i => (1 to i) filter {i % _ == 0 }} distinct
will do what you are looking for.
flatMap takes a function from A => List[B] and produces a simple List[B] as output, list.distinct gets rid of the duplicates.
Actually, there are several problems with your code.
First, foreach is a method which yields Unit (like void in Java). You want to yield something so you should use a for comprehension.
Second, in your divisor-test function, you've specified both the unnamed parameter ("_") and the named parameter (divisor).
The third problem is that you expect the result to be Int (in the code) but List[Int] in your description.
The following code will do what you want (although it will repeat factors, so you might want to pass it through distinct before using the result):
def main(args: Array[String]) {
val list_of_numbers = List(1, 4, 6)
def get_factors(list_of_numbers: List[Int]) = for (n <- list_of_numbers; r = 1 to n; f <- r.filter(n%_ == 0)) yield f
println(get_factors(list_of_numbers))
}
Note that you need two generators ("<-") in the for comprehension in order that you end up with simply a List. If you instead implemented the filter part in the yield expression, you would get a List[List[Int]].

Scala split argument over several lines and parse to Int

This is probably going to end up being very simple, but I ask more to help me learn better Scala idioms (Python guy by trade looking to learn some scala tricks.)
I'm doing some hacker rank problems and the method of input requires is a read over lines from stdin. The spec is quoted below:
The first line contains the number of test cases T. T test cases
follow. Each case contains two integers N and M.
So in the input passed to the script looks something like this:
4
2 2
3 2
2 3
4 4
I'm wondering what would be the proper, idiomatic way to do this. I've thought of a few:
Use io.Source.stdin.readLines.zipWithIndex, then from within a foreach, if the index is greater than 0, split on whitespace and map to (_.toInt)
Use the same readLines function to get the input and then pattern match against the index.
Split on whitespace and newlines to make a single list of digits, map toInt, pop the first element (problem size) and then modulo 2 to make tuples of arguments for my problem function.
I'm wondering what more experienced scala programmers would consider the best way to parse these args, where the 2 element lines would be args to a function and the first, single digit line is just the number of problems to solve.
Maybe you're looking for something like this?
def f(x: Int, y: Int) = { f"do something with $x and $y" }
io.Source.stdin.readLines
.map(_.trim.split("\\s+").map(_.toInt)) // split and convert to ints
.collect { case Array(a, b) => f(a, b) } // pass to f if there are two arguments
.foreach(println) // print the result of each function call
Another way to read the input for Hacker Rank problems is with scala.io.Stdin
import scala.io.StdIn
import scala.collection.mutable.ArrayBuffer
object Solution {
def main(args: Array[String]) = {
val q = StdIn.readInt
var lines = ArrayBuffer[Array[Int]]()
(1 to q).foreach(_ => lines += StdIn.readLine.split(" ").map(_.toInt))
for (a <- lines){
val n = a(0)
val m = a(1)
val ans = n * m
println(ans)
}
}
}
I have tested it on Hacker Rank platform today and the output is:
4
6
6
16

Merge the intersection of two CSV files with Scala

From input 1:
fruit, apple, cider
animal, beef, burger
and input 2:
animal, beef, 5kg
fruit, apple, 2liter
fish, tuna, 1kg
I need to produce:
fruit, apple, cider, 2liter
animal, beef, burger, 5kg
The closest example I could get is:
object FileMerger {
def main(args : Array[String]) {
import scala.io._
val f1 = (Source fromFile "file1.csv" getLines) map (_.split(", *")(1))
val f2 = Source fromFile "file2.csv" getLines
val out = new java.io.FileWriter("output.csv")
f1 zip f2 foreach { x => out.write(x._1 + ", " + x._2 + "\n") }
out.close
}
}
The problem is that the example assumes that the two CSV files contain the same number of elements and in the same order. My merged result must only contain elements that are in the first and the second file. I am new to Scala, and any help will be greatly appreciated.
You need an intersection of the two files: the lines from file1 and file2 which share some criteria. Consider this through a set theory perspective: you have two sets with some elements in common, and you need a new set with those elements. Well, there's more to it than that, because the lines aren't really equal...
So, let's say you read file1, and that's of type List[Input1]. We could code it like this, without getting into any details of what Input1 is:
case class Input1(line: String)
val f1: List[Input1] = (Source fromFile "file1.csv" getLines () map Input1).toList
We can do the same thing for file2 and List[Input2]:
case class Input2(line: String)
val f2: List[Input2] = (Source fromFile "file2.csv" getLines () map Input2).toList
You might be wondering why I created two different classes if they have the exact same definition. Well, if you were reading structured data, you would have two different types, so let's see how to handle that more complex case.
Ok, so how do we match them, since Input1 and Input2 are different types? Well, the lines are matched by keys, which, according to your code, are the first column in each. So let's create a class Key, and conversions Input1 => Key and Input2 => Key:
case class Key(key: String)
def Input1IsKey(input: Input1): Key = Key(input.line split "," head) // using regex would be better
def Input2IsKey(input: Input2): Key = Key(input.line split "," head)
Ok, now that we can produce a common Key from Input1 and Input2, let's get the intersection of them:
val intersection = (f1 map Input1IsKey).toSet intersect (f2 map Input2IsKey).toSet
So we can build the intersection of lines we want, but we don't have the lines! The problem is that, for each key, we need to know from which line it came. Consider that we have a set of keys, and for each key we want to keep track of a value -- that's exactly what a Map is! So we can build this:
val m1 = (f1 map (input => Input1IsKey(input) -> input)).toMap
val m2 = (f2 map (input => Input2IsKey(input) -> input)).toMap
So the output can be produced like this:
val output = intersection map (key => m1(key).line + ", " + m2(key).line)
All you have to do now is output that.
Let's consider some improvements on this code. First, note that the output produced above repeats the key -- that's exactly what your code does, but not what you want in the example. Let's change, then, Input1 and Input2 to split the key from the rest of the args:
case class Input1(key: String, rest: String)
case class Input2(key: String, rest: String)
It's now a bit harder to initialize f1 and f2. Instead of using split, which will break all the line unnecessarily (and at great cost to performance), we'll divide the line right the at the first comma: everything before is key, everything after is rest. The method span does that:
def breakLine(line: String): (String, String) = line span (',' !=)
Play a bit with the span method on REPL to get a better understanding of it. As for (',' !=), that's just an abbreviated form of saying (x => ',' != x).
Next, we need a way to create Input1 and Input2 from a tuple (the result of breakLine):
def TupleIsInput1(tuple: (String, String)) = Input1(tuple._1, tuple._2)
def TupleIsInput2(tuple: (String, String)) = Input2(tuple._1, tuple._2)
We can now read the files:
val f1: List[Input1] = (Source fromFile "file1.csv" getLines () map breakLine map TupleIsInput1).toList
val f2: List[Input2] = (Source fromFile "file2.csv" getLines () map breakLine map TupleIsInput2).toList
Another thing we can simplify is intersection. When we create a Map, its keys are sets, so we can create the maps first, and then use their keys to compute the intersection:
case class Key(key: String)
def Input1IsKey(input: Input1): Key = Key(input.key)
def Input2IsKey(input: Input2): Key = Key(input.key)
// We now only keep the "rest" as the map value
val m1 = (f1 map (input => Input1IsKey(input) -> input.rest)).toMap
val m2 = (f2 map (input => Input2IsKey(input) -> input.rest)).toMap
val intersection = m1.keySet intersect m2.keySet
And the output is computed like this:
val output = intersection map (key => key + m1(key) + m2(key))
Note that I don't append comma anymore -- the rest of both f1 and f2 start with a comma already.
It's tough to infer a requirement from one example. May be something like this is would serve your needs:
Create a map from key to line for the second file f2 (so from "animal, beef" -> "5kg")
For each lines in the first file f1, get the key to look up in the map
Look up value, if found write output
That translates to
val f1 = Source fromFile "file1.csv" getLines
val f2 = Source fromFile "file2.csv" getLines
val map = f2.map(_.split(", *")).map(arr => arr.init.mkString(", ") -> arr.last}.toMap
for {
line <- f1
key = line.split(", *").init.mkString(", ")
value <- map.get(key)
} {
out.write(line + ", " + value + "\n")
}