Scala Sub string combinations with delimiter - scala

My input string is
element1-element2-element3-element4a|element4b-element5-element6a|element6b
All the elements (sub strings) are separated by - and for some of the elements there will be alternatives separated by | (pipe).
A valid output string is which contains the elements separated by - (dash) only and any one of the alternative elements separated by |
All the List of valid possible combinations of output strings have to be returned.
Output:
element1-element2-element3-element4a-element5-element6a
element1-element2-element3-element4b-element5-element6a
element1-element2-element3-element4a-element5-element6b
element1-element2-element3-element4b-element5-element6b
This can be done using while loop and string functions but it takes more complexity.
(I'm a traditional Java programmer)
Can this be implemented using Scala features making it more efficient
Note: Input can contain any no of elements and pipes

This seems to fit the bill.
def getCombinations(input: String) = {
val group = """(\w|\|)+""".r // Match groups of letters and pipes
val word = """\w+""".r // Match groups of letters in between pipes
val groups = group.findAllIn(input).map(word.findAllIn(_).toVector).toList
// Use fold to construct a 'tree' of vectors, appending each possible entry in a
// pipe-separated group to each previous prefix. We're using vectors because
// the append time is O(1) rather than O(n).
val tree = groups match {
case (x :: tail) => {
val head = x.map(Vector(_)) // split each element in the head into its own node
tail.foldLeft(head) { (acc, elems) =>
for (elem <- elems; xs <- acc) yield (xs :+ elem)
}
}
case _ => Nil // Handle the case of 0 inputs
}
tree.map(_.mkString("-")) // Combine each of our trees back into a dash-separated string
}
I haven't tested this with extensive input, but the runtime complexity shouldn't be too bad. Introducing an 'Or' pipe causes the output to grow, by that's due the nature of the problem.

Related

Passing parameters to foldLeft scala

I am trying to use Scala's foldLeft function to compare a Stream with a List of Lists.
This is a snippet of the code I have
def freqParse(pairs: (Pair,Int,String), record: String): (Pair,Int,String) ={
val m: Pair = ("","")
val t: FreqPairs = Map((m,0))
(pairs._1,pairs._2,pairs._3)
}
val freqItems = items.map(v => (v._1)).toList
val cross = freqItems.flatMap(x => freqItems.map(y => (x, y))) // cross product to get pair of frequent items
val freq = lines.foldLeft(pairs,0,delim)(freqParse)
cross is basically a lists where each element is a pair of strings List(List(a,b), List(a,c)...).
lines is an input stream with a record per line (25902 lines in total).
I want to count how many times each pair (or each element in cross) occurs in the entirety of the stream. Essentially comparing all elements in cross to all elements in lines.
I decided foldLeft because that way I can take each line in the stream and split it at a delimiter and then check if both the elements in cross appear or not.
I am able to split each record in lines but I don't know how to pass the cross variable to the function to begin the comparison.

How to access previous element when using yield in for loop chisel3

This is mix Chisel / Scala question.
Background, I need to sum up a lot of numbers (the number of input signals in configurable). Due to timing constrains I had to split it to groups of 4 and pipe(register it), then it is fed into next stage (which will be 4 times smaller, until I reach on)
this is my code:
// log4 Aux function //
def log4(n : Int): Int = math.ceil(math.log10(n.toDouble) / math.log10(4.0)).toInt
// stage //
def Adder4PipeStage(len: Int,in: Vec[SInt]) : Vec[SInt] = {
require(in.length % 4 == 0) // will not work if not a muliplication of 4
val pipe = RegInit(VecInit(Seq.fill(len/4)(0.S(in(0).getWidth.W))))
pipe.zipWithIndex.foreach {case(p,j) => p := in.slice(j*4,(j+1)*4).reduce(_ +& _)}
pipe
}
// the pipeline
val adderPiped = for(j <- 1 to log4(len)) yield Adder4PipeStage(len/j,if(j==1) io.in else <what here ?>)
how to I access the previous stage, I am also open to hear about other ways to implement the above
There are several things you could do here:
You could just use a var for the "previous" value:
var prev: Vec[SInt] = io.in
val adderPiped = for(j <- 1 to log4(len)) yield {
prev = Adder4PipeStage(len/j, prev)
prev
}
It is a little weird using a var with a for yield (since the former is fundamentally mutable while the latter tends to be used with immutable-style code).
You could alternatively use a fold building up a List
// Build up backwards and reverse (typical in functional programming)
val adderPiped = (1 to log4(len)).foldLeft(io.in :: Nil) {
case (pipes, j) => Adder4PipeStage(len/j, pipes.head) :: pipes
}.reverse
.tail // Tail drops "io.in" which was 1st element in the result List
If you don't like the backwards construction of the previous fold,
You could use a fold with a Vector (better for appending than a List):
val adderPiped = (1 to log4(len)).foldLeft(Vector(io.in)) {
case (pipes, j) => pipes :+ Adder4PipeStage(len/j, pipes.last)
}.tail // Tail drops "io.in" which was 1st element in the result Vector
Finally, if you don't like these immutable ways of doing it, you could always just embrace mutability and write something similar to what one would in Java or Python:
For loop and mutable collection
val pipes = new mutable.ArrayBuffer[Vec[SInt]]
for (j <- 1 to log4(len)) {
pipes += Adder4PipeStage(len/j, if (j == 1) io.in else pipes.last)
}

representation of values (x,y) vs x._1,y._1

I am new to spark using scala and very much confused by the notations (x,y) in some scenarios and x._1, y._1. Especially when they are used one over the other in spark transformations
could someone explain is there a specific rule of thumb for when to use each of these syntaxes
Basically there are 2 ways to access a tuple parameter in anonymous function. They're functionally equivalent, use whatever method you prefer.
Through the attributes _1, _2,...
Through pattern matching into variable with meaningful name
val tuples = Array((1, 2), (2, 3), (3, 4))
// Attributes
tuples.foreach { t =>
println(s"${t._1} ${t._2}")
}
// Pattern matching
tuples.foreach { t =>
t match {
case (first, second) =>
println(s"$first $second")
}
}
// Pattern matching can also written as
tuples.foreach { case (first, second) =>
println(s"$first $second")
}
The notation (x, y) is a tuple of 2 elements, x and y. There are different ways to get access to the individual values in a tuple. You can use the ._1, ._2 notation to get at the elements:
val tup = (3, "Hello") // A tuple with two elements
val number = tup._1 // Gets the first element (3) from the tuple
val text = tup._2 // Gets the second element ("Hello") from the tuple
You can also use pattern matching. One way to extract the two values is like this:
val (number, text) = tup
Unlike a collection (for example, a List) a tuple has a fixed number of values (it's not always exactly two values) and the values can have different types (such as an Int and a String in the example above).
There are many tutorials about Scala tuples, for example: Scala tuple examples and syntax.

Scala: Split array into chunks by some logic

Is there some predefined function in Scala to split list into several lists by some logic? I found grouped method, but it doesn't fit my needs.
For example, I have List of strings: List("questions", "tags", "users", "badges", "unanswered").
And I want to split this list by max length of strings (for example 12). In other words in each resulting chunk sum of the length of all strings should not be more than 12:
List("questions"), List("tags", "users"), List("badges"), List("unanswered")
EDIT: I'm not necessarily need to find the most optimal way of combining strings into chunks, just linear loop which checks next string in list, and if its length doesn't fit to required (12) then return current chunk and next string will belong to next chunk
There is no buildIn mechanism to do that, that I know of, but you could achieve something like that with a foldLeft and bit of coding:
val test = List("questions", "tags", "users", "badges", "unanswered")
test.foldLeft(List.empty[List[String]]) {
case ((head :: tail), word) if head.map(_.length).sum + word.length < 12 =>
(word :: head) :: tail
case (result, word) =>
List(word) :: result
}
-> res0: List[List[String]] = List(List(unanswered), List(badges), List(users, tags), List(questions))
If you can make reasonable assumptions about the max length of a string (e.g. 10 characters), you could then use sliding which is much faster for a long list:
val elementsPerChunk = ??? // do some maths like
// CHUNK_SIZE / MAX_ELEM_LENGTH
val test = List("questions", "tags", "users", "badges", "unanswered")
test.sliding(elementsPerChunk, elementsPerChunk).toList

Function to return List of Map while iterating over String, kmer count

I am working on creating a k-mer frequency counter (similar to word count in Hadoop) written in Scala. I'm fairly new to Scala, but I have some programming experience.
The input is a text file containing a gene sequence and my task is to get the frequency of each k-mer where k is some specified length of the sequence.
Therefore, the sequence AGCTTTC has three 5-mers (AGCTT, GCTTT, CTTTC)
I've parsed through the input and created a huge string which is the entire sequence, the new lines throw off the k-mer counting as the end of one line's sequence should still form a k-mer with the beginning of the next line's sequence.
Now I am trying to write a function that will generate a list of maps List[Map[String, Int]] with which it should be easy to use scala's groupBy function to get the count of the common k-mers
import scala.io.Source
object Main {
def main(args: Array[String]) {
// Get all of the lines from the input file
val input = Source.fromFile("input.txt").getLines.toArray
// Create one huge string which contains all the lines but the first
val lines = input.tail.mkString.replace("\n","")
val mappedKmers: List[Map[String,Int]] = getMappedKmers(5, lines)
}
def getMappedKmers(k: Int, seq: String): List[Map[String, Int]] = {
for (i <- 0 until seq.length - k) {
Map(seq.substring(i, i+k), 1) // Map the k-mer to a count of 1
}
}
}
Couple of questions:
How to create/generate List[Map[String,Int]]?
How would you do it?
Any help and/or advice is definitely appreciated!
You're pretty close—there are three fairly minor problems with your code.
The first is that for (i <- whatever) foo(i) is syntactic sugar for whatever.foreach(i => foo(i)), which means you're not actually doing anything with the contents of whatever. What you want is for (i <- whatever) yield foo(i), which is sugar for whatever.map(i => foo(i)) and returns the transformed collection.
The second issue is that 0 until seq.length - k is a Range, not a List, so even once you've added the yield, the result still won't line up with the declared return type.
The third issue is that Map(k, v) tries to create a map with two key-value pairs, k and v. You want Map(k -> v) or Map((k, v)), either of which is explicit about the fact that you have a single argument pair.
So the following should work:
def getMappedKmers(k: Int, seq: String): IndexedSeq[Map[String, Int]] = {
for (i <- 0 until seq.length - k) yield {
Map(seq.substring(i, i + k) -> 1) // Map the k-mer to a count of 1
}
}
You could also convert either the range or the entire result to a list with .toList if you'd prefer a list at the end.
It's worth noting, by the way, that the sliding method on Seq does exactly what you want:
scala> "AGCTTTC".sliding(5).foreach(println)
AGCTT
GCTTT
CTTTC
I'd definitely suggest something like "AGCTTTC".sliding(5).toList.groupBy(identity) for real code.