How to split a Sequence in scala - scala

I'm using play framework with Scala and I want to split a Seq[String] into subsequence.
I return Seq[String] from a SQL Query which contains colors and season, it look like that:
spring; summer; autumn; winter, red; green; blu
The seasons and colours are separated by a comma, and I want to split that sequence to get 2 subsequences, one with the seasons and other with colors.
I've tried with:
val subsequence=sequecne.split(",")
But it doesn't work and return that error: value split is not a member of Seq[String]
So what can I do?

Assuming your sequence is like a sequence containing one string:
val sequence = Seq("spring; summer; autumn; winter, red; green; blu")
val split = sequence.flatMap(_.split(","))
// => split: Seq[String] = List(spring; summer; autumn; winter, " red; green; blu")

Try grouping,
val xs = Seq("spring; summer; autumn; winter, red; green; blu")
val groups = xs.head.split(",|;").map(_.trim).grouped(4)
This delivers an iterator of arrays of up to 4 items. The last array contains only 3, the colours.
To see the contents in the iterator,
groups.toArray
Array(Array(spring, summer, autumn, winter),
Array(red, green, blu))

In addition to lloydmeta:
sequence.flatMap(_.split(",")).map(_.split(";"))

This should give you what you seem to want based on a single element in the sequence and gives you a way to handle the case where you don't have the expected data in the string result from the SQL query. You may need to do some string trimming if that is a requirement.
val xs = Seq("spring; summer; autumn; winter, red; green; blu")
val ys = xs.head.split(",") match {
case Array(seasons, colours) => Array(seasons.split(";"), colours.split(";"))
case _ => ??? // unexpected case - handle appropriately
}
println(ys.toList.map(_.toList))
// List(List(spring, summer, autumn, winter), List( red, green, blu))

Related

Use two lists of variables to add scala columns

I have two Seqs that I want to use to add columns to a dataframe.
Seq one is something like:
Seq("red", "blue", "green", "yellow", "violet")
and Seq two is something like:
Seq("child", "teen", "adult", "senior")
I also have a column that is a string that is in the format of: s"$color+$age-score=$score", containing every combination of the colors and ages, with a resulting score, so 20 different color-age scores.
Currently, I am doing something like
finalDF.withColumn("red_child", getScore("red", "child"))
.withColumn("red_teen", getScore("red", "teen"))
.withColumn("red_adult", getScore("red", "adult"))
and so on, for all 20 possible combinations, with getScore being a helper function that takes care of the regex.
Since I am using withColumn 20 times, it makes the code very hard to read. I am wondering if there is any way to make this code look more clean, using the two Seqs for color and age to loop and add the columns to the dataframe.
Thanks.
You can simply select additional columns derived from the Tuple list generated using for-comprehension, as shown below:
val colors = Seq("red", "blue", "green", "yellow", "violet")
val ageGroups = Seq("child", "teen", "adult", "senior")
val colPairs = for { c <- colors; a <- ageGroups } yield (c, a)
def getScore(c: String, a: String): Column = ???
df.select( df.columns.map(col) ++ colPairs.map{ case (c, a) =>
getScore(c, a).as(c + "_" + a)
}: _*
)
Alternatively, use foldLeft to traverse the colPairs list to add columns via withColumn:
colPairs.foldLeft(df){ case (accDF, (c, a)) =>
accDF.withColumn(c + "_" + a, getScore(c, a))
}

How to split and remove empty spaces before get the result

I have this String:
val str = "9617 / 20634"
So after split i want to parse only the 2 values 9617 & 20634 in order to calculate percentage.
So instead of trim after the split can i do it before ?
It is easier to remove spaces after the split than before. Unless you are expected to trim it before, here is a simpler way of doing it.
val Array(x, y) = "9617 / 20634".split("/").map(_.trim.toFloat)
val p = x / y * 100
Values are converted to Float to prevent integer division leading to 0.
The val Array(x,y) = ... statement is just another way of calling
unapplySeq of the Array object.
For scalable solution one might use regexps:
scala> val r = """(\d+) */ *(\d+)""".r
r: scala.util.matching.Regex = (\d+) +/ +(\d+)
scala> "111 / 222" match {
case r(a, b) => println(s"Got $a, $b")
case _ =>
}
Got 111, 222
This is useful if you need different usage patterns.
Here you define correspondence between variables from you input (111, 222) and match groups ((\d+), (\d+)).

Function to return List of Map while iterating over String, kmer count

I am working on creating a k-mer frequency counter (similar to word count in Hadoop) written in Scala. I'm fairly new to Scala, but I have some programming experience.
The input is a text file containing a gene sequence and my task is to get the frequency of each k-mer where k is some specified length of the sequence.
Therefore, the sequence AGCTTTC has three 5-mers (AGCTT, GCTTT, CTTTC)
I've parsed through the input and created a huge string which is the entire sequence, the new lines throw off the k-mer counting as the end of one line's sequence should still form a k-mer with the beginning of the next line's sequence.
Now I am trying to write a function that will generate a list of maps List[Map[String, Int]] with which it should be easy to use scala's groupBy function to get the count of the common k-mers
import scala.io.Source
object Main {
def main(args: Array[String]) {
// Get all of the lines from the input file
val input = Source.fromFile("input.txt").getLines.toArray
// Create one huge string which contains all the lines but the first
val lines = input.tail.mkString.replace("\n","")
val mappedKmers: List[Map[String,Int]] = getMappedKmers(5, lines)
}
def getMappedKmers(k: Int, seq: String): List[Map[String, Int]] = {
for (i <- 0 until seq.length - k) {
Map(seq.substring(i, i+k), 1) // Map the k-mer to a count of 1
}
}
}
Couple of questions:
How to create/generate List[Map[String,Int]]?
How would you do it?
Any help and/or advice is definitely appreciated!
You're pretty close—there are three fairly minor problems with your code.
The first is that for (i <- whatever) foo(i) is syntactic sugar for whatever.foreach(i => foo(i)), which means you're not actually doing anything with the contents of whatever. What you want is for (i <- whatever) yield foo(i), which is sugar for whatever.map(i => foo(i)) and returns the transformed collection.
The second issue is that 0 until seq.length - k is a Range, not a List, so even once you've added the yield, the result still won't line up with the declared return type.
The third issue is that Map(k, v) tries to create a map with two key-value pairs, k and v. You want Map(k -> v) or Map((k, v)), either of which is explicit about the fact that you have a single argument pair.
So the following should work:
def getMappedKmers(k: Int, seq: String): IndexedSeq[Map[String, Int]] = {
for (i <- 0 until seq.length - k) yield {
Map(seq.substring(i, i + k) -> 1) // Map the k-mer to a count of 1
}
}
You could also convert either the range or the entire result to a list with .toList if you'd prefer a list at the end.
It's worth noting, by the way, that the sliding method on Seq does exactly what you want:
scala> "AGCTTTC".sliding(5).foreach(println)
AGCTT
GCTTT
CTTTC
I'd definitely suggest something like "AGCTTTC".sliding(5).toList.groupBy(identity) for real code.

Succinct way of reading data from file into an immutable 2 dimensional array in Scala

What I am looking for is a succinct way of ending up with an immutable two dimensional array X and one dimensional array Y without first scanning the file to find out the dimensions of the data.
The data, which consists of a header line followed by columnar double values, is in the following format
X0, X1, X2, ...., Y
0.1, 1.2, -0.2, ..., 1.1
0.2, 0.5, 0.4, ..., -0.3
-0.5, 0.3, 0.3, ..., 0.1
I have the following code (so far) for getting lines from a file and tokenizing each comma delimited line in order to get the samples. It currently doesn't fill in the X and Y arrays nor assign num and dimx
val X = new Array[Array[Double]](num,dimx)
val Y = new Array[Double](num)
def readDataFromFile(filename: String) {
var firstTime = true
val lines = fromFile(filename).getLines
lines.foreach(line => {
val tokens = line split(",")
if(firstTime) {
tokens.foreach(token => // get header titles and set dimx)
firstTime = false
} else {
println("data")
tokens.foreach(token => //blah, blah, blah...)
}
})
}
Obviously this is an issue because, while I can detect and use dimx on-the-fly, I don't know num a priori. Also, the repeated tokens.foreach is not very elegant. I could first scan the file and determine the dimensions, but this seems like a nasty way to go. Is there a better way? Thanks in advance
There isn't anything built in that's going to tell you the size of your data. Why not have the method return your arrays instead of you declaring them outside? That way you can also handle error conditions better.
case class Hxy(headers: Array[String], x: Array[Array[Double]], y: Array[Double]) {}
def readDataFromFile(name: String): Option[Hxy] = {
val lines = io.Source.fromFile(name).getLines
if (!lines.hasNext) None
else {
val header = lines.next.split(",").map(_.trim)
try {
val xy = lines.map(_.split(",").map(_.trim.toDouble)).toArray
if (xy.exists(_.length != header.length)) None
else Some( Hxy(header, xy.map(_.init), xy.map(_.last)) )
}
catch { case nfe: NumberFormatException => None }
}
}
Here, only if we have well-formed data do we get back the relevant arrays (helpfully packaged into a case class); otherwise, we get back None so we know that something went wrong.
(If you want to know why it didn't work, replace Option[Hxy] with something like Either[String,Hxy] and return Right(...) instead of Some(...) on success, Left(message) instead of None on failure.)
Edit: If you want the values (not just the array sizes) to be immutable, then you'd need to map everything to Vector somewhere along the way. I'd probably do it at the last step when you're placing the data into Hxy.
Array, as in Java is mutable. So you can't have immutable array. you need to choose between Array and immutablity. One way, how you can achieve your goal without foreaches and vars is similar to following:// simulate the lines for this example
val lines = List("X,Y,Z,","1,2,3","2,5.0,3.4")
val res = lines.map(_.split(",")).toArray
Use Array.newBuilder. I assume that the header has already been extracted.
val b = Array.newBuilder[Array[Double]]
lines.foreach { b += _.split(",").map(_.toDouble) }
val data = b.result
If you want to be immutable, take some immutable implementation of IndexedSeq (e.g. Vector) instead of Array; builders work on all collections.

how would I map a list of strings with a known format to a list of tuples?

I have an array of strings. Each string has 2 parts and is separated by white space. Looks like:
x <white space> y
I want to turn it into an array of Tuples where each tuple has (x, y)
How can I write this in scala? I know it will need something similar to:
val results = listOfStrings.collect { str => (str.left, str.right) }
not sure how i can break up each str to the left and right sides needed...
You could take advantage of the fact that in Scala, Regular expressions are also "extractors".
scala> var PairWithSpaces = "(\\w+)\\s+(\\w+)".r
PairWithSpaces: scala.util.matching.Regex = (.+)\s+(.+)
scala> val PairWithSpaces(l, r) = "1 17"
l: String = 1
r: String = 17
Now you can build your extractor into a natural looking "map":
scala> Array("a b", "1 3", "Z x").map{case PairWithSpaces(x,y) => (x, y) }
res10: Array[(String, String)] = Array((a,b), (1,3), (Z,x))
Perhaps overkill for you, but can really help readability if your regex gets fancy. I also like how this approach will fail fast if an illegal string is given.
Warning, not sure if the regex matches exactly what you need...
You could (assuming that you want to drop without complaint any string that doesn't fit the pattern):
val results = listOfStrings.map(_.split("\\s+")).collect { case Array(l,r) => (l,r) }