textfile parser: calculate start position with scala - scala

I am beginner in scala and I trying to implement the following algorithm.
I have the following input :
11 DFI1-MONT_TT_13 9(18) 14 IntegerType
11 SERI1-SENS_13 X(01) 06 StringType
11 DDRI1-MONT_TT_14 9(18) 12 IntegerType
11 SQRI1-SENS_14 X(01) 14 StringType
11 XCRI1-MONT_TT_15 9(18) 10 IntegerType
11 QSRI1-SENS_15 X(01) 08 StringType
11 WQRI1-DEVISE X(03) 07 StringType
and I want to calculate the start position for each field so my output shall look like :
11 DFI1-MONT_TT_13 9(18) 0 14 IntegerType
11 SERI1-SENS_13 X(01) 14 06 StringType
11 DDRI1-MONT_TT_14 9(18) 20 12 IntegerType
11 SQRI1-SENS_14 X(01) 32 14 StringType
11 XCRI1-MONT_TT_15 9(18) 46 10 IntegerType
11 QSRI1-SENS_15 X(01) 56 08 StringType
11 WQRI1-DEVISE X(03) 64 07 StringType
The start position can be calculated as follows :
startposition_line_n= startposition_line_n-1 + length_line_n-1
We are assuming that the first line start position is equal to 0
I already know that I can use the scanLeft or the foldLeft but as I am begining I don't now how to do this recursively . I took a sample from the dataset in input, currently it includes much more lines.

Here's a tail-recursive method that takes a List[String] as input and produces a new, modified, List[String] as output.
def setPos(input :List[String]
,pos :Int=0
,acc :List[String]=List()
) :List[String] =
if (input.isEmpty) acc.reverse
else {
val line = input.head.split("\\s+")
setPos(input.tail
,pos + line(3).toInt
,line.patch(3, Seq(pos.toString), 0).mkString(" ") :: acc)
}
This assumes that the 3rd space-delimited field is always the offset integer. It will throw an error if that's not the case.

Something like this, drop comments if you need some clarifications.
val input = List(
"10 DFI1-MONT_TT_13 9(18) 14 IntegerType",
"10 SERI1-SENS_13 X(01) 06 StringType",
"10 DDRI1-MONT_TT_14 9(18) 12 IntegerType",
"10 SQRI1-SENS_14 X(01) 14 StringType",
"10 XCRI1-MONT_TT_15 9(18) 10 IntegerType",
"10 QSRI1-SENS_15 X(01) 08 StringType",
"10 WQRI1-DEVISE X(03) 07 StringType")
)
input.foldLeft((0, List[String]())) {
case ((sum, acc), line) => {
val sp = line.split(" ")
val si = 3
(sum + sp(si).toInt, acc :+ ((sp.take(si) :+ sum) ++ sp.takeRight(sp.size - si)).mkString(" "))
}
}._2

Related

Scala/Spark : How to do outer join based on common columns

I have 2 data dataframes:
First dataframe contains temparature information.
Second dataframe contains precipitation information"
I read those files and created dataframes as :
val dataRecordsTemp = sc.textFile(tempFile).map{rec=>
val splittedRec = rec.split("\\s+")
Temparature(splittedRec(0),splittedRec(1),splittedRec(2),splittedRec(3),splittedRec(4))
}.map{x => Row.fromSeq(x.getDataFields())}
val headerFieldsForTemp = Seq("YEAR","MONTH","DAY","MAX_TEMP","MIN_TEMP")
val schemaTemp = StructType(headerFieldsForTemp.map{f => StructField(f, StringType, nullable=true)})
val dfTemp = session.createDataFrame(dataRecordsTemp,schemaTemp)
.orderBy(desc("year"), desc("month"), desc("day"))
println("Printing temparature data ...............................")
dfTemp.select("YEAR","MONTH","DAY","MAX_TEMP","MIN_TEMP").take(10).foreach(println)
val dataRecordsPrecip = sc.textFile(precipFile).map{rec=>
val splittedRec = rec.split("\\s+")
Precipitation(splittedRec(0),splittedRec(1),splittedRec(2),splittedRec(3),splittedRec(4),splittedRec(5))
}.map{x => Row.fromSeq(x.getDataFields())}
val headerFieldsForPrecipitation = Seq("YEAR","MONTH","DAY","PRECIPITATION","SNOW","SNOW_COVER")
val schemaPrecip = StructType(headerFieldsForPrecipitation.map{f => StructField(f, StringType, nullable=true)})
val dfPrecip = session.createDataFrame(dataRecordsPrecip,schemaPrecip)
.orderBy(desc("year"), desc("month"), desc("day"))
println("Printing precipitation data ...............................")
dfPrecip.select("YEAR","MONTH","DAY","PRECIPITATION","SNOW","SNOW_COVER").take(10).foreach(println)
I have to join 2 RDDs based on common columns (year,month,day). Input files have header and output file will have the header as well.The 1st file has information on temperature as (example):
year month day min-temp mav-temp
2017 12 13 13 25
2017 12 16 25 32
2017 12 25 34 56
2nd file has information precipitation as (example)
year month day precipitation snow snow-cover
2018 7 6 0.00 0.0 0
2017 12 13 0.04 0.0 0
2017 12 16 0.4 0.04 1
My expected output should be (ordered by date asynchronous , if no value found then blank):
year month day min-temp mav-temp precipitation snow snow-cover
2017 12 13 13 25 0.04 0.0 0
2017 12 16 25 32 0.4 0.04 1
2017 12 25 34 56
2018 7 6 0.00 0.0 0
May I get help on how to do that in Scala?
You need outer join these two datasets and then order result like this:
import org.apache.spark.sql.functions._
dfTemp
.join(dfPrecip, Seq("year", "month", "day"), "outer")
.orderBy(desc("year"), desc("month"), desc("day"))
.na.fill("")
If you don't need blank values and fine with null, then you may avoid .na.fill("").
Hope it helps!

Interleaving iterators

I wrote the following code, expecting the last print method to show the elements of both iterators combined. Instead it only shows the elements of perfectSquares. Can someone explain this to me?
object Fuge {
def main(args: Array[String]) : Unit = {
perfectSquares.takeWhile(_ < 100).foreach(square => print(square + " "))
println()
triangles.takeWhile(_ < 100).foreach(triangle => print(triangle + " "))
println()
(perfectSquares++triangles).takeWhile(_ < 100).foreach(combine => print(combine + " "))
}
def perfectSquares : Iterator[Int] = {
Iterator.from(1).map(x => x * x)
}
def triangles : Iterator[Int] = {
Iterator.from(1).map(n => (n * (n + 1)/2))
}
}
OUTPUT:
1 4 9 16 25 36 49 64 81
1 3 6 10 15 21 28 36 45 55 66 78 91
1 4 9 16 25 36 49 64 81
From the documentation on takeWhile:
/** Takes longest prefix of values produced by this iterator that satisfy a predicate.
*
* #param p The predicate used to test elements.
* #return An iterator returning the values produced by this iterator, until
* this iterator produces a value that does not satisfy
* the predicate `p`.
* #note Reuse: $consumesAndProducesIterator
*/
What this means is that the iterator stops at that juncture. What you've created is an iterator that goes far past 100 and then, at some point, starts off at 1 again. But takeWhile won't go that far because it's already run into a number higher than 100. See:
object Fuge {
def main(args: Array[String]) : Unit = {
perfectSquares.takeWhile(_ < 100).foreach(square => print(square + " "))
println()
triangles.takeWhile(_ < 100).foreach(triangle => print(triangle + " "))
println()
def interleave (a: Iterator[Int], b: Iterator[Int]): Stream[Int] = {
if (a.isEmpty || b.isEmpty) { Stream.empty }
else {
a.next() #:: b.next() #:: interleave(a, b)
}
}
lazy val interleaved = interleave(perfectSquares, triangles)
interleaved.takeWhile(_ < 100).foreach(combine => print(combine + " "))
}
def perfectSquares : Iterator[Int] = {
Iterator.from(1).map(x => x * x)
}
def triangles : Iterator[Int] = {
Iterator.from(1).map(n => (n * (n + 1)/2))
}
}
Here I'm using a stream to lazily evaluate the sequence of integers. In this way we can ensure interleaving. Note that this is just interleaved, not sorted.
This yields:
1 4 9 16 25 36 49 64 81
1 3 6 10 15 21 28 36 45 55 66 78 91
1 1 4 3 9 6 16 10 25 15 36 21 49 28 64 36 81 45
To sort during a stream, you need a BufferedIterator and to change up the interleave function a bit. This is because calling next() advances the iterator - you can't go back. And you also can't know how many items you need from list a before you need an item from list b, and vice versa. But BufferedIterator allows you to call head, which is a 'peek' and does not advance the iterator. Now the code becomes:
object Fuge {
def main(args: Array[String]) : Unit = {
perfectSquares.takeWhile(_ < 100).foreach(square => print(square + " "))
println()
triangles.takeWhile(_ < 100).foreach(triangle => print(triangle + " "))
println()
def interleave (a: BufferedIterator[Int], b: BufferedIterator[Int]): Stream[Int] = {
if (a.isEmpty || b.isEmpty) { Stream.empty }
else if (a.head <= b.head){
a.next() #:: interleave(a, b)
} else {
b.next() #:: interleave(a, b)
}
}
lazy val interleaved = interleave(perfectSquares.buffered, triangles.buffered)
interleaved.takeWhile(_ < 100).foreach(combine => print(combine + " "))
}
def perfectSquares : Iterator[Int] = {
Iterator.from(1).map(x => x * x)
}
def triangles : Iterator[Int] = {
Iterator.from(1).map(n => (n * (n + 1)/2))
}
}
And the output is:
1 4 9 16 25 36 49 64 81
1 3 6 10 15 21 28 36 45 55 66 78 91
1 1 3 4 6 9 10 15 16 21 25 28 36 36 45 49 55 64 66 78 81 91
The problems with using Streams here is that they cache all the previous data. I'd rather interleave iterators as is, without involving streams.
Something like this:
class InterleavingIterator[X, X1 <: X, X2 <: X](
iterator1: Iterator[X1],
iterator2: Iterator[X2]) extends Iterator[X] {
private var i2: (Iterator[X], Iterator[X]) = (iterator1, iterator2)
def hasNext: Boolean = iterator1.hasNext || iterator2.hasNext
def next: X = {
i2 = i2.swap
if (i2._1.hasNext) i2._1.next else i2._2.next
}
}

Scala for loop yield

I'm new to Scala so I'm trying to mess around with an example in Programming in Scala: A Comprehensive Step-by-Step Guide, 2nd Edition
// Returns a row as a sequence
def makeRowSeq(row: Int) =
for (col <- 1 to 10) yield {
val prod = (row * col).toString
val padding = " " * (4 - prod.length)
padding + prod
}
// Returns a row as a string
def makeRow(row: Int) = makeRowSeq(row).mkString
// Returns table as a string with one row per line
def multiTable() = {
val tableSeq = // a sequence of row strings
for (row <- 1 to 10)
yield makeRow(row)
tableSeq.mkString("\n")
}
When calling multiTable() the above code outputs:
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
6 12 18 24 30 36 42 48 54 60
7 14 21 28 35 42 49 56 63 70
8 16 24 32 40 48 56 64 72 80
9 18 27 36 45 54 63 72 81 90
10 20 30 40 50 60 70 80 90 100
This makes sense but if I try to change the code in multiTable() to be something like:
def multiTable() = {
val tableSeq = // a sequence of row strings
for (row <- 1 to 10)
yield makeRow(row) {
2
}
tableSeq.mkString("\n")
}
The 2 is being returned and changing the output. I'm not sure where it's being used though to manipulate the output and can't seem to find a similar example searching around here or Google. Any input would be appreciated!
makeRow(row) {2}
and
makeRow(row)(2)
and
makeRow(row).apply(2)
are all equivalent.
makeRow(row) is of type List[String], each String representing one row. So effectively, you are picking character at index 2 from each row. That is why you are seeing 9 spaces and one 1 in your output.
def multiTable() = {
val tableSeq = // a sequence of row strings
for (row <- 1 to 10)
yield makeRow(row) {2}
tableSeq.mkString("\n")
}
is equivalent to applying a map on each row like
def multiTable() = {
val tableSeq = // a sequence of row strings
for (row <- 1 to 10)
yield makeRow(row)
tableSeq.map(_(2)).mkString("\n")
}

Scala how can I count the number of occurrences in a list

val list = List(1,2,4,2,4,7,3,2,4)
I want to implement it like this: list.count(2) (returns 3).
A somewhat cleaner version of one of the other answers is:
val s = Seq("apple", "oranges", "apple", "banana", "apple", "oranges", "oranges")
s.groupBy(identity).mapValues(_.size)
giving a Map with a count for each item in the original sequence:
Map(banana -> 1, oranges -> 3, apple -> 3)
The question asks how to find the count of a specific item. With this approach, the solution would require mapping the desired element to its count value as follows:
s.groupBy(identity).mapValues(_.size)("apple")
scala collections do have count: list.count(_ == 2)
I had the same problem as Sharath Prabhal, and I got another (to me clearer) solution :
val s = Seq("apple", "oranges", "apple", "banana", "apple", "oranges", "oranges")
s.groupBy(l => l).map(t => (t._1, t._2.length))
With as result :
Map(banana -> 1, oranges -> 3, apple -> 3)
list.groupBy(i=>i).mapValues(_.size)
gives
Map[Int, Int] = Map(1 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 4 -> 3)
Note that you can replace (i=>i) with built in identity function:
list.groupBy(identity).mapValues(_.size)
Starting Scala 2.13, the groupMapReduce method does that in one pass through the list:
// val seq = Seq("apple", "oranges", "apple", "banana", "apple", "oranges", "oranges")
seq.groupMapReduce(identity)(_ => 1)(_ + _)
// immutable.Map[String,Int] = Map(banana -> 1, oranges -> 3, apple -> 3)
seq.groupMapReduce(identity)(_ => 1)(_ + _)("apple")
// Int = 3
This:
groups list elements (group part of groupMapReduce)
maps each grouped value occurrence to 1 (map part of groupMapReduce)
reduces values within a group of values (_ + _) by summing them (reduce part of groupMapReduce).
This is a one-pass version of what can be translated by:
seq.groupBy(identity).mapValues(_.map(_ => 1).reduce(_ + _))
val list = List(1, 2, 4, 2, 4, 7, 3, 2, 4)
// Using the provided count method this would yield the occurrences of each value in the list:
l map(x => l.count(_ == x))
List[Int] = List(1, 3, 3, 3, 3, 1, 1, 3, 3)
// This will yield a list of pairs where the first number is the number from the original list and the second number represents how often the first number occurs in the list:
l map(x => (x, l.count(_ == x)))
// outputs => List[(Int, Int)] = List((1,1), (2,3), (4,3), (2,3), (4,3), (7,1), (3,1), (2,3), (4,3))
I ran into the same problem but wanted to count multiple items in one go..
val s = Seq("apple", "oranges", "apple", "banana", "apple", "oranges", "oranges")
s.foldLeft(Map.empty[String, Int]) { (m, x) => m + ((x, m.getOrElse(x, 0) + 1)) }
res1: scala.collection.immutable.Map[String,Int] = Map(apple -> 3, oranges -> 3, banana -> 1)
https://gist.github.com/sharathprabhal/6890475
If you want to use it like list.count(2) you have to implement it using an Implicit Class.
implicit class Count[T](list: List[T]) {
def count(n: T): Int = list.count(_ == n)
}
List(1,2,4,2,4,7,3,2,4).count(2) // returns 3
List(1,2,4,2,4,7,3,2,4).count(5) // returns 0
It is interesting to note that the map with default 0 value, intentionally designed for this case demonstrates the worst performance (and not as concise as groupBy)
type Word = String
type Sentence = Seq[Word]
type Occurrences = scala.collection.Map[Char, Int]
def woGrouped(w: Word): Occurrences = {
w.groupBy(c => c).map({case (c, list) => (c -> list.length)})
} //> woGrouped: (w: forcomp.threadBug.Word)forcomp.threadBug.Occurrences
def woGetElse0Map(w: Word): Occurrences = {
val map = Map[Char, Int]()
w.foldLeft(map)((m, c) => m + (c -> (m.getOrElse(c, 0) + 1)) )
} //> woGetElse0Map: (w: forcomp.threadBug.Word)forcomp.threadBug.Occurrences
def woDeflt0Map(w: Word): Occurrences = {
val map = Map[Char, Int]().withDefaultValue(0)
w.foldLeft(map)((m, c) => m + (c -> (m(c) + 1)) )
} //> woDeflt0Map: (w: forcomp.threadBug.Word)forcomp.threadBug.Occurrences
def dfltHashMap(w: Word): Occurrences = {
val map = scala.collection.immutable.HashMap[Char, Int]().withDefaultValue(0)
w.foldLeft(map)((m, c) => m + (c -> (m(c) + 1)) )
} //> dfltHashMap: (w: forcomp.threadBug.Word)forcomp.threadBug.Occurrences
def mmDef(w: Word): Occurrences = {
val map = scala.collection.mutable.Map[Char, Int]().withDefaultValue(0)
w.foldLeft(map)((m, c) => m += (c -> (m(c) + 1)) )
} //> mmDef: (w: forcomp.threadBug.Word)forcomp.threadBug.Occurrences
val functions = List("grp" -> woGrouped _, "mtbl" -> mmDef _, "else" -> woGetElse0Map _
, "dfl0" -> woDeflt0Map _, "hash" -> dfltHashMap _
) //> functions : List[(String, String => scala.collection.Map[Char,Int])] = Lis
//| t((grp,<function1>), (mtbl,<function1>), (else,<function1>), (dfl0,<functio
//| n1>), (hash,<function1>))
val len = 100 * 1000 //> len : Int = 100000
def test(len: Int) {
val data: String = scala.util.Random.alphanumeric.take(len).toList.mkString
val firstResult = functions.head._2(data)
def run(f: Word => Occurrences): Int = {
val time1 = System.currentTimeMillis()
val result= f(data)
val time2 = (System.currentTimeMillis() - time1)
assert(result.toSet == firstResult.toSet)
time2.toInt
}
def log(results: Seq[Int]) = {
((functions zip results) map {case ((title, _), r) => title + " " + r} mkString " , ")
}
var groupResults = List.fill(functions.length)(1)
val integrals = for (i <- (1 to 10)) yield {
val results = functions map (f => (1 to 33).foldLeft(0) ((acc,_) => run(f._2)))
println (log (results))
groupResults = (results zip groupResults) map {case (r, gr) => r + gr}
log(groupResults).toUpperCase
}
integrals foreach println
} //> test: (len: Int)Unit
test(len)
test(len * 2)
// GRP 14 , mtbl 11 , else 31 , dfl0 36 , hash 34
// GRP 91 , MTBL 111
println("Done")
def main(args: Array[String]) {
}
produces
grp 5 , mtbl 5 , else 13 , dfl0 17 , hash 17
grp 3 , mtbl 6 , else 14 , dfl0 16 , hash 16
grp 3 , mtbl 6 , else 13 , dfl0 17 , hash 15
grp 4 , mtbl 5 , else 13 , dfl0 15 , hash 16
grp 23 , mtbl 6 , else 14 , dfl0 15 , hash 16
grp 5 , mtbl 5 , else 13 , dfl0 16 , hash 17
grp 4 , mtbl 6 , else 13 , dfl0 16 , hash 16
grp 4 , mtbl 6 , else 13 , dfl0 17 , hash 15
grp 3 , mtbl 5 , else 14 , dfl0 16 , hash 16
grp 3 , mtbl 6 , else 14 , dfl0 16 , hash 16
GRP 5 , MTBL 5 , ELSE 13 , DFL0 17 , HASH 17
GRP 8 , MTBL 11 , ELSE 27 , DFL0 33 , HASH 33
GRP 11 , MTBL 17 , ELSE 40 , DFL0 50 , HASH 48
GRP 15 , MTBL 22 , ELSE 53 , DFL0 65 , HASH 64
GRP 38 , MTBL 28 , ELSE 67 , DFL0 80 , HASH 80
GRP 43 , MTBL 33 , ELSE 80 , DFL0 96 , HASH 97
GRP 47 , MTBL 39 , ELSE 93 , DFL0 112 , HASH 113
GRP 51 , MTBL 45 , ELSE 106 , DFL0 129 , HASH 128
GRP 54 , MTBL 50 , ELSE 120 , DFL0 145 , HASH 144
GRP 57 , MTBL 56 , ELSE 134 , DFL0 161 , HASH 160
grp 7 , mtbl 11 , else 28 , dfl0 31 , hash 31
grp 7 , mtbl 10 , else 28 , dfl0 32 , hash 31
grp 7 , mtbl 11 , else 28 , dfl0 31 , hash 32
grp 7 , mtbl 11 , else 28 , dfl0 31 , hash 33
grp 7 , mtbl 11 , else 28 , dfl0 32 , hash 31
grp 8 , mtbl 11 , else 28 , dfl0 31 , hash 33
grp 8 , mtbl 11 , else 29 , dfl0 38 , hash 35
grp 7 , mtbl 11 , else 28 , dfl0 32 , hash 33
grp 8 , mtbl 11 , else 32 , dfl0 35 , hash 41
grp 7 , mtbl 13 , else 28 , dfl0 33 , hash 35
GRP 7 , MTBL 11 , ELSE 28 , DFL0 31 , HASH 31
GRP 14 , MTBL 21 , ELSE 56 , DFL0 63 , HASH 62
GRP 21 , MTBL 32 , ELSE 84 , DFL0 94 , HASH 94
GRP 28 , MTBL 43 , ELSE 112 , DFL0 125 , HASH 127
GRP 35 , MTBL 54 , ELSE 140 , DFL0 157 , HASH 158
GRP 43 , MTBL 65 , ELSE 168 , DFL0 188 , HASH 191
GRP 51 , MTBL 76 , ELSE 197 , DFL0 226 , HASH 226
GRP 58 , MTBL 87 , ELSE 225 , DFL0 258 , HASH 259
GRP 66 , MTBL 98 , ELSE 257 , DFL0 293 , HASH 300
GRP 73 , MTBL 111 , ELSE 285 , DFL0 326 , HASH 335
Done
It is curious that most concise groupBy is faster than even mutable map!
Short answer:
import scalaz._, Scalaz._
xs.foldMap(x => Map(x -> 1))
Long answer:
Using Scalaz, given.
import scalaz._, Scalaz._
val xs = List('a, 'b, 'c, 'c, 'a, 'a, 'b, 'd)
then all of these (in the order of less simplified to more simplified)
xs.map(x => Map(x -> 1)).foldMap(identity)
xs.map(x => Map(x -> 1)).foldMap()
xs.map(x => Map(x -> 1)).suml
xs.map(_ -> 1).foldMap(Map(_))
xs.foldMap(x => Map(x -> 1))
yield
Map('b -> 2, 'a -> 3, 'c -> 2, 'd -> 1)
Try this, should work.
val list = List(1,2,4,2,4,7,3,2,4)
list.count(_==2)
It will return 3
I did not get the size of the list using length but rather size as one the answer above suggested it because of the issue reported here.
val list = List("apple", "oranges", "apple", "banana", "apple", "oranges", "oranges")
list.groupBy(x=>x).map(t => (t._1, t._2.size))
using cats
import cats.implicits._
"Alphabet".toLowerCase().map(c => Map(c -> 1)).toList.combineAll
"Alphabet".toLowerCase().map(c => Map(c -> 1)).toList.foldMap(identity)
Here is another option:
scala> val list = List(1,2,4,2,4,7,3,2,4)
list: List[Int] = List(1, 2, 4, 2, 4, 7, 3, 2, 4)
scala> list.groupBy(x => x) map { case (k,v) => k-> v.length }
res74: scala.collection.immutable.Map[Int,Int] = Map(1 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 4 -> 3)
scala> val list = List(1,2,4,2,4,7,3,2,4)
list: List[Int] = List(1, 2, 4, 2, 4, 7, 3, 2, 4)
scala> println(list.filter(_ == 2).size)
3
Here is a pretty easy way to do it.
val data = List("it", "was", "the", "best", "of", "times", "it", "was",
"the", "worst", "of", "times")
data.foldLeft(Map[String,Int]().withDefaultValue(0)){
case (acc, letter) =>
acc + (letter -> (1 + acc(letter)))
}
// => Map(worst -> 1, best -> 1, it -> 2, was -> 2, times -> 2, of -> 2, the -> 2)
val words = Array("Mary", "had", "a", "little", "lamb", "its"
, "fleece", "was", "white", "as", "snow", "and", "everywhere"
, "that", "Mary", "went", "the", "lamb", "was", "sure", "to", "go")
words.groupBy(_.length)

find unique matrices from a larger matrix

I'm fairly new the functional programming, so I'm going through some practice exercises. I want to write a function, given a matrix of unique naturals, let's say 5x5, return a collection of unique matrices of a smaller size, say 3x3, where the matrices must be intact, i.e. created from values that are adjacent in the original.
01 02 03 04 05
06 07 08 09 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
Simple. Just slide across, then down, one by one in groups of 3, to get something that looks like:
01 02 03 | 02 03 04 | 03 04 05 | 06 07 08
06 07 08 | 07 08 09 | 08 09 10 | 11 12 13
11 12 13 | 12 13 14 | 13 14 15 | 16 17 18
or, in Scala,
List(List(1, 2, 3), List(6, 7, 8), List(11, 12, 13))
List(List(2, 3, 4), List(7, 8, 9), List(12, 13, 14))
List(List(3, 4, 5), List(8, 9, 10), List(13, 14, 15))
List(List(6, 7, 8), List(11, 12, 13), List(16, 17, 18))
and so on and so on...
So I venture out with Scala (my language of choice because it allows me to evolve from imperative to functional, and I've spent the last few years in Java.
val array2D = "01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25".grouped(3).map(_.trim.toInt).grouped(5)
val sliced = array2D.map(row => row.sliding(3, 1).toList).sliding(3, 1).toList
Now I have a data structure I can work with, but I don't see a functional way. Sure I can traverse each piece of sliced, create a var matrix = new ListBuffer[Seq[Int]]() and imperatively create a bag of those and I'm done.
I want to find a functional, ideally point-free approach using Scala, but I'm stumped. There's got to be a way to zip with 3 or something like that... I've searched the ScalaDocs and can't seem to figure it out.
You got halfway there. In fact, I was having trouble figuring out how to do what you had done already. I broke up your code a bit to make it easier to follow. Also, I made array2D a List, so I could play with the code more easily. :-)
val input = "01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25"
val intArray = (input split " " map (_.toInt) toList)
val array2D = (intArray grouped 5 toList)
val sliced = array2D.map(row => row.sliding(3, 1).toList).sliding(3, 1).toList
Ok, so you have a bunch of lists, each one a bit like this:
List(List(List( 1, 2, 3), List( 2, 3, 4), List( 3, 4, 5)),
List(List( 6, 7, 8), List( 7, 8, 9), List( 8, 9, 10)),
List(List(11, 12, 13), List(12, 13, 14), List(13, 14, 15)))
And you want them like this:
List(List(List(1, 2, 3), List(6, 7, 8), List(11, 12, 13)),
List(List(2, 3, 4), List(7, 8, 9), List(12, 13, 14)),
List(List(3, 4, 5), List(8, 9, 10), List(13, 14, 15)))
Does that feel right to you? Each of the three sublists is a matrix on its own:
List(List(1, 2, 3), List(6, 7, 8), List(11, 12, 13))
is
01 02 03
06 07 08
11 12 13
So, basically, you want to transpose them. The next step, then, is:
val subMatrices = sliced map (_.transpose)
The type of that thing is List[List[List[Seq[Int]]]]. Let's consider that a bit... The 2D matrix is represented by a sequence of a sequence, so List[Seq[Int]] corresponds to a matrix. Let's say:
type Matrix = Seq[Seq[Int]]
val subMatrices: List[List[Matrix]] = sliced map (_.transpose)
But you want one one list of matrices, so you can flatten that:
type Matrix = Seq[Seq[Int]]
val subMatrices: List[Matrix] = (sliced map (_.transpose) flatten)
But, alas, a map plus a flatten is a flatMap:
type Matrix = Seq[Seq[Int]]
val subMatrices: List[Matrix] = sliced flatMap (_.transpose)
Now, you want the unique submatrices. That's simple enough: it's a set.
val uniqueSubMatrices = subMatrices.toSet
Or, if you wish to keep the result as a sequence,
val uniqueSubMatrices = subMatrices.distinct
And that's it. Full code just to illustrate:
type Matrix = Seq[Seq[Int]]
val input = "01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25"
val intArray = (input split " " map (_.toInt) toList)
val array2D: Matrix = (intArray grouped 5 toList)
val sliced: List[List[Matrix]] = (array2D map (row => row sliding 3 toList) sliding 3 toList)
val subMatrices: List[Matrix] = sliced flatMap (_.transpose)
val uniqueSubMatrices: Set[Matrix] = subMatrices.toSet
It could be written as a single expression, but unless you break it up into functions, it's going to be horrible to read. And you'd either have to use the forward pipe (|>, not in the standard library), or add these functions implicitly to the types they act on, or it will be difficult to read anyway.
Edit: Okay, I think I finally understand what you want. I'm going to show a way that works, not a way that is high-performance. (That's generally the mutable Java-like solution, but you already know how to do that.)
First, you really, really ought to do this with your own collections that work in 2D sensibly. Using a bunch of 1D collections to emulate 2D collections is going to lead to unnecessary confusion and complication. Don't do it. Really. It's a bad idea.
But, okay, let's do it anyway.
val big = (1 to 25).grouped(5).map(_.toList).toList
This is the whole matrix that you want. Next,
val smaller = (for (r <- big.sliding(3)) yield r.toList).toList
are the groups of rows that you want. Now, you should have been using a 2D data structure, because you want to do something that doesn't map well onto 1D operations. But:
val small = smaller.map(xss =>
Iterator.iterate(xss.map(_.sliding(3)))(identity).
takeWhile(_.forall(_.hasNext)).
map(_.map(_.next)).
toList
).toList
If you carefully pull this apart, you see that you're creating a bunch of iterators (xss.map(_.sliding(3))) and then iterating through them all in lock step by keeping hold of those same iterators step after step, stopping when at least one of them is empty, and mapping them onto their next values (which is how you walk forward with them).
Now that you've got the matrices you can store them however you want. Personally, I'd flatten the list:
val flattened = small.flatten
You wrote a structure that has the matrices side by side, which you can also do with some effort (again, because creating 2D operations out of 1D operations is not always straightforward):
val sidebyside = flattened.reduceRight((l,r) => (l,r).zipped.map(_ ::: _))
(note reduceRight to make this an O(n) operation instead of O(n^2)--joining to the end of long accumulating lists is a bad idea--but note also that with too many matrices this will probably overflow the stack).