Slow IO with large data - scala

I'm trying to find a better way to do this as it could take years to compute! I'm need to compute a map which is too large to fit in memory, so I am trying to make use of IO as follows.
I have a file that contains a list of Ints, about 1 million of them. I have another file that contains data about my (500,000) document collection. I need to calculate a function of the count, for every Int in the first file, of how many documents (lines in the second) it appears in. Let me give an example:
File1:
-1
1
2
etc...
file2:
E01JY3-615, CR93E-177 , [-1 -> 2,1 -> 1,2 -> 2,3 -> 2,4 -> 2,8 -> 2,... // truncated for brevity]
E01JY3-615, CR93E-177 , [1 -> 2,2 -> 2,4 -> 2,5 -> 2,8 -> 2,... // truncated for brevity]
etc...
Here is what I have tried so far
def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit) {
val p = new java.io.PrintWriter(new BufferedWriter((new FileWriter(f))))
try {
op(p)
} finally {
p.close()
}
}
def binarySearch(array: Array[String], word: Int):Boolean = array match {
case Array() => false
case xs => if (array(array.size/2).split("->")(0).trim().toInt == word) {
return true
} else if (array(array.size/2).split("->")(0).trim().toInt > word){
return binarySearch(array.take(array.size/2), word)
} else {
return binarySearch(array.drop(array.size/2 + 1), word)
}
}
var v = Source.fromFile("vocabulary.csv").getLines()
printToFile(new File("idf.csv"))(out => {
v.foreach(word =>{
var docCount: Int = 0
val s = Source.fromFile("documents.csv").getLines()
s.foreach(line => {
val split = line.split("\\[")
val fpStr = split(1).init
docCount = if (binarySearch(fpStr.split(","), word.trim().toInt)) docCount + 1 else docCount
})
val output = word + ", " + math.log10(500448 / (docCount + 1))
out.println(output)
println(output)
})
})
There must be a faster way to do this, can anyone think of a way?

From what I understand of your code, you are trying to find every word in the dictionary in the document list.
Hence, you are making N*M comparisons, where N is the number of words (in the dictionary with integers) and M is the number of documents in the document list. Instantiating to your values, you are trying to calculate 10^6 * 5*10^5 comparisons which is 5*10^11. Unfeasible.
Why not create a mutable map with all the integers in the dictionary as keys (1000000 ints in memory is roughly 3.8M from my measurements) and pass through the document list only once, where for each document you extract the integers and increment the respective count values in the map (for which the integer is key).
Something like this:
import collection.mutable.Map
import scala.util.Random._
val maxValue = 1000000
val documents = collection.mutable.Map[String,List[(Int,Int)]]()
// util function just to insert fake input; disregard
def provideRandom(key:String) ={ (1 to nextInt(4)).foreach(_ => documents.put(key,(nextInt(maxValue),nextInt(maxValue)) :: documents.getOrElse(key,Nil)))}
// inserting fake documents into our fake Document map
(1 to 500000).foreach(_ => {val key = nextString(5); provideRandom(key)})
// word count map
val wCount = collection.mutable.Map[Int,Int]()
// Counting the numbers and incrementing them in the map
documents.foreach(doc => doc._2.foreach(k => wCount.put(k._1, (wCount.getOrElse(k._1,0)+1))))
scala> wCount
res5: scala.collection.mutable.Map[Int,Int] = Map(188858 -> 1, 178569 -> 2, 437576 -> 2, 660074 -> 2, 271888 -> 2, 721076 -> 1, 577416 -> 1, 77760 -> 2, 67471 -> 1, 804106 -> 2, 185283 -> 1, 41623 -> 1, 943946 -> 1, 778258 -> 2...
the result is a map with its keys being a number in the dict and the value the number of times it appears in the document list
This is oversimplified since
I dont verify if the number exists in the dictionary, although you only need to init the map with the values and then increment the value in the final map if it has that key;
I dont do IO, which speeds up the whole thing
This way you only pass through the documents once, which makes the task feasible again.

Related

Retain MAX value of Aerospike CDT List

Context
Consider having a stream of tuples (string, timestamp), with the goal of having a bin containing a Map of the minimal timestamp received for each string.
Another constraint is that the update for this Map will be atomic.
For this input example:
("a", 1)
("b", 2)
("a", 3)
the expected output: Map("a" -> 1, "b" -> 2) without the maximal timestamp 3.
The current implementation I chose is using CDT of list to hold the timestamps, so my result is Map("a" -> ArrayList(1), "b" -> ArrayList(2)) as follows:
private val boundedAndOrderedListPolicy = new ListPolicy(ListOrder.ORDERED, ListWriteFlags.INSERT_BOUNDED)
private def bundleContext(str: String) =
CTX.mapKeyCreate(Value.get(str), MapOrder.UNORDERED)
private def buildInitTimestampOp(tuple: (String, Long)): List[Operation] = {
// having these two ops together assure that the size of the list is always 1
val str = tuple._1
val timestamp = tuple._2
val bundleCtx: CTX = bundleContext(str)
List(
ListOperation.append(boundedAndOrderedListPolicy, initBin, Value.get(timestamp), bundleCtx),
ListOperation.removeByRankRange(initBin, 1, ListReturnType.NONE, bundleCtx), // keep the first element of the order list - earliest time
)
}
This works as expected. However, if you have a better way to achieve this without the list and in an atomic manner - I would love to hear it.
My question
What does not work for me is retaining the max timestamp received for each input str. For the example above, the desired result should be Map("a" -> ArrayList(3), "b" -> ArrayList(2)). My implementation attempt is:
private def buildLastSeenTimestampOp(tuple: (String, Long)): List[Operation] = {
// having these two ops together assure that the size of the list is always 1
val str = tuple._1
val timestamp = tuple._2
val bundleCtx: CTX = bundleContext(str)
List(
ListOperation.append(boundedAndOrderedListPolicy, lastSeenBin, Value.get(timestamp), bundleCtx),
ListOperation.removeByRankRange(lastSeenBin, 1, ListReturnType.NONE | ListReturnType.REVERSE_RANK, bundleCtx), // keep the last element of the ordered list - last time
)
}
Any idea why doesn't it work?
So, i've solved it:
private def buildLastSeenTimestampOp(tuple: (String, Long)): List[Operation] = {
// having these two ops together assure that the size of the list is always 1
val str = tuple._1
val timestamp = tuple._2
val bundleCtx: CTX = bundleContext(str)
List(
ListOperation.append(boundedAndOrderedListPolicy, lastSeenBin, Value.get(command.timestamp.getMillis), bundleCtx),
ListOperation.removeByRankRange(lastSeenBin, -1, ListReturnType.NONE | ListReturnType.INVERTED, bundleCtx), // keep the last element of the ordered list - last time
)
}
When dealing with Rank/Index (removeByIndexRange for indexes) in ascending order -1 represent the max Rank/Index.
Using the ListReturnType.INVERTED is retaining the range (or element in this case) that is selected by initial rank/index until count - by deleting all elements of the list that aren't in the selected range.

Converting arabic numbers into chinese financial numbers

I am trying to create a function in functional programming, which recieves a normal Int value and translates it to financial chinese numbers and returns a String, for exaple: 301 = 三百零一. To begin, I have two maps, one with every digit from 0 to 9, and the other one with the exponentials, from 10, to 1000000.
val digits: Map[Int, String] = Map(0 -> "〇", 1 -> "壹", 2 -> "貳", 3 -> "參", 4 -> "肆", 5 -> "伍", 6 -> "陸", 7 -> "柒", 8 -> "捌", 9 -> "玖");
val exponent: Map[Int, String] = Map(1 -> "", 10 -> "拾", 100 -> "佰", 1000 -> "仟", 10000 -> "萬", 100000 -> "億", 1000000 -> "兆");
For the ones who don´t know, here goes a little explanation about how chinese numbers work. If you already know, don´t bother in reading this paragraph. In the chinese numbers, when you want to write a large number, for example 5000, you write the 5 and the 1000 symbols (伍仟) to intimate that you are multiplying 5 * 1000. If you have 539, it´s 5100 + 310 + 9. This would be 伍佰參拾玖. Lastly, if the number has 0´s between multiplications, it doesn´t matter how many they are, you write only one 0 between the other characters. For example: 501 = 5100 + 1. This is 伍佰〇壹. One last example for calrification: 50103 = 510000 + 1*100 + 3. This is 伍萬〇壹佰〇參.
So what I could do, is the following:
def format(unit: Int): String = {
val l = unit.toString.map(_.asDigit).toList
if(l.isEmpty) ""
else if(l.tail.isEmpty) digits(l.head)
else digits(l.head) + format(l.tail.mkString.toInt)
}
This translates the characters one by one. For example:
format(135) "壹參伍"
And I don´t know how to continue.
If I understood your problem correctly you can do something like this:
def toChineseFinancial(number: Int): String = {
val digits = number.toString.iterator.map(_.asDigit).toList
val length = digits.length
val exponents = List.tabulate(length)(n => math.pow(10, n).toInt)
val (sb, _) =
digits
.iterator
.zip(exponents.reverseIterator)
.foldLeft(new collection.mutable.StringBuilder(length * 2) -> false) {
case ((sb, flag), (digit, exp)) =>
if (digit == 0) sb -> true
else if (flag) sb.append("〇").append(digitsMap(digit)).append(exponentsMap(exp)) -> false
else sb.append(digitsMap(digit)).append(exponentsMap(exp)) -> false
}
sb.result()
}
You can see it running here.
Note: I used mutable.StringBuilder because building Strings is somewhat expensive, but if you want to avoid any kind of mutability you can easily replace it with a normal String.
I would expand the exponents Map using a simple case class for its values to cover:
numbers of magnitude 1, 10, 10^2, ..., 10^12
10's, 100's and 1000's of "萬" (10^4), "億" (10^8) and "兆" (10^12)
as shown below:
case class CNU(unit: String, factor: Int)
val exponents: Map[Long, CNU] = Map(
1L -> CNU("", 1),
10L -> CNU("拾", 1),
100L -> CNU("佰", 1),
1000L -> CNU("仟", 1),
10000L ->CNU("萬", 1),
100000L -> CNU("萬", 10),
1000000L -> CNU("萬", 100),
10000000L -> CNU("萬", 1000),
100000000L -> CNU("億", 1),
1000000000L -> CNU("億", 10),
10000000000L -> CNU("億", 100),
100000000000L -> CNU("億", 1000),
1000000000000L -> CNU("兆", 1),
10000000000000L -> CNU("兆", 10),
100000000000000L -> CNU("兆", 100),
1000000000000000L -> CNU("兆", 1000)
)
Creating the method:
val digits: Map[Int, String] = Map(
0 -> "〇", 1 -> "壹", 2 -> "貳", 3 -> "參", 4 -> "肆",
5 -> "伍", 6 -> "陸", 7 -> "柒", 8 -> "捌", 9 -> "玖"
)
def toChineseNumber(num: Long): String = {
val s = num.toString
val ds = s.map(_.asDigit).zip(s.length-1 to 0 by -1)
ds.foldRight(List.empty[String], 0){ case ((d, i), (accList, dPrev)) =>
val cnu = exponents(math.pow(10, i).toLong)
val digit =
if (d == 0) {
if (dPrev != 0 || num == 0) digits(d) else ""
}
else
digits(d)
val unit =
if (d == 0)
""
else {
if (cnu.factor == 1) cnu.unit else exponents(cnu.factor).unit
}
((digit + unit) :: accList, d)
}.
_1.mkString
}
Note that method foldRight is used to traverse and process the input number from right to left and dPrev in the tuple-accumulator is for carrying digits across iterations for handling repetitive 0's.
Testing it:
toChineseNumber(50)
// res1: String = 伍拾
toChineseNumber(30001)
// res2: String = 參萬〇壹
toChineseNumber(1023405)
// res3: String = 壹佰〇貳萬參仟肆佰〇伍
toChineseNumber(2233007788L)
// res4: String = 貳拾貳億參仟參佰〇柒仟柒佰捌拾捌

How to pair each element of a Seq with the rest?

I'm looking for an elegant way to combine every element of a Seq with the rest for a large collection.
Example: Seq(1,2,3).someMethod should produce something like
Iterator(
(1,Seq(2,3)),
(2,Seq(1,3)),
(3,Seq(1,2))
)
Order of elements doesn't matter. It doesn't have to be a tuple, a Seq(Seq(1),Seq(2,3)) is also acceptable (although kinda ugly).
Note the emphasis on large collection (which is why my example shows an Iterator).
Also note that this is not combinations.
Ideas?
Edit:
In my use case, the numbers are expected to be unique. If a solution can eliminate the dupes, that's fine, but not at additional cost. Otherwise, dupes are acceptable.
Edit 2: In the end, I went with a nested for-loop, and skipped the case when i == j. No new collections were created. I upvoted the solutions that were correct and simple ("simplicity is the ultimate sophistication" - Leonardo da Vinci), but even the best ones are quadratic just by the nature of the problem, and some create intermediate collections by usage of ++ that I wanted to avoid because the collection I'm dealing with has close to 50000 elements, 2.5 billion when quadratic.
The following code has constant runtime (it does everything lazily), but accessing every element of the resulting collections has constant overhead (when accessing each element, an index shift must be computed every time):
def faceMap(i: Int)(j: Int) = if (j < i) j else j + 1
def facets[A](simplex: Vector[A]): Seq[(A, Seq[A])] = {
val n = simplex.size
(0 until n).view.map { i => (
simplex(i),
(0 until n - 1).view.map(j => simplex(faceMap(i)(j)))
)}
}
Example:
println("Example: facets of a 3-dimensional simplex")
for ((i, v) <- facets((0 to 3).toVector)) {
println(i + " -> " + v.mkString("[", ",", "]"))
}
Output:
Example: facets of a 3-dimensional simplex
0 -> [1,2,3]
1 -> [0,2,3]
2 -> [0,1,3]
3 -> [0,1,2]
This code expresses everything in terms of simplices, because "omitting one index" corresponds exactly to the face maps for a combinatorially described simplex. To further illustrate the idea, here is what the faceMap does:
println("Example: how `faceMap(3)` shifts indices")
for (i <- 0 to 5) {
println(i + " -> " + faceMap(3)(i))
}
gives:
Example: how `faceMap(3)` shifts indices
0 -> 0
1 -> 1
2 -> 2
3 -> 4
4 -> 5
5 -> 6
The facets method uses the faceMaps to create a lazy view of the original collection that omits one element by shifting the indices by one starting from the index of the omitted element.
If I understand what you want correctly, in terms of handling duplicate values (i.e., duplicate values are to be preserved), here's something that should work. Given the following input:
import scala.util.Random
val nums = Vector.fill(20)(Random.nextInt)
This should get you what you need:
for (i <- Iterator.from(0).take(nums.size)) yield {
nums(i) -> (nums.take(i) ++ nums.drop(i + 1))
}
On the other hand, if you want to remove dups, I'd convert to Sets:
val numsSet = nums.toSet
for (num <- nums) yield {
num -> (numsSet - num)
}
seq.iterator.map { case x => x -> seq.filter(_ != x) }
This is quadratic, but I don't think there is very much you can do about that, because in the end of the day, creating a collection is linear, and you are going to need N of them.
import scala.annotation.tailrec
def prems(s : Seq[Int]):Map[Int,Seq[Int]]={
#tailrec
def p(prev: Seq[Int],s :Seq[Int],res:Map[Int,Seq[Int]]):Map[Int,Seq[Int]] = s match {
case x::Nil => res+(x->prev)
case x::xs=> p(x +: prev,xs, res+(x ->(prev++xs)))
}
p(Seq.empty[Int],s,Map.empty[Int,Seq[Int]])
}
prems(Seq(1,2,3,4))
res0: Map[Int,Seq[Int]] = Map(1 -> List(2, 3, 4), 2 -> List(1, 3, 4), 3 -> List(2, 1, 4),4 -> List(3, 2, 1))
I think you are looking for permutations. You can map the resulting lists into the structure you are looking for:
Seq(1,2,3).permutations.map(p => (p.head, p.tail)).toList
res49: List[(Int, Seq[Int])] = List((1,List(2, 3)), (1,List(3, 2)), (2,List(1, 3)), (2,List(3, 1)), (3,List(1, 2)), (3,List(2, 1)))
Note that the final toList call is only there to trigger the evaluation of the expressions; otherwise, the result is an iterator as you asked for.
In order to get rid of the duplicate heads, toMap seems like the most straight-forward approach:
Seq(1,2,3).permutations.map(p => (p.head, p.tail)).toMap
res50: scala.collection.immutable.Map[Int,Seq[Int]] = Map(1 -> List(3, 2), 2 -> List(3, 1), 3 -> List(2, 1))

How to find the total average of filtered values in scala

The following code allows me the sum per filter key. How do I sum and average all the values together. i.e combine the results of all filtered values.
val f= p.groupBy(d => (d.Id))
.mapValues(totavg =>
(totavg.groupBy(_.Day).filterKeys(Set(2,3,4)).mapValues(_.map(_.Amount))
Sample output:
Map(A9 -> Map(2 -> List(473.3, 676.48), 4 -> List(685.45, 812.73))
I would like to add all values together and compute total average.
i.e (473.3+676.48+685.45+812.73)/4
For the given Map, you can apply flatMap twice to return the sequence of values firstly, then calculate the average:
val m = Map("A9" -> Map(2 -> List(473.3, 676.48), 4 -> List(685.45, 812.73)))
val s = m.flatMap(_._2.flatMap(_._2))
// s: scala.collection.immutable.Iterable[Double] = List(473.3, 676.48, 685.45, 812.73)
s.sum/s.size
// res14: Double = 661.99

How do I populate a list of objects with new values

Apologies: I'm well noob
I have an items class
class item(ind:Int,freq:Int,gap:Int){}
I have an ordered list of ints
val listVar = a.toList
where a is an array
I want a list of items called metrics where
ind is the (unique) integer
freq is the number of times that ind appears in list
gap is the minimum gap between ind and the number in the list before it
so far I have:
def metrics = for {
n <- 0 until 255
listVar filter (x == n) count > 0
}
yield new item(n, (listVar filter == n).count,0)
It's crap and I know it - any clues?
Well, some of it is easy:
val freqMap = listVar groupBy identity mapValues (_.size)
This gives you ind and freq. To get gap I'd use a fold:
val gapMap = listVar.sliding(2).foldLeft(Map[Int, Int]()) {
case (map, List(prev, ind)) =>
map + (ind -> (map.getOrElse(ind, Int.MaxValue) min ind - prev))
}
Now you just need to unify them:
freqMap.keys.map( k => new item(k, freqMap(k), gapMap.getOrElse(k, 0)) )
Ideally you want to traverse the list only once and in the course for each different Int, you want to increment a counter (the frequency) as well as keep track of the minimum gap.
You can use a case class to store the frequency and the minimum gap, the value stored will be immutable. Note that minGap may not be defined.
case class Metric(frequency: Int, minGap: Option[Int])
In the general case you can use a Map[Int, Metric] to lookup the Metric immutable object. Looking for the minimum gap is the harder part. To look for gap, you can use the sliding(2) method. It will traverse the list with a sliding window of size two allowing to compare each Int to its previous value so that you can compute the gap.
Finally you need to accumulate and update the information as you traverse the list. This can be done by folding each element of the list into your temporary result until you traverse the whole list and get the complete result.
Putting things together:
listVar.sliding(2).foldLeft(
Map[Int, Metric]().withDefaultValue(Metric(0, None))
) {
case (map, List(a, b)) =>
val metric = map(b)
val newGap = metric.minGap match {
case None => math.abs(b - a)
case Some(gap) => math.min(gap, math.abs(b - a))
}
val newMetric = Metric(metric.frequency + 1, Some(newGap))
map + (b -> newMetric)
case (map, List(a)) =>
map + (a -> Metric(1, None))
case (map, _) =>
map
}
Result for listVar: List[Int] = List(2, 2, 4, 4, 0, 2, 2, 2, 4, 4)
scala.collection.immutable.Map[Int,Metric] = Map(2 -> Metric(4,Some(0)),
4 -> Metric(4,Some(0)), 0 -> Metric(1,Some(4)))
You can then turn the result into your desired item class using map.toSeq.map((i, m) => new Item(i, m.frequency, m.minGap.getOrElse(-1))).
You can also create directly your Item object in the process, but I thought the code would be harder to read.