Retain MAX value of Aerospike CDT List - scala

Context
Consider having a stream of tuples (string, timestamp), with the goal of having a bin containing a Map of the minimal timestamp received for each string.
Another constraint is that the update for this Map will be atomic.
For this input example:
("a", 1)
("b", 2)
("a", 3)
the expected output: Map("a" -> 1, "b" -> 2) without the maximal timestamp 3.
The current implementation I chose is using CDT of list to hold the timestamps, so my result is Map("a" -> ArrayList(1), "b" -> ArrayList(2)) as follows:
private val boundedAndOrderedListPolicy = new ListPolicy(ListOrder.ORDERED, ListWriteFlags.INSERT_BOUNDED)
private def bundleContext(str: String) =
CTX.mapKeyCreate(Value.get(str), MapOrder.UNORDERED)
private def buildInitTimestampOp(tuple: (String, Long)): List[Operation] = {
// having these two ops together assure that the size of the list is always 1
val str = tuple._1
val timestamp = tuple._2
val bundleCtx: CTX = bundleContext(str)
List(
ListOperation.append(boundedAndOrderedListPolicy, initBin, Value.get(timestamp), bundleCtx),
ListOperation.removeByRankRange(initBin, 1, ListReturnType.NONE, bundleCtx), // keep the first element of the order list - earliest time
)
}
This works as expected. However, if you have a better way to achieve this without the list and in an atomic manner - I would love to hear it.
My question
What does not work for me is retaining the max timestamp received for each input str. For the example above, the desired result should be Map("a" -> ArrayList(3), "b" -> ArrayList(2)). My implementation attempt is:
private def buildLastSeenTimestampOp(tuple: (String, Long)): List[Operation] = {
// having these two ops together assure that the size of the list is always 1
val str = tuple._1
val timestamp = tuple._2
val bundleCtx: CTX = bundleContext(str)
List(
ListOperation.append(boundedAndOrderedListPolicy, lastSeenBin, Value.get(timestamp), bundleCtx),
ListOperation.removeByRankRange(lastSeenBin, 1, ListReturnType.NONE | ListReturnType.REVERSE_RANK, bundleCtx), // keep the last element of the ordered list - last time
)
}
Any idea why doesn't it work?

So, i've solved it:
private def buildLastSeenTimestampOp(tuple: (String, Long)): List[Operation] = {
// having these two ops together assure that the size of the list is always 1
val str = tuple._1
val timestamp = tuple._2
val bundleCtx: CTX = bundleContext(str)
List(
ListOperation.append(boundedAndOrderedListPolicy, lastSeenBin, Value.get(command.timestamp.getMillis), bundleCtx),
ListOperation.removeByRankRange(lastSeenBin, -1, ListReturnType.NONE | ListReturnType.INVERTED, bundleCtx), // keep the last element of the ordered list - last time
)
}
When dealing with Rank/Index (removeByIndexRange for indexes) in ascending order -1 represent the max Rank/Index.
Using the ListReturnType.INVERTED is retaining the range (or element in this case) that is selected by initial rank/index until count - by deleting all elements of the list that aren't in the selected range.

Related

Scala - conditional product/join of two arrays with default values using for comprehensions

I have two Sequences, say:
val first = Array("B", "L", "T")
val second = Array("T70", "B25", "B80", "A50", "M100", "B50")
How do I get a product such that elements of the first array are joined with each element of the second array which startsWith the former and also yield a default empty result when no element in the second array meets the condition.
Effectively to get an Output:
expectedProductArray = Array("B-B25", "B-B80", "B-B50", "L-Default", "T-T70")
I tried doing,
val myProductArray: Array[String] = for {
f <- first
s <- second if s.startsWith(f)
} yield s"""$f-$s"""
and i get:
myProductArray = Array("B-B25", "B-B80", "B-B50", "T-T70")
Is there an Idiomatic way of adding a default value for values in first sequence not having a corresponding value in the second sequence with the given criteria? Appreciate your thoughts.
Here's one approach by making array second a Map and looking up the Map for elements in array first with getOrElse:
val first = Array("B", "L", "T")
val second = Array("T70", "B25", "B80", "A50", "M100", "B50")
val m = second.groupBy(_(0).toString)
// m: scala.collection.immutable.Map[String,Array[String]] =
// Map(M -> Array(M100), A -> Array(A50), B -> Array(B25, B80, B50), T -> Array(T70))
first.flatMap(x => m.getOrElse(x, Array("Default")).map(x + "-" + _))
// res1: Array[String] = Array(B-B25, B-B80, B-B50, L-Default, T-T70)
In case you prefer using for-comprehension:
for {
x <- first
y <- m.getOrElse(x, Array("Default"))
} yield s"$x-$y"

How to copy matrix to column array

I'm trying to copy a column of a matrix into an array, also I want to make this matrix public.
Heres my code:
val years = Array.ofDim[String](1000, 1)
val bufferedSource = io.Source.fromFile("Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val i=0;
//println("THEME, TITLE, ARTIST, YEAR, SPOTIFY_URL")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
years(i)=cols(3)(i)
}
I want the cols to be a global matrix and copy the column 3 to years, because of the method of that I get cols I dont know how to define it
There're three different problems in your attempt:
Your regexp will fail for this dataset. I suggest you change it to:
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
This will capture the blocks wrapped in double quotes but containing commas (courtesy of Luke Sheppard on regexr)
This val i=0; is not very scala-ish / functional. We can replace it by a zipWithIndex in the for comprehension:
for ((line, count) <- bufferedSource.getLines.zipWithIndex)
You can create the "global matrix" by extracting elements from each line (val Array (...)) and returning them as the value of the for-comprehension block (yield):
It looks like that:
for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line....
...
(theme,title,artist,year,spotify_url)
}
And here is the complete solution:
val bufferedSource = io.Source.fromFile("/tmp/Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val years = Array.ofDim[String](1000, 1)
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
val iteratorMatrix = for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line.split(regex, -1).map(_.trim)
years(count) = Array(year)
(theme,title,artist,year,spotify_url)
}
// will actually consume the iterator and fill in globalMatrix AND years
val globalMatrix = iteratorMatrix.toList
Here's a function that will get the col column from the CSV. There is no error handling here for any empty row or other conditions. This is a proof of concept so add your own error handling as you see fit.
val years = (fileName: String, col: Int) => scala.io.Source.fromFile(fileName)
.getLines()
.map(_.split(",")(col).trim())
Here's a suggestion if you are looking to keep the contents of the file in a map. Again there's no error handling just proof of concept.
val yearColumn = 3
val fileName = "Top_1_000_Songs_To_Hear_Before_You_Die.csv"
def lines(file: String) = scala.io.Source.fromFile(file).getLines()
val mapRow = (row: String) => row.split(",").zipWithIndex.foldLeft(Map[Int, String]()){
case (acc, (v, idx)) => acc.updated(idx,v.trim())}
def mapColumns = (values: Iterator[String]) =>
values.zipWithIndex.foldLeft(Map[Int, Map[Int, String]]()){
case (acc, (v, idx)) => acc.updated(idx, mapRow(v))}
val parser = lines _ andThen mapColumns
val matrix = parser(fileName)
val years = matrix.flatMap(_.swap._1.get(yearColumn))
This will build a Map[Int,Map[Int, String]] which you can use elsewhere. The first index of the map is the row number and the index of the inner map is the column number. years is an Iterable[String] that contains the year values.
Consider adding contents to a collection at the same time as it is created, in contrast to allocate space first and then update it; for instance like this,
val rawSongsInfo = io.Source.fromFile("Top_Songs.csv").getLines
val cols = for (rsi <- rawSongsInfo) yield rsi.split(",").map(_.trim)
val years = cols.map(_(3))

Efficient countByValue of each column Spark Streaming

I want to find countByValues of each column in my data. I can find countByValue() for each column (e.g. 2 columns now) in basic batch RDD as fallows:
scala> val double = sc.textFile("double.csv")
scala> val counts = sc.parallelize((0 to 1).map(index => {
double.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
}))
scala> counts.take(2)
res20: Array[scala.collection.Map[Long,Long]] = Array(Map(2 -> 5, 1 -> 5), Map(4 -> 5, 5 -> 5))
Now I want to perform same with DStreams. I have windowedDStream and want to countByValue on each column. My data has 50 columns. I have done it as fallows:
val windowedDStream = myDStream.window(Seconds(2), Seconds(2)).cache()
ssc.sparkContext.parallelize((0 to 49).map(index=> {
val counts = windowedDStream.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
counts.print()
}))
val topCounts = counts.map . . . . will not work
I get correct results with this, the only issue is that I want to apply more operations on counts and it's not available outside map.
You misunderstand what parallelize does. You think when you give it a Seq of two elements, those two elements will be calculated in parallel. That it not the case and it would be impossible for it to be the case.
What parallelize actually does is it creates an RDD from the Seq that you provided.
To try to illuminate this, consider that this:
val countsRDD = sc.parallelize((0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
})
Is equal to this:
val counts = (0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
}
val countsRDD = sc.parallelize(counts)
By the time parallelize runs, the work has already been performed. parallelize cannot retroactively make it so that the calculation happened in parallel.
The solution to your problem is to not use parallelize. It is entirely pointless.

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}

Slow IO with large data

I'm trying to find a better way to do this as it could take years to compute! I'm need to compute a map which is too large to fit in memory, so I am trying to make use of IO as follows.
I have a file that contains a list of Ints, about 1 million of them. I have another file that contains data about my (500,000) document collection. I need to calculate a function of the count, for every Int in the first file, of how many documents (lines in the second) it appears in. Let me give an example:
File1:
-1
1
2
etc...
file2:
E01JY3-615, CR93E-177 , [-1 -> 2,1 -> 1,2 -> 2,3 -> 2,4 -> 2,8 -> 2,... // truncated for brevity]
E01JY3-615, CR93E-177 , [1 -> 2,2 -> 2,4 -> 2,5 -> 2,8 -> 2,... // truncated for brevity]
etc...
Here is what I have tried so far
def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit) {
val p = new java.io.PrintWriter(new BufferedWriter((new FileWriter(f))))
try {
op(p)
} finally {
p.close()
}
}
def binarySearch(array: Array[String], word: Int):Boolean = array match {
case Array() => false
case xs => if (array(array.size/2).split("->")(0).trim().toInt == word) {
return true
} else if (array(array.size/2).split("->")(0).trim().toInt > word){
return binarySearch(array.take(array.size/2), word)
} else {
return binarySearch(array.drop(array.size/2 + 1), word)
}
}
var v = Source.fromFile("vocabulary.csv").getLines()
printToFile(new File("idf.csv"))(out => {
v.foreach(word =>{
var docCount: Int = 0
val s = Source.fromFile("documents.csv").getLines()
s.foreach(line => {
val split = line.split("\\[")
val fpStr = split(1).init
docCount = if (binarySearch(fpStr.split(","), word.trim().toInt)) docCount + 1 else docCount
})
val output = word + ", " + math.log10(500448 / (docCount + 1))
out.println(output)
println(output)
})
})
There must be a faster way to do this, can anyone think of a way?
From what I understand of your code, you are trying to find every word in the dictionary in the document list.
Hence, you are making N*M comparisons, where N is the number of words (in the dictionary with integers) and M is the number of documents in the document list. Instantiating to your values, you are trying to calculate 10^6 * 5*10^5 comparisons which is 5*10^11. Unfeasible.
Why not create a mutable map with all the integers in the dictionary as keys (1000000 ints in memory is roughly 3.8M from my measurements) and pass through the document list only once, where for each document you extract the integers and increment the respective count values in the map (for which the integer is key).
Something like this:
import collection.mutable.Map
import scala.util.Random._
val maxValue = 1000000
val documents = collection.mutable.Map[String,List[(Int,Int)]]()
// util function just to insert fake input; disregard
def provideRandom(key:String) ={ (1 to nextInt(4)).foreach(_ => documents.put(key,(nextInt(maxValue),nextInt(maxValue)) :: documents.getOrElse(key,Nil)))}
// inserting fake documents into our fake Document map
(1 to 500000).foreach(_ => {val key = nextString(5); provideRandom(key)})
// word count map
val wCount = collection.mutable.Map[Int,Int]()
// Counting the numbers and incrementing them in the map
documents.foreach(doc => doc._2.foreach(k => wCount.put(k._1, (wCount.getOrElse(k._1,0)+1))))
scala> wCount
res5: scala.collection.mutable.Map[Int,Int] = Map(188858 -> 1, 178569 -> 2, 437576 -> 2, 660074 -> 2, 271888 -> 2, 721076 -> 1, 577416 -> 1, 77760 -> 2, 67471 -> 1, 804106 -> 2, 185283 -> 1, 41623 -> 1, 943946 -> 1, 778258 -> 2...
the result is a map with its keys being a number in the dict and the value the number of times it appears in the document list
This is oversimplified since
I dont verify if the number exists in the dictionary, although you only need to init the map with the values and then increment the value in the final map if it has that key;
I dont do IO, which speeds up the whole thing
This way you only pass through the documents once, which makes the task feasible again.