How can i consume a service that returns pages as a Stream of items?
Amazon S3, for example, lets you fetch the initial object listing or the next object listing from the previous one.
For example consider this code that simulates such behavior:
import math._
case class Page(number: Int)
case class Pages(pages: Seq[Page], truncated: Boolean)
class PagesService(pageSize: Int, pagesServed: Int) {
def getPages =
Pages((1 to pageSize).map(Page), pageSize < pagesServed)
def nextPages(previous: Pages) = {
val first = previous.pages.last.number + 1
val last = min(first + pageSize, pagesServed)
Pages((first to last).map(Page), last < pagesServed)
}
}
object PagesClient extends App {
val service = new PagesService(10, 100)
val first = service.getPages
assert(first.truncated)
first.pages.foreach(println(_))
val second = service.nextPages(first)
second.pages.foreach(println(_))
val book: Stream[Page] = ???
}
How could i write that last expression?
val book: Stream[Pages] = first #:: book.map(service.nextPages).takeWhile(_.pages.nonEmpty)
val pages: Stream[Page] = book.flatten(_.pages)
I do not know if it is a typo. If you mean Stream[Pages] it is simple:
val book: Stream[Pages] = first #:: book.map(x => service.nextPages(x))
If you meant Stream[Page] i.e. a stream of Page from all pages, then:
val first = service.getPages
val second = service.nextPages(first)
val books: Stream[Page] = {
val currentPages = book.iterator
val firstPages = currentPages.next.pages.iterator
def inner(current: Iterator[Page]): Stream[Page] = {
if (current.hasNext) {
current.next #:: inner(current)
} else {
val i = currentPages.next
inner(i.pages.iterator)
}
}
inner(firstPages);
}
The above basically takes a Pages, returns its Page as part of stream. If the Pages is exhausted, then goes over to the next Pages and so on.
Related
I'm implementing a iterator to a HTTP resource, which I can recover a list of elements paged, I tried to do this with a plain Iterator, but it's a blocking implementation, and since I'm using akka it makes my dispatcher go a little crazy.
My will it's to implement the same iterator using akka-stream. The problem is I need bit different retry strategy.
The service returns a list of elements, identified by a id, and sometimes when I query for the next page, the service returns the same products on the current page.
My current algorithm is this.
var seenIds = Set.empty
var position = 0
def isProblematicPage(elements: Seq[Element]) Boolean = {
val currentIds = elements.map(_.id)
val intersection = seenIds & currentIds
val hasOnlyNewIds = intersection.isEmpty
if (hasOnlyNewIds) {
seenIds = seenIds | currentIds
}
!hasOnlyNewIds
}
def incrementPage(): Unit = {
position += 10
}
def doBackOff(attempt: Int): Unit = {
// Backoff logic
}
#tailrec
def fetchPage(attempt: Int = 0): Iterator[Element] = {
if (attempt > MaxRetries) {
incrementPage()
return Iterator.empty
}
val eventualPage = service.retrievePage(position, position + 10)
val page = Await.result(eventualPage, 5 minutes)
if (isProblematicPage(page)) {
doBackOff(attempt)
fetchPage(attempt + 1)
} else {
incrementPage()
page.iterator
}
}
I'm doing the implementation using akka-streams but I can't figure out how to accumulate the pages and test for repetition using the streams structure.
Any suggestions?
The Flow.scan method is useful in such situations.
I would start your stream with a source of positions:
type Position = Int
//0,10,20,...
def positionIterator() : Iterator[Position] = Iterator from (0,10)
val positionSource : Source[Position,_] = Source fromIterator positionIterator
This position source can then be directed to a Flow.scan which utilizes a function similar to your fetchPage (side note: you should avoid awaits as much as possible, there is a way to not have awaits in your code but that is outside the scope of your original question). The new function needs to take in the "state" of already seen Elements:
def fetchPageWithState(service : Service)
(seenEls : Set[Element], position : Position) : Set[Elements] = {
val maxRetries = 10
val seenIds = seenEls map (_.id)
#tailrec
def readPosition(attempt : Int) : Seq[Elements] = {
if(attempt > maxRetries)
Iterator.empty
else {
val eventualPage : Seq[Element] =
Await.result(service.retrievePage(position, position + 10), 5 minutes)
if(eventualPage.map(_.id).exists(seenIds.contains)) {
doBackOff(attempt)
readPosition(attempt + 1)
}
else
eventualPage
}
}//end def readPosition
seenEls ++ readPosition(0).toSet
}//end def fetchPageWithState
This can now be used within a Flow:
def fetchFlow(service : Service) : Flow[Position, Set[Element],_] =
Flow[Position].scan(Set.empty[Element])(fetchPageWithState(service))
The new Flow can be easily connected to your Position Source to create a Source of Set[Element]:
def elementsSource(service : Service) : Source[Set[Element], _] =
positionSource via fetchFlow(service)
Each new value from elementsSource will be an ever growing Set of unique Elements from fetched pages.
The Flow.scan stage was a good advice, but it lacked the feature to deal with futures, so I implemented it asynchronous version Flow.scanAsync it's now available on akka 2.4.12.
The current implementation is:
val service: WebService
val maxTries: Int
val backOff: FiniteDuration
def retry[T](zero: T, attempt: Int = 0)(f: => Future[T]): Future[T] = {
f.recoverWith {
case ex if attempt >= maxAttempts =>
Future(zero)
case ex =>
akka.pattern.after(backOff, system.scheduler)(retry(zero, attempt + 1)(f))
}
}
def isProblematicPage(lastPage: Seq[Element], currPage: Seq[Element]): Boolean = {
val lastPageIds = lastPage.map(_.id).toSet
val currPageIds = currPage.map(_.id).toSet
val intersection = lastPageIds & currPageIds
intersection.nonEmpty
}
def retrievePage(lastPage: Seq[Element], startIndex: Int): Future[Seq[Element]] = {
retry(Seq.empty) {
service.fetchPage(startIndex).map { currPage: Seq[Element] =>
if (isProblematicPage(lastPage, currPage)) throw new ProblematicPageException(startIndex)
else currPage
}
}
}
val pagesRange: Range = Range(0, maxItems, pageSize)
val scanAsyncFlow = Flow[Int].via(ScanAsync(Seq.empty)(retrievePage))
Source(pagesRange)
.via(scanAsyncFlow)
.mapConcat(identity)
.runWith(Sink.seq)
Thanks Ramon for the advice :)
I have a uuid generator, like:
class NewUuid {
def apply: String = UUID.randomUUID().toString.replace("-", "")
}
And other class can use it:
class Dialog {
val newUuid = new NewUuid
def newButtons(): Seq[Button] = Seq(new Button(newUuid()), new Button(newUuid()))
}
Now I want to test the Dialog and mock the newUuid:
val dialog = new Dialog {
val newUuid = mock[NewUuid]
newUuid.apply returns "uuid1"
}
dialog.newButtons().map(_.text) === Seq("uuid1", "uuid1")
You can see the returned uuid is always uuid1.
Is it possible to let newUuid to return different values for different calls? e.g. The first call returns uuid1, the second returns uuid2, etc
newUuid.apply returns "uudi1" thenReturns "uuid2"
https://etorreborre.github.io/specs2/guide/SPECS2-3.5/org.specs2.guide.UseMockito.html
Use an iterator to produce your UUIDs
def newUuid() = UUID.randomUUID().toString.replace("-", "")
val defaultUuidSource = Iterator continually newUuid()
class Dialog(uuids: Iterator[String] = defaultUuidSource) {
def newButtons() = Seq(
Button(uuids.next()),
Button(uuids.next())
)
}
Then provide a different one for testing:
val testUuidSource = Iterator from 1 map {"uuid" + _}
new Dialog(testUuidSource)
I'm trying to use Spark GraphX, and encountering what I think is a problem in how I'm using Scala. I'm a newbie to both Scala and Spark.
I create a graph by invoking my own function:
val initialGraph: Graph[VertexAttributes, Int] = sim.createGraph
VertexAttributes is a class I defined:
class VertexAttributes(var pages: List[Page], var ads: List[Ad], var step: Long, val inDegree: Int, val outDegree: Int)
extends java.io.Serializable
{
// Define alternative methods to be used as the score
def averageScore() =
{
this.ads.map(_.score).sum / this.ads.length
}
def maxScore() =
{
if(this.ads.length == 0) None else Some(this.ads.map(_.score).max)
}
// Select averageScore as the function to be used
val score = averageScore _
}
After some computations, I use the GraphX vertices() function to get the scores for each vertex:
val nodeRdd = g.vertices.map(v => if(v._2.score() == 0)(v._1 + ",'0,0,255'") else (v._1 + ",'255,0,0'"))
But this won't compile, the sbt message is:
value score is not a member of type parameter VertexAttributes
I have googled this error message, but frankly can't follow the conversation. Can anyone please explain the cause of the error and how I can fix it?
Thank you.
P.S. Below is my code for the createGraph method:
// Define a class to run the simulation
class Butterflies() extends java.io.Serializable
{
// A boolean flag to enable debug statements
var debug = true
// A boolean flag to read an edgelist file rather than compute the edges
val readEdgelistFile = true;
// Create a graph from a page file and an ad file
def createGraph(): Graph[VertexAttributes, Int] =
{
// Just needed for textFile() method to load an RDD from a textfile
// Cannot use the global Spark context because SparkContext cannot be serialized from master to worker
val sc = new SparkContext
// Parse a text file with the vertex information
val pages = sc.textFile("hdfs://ip-172-31-4-59:9000/user/butterflies/data/1K_nodes.txt")
.map { l =>
val tokens = l.split("\\s+") // split("\\s") will split on whitespace
val id = tokens(0).trim.toLong
val tokenList = tokens.last.split('|').toList
(id, tokenList)
}
println("********** NUMBER OF PAGES: " + pages.count + " **********")
// Parse a text file with the ad information
val ads = sc.textFile("hdfs://ip-172-31-4-59:9000/user/butterflies/data/1K_ads.txt")
.map { l =>
val tokens = l.split("\\s+") // split("\\s") will split on whitespace
val id = tokens(0).trim.toLong
val tokenList = tokens.last.split('|').toList
val next: VertexId = 0
val score = 0
//val vertexId: VertexId = id % 1000
val vertexId: VertexId = id
(vertexId, Ad(id, tokenList, next, score))
}
println("********** NUMBER OF ADS: " + ads.count + " **********")
// Check if we should simply read an edgelist file, or compute the edges from scratch
val edgeGraph =
if (readEdgelistFile)
{
// Create a graph from an edgelist file
GraphLoader.edgeListFile(sc, "hdfs://ip-172-31-4-59:9000/user/butterflies/data/1K_edges.txt")
}
else
{
// Create the edges between similar pages
// Create of list of all possible pairs of pages
// Check if any pair shares at least one token
// We only need the pair id's for the edgelist
val allPairs = pages.cartesian(pages).filter{ case (a, b) => a._1 < b._1 }
val similarPairs = allPairs.filter{ case (page1, page2) => page1._2.intersect(page2._2).length >= 1 }
val idOnly = similarPairs.map{ case (page1, page2) => Edge(page1._1, page2._1, 1)}
println("********** NUMBER OF EDGES: " + idOnly.count + " **********")
// Save the list of edges as a file, to be used instead of recomputing the edges every time
//idOnly.saveAsTextFile("hdfs://ip-172-31-4-59:9000/user/butterflies/data/saved_edges")
// Create a graph from an edge list RDD
Graph.fromEdges[Int, Int](idOnly, 1);
}
// Copy into a graph with nodes that have vertexAttributes
//val attributeGraph: Graph[VertexAttributes, Int] =
val attributeGraph =
edgeGraph.mapVertices{ (id, v) => new VertexAttributes(Nil, Nil, 0, 0, 0) }
// Add the node information into the graph
val nodeGraph = attributeGraph.outerJoinVertices(pages) {
(vertexId, attr, pageTokenList) =>
new VertexAttributes(List(Page(vertexId, pageTokenList.getOrElse(List.empty), 0)),
attr.ads, attr.step, attr.inDegree, attr.outDegree)
}
// Add the node degree information into the graph
val degreeGraph = nodeGraph
.outerJoinVertices(nodeGraph.inDegrees)
{
case (id, attr, inDegree) => new VertexAttributes(attr.pages, attr.ads, attr.step, inDegree.getOrElse(0), attr.outDegree)
}
.outerJoinVertices(nodeGraph.outDegrees)
{
case (id, attr, outDegree) =>
new VertexAttributes(attr.pages, attr.ads, attr.step, attr.inDegree, outDegree.getOrElse(0))
}
// Add the ads to the nodes
val adGraph = degreeGraph.outerJoinVertices(ads)
{
(vertexId, attr, ad) =>
{
if (ad.isEmpty)
{
new VertexAttributes(attr.pages, List.empty, attr.step, attr.inDegree, attr.outDegree)
}
else
{
new VertexAttributes(attr.pages, List(Ad(ad.get.id, ad.get.tokens, ad.get.next, ad.get.score)),
attr.step, attr.inDegree, attr.outDegree)
}
}
}
// Display the graph for debug only
if (debug)
{
println("********** GRAPH **********")
//printVertices(adGraph)
}
// return the generated graph
return adGraph
}
}
VertexAttributes in your code refers to a type parameter, not to the VertexAttributes class. The mistake is probably in your createGraph function. For example it may be like this:
class Sim {
def createGraph[VertexAttributes]: Graph[VertexAttributes, Int]
}
or:
class Sim[VertexAttributes] {
def createGraph: Graph[VertexAttributes, Int]
}
In both cases you have a type parameter called VertexAttributes. This is the same as if you wrote:
class Sim[T] {
def createGraph: Graph[T, Int]
}
The compiler doesn't know that T has a score method (because it doesn't). You don't need that type parameter. Just write:
class Sim {
def createGraph: Graph[VertexAttributes, Int]
}
Now VertexAttributes will refer to the class, not to the local type parameter.
I have the following sample code :
package models
import java.util.concurrent.atomic.AtomicInteger
import scala.collection.mutable.ArrayBuffer
case class Task(id: Int, label: String)
object Task {
private val buffer = new ArrayBuffer[Task]
private val incrementer = new AtomicInteger()
def all(): List[Task] = buffer.toList
def create(label: String): Int = {
val newId = incrementer.incrementAndGet()
buffer += new Task(newId, label)
newId
}
def delete(id: Int): Boolean = {
// TODO : add code
}
}
In method delete I need to find a Task that has id equal to the parameter id and if one is found I need to remove it from the collection and return true from the method. Otherwise (if none is found) I should just return false.
I know how to do this in an imperative language such as C# or Java but Scala stumps me..
PS : The code is strictly used to understand the language and the platform, it sucks too much to be pushed in production. Don't worry.
This is one possible solution, however in this case I think it's also possible to switch to var + immutable ArrayBuffer and use filter. Also note that this code is not thread safe
import java.util.concurrent.atomic.AtomicInteger
import scala.collection.mutable.ArrayBuffer
case class Task(id: Int, label: String)
object Task {
private val buffer = new ArrayBuffer[Task]
private val incrementer = new AtomicInteger()
def all(): List[Task] = buffer.toList
def create(label: String): Int = {
val newId = incrementer.incrementAndGet()
buffer.append(Task(newId, label))
newId
}
def delete(id: Int): Boolean = {
buffer.
find(_.id == id). // find task by id
map(buffer -= _). // remove it from buffer
exists(_ => true) // the same as: map(_ => true).getOrElse(false)
}
}
val id1 = Task.create("aaa")
val id2 = Task.create("bbb")
println(s"Id1 = $id1 Id2 = $id2")
println(s"All = ${Task.all()}")
val deleted = Task.delete(id1)
println(s"Deleted = $deleted")
println(s"All = ${Task.all()}")
println(s"Not Deleted = ${Task.delete(123)}")
We have some code which needs to run faster. Its already profiled so we would like to make use of multiple threads. Usually I would setup an in memory queue, and have a number of threads taking jobs of the queue and calculating the results. For the shared data I would use a ConcurrentHashMap or similar.
I don't really want to go down that route again. From what I have read using actors will result in cleaner code and if I use akka migrating to more than 1 jvm should be easier. Is that true?
However, I don't know how to think in actors so I am not sure where to start.
To give a better idea of the problem here is some sample code:
case class Trade(price:Double, volume:Int, stock:String) {
def value(priceCalculator:PriceCalculator) =
(priceCalculator.priceFor(stock)-> price)*volume
}
class PriceCalculator {
def priceFor(stock:String) = {
Thread.sleep(20)//a slow operation which can be cached
50.0
}
}
object ValueTrades {
def valueAll(trades:List[Trade],
priceCalculator:PriceCalculator):List[(Trade,Double)] = {
trades.map { trade => (trade,trade.value(priceCalculator)) }
}
def main(args:Array[String]) {
val trades = List(
Trade(30.5, 10, "Foo"),
Trade(30.5, 20, "Foo")
//usually much longer
)
val priceCalculator = new PriceCalculator
val values = valueAll(trades, priceCalculator)
}
}
I'd appreciate it if someone with experience using actors could suggest how this would map on to actors.
This is a complement to my comment on shared results for expensive calculations. Here it is:
import scala.actors._
import Actor._
import Futures._
case class PriceFor(stock: String) // Ask for result
// The following could be an "object" as well, if it's supposed to be singleton
class PriceCalculator extends Actor {
val map = new scala.collection.mutable.HashMap[String, Future[Double]]()
def act = loop {
react {
case PriceFor(stock) => reply(map getOrElseUpdate (stock, future {
Thread.sleep(2000) // a slow operation
50.0
}))
}
}
}
Here's an usage example:
scala> val pc = new PriceCalculator; pc.start
pc: PriceCalculator = PriceCalculator#141fe06
scala> class Test(stock: String) extends Actor {
| def act = {
| println(System.currentTimeMillis().toString+": Asking for stock "+stock)
| val f = (pc !? PriceFor(stock)).asInstanceOf[Future[Double]]
| println(System.currentTimeMillis().toString+": Got the future back")
| val res = f.apply() // this blocks until the result is ready
| println(System.currentTimeMillis().toString+": Value: "+res)
| }
| }
defined class Test
scala> List("abc", "def", "abc").map(new Test(_)).map(_.start)
1269310737461: Asking for stock abc
res37: List[scala.actors.Actor] = List(Test#6d888e, Test#1203c7f, Test#163d118)
1269310737461: Asking for stock abc
1269310737461: Asking for stock def
1269310737464: Got the future back
scala> 1269310737462: Got the future back
1269310737465: Got the future back
1269310739462: Value: 50.0
1269310739462: Value: 50.0
1269310739465: Value: 50.0
scala> new Test("abc").start // Should return instantly
1269310755364: Asking for stock abc
res38: scala.actors.Actor = Test#15b5b68
1269310755365: Got the future back
scala> 1269310755367: Value: 50.0
For simple parallelization, where I throw a bunch of work out to process and then wait for it all to come back, I tend to like to use a Futures pattern.
class ActorExample {
import actors._
import Actor._
class Worker(val id: Int) extends Actor {
def busywork(i0: Int, i1: Int) = {
var sum,i = i0
while (i < i1) {
i += 1
sum += 42*i
}
sum
}
def act() { loop { react {
case (i0:Int,i1:Int) => sender ! busywork(i0,i1)
case None => exit()
}}}
}
val workforce = (1 to 4).map(i => new Worker(i)).toList
def parallelFourSums = {
workforce.foreach(_.start())
val futures = workforce.map(w => w !! ((w.id,1000000000)) );
val computed = futures.map(f => f() match {
case i:Int => i
case _ => throw new IllegalArgumentException("I wanted an int!")
})
workforce.foreach(_ ! None)
computed
}
def serialFourSums = {
val solo = workforce.head
workforce.map(w => solo.busywork(w.id,1000000000))
}
def timed(f: => List[Int]) = {
val t0 = System.nanoTime
val result = f
val t1 = System.nanoTime
(result, t1-t0)
}
def go {
val serial = timed( serialFourSums )
val parallel = timed( parallelFourSums )
println("Serial result: " + serial._1)
println("Parallel result:" + parallel._1)
printf("Serial took %.3f seconds\n",serial._2*1e-9)
printf("Parallel took %.3f seconds\n",parallel._2*1e-9)
}
}
Basically, the idea is to create a collection of workers--one per workload--and then throw all the data at them with !! which immediately gives back a future. When you try to read the future, the sender blocks until the worker's actually done with the data.
You could rewrite the above so that PriceCalculator extended Actor instead, and valueAll coordinated the return of the data.
Note that you have to be careful passing non-immutable data around.
Anyway, on the machine I'm typing this from, if you run the above you get:
scala> (new ActorExample).go
Serial result: List(-1629056553, -1629056636, -1629056761, -1629056928)
Parallel result:List(-1629056553, -1629056636, -1629056761, -1629056928)
Serial took 1.532 seconds
Parallel took 0.443 seconds
(Obviously I have at least four cores; the parallel timing varies rather a bit depending on which worker gets what processor and what else is going on on the machine.)