Scala Future Sequence Mapping: finding length? - scala

I want to return both a Future[Seq[String]] from a method and the length of that Seq[String] as well. Currently I'm building the Future[Seq[String]] using a mapping function from another Future[T].
Is there any way to do this without awaiting for the Future?

You can map over the current Future to create a new one with the new data added to the type.
val fss: Future[Seq[String]] = Future(Seq("a","b","c"))
val x: Future[(Seq[String],Int)] = fss.map(ss => (ss, ss.length))

If you somehow know what the length of the Seq will be without actually waiting for it, then something like this;
val t: Future[T] = ???
def foo: (Int, Future[Seq[String]]) = {
val length = 42 // ???
val fut: Future[Seq[String]] = t map { v =>
genSeqOfLength42(v)
}
(length, fut)
}
If you don't, then you will have to return Future[(Int, Seq[String])] as jwvh said, or you can easily get the length later in the calling function.

Related

Extend / Replicate Scala collections syntax to create your own collection?

I want to build a map however I want to discard all keys with empty values as shown below:
#tailrec
def safeFiltersMap(
map: Map[String, String],
accumulator: Map[String,String] = Map.empty): Map[String, String] = {
if(map.isEmpty) return accumulator
val curr = map.head
val (key, value) = curr
safeFiltersMap(
map.tail,
if(value.nonEmpty) accumulator + (key->value)
else accumulator
)
}
Now this is fine however I need to use it like this:
val safeMap = safeFiltersMap(Map("a"->"b","c"->"d"))
whereas I want to use it like the way we instantiate a map:
val safeMap = safeFiltersMap("a"->"b","c"->"d")
What syntax can I follow to achieve this?
The -> syntax isn't a special syntax in Scala. It's actually just a fancy way of constructing a 2-tuple. So you can write your own functions that take 2-tuples as well. You don't need to define a new Map type. You just need a function that filters the existing one.
def safeFiltersMap(args: (String, String)*): Map[String, String] =
Map(args: _*).filter {
result => {
val (_, value) = result
value.nonEmpty
}
}
Then call using
safeFiltersMap("a"->"b","c"->"d")

Best way to get List[String] or Future[List[String]] from List[Future[List[String]]] Scala

I have a flow that returns List[Future[List[String]]] and I want to convert it to List[String] .
Here's what I am doing currently to achieve it -
val functionReturnedValue: List[Future[List[String]]] = functionThatReturnsListOfFutureList()
val listBuffer = new ListBuffer[String]
functionReturnedValue.map{futureList =>
val list = Await.result(futureList, Duration(10, "seconds"))
list.map(string => listBuffer += string)
}
listBuffer.toList
Waiting inside loop is not good, also need to avoid use of ListBuffer.
Or, if it is possible to get Future[List[String]] from List[Future[List[String]]]
Could someone please help with this?
There is no way to get a value from an asynchronus context to the synchronus context wihtout blocking the sysnchronus context to wait for the asynchronus context.
But, yes you can delay that blocking as much as you can do get better results.
val listFutureList: List[Future[List[String]]] = ???
val listListFuture: Future[List[List[String]]] = Future.sequence(listFutureList)
val listFuture: Future[List[String]] = listListFuture.map(_.flatten)
val list: List[String] = Await.result(listFuture, Duration.Inf)
Using Await.result invokes a blocking operation, which you should avoid if you can.
Just as a side note, in your code you are using .map but as you are only interested in the (mutable) ListBuffer you can just use foreach which has Unit as a return type.
Instead of mapping and adding item per item, you can use .appendAll
functionReturnedValue.foreach(fl =>
listBuffer.appendAll(Await.result(fl, Duration(10, "seconds")))
)
As you don't want to use ListBuffer, another way could be using .sequence is with a for comprehension and then .flatten
val fls: Future[List[String]] = for (
lls <- Future.sequence(functionReturnedValue)
) yield lls.flatten
You can transform List[Future[In]] to Future[List[In]] safetly as follows:
def aggregateSafeSequence[In](futures: List[Future[In]])(implicit ec: ExecutionContext): Future[List[In]] = {
val futureTries = futures.map(_.map(Success(_)).recover { case NonFatal(ex) => Failure(ex)})
Future.sequence(futureTries).map {
_.foldRight(List[In]()) {
case (curr, acc) =>
curr match {
case Success(res) => res :: acc
case Failure(ex) =>
println("Failure occurred", ex)
acc
}
}
}
}
Then you can use Await.result In order to wait if you like but it's not recommended and you should avoid it if possible.
Note that in general Future.sequence, if one the futures fails all the futures will fail together, so i went to a little different approach.
You can use the same way from List[Future[List[String]]] and etc.

request timeout from flatMapping over cats.effect.IO

I am attempting to transform some data that is encapsulated in cats.effect.IO with a Map that also is in an IO monad. I'm using http4s with blaze server and when I use the following code the request times out:
def getScoresByUserId(userId: Int): IO[Response[IO]] = {
implicit val formats = DefaultFormats + ShiftJsonSerializer() + RawShiftSerializer()
implicit val shiftJsonReader = new Reader[ShiftJson] {
def read(value: JValue): ShiftJson = value.extract[ShiftJson]
}
implicit val shiftJsonDec = jsonOf[IO, ShiftJson]
// get the shifts
var getDbShifts: IO[List[Shift]] = shiftModel.findByUserId(userId)
// use the userRoleId to get the RoleId then get the tasks for this role
val taskMap : IO[Map[String, Double]] = taskModel.findByUserId(userId).flatMap {
case tskLst: List[Task] => IO(tskLst.map((task: Task) => (task.name -> task.standard)).toMap)
}
val traversed: IO[List[Shift]] = for {
shifts <- getDbShifts
traversed <- shifts.traverse((shift: Shift) => {
val lstShiftJson: IO[List[ShiftJson]] = read[List[ShiftJson]](shift.roleTasks)
.map((sj: ShiftJson) =>
taskMap.flatMap((tm: Map[String, Double]) =>
IO(ShiftJson(sj.name, sj.taskType, sj.label, sj.value.toString.toDouble / tm.get(sj.name).get)))
).sequence
//TODO: this flatMap is bricking my request
lstShiftJson.flatMap((sjLst: List[ShiftJson]) => {
IO(Shift(shift.id, shift.shiftDate, shift.shiftStart, shift.shiftEnd,
shift.lunchDuration, shift.shiftDuration, shift.breakOffProd, shift.systemDownOffProd,
shift.meetingOffProd, shift.trainingOffProd, shift.projectOffProd, shift.miscOffProd,
write[List[ShiftJson]](sjLst), shift.userRoleId, shift.isApproved, shift.score, shift.comments
))
})
})
} yield traversed
traversed.flatMap((sLst: List[Shift]) => Ok(write[List[Shift]](sLst)))
}
as you can see the TODO comment. I've narrowed down this method to the flatmap below the TODO comment. If I remove that flatMap and merely return "IO(shift)" to the traversed variable the request does not timeout; However, that doesn't help me much because I need to make use of the lstShiftJson variable which has my transformed json.
My intuition tells me I'm abusing the IO monad somehow, but I'm not quite sure how.
Thank you for your time in reading this!
So with the guidance of Luis's comment I refactored my code to the following. I don't think it is optimal (i.e. the flatMap at the end seems unecessary, but I couldnt' figure out how to remove it. BUT it's the best I've got.
def getScoresByUserId(userId: Int): IO[Response[IO]] = {
implicit val formats = DefaultFormats + ShiftJsonSerializer() + RawShiftSerializer()
implicit val shiftJsonReader = new Reader[ShiftJson] {
def read(value: JValue): ShiftJson = value.extract[ShiftJson]
}
implicit val shiftJsonDec = jsonOf[IO, ShiftJson]
// FOR EACH SHIFT
// - read the shift.roleTasks into a ShiftJson object
// - divide each task value by the task.standard where task.name = shiftJson.name
// - write the list of shiftJson back to a string
val traversed = for {
taskMap <- taskModel.findByUserId(userId).map((tList: List[Task]) => tList.map((task: Task) => (task.name -> task.standard)).toMap)
shifts <- shiftModel.findByUserId(userId)
traversed <- shifts.traverse((shift: Shift) => {
val lstShiftJson: List[ShiftJson] = read[List[ShiftJson]](shift.roleTasks)
.map((sj: ShiftJson) => ShiftJson(sj.name, sj.taskType, sj.label, sj.value.toString.toDouble / taskMap.get(sj.name).get ))
shift.roleTasks = write[List[ShiftJson]](lstShiftJson)
IO(shift)
})
} yield traversed
traversed.flatMap((t: List[Shift]) => Ok(write[List[Shift]](t)))
}
Luis mentioned that mapping my List[Shift] to a Map[String, Double] is a pure operation so we want to use a map instead of flatMap.
He mentioned that I'm wrapping every operation that comes from the database in IO which is causing a great deal of recomputation. (including DB transactions)
To solve this issue I moved all of the database operations inside of my for loop, using the "<-" operator to flatMap each of the return values allows the variables being used to preside within the IO monads, hence preventing the recomputation experienced before.
I do think there must be a better way of returning my return value. flatMapping the "traversed" variable to get back inside of the IO monad seems to be unnecessary recomputation, so please anyone correct me.

Scala: Convert a vector of tuples containing a future to a future of a vector of tuples

I'm looking for a way to convert a Vector[(Future[TypeA], TypeB)] to a Future[Vector[(TypeA, TypeB)]].
I'm aware of the conversion of a collection of futures to a future of a collection using Future.sequence(...) but cannot find out a way to manage the step from the tuple with a future to a future of tuple.
So I'm looking for something that implements the desired functionality of the dummy extractFutureFromTuple in the following.
val vectorOfTuples: Vector[(Future[TypeA], TypeB)] = ...
val vectorOfFutures: Vector[Future[(TypeA, TypeB)]] = vectorOfTuples.map(_.extractFutureFromTuple)
val futureVector: Future[Vector[(TypeA, TypeB)]] = Future.sequence(vectorOfFutures)
Note that you can do this with a single call to Future.traverse:
val input: Vector[(Future[Int], Long)] = ???
val output: Future[Vector[(Int, Long)]] = Future.traverse(input) {
case (f, v) => f.map(_ -> v)
}

Strange "Task not serializable" with Spark

In my program, I have a method which returns some RDD, let's call it myMethod which takes a non-serializable parameter and let the RDD be of the type Long (my real RDD is a Tuple type but only contains primitive types).
When I try something like this:
val x: NonSerializableThing = ...
val l: Long = ...
myMethod(x, l).map(res => res + l) // myMethod's RDD does NOT include the NonSerializableThing
I get Task not serializable.
When I replace res + l by res + 1L (i.e., some constant), it runs.
From the serialization trace, it tries to serialize the NonSerializableThing and chokes there, but I double-checked my method and this object never appears in an RDD.
When I try to collect output of myMethod directly, i.e. with
myMethod(x, l).take(1) foreach println
I also get no problems.
The method uses the NonSerializableThing to get a (local) Seq of values on which multiple Cassandra queries are made (this is needed because I need to construct the partition keys to query for), like this:
def myMethod(x: NonSerializableThing, l: Long): RDD[Long] = {
val someParam1: String = x.someProperty
x.getSomeSeq.flatMap(y: OtherNonSerializableThing => {
val someParam2: String = y.someOtherProperty
y.someOtherSeq.map(someParam3: String =>
sc.cassandraTable("fooKeyspace", "fooTable").
select("foo").
where("bar=? and quux=? and baz=? and l=?", someParam1, someParam2, someParam3, l).
map(_.getLong(0))
}.reduce((a, b) => a.union(b))
}
The getSomeSeq and someOtherSeq return plain non-spark Seqs
What I want to achieve is to "union" multiple Cassandra queries.
What could be the problem here?
EDIT, Addendum, as requested by Jem Tucker:
What I have in my class is something like this:
implicit class MySparkExtension(sc: SparkContext) {
def getThing(/* some parameters */): NonSerializableThing = { ... }
def myMethod(x: NonSerializableThing, l: Long): RDD[Long] = {
val someParam1: String = x.someProperty
x.getSomeSeq.flatMap(y: OtherNonSerializableThing => {
val someParam2: String = y.someOtherProperty
y.someOtherSeq.map(someParam3: String =>
sc.cassandraTable("fooKeyspace", "fooTable").
select("foo").
where("bar=? and quux=? and baz=? and l=?", someParam1, someParam2, someParam3, l).
map(_.getLong(0))
}.reduce((a, b) => a.union(b))
}
}
This is declared in a package object. The problem occurrs here:
// SparkContext is already declared as sc
import my.pkg.with.extension._
val thing = sc.getThing(/* parameters */)
val l = 42L
val rdd = sc.myMethod(thing, l)
// until now, everything is OK.
// The following still works:
rdd.take(5) foreach println
// The following causes the exception:
rdd.map(x => x >= l).take(5) foreach println
// While the following works:
rdd.map(x => x >= 42L).take(5) foreach println
I tested this entered "live" into a Spark shell as well as in an algorithm submitted via spark-submit.
What I now want to try (as per my last comment) is the following:
implicit class MySparkExtension(sc: SparkContext) {
def getThing(/* some parameters */): NonSerializableThing = { ... }
def myMethod(x: NonSerializableThing, l: Long): RDD[Long] = {
val param1 = x.someProperty
val partitionKeys =
x.getSomeSeq.flatMap(y => {
val param2 = y.someOtherProperty
y.someOtherSeq.map(param3 => (param1, param2, param3, l)
}
queryTheDatabase(partitionKeys)
}
private def queryTheDatabase(partitionKeys: Seq[(String, String, String, Long)]): RDD[Long] = {
partitionKeys.map(k =>
sc.cassandraTable("fooKeyspace", "fooTable").
select("foo").
where("bar=? and quux=? and baz=? and l=?", k._1, k._2, k._3, k._4).
map(_.getLong(0))
).reduce((a, b) => a.union(b))
}
}
I believe this could work because the RDD is constructed in the method queryTheDatabase now, where no NonSerializableThing exists.
Another option might be: The NonSerializableThing would indeed be serializable, but I pass in the SparkContext in there as an implicit constructor parameter. I think if I make this transient, it would (uselessly) get serialized but not cause any problems.
When you replace l with 1L Spark no longer tries to serialize the class with the method / variables in and so the error is not thrown.
You should be able to fix by marking val x: NonSerializableThing = ... as transient e.g.
#transient
val x: NonSerializableThing = ...
This means when the class is serialized this variable should be ignored.