Spark streaming - transform two streams and join - scala

I've got an issue where I need to transform two streams am reading from spark before joining.
Once I do the transformation, I no longer can join, I guess the type is no longer DStream[(String, String)] but DStream[Map[String, String]]
val windowStream1 = act1Stream.window(Seconds(5)).transform{rdd => rdd.map(_._2).map(l =>(...toMap)}
val windowStream2 = act2Stream.window(Seconds(5)).transform{rdd => rdd.map(_._2).map(l =>(...toMap)}
val joinedWindow = windowStream1.join(windowStream2) //can't join
Any idea ?

This doesn't solve your problem but makes it more digestible. You can split up the method chain and document which types you would expect on each step by defining temporal val/def/var identifiers with the expected type. This way you can easily spot where the type is not matching your expectations anymore.
E.g. I expect your act1Stream and act2Stream instances to be of type DStream[(String, String)], which i will call s1 and s2 for the moment. Comment me if that is not the case.
def joinedWindow(
s1: DStream[(String, String)],
s2: DStream[(String, String)]
): DStream[...] = {
val w1 = windowedStream(s1)
val w2 = windowedStream(s2)
w1.join(w2)
}
def windowedStream(actStream: DStream[(String, String)]): DStream[Map[...]] = {
val windowed: DStream[(String, String)] = actStream.window(Seconds(5))
windowed.transform( myTransform )
}
def myTransform(rdd: RDD[(String, String)]): RDD[Map[...]] = {
val mapped: RDD[String] = rdd.map(_._2)
// not enough information to conclude
// the result type from given code
mapped.map(l =>(...toMap))
}
From there one can conclude the rest of the types by filling the ... sections. Line by line eliminating compiler errors until you get your desired results. With the documentation of
DStream[T]
def window(windowDuration: Duration): DStream[T]
def transform[U](transformFunc: (RDD[T]) ⇒ RDD[U])(implicit arg0: ClassTag[U]): DStream[U]
PairDStreamFunctions[K,V]
def join[W](other: DStream[(K, W)])(implicit arg0: ClassTag[W]): DStream[(K, (V, W))]
RDD[T]
def map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]
At least this way you get to the point where you exactly know that the expected type and the produced type do not match.

Related

How to convert Future[Seq[A]] into Map[String, Seq[A]]?

I have a function which return a Future[Seq[A]]
def foo(x:String): Future[Seq[A]]
Then how can I convert it into a Map[String, Seq[A]] ?
This is What I have tried
foo(x).map { e =>
(ResponseHeader.Ok, e.groupBy(_.use).mapValues(???))
}
*EDIT :
What I want to achieve is to group by my Seq based on 1 of its column as the key, and convert it into a Map[key, Seq[A]].
I tried to group by it, but I dont know what to put inside the mapValues
The mapValues call is not required, the groupBy will give you what you want:
val e: Seq[A] = ???
val map: Map[String, Seq[A]] = e.groupBy(_.use)
val res: Future[(Int, Map[String, Seq[A]])] =
foo("").map { e =>
(ResponseHeader.Ok, e.groupBy(_.use))
}

cats-effect:How to transform Map[x,IO[y]] to IO[Map[x,y]]

I have a map of string to IO like this Map[String, IO[String]], I want to transform it into IO[Map[String, String]]. How to do it?
It would be nice to use unorderedTraverse here, but as codenoodle pointed out, it doesn't work because IO is not a commutative applicative. However there is a type that is, and it's called IO.Par. Like the name suggests, its ap combinator won't execute things sequentially but in parallel, so it's commutative – doing a and then b is not the same as doing b and then a, but doing a and b concurrently is the same as doing b and a concurrently.
So you can use unorderedTraverse using a function that doesn't return IO but IO.Par. However the downside to that is that now you need to convert from IO to IO.Par and then back – hardly an improvement.
To solve this problem, I have added the parUnorderedTraverse method in cats 2.0 that will take care of these conversions for you. And because it all happens in parallel it will also be more efficient! There are also parUnorderedSequence, parUnorderedFlatTraverse and parUnorderedFlatSequence.
I should also point out that this works not only for IO but also for everything else with a Parallel instance, such as Either[A, ?] (where A is a CommutativeSemigroup). It should also be possible for List/ZipList, but nobody appears to have bothered to do it yet.
You'll have to be a little careful with this one. Maps in Scala are unordered, so if you try to use cats's sequence like this…
import cats.instances.map._
import cats.effect.IO
import cats.UnorderedTraverse
object Example1 {
type StringMap[V] = Map[String, V]
val m: StringMap[IO[String]] = Map("1" -> IO{println("1"); "1"})
val n: IO[StringMap[String]] = UnorderedTraverse[StringMap].unorderedSequence[IO, String](m)
}
you'll get the following error:
Error: could not find implicit value for evidence parameter of type cats.CommutativeApplicative[cats.effect.IO]
The issue here is that the IO monad is not actually commutative. Here is the definition of commutativity:
map2(u, v)(f) = map2(v, u)(flip(f)) // Commutativity (Scala)
This definition shows that the result is the same even when the effects happen in a different order.
You can make the above code compile by providing an instance of CommutativeApplicative[IO] but that still doesn't make the IO monad commutative. If you run the following code you can see the side effects are not processed in the same order:
import cats.effect.IO
import cats.CommutativeApplicative
object Example2 {
implicit object FakeEvidence extends CommutativeApplicative[IO] {
override def pure[A](x: A): IO[A] = IO(x)
override def ap[A, B](ff: IO[A => B])(fa: IO[A]): IO[B] =
implicitly[Applicative[IO]].ap(ff)(fa)
}
def main(args: Array[String]): Unit = {
def flip[A, B, C](f: (A, B) => C) = (b: B, a: A) => f(a, b)
val fa = IO{println(1); 1}
val fb = IO{println(true); true}
val f = (a: Int, b: Boolean) => s"$a$b"
println(s"IO is not commutative: ${FakeEvidence.map2(fa, fb)(f).unsafeRunSync()} == ${FakeEvidence.map2(fb, fa)(flip(f)).unsafeRunSync()} (look at the side effects above^^)")
}
}
Which outputs the following:
1
true
true
1
IO is not commutative: 1true == 1true (look at the side effects above^^)
In order to get around this I would suggest making your map something with an order, like a List, where sequence will not require commutativity. The following example is just one way to do this:
import cats.effect.IO
import cats.implicits._
object Example3 {
val m: Map[String, IO[String]] = Map("1" -> IO {println("1"); "1"})
val l: IO[List[(String, String)]] = m.toList.traverse[IO, (String, String)] { case (s, io) => io.map(s2 => (s, s2))}
val n: IO[Map[String, String]] = l.map { _.toMap }
}

Maps and Flatmaps in Scala

I am new to scala. I need a lot of help with using maps and flatmaps with tuples.
I have functions as follows-
def extract(url: String): String = {//some code}
def splitstring(content: String): Array[String]={//some code}
def SentenceDetect(paragraph: Array[String]): Array[String] = {//some code}
def getMd5(literal: String): String = {//some code}
I have a incoming list of urls. and I want it to go through above series of functions using map and flatmaps.
var extracted_content=url_list.map(url => (url,extract(url)))
val trimmed_content=extracted_content.map(t => (t._1,splitstring(t._2)))
val sentences=trimmed_content.map(t => (t._1,SentenceDetect(t._2)))
val hashed_values=sentences.flatMap(t => (t._1,getMd5(t._2)))
The issue is I am getting at error at flatMap as type mismatch--
Error:(68, 46) type mismatch;
found : (String, String)
required: scala.collection.GenTraversableOnce[?]
val hashed_values=sentences.flatMap(t => (t._1,getMd5(t._2.toString)))
How to get it done.
I think this is what you're after.
val hashed_values = sentences.map(t => (t._1, t._2.map(getMd5)))
This should result in type List[(String,Array[String])]. This assumes that you actually want the Md5 calculation of each element in the t._2 array.
Recall that the signature of flatMap() is flatMap(f: (A) ⇒ GenTraversableOnce[B]), in other words, it takes a function that takes an element and returns a collection of transitioned elements. A tuple, (String,String), is not GenTraversableOnce thus the error you're getting.
You are getting this error because getMd5(...) accepts a string, however sentences is of type List[(String, Array[String])] (assuming url_list is List[String]), so t._2 is of type Array[String].
Anyway, some notes regarding your code:
variable names in scala are "lower camel" (https://docs.scala-lang.org/style/naming-conventions.html), not lower-with-underscore
extracted_content should probably be a val
since all your variables are maps, and since you want to transform the map's value, you better use .mapValues instead of .map

Stacking monads Writer and OptionT

I have the following code:
override def getStandsByUser(email: String): Try[Seq[Stand]] =
(for {
user <- OptionT(userService.findOneByEmail(email)): Try[Option[User]]
stands <- OptionT.liftF(standService.list()):[Try[List[Stand]]]
filtered = stands.filter(stand => user.stands.contains(stand.id))
} yield filtered).getOrElse(Seq())
}
I want to add logging on each stage of the processing - so I need to introduce writer monad and stack it with monad transformer OptionT. Could you please suggest how to do that?
The best way to do this is to convert your service calls into using cats-mtl.
For representing Try or Option you can use MonadError and for logging you can use FunctorTell. Now I don't know what exactly you're doing inside your userService or standService, but I wrote some code to demonstrate what the result might look like:
type Log = List[String]
//inside UserService
def findOneByEmail[F[_]](email: String)
(implicit F: MonadError[F, Error], W: FunctorTell[F, Log]): F[User] = ???
//inside StandService
def list[F[_]]()
(implicit F: MonadError[F, Error], W: FunctorTell[F, Log]): F[List[Stand]] = ???
def getStandsByUser[F[_]](email: String)
(implicit F: MonadError[F, Error], W: FunctorTell[F, Log]): F[List[Stand]] =
for {
user <- userService.findOneByEmail(email)
stands <- standService.list()
} yield stands.filter(stand => user.stands.contains(stand.id))
//here we actually run the function
val result =
getStandsByUser[WriterT[OptionT[Try, ?], Log, ?] // yields WriterT[OptionT[Try, ?], Log, List[Stand]]
.run // yields OptionT[Try, (Log, List[Stand])]
.value // yields Try[Option[(Log, List[Stand])]]
This way we can avoid all of the calls to liftF and easily compose our different services even if they will use different monad transformers at runtime.
If you take a look at the definition of cats.data.Writer you will see that it is an alias to cats.data.WriterT with the effect fixed to Id.
What you want to do is use WriterT directly and instead of Id use OptionT[Try, YourType].
Here is a small code example of how that can be achieved:
object Example {
import cats.data._
import cats.implicits._
type MyType[A] = OptionT[Try, A]
def myFunction: MyType[Int] = OptionT(Try(Option(1)))
def main(args: Array[String]): Unit = {
val tmp: WriterT[MyType, List[String], Int] = for {
_ <- WriterT.tell[MyType, List[String]](List("Before first invocation"))
i <- WriterT.liftF[MyType, List[String], Int](myFunction)
_ <- WriterT.tell[MyType, List[String]](List("After second invocation"))
j <- WriterT.liftF[MyType, List[String], Int](myFunction)
_ <- WriterT.tell[MyType, List[String]](List(s"Result is ${i + j}"))
} yield i + j
val result: Try[Option[(List[String], Int)]] = tmp.run.value
println(result)
// Success(Some((List(Before first invocation, After second invocation, Result is 2),2)))
}
}
The type annotations make this a bit ugly, but depending on your use case you might be able to get rid of them. As you can see myFunction returns a result of type OptionT[Try, Int] and WriterT.lift will push that into a writer object that also has a List[String] for your logs.

Supplying a code block as one of multiple method parameters

Consider these overloaded groupBy signatures:
def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
groupBy[K](f, defaultPartitioner(this))
}
def groupBy[K](
f: T => K,
numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
groupBy(f, new HashPartitioner(numPartitions))
}
A correct/working invocation of the former is as follows:
val groupedRdd = df.rdd.groupBy{ r => r.getString(r.fieldIndex("centroidId"))}
But I am unable to determine how to add the second parameter. Here is the obvious attempt - which gives syntax errors:
val groupedRdd = df.rdd.groupBy{ r => r.getString(r.fieldIndex("centroidId")),
nPartitions}
I had also tried (also with syntax errors) :
val groupedRdd = df.rdd.groupBy({ r => r.getString(r.fieldIndex("centroidId"))},
nPartitions)
btw Here is an approach that does work .. but I am looking for the inline syntax
def func(r: Row) = r.getString(r.fieldIndex("centroidId"))
val groupedRdd = df.rdd.groupBy( func _, nPartitions)
Since this is a generic method with type parameters T, K, Scala sometimes can't infer what types those should be from the context. In such cases you can help it by providing type annotation like this:
df.rdd.groupBy({ r: Row => r.getString(r.fieldIndex("centroidId")) }, nPartitions)
This is also the reason why this approach works:
def func(r: Row) = r.getString(r.fieldIndex("centroidId"))
val groupedRdd = df.rdd.groupBy(func _, nPartitions)
This fixes the type for r to be a Row similarly to the approach above.