Scio Apache Beam - How to properly separate a pipeline code? - scala

I have a pipeline with a set of PTransforms and my method is getting very long.
I'd like to write my DoFns and my composite transforms in a separate package and use them back in my main method. With python it's pretty straightforward, how can I achieve that with Scio? I don't see any example of doing that. :(
withFixedWindows(
FIXED_WINDOW_DURATION,
options = WindowOptions(
trigger = groupedWithinTrigger,
timestampCombiner = TimestampCombiner.END_OF_WINDOW,
accumulationMode = AccumulationMode.ACCUMULATING_FIRED_PANES,
allowedLateness = Duration.ZERO
)
)
.sumByKey
// How to write this in an another file and use it here?
.transform("Format Output") {
_
.withWindow[IntervalWindow]
.withTimestamp
}

If I understand your question correctly, you want to bundle your map, groupBy, ... transformations in a separate package, and use them in your main pipeline.
One way would be to use applyTransform, but then you would end up using PTransforms, which are not scala-friendly.
You can simply write a function that receives an SCollection and returns the transformed one, like:
def myTransform(input: SCollection[InputType]): Scollection[OutputType] = ???
But if you intend to write your own Source/Sink, take a look at the ScioIO class

You can use map function to map your elements example.
Instead of passing a lambda, you can pass a method reference from another class
Example .map(MyClass.MyFunction)

I think one way to solve this could be to define an object in another package and then create a method in that object that would have the logic required for your transformation. For example:
def main(cmdlineArgs: Array[String]): Unit = {
val (sc, args) = ContextAndArgs(cmdlineArgs)
val defaulTopic = "tweets"
val input = args.getOrElse("inputTopic", defaulTopic)
val output = args("outputTopic")
val inputStream: SCollection[Tweet] = sc.withName("read from pub sub").pubsubTopic(input)
.withName("map to tweet class").map(x => {parse(x).extract[Tweet]})
inputStream
.flatMap(sentiment.predict) // object sentiment with method predict
}
object sentiment {
def predict(tweet: Tweet): Option[List[TweetSentiment]] = {
val data = tweet.text
val emptyCase = Some("")
Some(data) match {
case `emptyCase` => None
case Some(v) => Some(entitySentimentFile(data)) // I used another method, //not defined
}
}
Please also this link for an example given in the Scio examples

Related

apply a function to list in Scala

I'm trying to learn Scala.
I ask for help to understand loop foreach
I have a function, it reads only last csv from path. But it works when I point only one way:
val path = "src/main/resources/historical_novel/"
def getLastFile(path: String, spark: SparkSession): String = {
val hdfs = ...}
but how can I apply this function to the list such as
val paths: List[String] = List(
"src/main/resources/historical_novel/",
"src/main/resources/detective/",
"src/main/resources/adventure/",
"src/main/resources/horror/")
I want to get such result:
src/main/resources/historical_novel/20221027.csv
src/main/resources/detective/20221026.csv
src/main/resources/adventure/20221026.csv
src/main/resources/horror/20221027.csv
I create df with column (path), then apply function through WithColumn and it is work,
but I want to do it with foreach, understand it.
let's say your function is like this
def f(s: String): Unit = {}
you can simply do this
paths.foreach(p => f(p))
After your edit, I think you may want use map, a function that can transform a collection to another collection. like this
val result = paths.map(p => getLastFile(p, yourSparkSession))
foreach applies a function you define or provide on each element in a Collection.
The simplest example is to print each path to the console
paths.foreach(path => println(path))
To apply a series of functions as you describe you can use {} in the foreach body and call multiple functions.
paths.foreach(path => {
val file = loadFile(path)
writeToDataBase(file)
})

Use ScalaCheck generator inside another generator

I have a generator that creates a very compelx object. I cannot create this object through something like
val myGen = for{
a <- Gen.choose(-10,10)
...
} yield new MyClass(a,b,c,...)
I tried an approach of creating a custom generator like this
val myComplexGen :Gen[ComplexObject] = {
...
val myTempVariable = Gen.choose(-10,10)
val otherTempVal = Gen.choose(100,2000)
new MyComplexObject(myTempVariable,otherTempVal,...)
}
and then
test("myTest") {
forAll(myComplexGen){ complexObj =>
... // Here, complexObj.myTempVariable is always the same through all the iterations
}
}
While this works, the values generated are always the same. The inner Gen.choose yield always the same value.
Is there any way I can write a custom Gen with its own logic, and use inner Gen.choose inside, that would be random ?
I've been able to workaround the problem. The solution is definitely not elegant but that's the only way I could work it out.
I have transformed myComplexGen into a def, and called it inside another gen with dummy variables
def myComplexGen :ComplexObject = {
...
val myTempVariable = Gen.choose(-10,10)
val otherTempVal = Gen.choose(100,2000)
new MyComplexObject(myTempVariable,otherTempVal,...)
}
val realComplexGen :Gen[ComplexObject] = for {
i <- Gen.choose(0,10) // Not actually used, but for cannot be empty
} yield myComplexGen()
Now I can use realComplexGenin a forAll and the object is really random.

Does scala have a lazy evaluating wrapper?

I want to return a wrapper/holder for a result that I want to compute only once and only if the result is actually used. Something like:
def getAnswer(question: Question): Lazy[Answer] = ???
println(getAnswer(q).value)
This should be pretty easy to implement using lazy val:
class Lazy[T](f: () => T) {
private lazy val _result = Try(f())
def value: T = _result.get
}
But I'm wondering if there's already something like this baked into the standard API.
A quick search pointed at Streams and DelayedLazyVal but neither is quite what I'm looking for.
Streams do memoize the stream elements, but it seems like the first element is computed at construction:
def compute(): Int = { println("computing"); 1 }
val s1 = compute() #:: Stream.empty
// computing is printed here, before doing s1.take(1)
In a similar vein, DelayedLazyVal starts computing upon construction, even requires an execution context:
val dlv = new DelayedLazyVal(() => 1, { println("started") })
// immediately prints out "started"
There's scalaz.Need which I think you'd be able to use for this.

passing a code block to method without execution

I have following code:
import java.io._
import com.twitter.chill.{Input, Output, ScalaKryoInstantiator}
import scala.reflect.ClassTag
object serializer {
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
def load[T](file:_=>_,name:String,cls:Class[T]):T = {
if (java.nio.file.Files.notExists(new File(name).toPath())) {
val temp = file
val baos = new FileOutputStream(name)
val output = new Output(baos, 4096)
kryo.writeObject(output, temp)
temp.asInstanceOf[T]
}
else {
println("loading from " + name)
val baos = new FileInputStream(name)
val input = new Input(baos)
kryo.readObject(input,cls)
}
}
}
I want to use it in this way:
val mylist = serializer.load((1 to 100000).toList,"allAdj.bin",classOf[List[Int]])
I don't want to run (1 to 100000).toList every time so I want to pass it to the serializer and then decide to compute it for the first time and serialize it for future or load it from file.
The problem is that the code block is running first in my code, how can I pass the code block without executing it?
P.S. Is there any scala tool that do the exact thing for me?
To have parameters not be evaluated before being passed, use pass-by-name, like this:
def method(param: =>ParamType)
Whatever you pass won't be evaluated at the time you pass, but will be evaluated each time you use param, which might not be what you want either. To have it be evaluated only the first time you use, do this:
def method(param: =>ParamType) = {
lazy val p: ParamType = param
Then use only p on the body. The first time p is used, param will be evaluated and the value will be stored. All other uses of p will use the stored value.
Note that this happens every time you invoke method. That is, if you call method twice, it won't use the "stored" value of p -- it will evaluate it again on first use. If you want to "pre-compute" something, then perhaps you'd be better off with a class instead?

How to create a play.api.libs.iteratee.Enumerator which inserts some data between the items of a given Enumerator?

I use Play framework with ReactiveMongo. Most of ReactiveMongo APIs are based on the Play Enumerator. As long as I fetch some data from MongoDB and return it "as-is" asynchronously, everything is fine. Also the transformation of the data, like converting BSON to String, using Enumerator.map is obvious.
But today I faced a problem which at the bottom line narrowed to the following code. I wasted half of the day trying to create an Enumerator which would consume items from the given Enumerator and insert some items between them. It is important not to load all the items at once, as there could be many of them (the code example has only two items "1" and "2"). But semantically it is similar to mkString of the collections. I am sure it can be done very easily, but the best I could come with - was this code. Very similar code creating an Enumerator using Concurrent.broadcast serves me well for WebSockets. But here even that does not work. The HTTP response never comes back. When I look at Enumeratee, it looks that it is supposed to provide such functionality, but I could not find the way to do the trick.
P.S. Tried to call chan.eofAndEnd in Iteratee.mapDone, and chunked(enums >>> Enumerator.eof instead of chunked(enums) - did not help. Sometimes the response comes back, but does not contain the correct data. What do I miss?
def trans(in:Enumerator[String]):Enumerator[String] = {
val (res, chan) = Concurrent.broadcast[String]
val iter = Iteratee.fold(true) { (isFirst, curr:String) =>
if (!isFirst)
chan.push("<-------->")
chan.push(curr)
false
}
in.apply(iter)
res
}
def enums:Enumerator[String] = {
val en12 = Enumerator[String]("1", "2")
trans(en12)
//en12 //if I comment the previous line and uncomment this, it prints "12" as expected
}
def enum = Action {
Ok.chunked(enums)
}
Here is my solution which I believe to be correct for this type of problem. Comments are welcome:
def fill[From](
prefix: From => Enumerator[From],
infix: (From, From) => Enumerator[From],
suffix: From => Enumerator[From]
)(implicit ec:ExecutionContext) = new Enumeratee[From, From] {
override def applyOn[A](inner: Iteratee[From, A]): Iteratee[From, Iteratee[From, A]] = {
//type of the state we will use for fold
case class State(prev:Option[From], it:Iteratee[From, A])
Iteratee.foldM(State(None, inner)) { (prevState, newItem:From) =>
val toInsert = prevState.prev match {
case None => prefix(newItem)
case Some(prevItem) => infix (prevItem, newItem)
}
for(newIt <- toInsert >>> Enumerator(newItem) |>> prevState.it)
yield State(Some(newItem), newIt)
} mapM {
case State(None, it) => //this is possible when our input was empty
Future.successful(it)
case State(Some(lastItem), it) =>
suffix(lastItem) |>> it
}
}
}
// if there are missing integers between from and to, fill that gap with 0
def fillGap(from:Int, to:Int)(implicit ec:ExecutionContext) = Enumerator enumerate List.fill(to-from-1)(0)
def fillFrom(x:Int)(input:Int)(implicit ec:ExecutionContext) = fillGap(x, input)
def fillTo(x:Int)(input:Int)(implicit ec:ExecutionContext) = fillGap(input, x)
val ints = Enumerator(10, 12, 15)
val toStr = Enumeratee.map[Int] (_.toString)
val infill = fill(
fillFrom(5),
fillGap,
fillTo(20)
)
val res = ints &> infill &> toStr // res will have 0,0,0,0,10,0,12,0,0,15,0,0,0,0
You wrote that you are working with WebSockets, so why don't you use dedicated solution for that? What you wrote is better for Server-Sent-Events rather than WS. As I understood you, you want to filter your results before sending them back to client? If its correct then you Enumeratee instead of Enumerator. Enumeratee is transformation from-to. This is very good piece of code how to use Enumeratee. May be is not directly about what you need but I found there inspiration for my project. Maybe when you analyze given code you would find best solution.