Can the subflows of groupBy depend on the keys they were generated from ? - scala

I have a flow with data associated to users. I also have a state for each user, that I can get asynchronously from DB.
I want to separate my flow with one subflow per user, and load the state for each user when materializing the subflow, so that the elements of the subflow can be treated with respect to this state.
If I don't want to merge the subflows downstream, I can do something with groupBy and Sink.lazyInit :
def getState(userId: UserId): Future[UserState] = ...
def getUserId(element: Element): UserId = ...
def treatUser(state: UserState): Sink[Element, _] = ...
val treatByUser: Sink[Element] = Flow[Element].groupBy(
Int.MaxValue,
getUserId
).to(
Sink.lazyInit(
elt => getState(getUserId(elt)).map(treatUser),
??? // this is never called, since the subflow is created when an element comes
)
)
However, this does not work if treatUser becomes a Flow, since there is no equivalent for Sink.lazyInit.
Since subflows of groupBy are materialized only when a new element is pushed, it should be possible to use this element to materialize the subflow, but I wasn't able to adapt the source code for groupBy so that this work consistently. Likewise, Sink.lazyInitdoesn't seem to be easily translatable to the Flow case.
Any idea on how to solve this issue ?

The relevant Akka issue you have to look at is #20129: add Sink.dynamic and Flow.dynamic.
In the associated PR #20579 they actually implemented LazySink stuffs.
They are planning to do LazyFlow next:
Will do next lazyFlow with similar signature.
Unfortunately you have to wait for that functionality to be implemented in Akka or write it yourself (then consider a PR to Akka).

Related

Pass each item of dataset through list and update

I am working on refactoring code for a Spark job written in Scala. We have a dataset of rolled up data, "rollups", that we pass through a list of rules. Each of the rollups has a "flags" value that is a list that we appended the rule information to to keep track of which "rule" was triggered by that rollup. Each rule is an object that will look at the data of a rollup and decide whether or not to add its identifying information to the rollups "flags".
So here is where the problem is. Currently each rule takes in the rollup data, adds a flag and then returns it again.
object Rule_1 extends AnomalyRule {
override def identifyAnomalies(rollupData: RollupData): RollupData = {
if (*condition on rollup data is triggered*) {
rollupData.flags = rollupData.addFlag("rule 1")
}
rollupData
}
}
This allows us to calculate the rules like:
val anomalyInfo = rollups.flatMap(x =>
AnomalyRules.rules.map(y =>
y.identifyAnomalies(x)
).filter(a => a.flags.nonEmpty)
)
# do later processing on anomalyInfo
The rollups val here is a Dataset of our rollup data and rules is a list of rule objects.
The issue with this method is that it will create a duplicate rollups for each rule. For example if we have 7 rules, each rollup will be duplicated 7 times because each rule returns the rollup passed into it. Running dropDuplicates() on the dataset will take care of this issue but its ugly and confusing.
This is why I wanted to refactor, but if I set up the rules to instead only append the rule like this:
object Rule_1 extends AnomalyRule {
override def identifyAnomalies(rollupData: RollupData): Unit= {
if (*condition on rollup data is triggered*) {
rollupData.flags = rollupData.addFlag("rule 1")
}
}
}
We can instead write
rollups.foreach(rollup =>
AnomalyRules.rules.foreach(rule =>
rule.identifyAnomalies(rollup)
)
)
# do later processing on rollups
This seems like the more intuitive approach. However, while running these rules in unit tests works fine, no "flag" information is added when the "rollups" dataset is passed through. I think it is because datasets are not mutable? This method actually does work if I collect the rollups as a list but the dataset I am testing on is much smaller than what is in production so we can't do that. This is where I am stuck and can not think of a cleaner way of writing this code. I feel like I am missing some fundamental programming concept or do not know how mutability works well enough.

What would be the best way to define a gatling a user flow using ChainBuilder?

I am new to Scala and Gatling and I am trying to figure out what would be the best way to define a user story and pass it a ChainBuilder to Gatling Scenario.
When I say user Story In my case I mean a flow that will consist of Login, many different calls and then a loop over another list of calls for the whole duration of the test.
I have created the following function to create a scenario:
def createScenario(name: String, feed: FeederBuilder, chains: ChainBuilder*): ScenarioBuilder = {
scenario(name).feed(feed).forever() {
exec(chains).pause(Config.pauseBetweenRequests)
}
}
And here is how I execute this function:
val scenario = createScenario(Config.testName, feeder.random,
setSessionParams(PARAM1, Config.param1),
setSessionParams(PARAM2, Config.param2),
login,
executeSomeCall1,
executeSomeCall2,
executeSomeCall3,
executeSomeCall4,
executeSomeCall5,
executeSomeCall6,
executeSomeCall7,
executeSomeCall8,
executeSomeCall9,
)
Here is an example of what executeSomeCall function looks like:
def executeSomeCall = {
exec(http("ET Call Home")
.post("/et/call/home")
.body(ElFileBody("/redFingerBody.json")).asJson
.check(status is 200))
}
My first question:
Is that the correct way to define a chain of rest calls and feed it to the scenario? I am asking that because what I see when I define a flow like that is that for some reason not all the my REST calls are actually executed. Weirdly enough, if I change the order of the calls it does work and all functions are called. (So I am definitely doing something wrong)
My second question:
How can I define an infinite loop within this flow? (Infinite for as long as the test is running)
So for example, I'd like the above flow to start and when it reaches executeSomeCall8, it will then loop executeSomeCall8 and executeSomeCall9 for the whole duration of the test.
I don't see why your calls would not be executed, however the way you're constructing your scenario is not that flexible. You can make use of chaining without requiring a createScenario() method.
That leads to your second question, when you have the scenario chained like:
val scn = scenario("something")
...
.exec(someCall7)
.forever(){
exec(sommeCall8)
.exec(someCall9)
}
...
where someCallN in my case look like:
val someCall = http("request name")
.get("/some/uri")
...
Note: foerever() is just an example, you can use other loop statements that suits your needs.
I hope it helps.

Returning value from a Scala Future

In the below code, I'm trying to do two operations. One, to create a customer in a db, and the other, to create an event in the db. The creation of the event, is dependent on the creation of the user.
I'm new to Scala, and confused on the role of Futures here. I'm trying to query a db and see if the user is there, and if not, create the user. The below code is supposed to check if the user exists with the customerByPhone() function, and if it doesn't, then go into the createUserAndEvent() function.
What it's actually doing, is skipping the response from customerByPhone and going straight into createUserAndEvent(). I thought that by using a flatmap, the program would automatically wait for the response and that I wouldn't have to use Await.result is that not the case? Is there a way to avoid using Await.result to not block the thread on production code?
override def findOrCreate(phoneNumber: String, creationReason: String): Future[AvroCustomer] = {
//query for customer in db
//TODO this goes into createUserAndEvent before checking that response comes back empty from querying for user
customerByPhone(phoneNumber)
.flatMap(_ => createUserAndEvent(phoneNumber, creationReason, 1.0))
}
You don't need to use Await.result or any other blocking. You do in fact have the result from customerByPhone, you're just ignoring it with the _ . I think what you want is something like this:
customerByPhone(phoneNumber)
.flatMap(customer => {
if(customer == null)
createUserAndEvent(phoneNumber, creationReason, 1.0)
else
Future(customer)
})
You need to code the logic to do something only if the customer isn't there.

Spark: How to structure a series of side effect actions inside mapping transformation to avoid repetition?

I have a spark streaming application that needs to take these steps:
Take a string, apply some map transformations to it
Map again: If this string (now an array) has a specific value in it, immediately send an email (or do something OUTSIDE the spark environment)
collect() and save in a specific directory
apply some other transformation/enrichment
collect() and save in another directory.
As you can see this implies to lazily activated calculations, which do the OUTSIDE action twice. I am trying to avoid caching, as at some hundreds lines per second this would kill my server.
Also trying to mantaining the order of operation, though this is not as much as important: Is there a solution I do not know of?
EDIT: my program as of now:
kafkaStream;
lines = take the value, discard the topic;
lines.foreachRDD{
splittedRDD = arg.map { split the string };
assRDD = splittedRDD.map { associate to a table };
flaggedRDD = assRDD.map { add a boolean parameter under a if condition + send mail};
externalClass.saveStaticMethod( flaggedRDD.collect() and save in file);
enrichRDD = flaggedRDD.map { enrich with external data };
externalClass.saveStaticMethod( enrichRDD.collect() and save in file);
}
I put the saving part after the email so that if something goes wrong with it at least the mail has been sent.
The final 2 methods I found were these:
In the DStream transformation before the side-effected one, make a copy of the Dstream: one will go on with the transformation, the other will have the .foreachRDD{ outside action }. There are no major downside in this, as it is just one RDD more on a worker node.
Extracting the {outside action} from the transformation and mapping the already sent mails: filter if mail has already been sent. This is a almost a superfluous operation as it will filter out all of the RDD elements.
Caching before going on (although I was trying to avoid it, there was not much to do)
If trying to not caching, solution 1 is the way to go

Spark Streaming -- (Top K requests) how to maintain a state on a driver?

I have a stream of logs with URLs users request.
Every minute I want to get top100 pages requested during all the time and save it to HDFS.
I understand how to maintain a number of requests for each url:
val ratingItemsStream : DStream[(String,Long)] = lines
.map(LogEntry(_))
.map(entry => (entry.url, 1L))
.reduceByKey(_ + _)
.updateStateByKey(updateRequestCount)
// this provides a DStream of Tuple of [Url, # of requests]
But what to I do next?
Obviously I need to pass all the updates to host to maintain a priorityqueue, and then take top K of it every 1 minute.
How can I achieve this?
UPD: I've seen spark examples and algebird's MapMonoid used there. But since I do not understand how it works(surpisingly no information was found online), I don't want to use it. There must me some way, right?
You could approach it by taking a x-minute window aggregations of the data and applying sorting to get the ranking.
val window = ratingItemStream.window(Seconds(windowSize), Seconds(windowSize))
window.forEachRDD{rdd =>
val byScore = rdd.map(_.swap).sortByKey(ascending=false).zipWithIndex
val top100 = byScore.collect{case ((score, url), index) if (index<100) => (url, score)}
top100.saveAsTextFile("./path/to/file/")
}
(sample code, not tested!)
Note that rdd.top(x) will give you better performance than sorting/zipping but it returns an array, and therefore, you're on your own to save it to hdfs using the hadoop API (which is an option, I think)