how to detect duplicated line using scala akka stream - scala

we have a scala application that read lines from text file and process them using Akka Stream. for better performance we set parallelism to 5. the problem is if the multiple lines contains the same email we only keep one of the line and treated others as duplicated and throw error. I tried to use a java concurrentHashMap to detect duplication but it didn't work, here is my code:
allIdentifiers = new ConcurrentHashMap[String, Int]()
Source(rows)
.mapAsync(config.parallelism.value) {
case (dataRow, index) => {
val eventResendResult: EitherT[Future, NonEmptyList[ResendError], ResendResult] =
for {
cleanedRow <- EitherT.cond[Future](
!allIdentifiers.containsKey(dataRow.lift(emailIndex)), {
allIdentifiers.put(dataRow.lift(emailIndex),index)
dataRow
}, {
NonEmptyList.of(
DuplicatedError(
s"Duplicated record at row $index",
List(identifier)
)
)
}
)
_ = logger.debug(
LoggingMessage(
requestId = RequestId(),
message = s"allIdentifiers: $allIdentifiers"
)
)
... more process step ...
} yield foldResponses(sent)
eventResendResult
.leftMap(errors => ResendResult(errors.toList, List.empty))
.merge
}
}
.runWith(Sink.reduce { (result1: ResendResult, result2: ResendResult) =>
ResendResult(
result1.errors ++ result2.errors,
result1.results ++ result2.results
)
})
we have config.parallelism.value set to 5, means any moment it'll process up to 5 lines concurrently. what I observed is if there are duplicated lines right next to each other, it didn't work, example:
line 0 contains email1
line 1 contains email1
line 2 contains email2
line 3 contains email2
line 4 contains email3
from the log i see the concurrentHashMap was populated with entries, but all lines passed the duplication detect and moved to the next process step.
so Akka Stream's parallelism is not the same thing as java's multithreads? how can i detect duplicated line in this case?

The problem is in the following snippet:
cleanedRow <- EitherT.cond[Future](
!allIdentifiers.containsKey(dataRow.lift(emailIndex)), {
allIdentifiers.put(dataRow.lift(emailIndex),index)
dataRow
}, {
NonEmptyList.of(
DuplicatedError(
s"Duplicated record at row $index",
List(identifier)
)
)
}
)
In particular: imagine two threads simultaneously processing an email which should be deduplicated. It is possible for the following to happen (in order)
The first thread checks containsKey and finds the email is not in the map
The second thread checks containsKey and finds the email is not in the map
The first thread adds the email to the map (based on results from step 1.) and passes the email through
The second thread adds the email to the map (based on results from step 3.) and passes the email through
In other words: you need to atomically check the map for the key and update it. This is a pretty common sort of thing to want, so it is exactly what ConcurrentHashMap's put does: it updates the value at the key and returns the previous value it replaced, if there was one.
I'm not too familiar with the combinators in Cats, so the following might not be idiomatic. However, note how it inserts and checks for a previous value in one atomic step.
cleanedRow <- EitherT(Future.successful {
val previous = allIdentifiers.put(dataRow.lift(emailIndex), index)
Either.cond(
previous != null,
dataRow,
NonEmptyList.of(
DuplicatedError(
s"Duplicated record at row $index",
List(identifier)
)
)
)
})

Related

Have assertion error return respective list position instead of the object?

test("Comparison") {
val list: List[String] = List("Thing", "Entity", "Variable")
val expected: List[String] = List("Thingg", "Entityy", "Variablee")
var expectedPosition = 0
for (item <- list) {
assert(list(expectedPosition) == expected(expectedPosition))
expectedPosition += 1
}
}
In Scala, in order to make my low-level code tests more readable, I thought it would be a good idea to just use one assertion and have it iterate through a loop and increase an accumulator at the end.
This is to test more than one input at a time when the multiple inputs would more or less have similar attributes. When the assertion fails, it comes back as "Thing[]" did not equal "Thing[g]". Instead of it reporting the item in the list that failed, is there a way to get it to directly state the list position without concatenating the list position before the assertion or using a conditional statement that returns the list position? Like I would rather keep it all contained within the assert() error report.
val expected: LazyList[String] = ...
expected.zip(list)
.zipWithIndex
.foreach{case ((exp,itm),idx) =>
assert(exp == itm, s"off at index $idx")
}
//java.lang.AssertionError: assertion failed: off at index 0
expected is lazy so that it is traversed only once even though it gets zipped twice.

Using flatMapGroupsWithState with forEachBatch

I have a streaming spark app, wherein the running stream I'm removing duplicate rows using Stateful Aggregation with flatMapGroupsWithState.
But when I used forEachBatch on the stream, and used the same functions I created to remove duplicates on stream, it is treating each Batch as an independent entity, and returning duplicates only among that single Micro Batch.
Code:
case class User(name: String, userId: Integer)
case class StateClass(totalUsers: Int)
def removeDuplicates(inputData: Dataset[User]): Dataset[User] = {
inputData
.groupByKey(_.userId)
.flatMapGroupsWithState(OutputMode.Append, GroupStateTimeout.ProcessingTimeTimeout)(removeDuplicatesInternal)
}
def removeDuplicatesInternal(id: Integer, newData: Iterator[User], state: GroupState[StateClass]): Iterator[User] = {
if (state.hasTimedOut) {
state.remove() // Removing state since no same UserId in 4 hours
return Iterator()
}
if (newData.isEmpty)
return Iterator()
if (!state.exists) {
val firstUserData = newData.next()
val newState = StateClass(1) // Total count = 1 initially
state.update(newState)
state.setTimeoutDuration("4 hours")
Iterator(firstUserData) // Returning UserData first time
}
else {
val newState = StateClass(state.get.totalUsers + 1)
state.update(newState)
state.setTimeoutDuration("4 hours")
Iterator() // Returning empty since state already exists (Already sent this UserData before)
}
}
Input Stream I used is userStream.
Above functions works fine when I directly pass stream to it.
val results = removeDuplicates(userStream)
But when I do something like:
userStream
.writeStream
.foreachBatch { (batch, batchId) => writeBatch(batch) }
def writeBatch(batch: Dataset[User]): Unit = {
val distinctBatch = removeDuplicates(batch)
}
I get distinct User data only within that Micro Batch. But I want it to be distinct overall across 4 hour timeout.
For Eg:
If 1st batch has UserIds (1, 3, 5, 1), and second batch has UserIds (2, 3, 1).
Expected Behaviour:
Output: 1st Batch = (1, 3, 5) and 2nd Batch = (2)
My Output: 1st Batch = (1, 3, 5) and 2nd Batch = (2, 3, 1)
How can I enable it to use the same State throughout? Right now, it is treating each micro-batch different, and creating a separate state for each batch.
PS: Problem is not limited to getting duplicates on Stream, I need to use forEachBatch for some computations on Micro batches, and remove duplicates before writing.
For Running test script, refer this: https://ideone.com/nZ5pq2
The behavior is actually expected one.
flatMapGroupsWithState leverages state store only when the query is streaming one. (For batch query, it doesn't even create a state store because it's not necessary.) Once you call forEachBatch, the provided batch parameter is no longer continuous one across batches - consider it as a dataset from batch query, which the batch means "a" micro-batch.
So you still need to pass your streaming dataset to removeDuplicate, or have your own way to deduplicate records across batches in forEachBatch.

Creating Seq after waiting for all results from map/foreach in Scala

I am trying to loop over inputs and process them to produce scores.
Just for the first input, I want to do some processing that takes a while.
The function ends up returning just the values from the 'else' part. The 'if' part is done executing after the function returns the value.
I am new to Scala and understand the behavior but not sure how to fix it.
I've tried inputs.zipWithIndex.map instead of foreach but the result is the same.
def getscores(
inputs: inputs
): Future[Seq[scoreInfo]] = {
var scores: Seq[scoreInfo] = Seq()
inputs.zipWithIndex.foreach {
case (f, i) => {
if (i == 0) {
// long operation that returns Future[Option[scoreInfo]]
getgeoscore(f).foreach(gso => {
gso.foreach(score => {
scores = scores.:+(score)
})
})
} else {
scores = scores.:+(
scoreInfo(
id = "",
score = 5
)
)
}
}
}
Future {
scores
}
}
For what you need, I would drop the mutable variable and replace foreach with map to obtain an immutable list of Futures and recover to handle exceptions, followed by a sequence like below:
def getScores(inputs: Inputs): Future[List[ScoreInfo]] = Future.sequence(
inputs.zipWithIndex.map{ case (input, idx) =>
if (idx == 0)
getGeoScore(input).map(_.getOrElse(defaultScore)).recover{ case e => errorHandling(e) }
else
Future.successful(ScoreInfo("", 5))
})
To capture/print the result, one way is to use onComplete:
getScores(inputs).onComplete(println)
The part your missing is understanding a tricky element of concurrency, and that is that the order of execution when using multiple futures is not guaranteed.
If your block here is long running, it will take a while before appending the score to scores
// long operation that returns Future[Option[scoreInfo]]
getgeoscore(f).foreach(gso => {
gso.foreach(score => {
// stick a println("here") in here to see what happens, for demonstration purposes only
scores = scores.:+(score)
})
})
Since that executes concurrently, your getscores function will also simultaneously continue its work iterating over the rest of inputs in your zipWithindex. This iteration, especially since it's trivial work, likely finishes well before the long-running getgeoscore(f) completes the execution of the Future it scheduled, and the code will exit the function, moving on to whatever code is next after you called getscores
val futureScores: Future[Seq[scoreInfo]] = getScores(inputs)
futureScores.onComplete{
case Success(scoreInfoSeq) => println(s"Here's the scores: ${scoreInfoSeq.mkString(",")}"
}
//a this point the call to getgeoscore(f) could still be running and finish later, but you will never know
doSomeOtherWork()
Now to clean this up, since you can run a zipWithIndex on your inputs parameter, I assume you mean it's something like a inputs:Seq[Input]. If all you want to do is operate on the first input, then use the head function to only retrieve the first option, so getgeoscores(inputs.head) , you don't need the rest of the code you have there.
Also, as a note, if using Scala, get out of the habit of using mutable vars, especially if you're working with concurrency. Scala is built around supporting immutability, so if you find yourself wanting to use a var , try using a val and look up how to work with the Scala's collection library to make it work.
In general, that is when you have several concurrent futures, I would say Leo's answer describes the right way to do it. However, you want only the first element transformed by a long running operation. So you can use the future return by the respective function and append the other elements when the long running call returns by mapping the future result:
def getscores(inputs: Inputs): Future[Seq[ScoreInfo]] =
getgeoscore(inputs.head)
.map { optInfo =>
optInfo ++ inputs.tail.map(_ => scoreInfo(id = "", score = 5))
}
So you neither need zipWithIndex nor do you need an additional future or join the results of several futures with sequence. Mapping the future just gives you a new future with the result transformed by the function passed to .map().

RxJS interleaving merged observables (priority queue?)

UPDATE
I think I've figured out the solution. I explain it in this video. Basically, use timeoutWith, and some tricks with zip (within zip).
https://youtu.be/0A7C1oJSJDk
If I have a single observable like this:
A-1-2--B-3-4-5-C--D--6-7-E
I want to put the "numbers" as lower priority; it should wait until the "letters" is filled up (a group of 2 for example) OR a timeout is reached, and then it can emit. Maybe the following illustration (of the desired result) can help:
A------B-1-----C--D-2----E-3-4-5-6-7
I've been experimenting with some ideas... one of them: first step is to split that stream (groupBy), one containing letters, and the other containing numbers..., then "something in the middle" happen..., and finally those two (sub)streams get merged.
It's that "something in the middle" what I'm trying to figure out.
How to achieve it? Is that even possible with RxJS (ver 5.5.6)? If not, what's the closest one? I mean, what I want to avoid is having the "numbers" flooding the stream, and not giving enough chance for the "letters" to be processed in timely manner.
Probably this video I made of my efforts so far can clarify as well:
Original problem statement: https://www.youtube.com/watch?v=mEmU4JK5Tic
So far: https://www.youtube.com/watch?v=HWDI9wpVxJk&feature=youtu.be
The problem with my solution so far (delaying each emission in "numbers" substream using .delay) is suboptimal, because it keeps clocking at slow pace (10 seconds) even after the "characters" (sub)stream has ended (not completed -- no clear boundary here -- just not getting more value for indeterminate amount of time). What I really need is, to have the "numbers" substream raise its pace (to 2 seconds) once that happen.
Unfortunately I don't know RxJs5 that much and use xstream myself (authored by one of the contributor to RxJS5) which is a little bit simpler in terms of the number of operators.
With this I crafted the following example:
(Note: the operators are pretty much the same as in Rx5, the main difference is with flatten wich is more or less like switch but seems to handle synchronous streams differently).
const xs = require("xstream").default;
const input$ = xs.of("A",1,2,"B",3,4,5,"C","D",6,7,"E");
const initialState = { $: xs.never(), count: 0, buffer: [] };
const state$ = input$
.fold((state, value) => {
const t = typeof value;
if (t === "string") {
return {
...state,
$: xs.of(value),
count: state.count + 1
};
}
if (state.count >= 2) {
const l = state.buffer.length;
return {
...state,
$: l > 0 ? xs.of(state.buffer[0]) : xs.of(value) ,
count: 0,
buffer: state.buffer.slice(1).concat(value)
};
}
return {
...state,
$: xs.never(),
buffer: state.buffer.concat(value),
};
}, initialState);
xs
.merge(
state$
.map(s => s.$),
state$
.last()
.map(s => xs.of.apply(xs, s.buffer))
)
.flatten()
.subscribe({
next: console.log
});
Which gives me the result you are looking for.
It works by folding the stream on itself, looking at the type of values and emitting a new stream depending on it. When you need to wait because not enough letters were dispatched I emit an emptystream (emits no value, no errors, no complete) as a "placeholder".
You could instead of emitting this empty stream emit something like
xs.empty().endsWith(xs.periodic(timeout)).last().mapTo(value):
// stream that will emit a value only after a specified timeout.
// Because the streams are **not** flattened concurrently you can
// use this as a "pending" stream that may or may not be eventually
// consumed
where value is the last received number in order to implement timeout related conditions however you would then need to introduce some kind of reflexivity with either a Subject in Rx or xs.imitate with xstream because you would need to notify your state that your "pending" stream has been consumed wich makes the communication bi-directionnal whereas streams / observables are unidirectionnal.
The key here the use of timeoutWith, to switch to the more aggresive "pacer", when the "events" kicks in. In this case the "event" is "idle detected in the higher-priority stream".
The video: https://youtu.be/0A7C1oJSJDk

Need help - How to loop through a list and/or a map

Scala is pretty new for me and I have problems as soon as a leave the gatling dsl.
In my case I call an API (Mailhog) which responds with a lot of mails in json-format. I can’t grab all the values.
I need it with “jsonPath” and I need to “regex” as well.
That leads into a map and a list which I need to iterate through and save each value.
.check(jsonPath("$[*]").ofType[Map[String,Any]].findAll.saveAs("id_map"))
.check(regex("href=3D\\\\\"(.*?)\\\\\"").findAll.saveAs("url_list"))
At first I wanted to loop the “checks” but I did’nt find any to repeat them without repeating the “get”-request too. So it’s a map and a list.
1) I need every value of the map and was able to solve the problem with the following foreach loop.
.foreach("${id_map}", "idx") {
exec(session => {
val idMap = session("idx").as[Map[String,Any]]
val ID = idMap("ID")
session.set("ID", ID)
})
.exec(http("Test")
.get("/{ID}"))
})}
2) I need every 3rd value of the list and make a get-request on them. Before I can do this, I need to replace a part of the string. I tried to replace parts of the string while checking for them. But it won’t work with findAll.
.check(regex("href=3D\\\\\"(.*?)\\\\\"").findAll.transform(raw => raw.replace("""=\r\n""","")).saveAs("url"))
How can I replace a part of every string in my list?
also how can I make a get-request on every 3rd element in the list.
I can't get it to work with the same foreach structure above.
I was abole to solve the problem by myself. At first I made a little change to my check(regex ...) part.
.check(regex("href=3D\\\\\"(.*?)\\\\\"").findAll.transform(_.map(raw => raw.replace("""=\r\n""",""))).saveAs("url_list"))
Then I wanted to make a Get-Request only on every third element of my list (because the URLs I extracted appeared three times per Mail).
.exec(session => {
val url_list =
session("url_list").as[List[Any]].grouped(3).map(_.head).toList
session.set("url_list", url_list)
})
At the end I iterate through my final list with a foreach-loop.
foreach("${url_list}", "urls") {
exec(http("Activate User")
.get("${urls}")
)
}