Can Vertx executeBlocking be used on a list? - vert.x

Trying to process a list of long running jobs in a vertx way
One would hope one could do something like:
use the executeBlocking to process the long running job in an async manner
use the composite future to wait for the futures to complete
I'm aware this approach does not work .. the list of Futures is not complete before the code drops into the CompositeFuture.
Is there a executeBlocking approach or does one have to use either the eventbus, vertx utils that support lists?
java.util.ArrayList futureList = new ArrayList()
for (i = 0; i < 100; i ++){
vertx.executeBlocking({ future ->
int id = i
println "Running " + id
java.lang.Thread.sleep(1000)
println "Thread done " + id
future.complete()
}, true , { res ->
if (res.succeeded()) {
print "."
} else {
print "x"
}
})
}
CompositeFuture.join(futureList).setHandler({ ar ->
if (ar.succeeded()) {
System.err.println "all threads should be done.."
}
})
Results in .. "all threads should be done" printing early
Running 84
Running 87
Running 87
Running 95
all threads should be done..
done.
Thread done 3
Thread done 36
Thread done 3
Thread done 0

In your example, futureList is empty so CompositeFuture.join(futureList) is completed immediately.
Change your example like this:
java.util.ArrayList futureList = new ArrayList()
for (i = 0; i < 100; i ++){
Future jobFuture = Future.future()
futureList.add(jobFuture)
vertx.executeBlocking({ future ->
int id = i
println "Running " + id
java.lang.Thread.sleep(1000)
println "Thread done " + id
future.complete()
}, true , { res ->
if (res.succeeded()) {
print "."
} else {
print "x"
}
jobFuture.complete()
})
}
Notice the jobFuture creation:
Future jobFuture = Future.future()
futureList.add(jobFuture)
As well as completion:
jobFuture.complete()
Now the CompositeFuture.join(futureList) handler will be executed only after all jobs complete.

Related

Stateful Structured Spark Streaming: Timeout is not getting triggered

I've set the timeout duration to "2 minutes" as follows:
def updateAcrossEvents (tuple3: Tuple3[String, String, String], inputs: Iterator[R00tJsonObject],
oldState: GroupState[MyState]): OutputRow = {
println("$$$$ Inside updateAcrossEvents with : " + tuple3._1 + ", " + tuple3._2 + ", " + tuple3._3)
var state: MyState = if (oldState.exists) oldState.get else MyState(tuple3._1, tuple3._2, tuple3._3)
if (oldState.hasTimedOut) {
println("##### oldState has timed out ####")
// Logic to Write OutputRow
OutputRow("some values here...")
} else {
for (input <- inputs) {
state = updateWithEvent(state, input)
oldState.update(state)
oldState.setTimeoutDuration("2 minutes")
}
OutputRow(null, null, null)
}
}
I have also specified ProcessingTimeTimeout in 'mapGroupsWithState' as follows...
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(updateAcrossEvents)
But 'hasTimedOut' is never true so I don't get any output! What am I doing wrong?
It seems it only works if input data is continuously flowing. I had stopped the input job because I had enough data but it seems timeouts work only if the data is continuously fed. Not sure why it's designed that way. Makes it a bit harder to write unit/integration tests BUT I am sure there's a reason why it's designed this way. Thanks.

How to use doOnNext, doOnSubscribe and doOnComplete?

New to RxJava2/RxAndroid and Android development, but am pretty familiar with Java.
However, I've ran into quite a roadblock when trying to "optimize" and be able to update the UI between a bunch of calls to the same resource.
My code is as follows:
private int batch = 0;
private int totalBatches = 0;
private List<ItemInfo> apiRetItems = new ArrayList<>();
private Observable<ItemInfo[]> apiGetItems(int[] ids) {
int batchSize = 100;
return Observable.create(emitter -> {
int[] idpart = new int[0];
for(int i = 0; i < ids.length; i += batchSize) {
batch++;
idpart = Arrays.copyOfRange(ids, i, Math.min(ids.length, i+batchSize));
ItemInfo[] items = client.items().get(idpart);
emitter.onNext(items);
}
emitter.onComplete();
}).doOnSubscribe( __ -> {
Log.d("GW2DB", "apiGetItems subscribed to with " + ids.length + " ids.");
totalBatches = (int)Math.ceil(ids.length / batchSize);
progressbarUpdate(0, totalBatches);
}).doOnNext(items -> {
Log.d("GW2DB", batch + " batches of " + totalBatches + " batches completed.");
progressbarUpdate(batch, totalBatches);
}).doOnComplete( () -> {
Log.d("GW2DB", "Fetching items completed!");
progressbarReset();
});
}
If I remove the doOnSubscribe, doOnNext and doOnComplete I get no errors in Android Studio, but if I use any of them I get Incompatible types. Required: Observable<[...].ItemInfo[]>. Found: Observable<java.lang.Object>
I'm using RxAndroid 2.1.1 and RxJava 2.2.16.
Any ideas?
Since you are adding a chain of method calls, the compiler is just unable to correctly guess the type for the generic parameter in Observable.create. You can set it explicitly using Observable.<ItemInfo[]>create(...).

Efficient way to optimise a Scala code to read large file that doesn't fit in memory

Problem Statement Below,
We have a large log file which stores user interactions with an application. The entries in the log file follow the following schema: {userId, timestamp, actionType} where actionType is one of two possible values: [open, close]
Constraints:
The log file is too big to fit in memory on one machine. Also assume that the aggregated data doesn’t fit into memory.
Code has to be able to run on a single machine.
Should not use an out-of-the box implementation of mapreduce or 3rd party database; don’t assume we have a Hadoop or Spark or other distributed computing framework.
There can be multiple entries of each actionType for each user, and there might be missing entries in the log file. So a user might be missing a close record between two open records or vice versa.
Timestamps will come in strictly ascending order.
For this problem, we need to implement a class/classes that computes the average time spent by each user between open and close. Keep in mind that there are missing entries for some users, so we will have to make a choice about how to handle these entries when making our calculations. Code should follow a consistent policy with regards to how we make that choice.
The desired output for the solution should be [{userId, timeSpent},….] for all the users in the log file.
Sample log file (comma-separated, text file)
1,1435456566,open
2,1435457643,open
3,1435458912,open
1,1435459567,close
4,1435460345,open
1,1435461234,open
2,1435462567,close
1,1435463456,open
3,1435464398,close
4,1435465122,close
1,1435466775,close
Approach
Below is the code I've written in Python & Scala, which seems to be not efficient and upto the expectations of the scenario given, I'd like to feedback from community of developers in this forum how better we could optimise this code as per given scenario.
Scala implementation
import java.io.FileInputStream
import java.util.{Scanner, Map, LinkedList}
import java.lang.Long
import scala.collection.mutable
object UserMetrics extends App {
if (args.length == 0) {
println("Please provide input data file name for processing")
}
val userMetrics = new UserMetrics()
userMetrics.readInputFile(args(0),if (args.length == 1) 600000 else args(1).toInt)
}
case class UserInfo(userId: Integer, prevTimeStamp: Long, prevStatus: String, timeSpent: Long, occurence: Integer)
class UserMetrics {
val usermap = mutable.Map[Integer, LinkedList[UserInfo]]()
def readInputFile(stArr:String, timeOut: Int) {
var inputStream: FileInputStream = null
var sc: Scanner = null
try {
inputStream = new FileInputStream(stArr);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
val line: String = sc.nextLine();
processInput(line, timeOut)
}
for ((key: Integer, userLs: LinkedList[UserInfo]) <- usermap) {
val userInfo:UserInfo = userLs.get(0)
val timespent = if (userInfo.occurence>0) userInfo.timeSpent/userInfo.occurence else 0
println("{" + key +","+timespent + "}")
}
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
}
def processInput(line: String, timeOut: Int) {
val strSp = line.split(",")
val userId: Integer = Integer.parseInt(strSp(0))
val curTimeStamp = Long.parseLong(strSp(1))
val status = strSp(2)
val uInfo: UserInfo = UserInfo(userId, curTimeStamp, status, 0, 0)
val emptyUserInfo: LinkedList[UserInfo] = new LinkedList[UserInfo]()
val lsUserInfo: LinkedList[UserInfo] = usermap.getOrElse(userId, emptyUserInfo)
if (lsUserInfo != null && lsUserInfo.size() > 0) {
val lastUserInfo: UserInfo = lsUserInfo.get(lsUserInfo.size() - 1)
val prevTimeStamp: Long = lastUserInfo.prevTimeStamp
val prevStatus: String = lastUserInfo.prevStatus
if (prevStatus.equals("open")) {
if (status.equals(lastUserInfo.prevStatus)) {
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
val timeDiff = lastUserInfo.timeSpent + timeSelector
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))
} else if(!status.equals(lastUserInfo.prevStatus)){
val timeDiff = lastUserInfo.timeSpent + curTimeStamp - prevTimeStamp
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))
}
} else if(prevStatus.equals("close")) {
if (status.equals(lastUserInfo.prevStatus)) {
lsUserInfo.remove()
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent + timeSelector, lastUserInfo.occurence+1))
}else if(!status.equals(lastUserInfo.prevStatus))
{
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent, lastUserInfo.occurence))
}
}
}else if(lsUserInfo.size()==0){
lsUserInfo.add(uInfo)
}
usermap.put(userId, lsUserInfo)
}
}
Python Implementation
import sys
def fileBlockStream(fp, number_of_blocks, block):
#A generator that splits a file into blocks and iterates over the lines of one of the blocks.
assert 0 <= block and block < number_of_blocks #Assertions to validate number of blocks given
assert 0 < number_of_blocks
fp.seek(0,2) #seek to end of file to compute block size
file_size = fp.tell()
ini = file_size * block / number_of_blocks #compute start & end point of file block
end = file_size * (1 + block) / number_of_blocks
if ini <= 0:
fp.seek(0)
else:
fp.seek(ini-1)
fp.readline()
while fp.tell() < end:
yield fp.readline() #iterate over lines of the particular chunk or block
def computeResultDS(chunk,avgTimeSpentDict,defaultTimeOut):
countPos,totTmPos,openTmPos,closeTmPos,nextEventPos = 0,1,2,3,4
for rows in chunk.splitlines():
if len(rows.split(",")) != 3:
continue
userKeyID = rows.split(",")[0]
try:
curTimeStamp = int(rows.split(",")[1])
except ValueError:
print("Invalid Timestamp for ID:" + str(userKeyID))
continue
curEvent = rows.split(",")[2]
if userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "close":
#Check if already existing userID with expected Close event 0 - Open; 1 - Close
#Array value within dictionary stores [No. of pair events, total time spent (Close tm-Open tm), Last Open Tm, Last Close Tm, Next expected Event]
curTotalTime = curTimeStamp - avgTimeSpentDict[userKeyID][openTmPos]
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
avgTimeSpentDict[userKeyID][totTmPos] = totalTime
avgTimeSpentDict[userKeyID][closeTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 0 #Change next expected event to Open
elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "open":
avgTimeSpentDict[userKeyID][openTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 1 #Change next expected event to Close
elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "open":
curTotalTime,closeTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][openTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][closeTmPos]=closeTime
avgTimeSpentDict[userKeyID][openTmPos]=curTimeStamp
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "close":
curTotalTime,openTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][closeTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][openTmPos]=openTime
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
elif curEvent == "open":
#Initialize userid with Open event
avgTimeSpentDict[userKeyID] = [0,0,curTimeStamp,0,1]
elif curEvent == "close":
#Initialize userid with missing handler function since there is no Open event for this User
totaltime,OpenTime = missingHandler(defaultTimeOut,0,curTimeStamp)
avgTimeSpentDict[userKeyID] = [1,totaltime,OpenTime,curTimeStamp,0]
def missingHandler(defaultTimeOut,curTimeVal,lastTimeVal):
if lastTimeVal - curTimeVal > defaultTimeOut:
return defaultTimeOut,curTimeVal
else:
return lastTimeVal - curTimeVal,curTimeVal
def computeAvg(avgTimeSpentDict,defaultTimeOut):
resDict = {}
for k,v in avgTimeSpentDict.iteritems():
if v[0] == 0:
resDict[k] = 0
else:
resDict[k] = v[1]/v[0]
return resDict
if __name__ == "__main__":
avgTimeSpentDict = {}
if len(sys.argv) < 2:
print("Please provide input data file name for processing")
sys.exit(1)
fileObj = open(sys.argv[1])
number_of_chunks = 4 if len(sys.argv) < 3 else int(sys.argv[2])
defaultTimeOut = 60000 if len(sys.argv) < 4 else int(sys.argv[3])
for chunk_number in range(number_of_chunks):
for chunk in fileBlockStream(fileObj, number_of_chunks, chunk_number):
computeResultDS(chunk, avgTimeSpentDict, defaultTimeOut)
print (computeAvg(avgTimeSpentDict,defaultTimeOut))
avgTimeSpentDict.clear() #Nullify dictionary
fileObj.close #Close the file object
Both program above gives desired output, but efficiency is what matters for this particular scenario. Let me know if you've anything better or any suggestions on existing implementation.
Thanks in Advance!!
What you are after is iterator usage. I'm not going to re-write your code, but the trick here is likely to be using an iterator. Fortunately Scala provides decent out of the box tooling for the job.
import scala.io.Source
object ReadBigFiles {
def read(fileName: String): Unit = {
val lines: Iterator[String] = Source.fromFile(fileName).getLines
// now you get iterator semantics for the file line traversal
// that means you can only go through the lines once, but you don't incur a penalty on heap usage
}
}
For your use case, you seem to require a lastUser, so you're dealing with groups of 2 entries. I think you you have two choices, either go for iterator.sliding(2), which will produce iterators for every pair, or simply add recursion to the mix using options.
def navigate(source: Iterator[String], last: Option[User]): ResultType = {
if (source.hasNext) {
val current = source.next()
last match {
case Some(existing) => // compare with previous user etc
case None => navigate(source, Some(current))
}
} else {
// exit recursion, return result
}
}
You can avoid all the code you've written to read the file and so on. If you need to count occurrences, simply build a Map inside your recursion, and increment the occurrences at every step based on your business logic.
from queue import LifoQueue, Queue
def averageTime() -> float:
logs = {}
records = Queue()
with open("log.txt") as fp:
lines = fp.readlines()
for line in lines:
if line[0] not in logs:
logs[line[0]] = LifoQueue()
logs[line[0]].put((line[1], line[2]))
else:
logs[line[0]].put((line[1], line[2]))
for k in logs:
somme = 0
count = 0
while not logs[k].empty():
l = logs[k].get()
somme = (somme + l[0]) if l[1] == "open" else (somme - l[0])
count = count + 1
records.put([k, somme, count//2])
while not records.empty():
record = records.get()
print(f"UserId={record[0]} Avg={record[1]/record[2]}")

Using QueryResults does not return anything

I am trying to count the number of objects by using QueryResults. My rule is:
query "Test"
m : Message()
end
function int countFacts(String queryString) {
QueryResults queryResults = DroolsTest.getQueryResults(queryString);
if (queryResults != null) {
System.out.println("Total FACTS found: " + queryResults.size());
return queryResults.size();
}
return 0;
}
rule "Hello World"
when
m : Message( status == Message.HELLO, myMessage : message )
eval(countFacts("Test")>0 )
then
System.out.println( myMessage );
end
And in the java side
public static QueryResults getQueryResults(String queryName) {
System.out.println("inside queryResults for queryName " +queryName );
QueryResults queryResults = kSession.getQueryResults(queryName);
return queryResults;
}
When I am trying to run the rule, execution stops and nothing happens.
The kSession.getQueryResults(queryName) returns nothing and after some time I have to manually terminate the execution.
what is wrong here?
I think the issue here is your thread being blocked. If your requirement is to count the number of facts in your session as part of a rule, then you can do it in a much more "Drools-friendly" way using an accumulate:
rule "More than 2 HELLO messages"
when
Number(longValue > 2) from accumulate (
Message( status == Message.HELLO),
count(1)
)
then
System.out.println( myMessage );
end
If you really want to call a query from the when part of a rule, you will need to pass an unbound variable to the query in order to get the result back in your rule:
query countMessages(long $n)
Number($n:= longValue) from accumulate (
Message(),
count(1)
)
end
rule "Hello World"
when
m : Message( status == Message.HELLO, myMessage : message )
countMessages($n;)
eval($n > 0)
then
System.out.println( m );
end
Hope it helps,

How to execute map, filter, flatMap using multiple threads in RxScala/Java?

How to run filter, map and flatMap on Observable using multiple threads:
def withDelay[T](delay: Duration)(t: => T) = {
Thread.sleep(delay.toMillis)
t
}
Observable
.interval(500 millisecond)
.filter(x => {
withDelay(1 second) { x % 2 == 0 }
})
.map(x => {
withDelay(1 second) { x * x }
}).subscribe(println(_))
The goal is to run filtering and transformation operations concurrently using multiple threads.
Yo can use Async.toAsync() on each operation.
It's on the package rxjava-async
Documentation
This will process each item of collection in a different thread (rxjava3).
var collect = Observable.fromIterable(Arrays.asList("A", "B", "C"))
.flatMap(v -> {
return Observable.just(v)
.observeOn(Schedulers.computation())
.map(v1 -> {
int time = ThreadLocalRandom.current().nextInt(1000);
Thread.sleep(time);
return String.format("processed-%s", v1);
});
})
.observeOn(Schedulers.computation())
.blockingStream()
.collect(Collectors.toList());
You have to use observeOn operator, which it will execute all the next operators in the specific thread defined after the operator has been set
/**
* Once that you set in your pipeline the observerOn all the next steps of your pipeline will be executed in another thread.
* Shall print
* First step main
* Second step RxNewThreadScheduler-2
* Third step RxNewThreadScheduler-1
*/
#Test
public void testObservableObserverOn() throws InterruptedException {
Subscription subscription = Observable.just(1)
.doOnNext(number -> System.out.println("First step " + Thread.currentThread()
.getName()))
.observeOn(Schedulers.newThread())
.doOnNext(number -> System.out.println("Second step " + Thread.currentThread()
.getName()))
.observeOn(Schedulers.newThread())
.doOnNext(number -> System.out.println( "Third step " + Thread.currentThread()
.getName()))
.subscribe();
new TestSubscriber((Observer) subscription)
.awaitTerminalEvent(100, TimeUnit.MILLISECONDS);
}
More async examples here https://github.com/politrons/reactive/blob/master/src/test/java/rx/observables/scheduler/ObservableAsynchronous.java