Project Reactor: How does blockLast() work? - reactive-programming

From the documentation for blockLast():
Subscribe to this Flux and block indefinitely until the upstream signals its last value or completes. Returns that value, or null if the Flux completes empty. In case the Flux errors, the original exception is thrown (wrapped in a RuntimeException if it was a checked exception).
Let's say for an example code sample:
Flux
.range(0, 1000)
.doOnNext(i -> System.out.println("i = " + i + "Thread: " + Thread.currentThread().getName()))
.flatMap(i -> {
System.out.println("end"+ i + " Thread: " + Thread.currentThread().getName());
return Mono.just(i);
}).blockLast();
If I were to understand this based off the documentation's own description, I'd think blockLast means to block the publisher (in this case till all 1000 integers are emitted successfully, last one included).
After which .flatMap(..) is called, one at a time (since we don't specifically force parallel processing.
However I see the following in the console when run:
i = 0Thread: main
end0 Thread: main
i = 1Thread: main
end1 Thread: main
i = 2Thread: main
end2 Thread: main
i = 3Thread: main
end3 Thread: main
i = 4Thread: main
end4 Thread: main
i = 5Thread: main
Isnt i = 0Thread: main supposed to run till i = 1000Thread: main first then .flatMap gets executed?
i.e.
i = 0Thread: main
i = 1Thread: main
i = 2Thread: main
i = 3Thread: main
i = 4Thread: main
.
.
end1 Thread: main
end2 Thread: main
end3 Thread: main
The behavior is exactly the same if .subscribe() is used. I'm kinda confused here.

The observed behaviour is fine. A Flux describes a sequence of operations that are executed as elements are emitted.
So, in your example, each integer generated by range is immediately processed by the next operation in chain, i.e. flatMap here.
It is the same behaviour as with standard java.util.stream.Stream API.
The reason for that behaviour is double:
Avoid buffering all elements between each processing step
A data source can emit an infinite number of messages. And it can also emit messages with various frequency (with constant delay, or not, very fast or very slow, etc.). So, a stream API is designed to process and return each element as soon as it is received, independently of the messages before or after it.
And about blockLast specifically: internally, it subscribe to the flux, and wait for completion or error signal to return or throw an error to the user.

Related

How does history replay works in cadence?

How does history replay work in cadence?
I have a workflow which calls two activity sequentially.
Say, the first activity got completed and the second has 100 no of lines of code. If the app server restarts when executing the 50th line of the code in activity2, is it exactly starts the execution from the 50th line. If yes, what magic is happening inside cadence?
#Override
public String composeGreeting(String greeting, String name) throws Exception {
FileWriter fw =
new FileWriter(
"/Users/kumble-004/Documents/Uber_Cadence/Sample_Projects/TestCadence/src/com/company/"+name+".txt");
System.out.println(
DateTimeFormatter.ofPattern("yyyy/MM/dd HH:mm:ss").format(LocalDateTime.now())
+ " [Activity] started");
long time = System.currentTimeMillis() + 240000;
int i = 0, j=1;
while (System.currentTimeMillis() != time) {
if(i++ %10000000 == 0) {
fw.write("print - " + j++ + " " +
DateTimeFormatter.ofPattern("yyyy/MM/dd HH:mm:ss").format(LocalDateTime.now()) +"\n");
}
}
fw.close();
System.out.println(
DateTimeFormatter.ofPattern("yyyy/MM/dd HH:mm:ss").format(LocalDateTime.now())
+ " [Activity] ended");
return greeting + " " + name + "!";
}
}
I have the above code in my hello activity. this code will run for 4 minutes and it will be writing data in a file when a condition meets
I started a workflow and have quit the cadence server after printing [Activity] started. I didn't start it just stops it. But after 4 minutes it is exactly printing [Activity] Ended in the console. I am wondering how is this possible because I stop the server but code is executing, data is writtening in file.
While I am checking it via cadence UI it shows that the last history is
ActivityTaskStarted. And I started my server. After 15 mins(beacuse scheduleToCloseTimeoutSeconds is 15 mins) Activity returns with event ActivityTaskTimedOut and the whole whorkflow has failed due to this timeout.
Kindly explain what is happening when restarting cadence server ?
If the app server restarts when executing the 50th line of the code in activity2, is it exactly starts the execution from the 50th line
No, it will not resume from 50th line of activity automatically for you.
Replay is only happening for Workflow. It is relaying on History to replay and rebuild the memory stack. Everything happens in the workflow is stored in the history:
Every step in the workflow code, generates a bunch of results called "Decision"
Activity/ChildWorkflow results
External events like Signals
Timers
Etc.
For more details, please refer to the doc about replay history and What exactly is a Cadence decision task?
But after 4 minutes it is exactly printing [Activity] Ended in the console. I am wondering how is this possible because I stop the server but code is executing, data is writtening in file.
That's because your activity worker is still running. The code you are running is purely activity code.
However, it activity results will not be able reported to server when the server is down. Which means the history will lose it and workflow may reschedule another activity(if retry is enabled).
Please refer to the doc about activity timeout and retry

How to use/control RxJava Observable.cache

I am trying to use the RxJava caching mechanism ( RxJava2 ) but i can't seem to catch how it works or how can i control the cached contents since there is the cache operator.
I want to verify the cached data with some conditions before emitting the new data.
for example
someObservable.
repeat().
filter { it.age < maxAge }.
map(it.name).
cache()
How can i check and filter the cache value and emit it if its succeeds and if not then i will request a new value.
since the value changes periodically i need to verify if the cache is still valid before i can request a new one.
There is also ObservableCache<T> class but i can't find any resources of using it.
Any help would be much appreciated. Thanks.
This is not how replay/ cache works. Please read the #replay/ #cache documentation first.
replay
This operator returns a ConnectableObservable, which has some methods (#refCount/ #connect/ #autoConnect) for connecting to the source.
When #replay is applied without an overload, the source subscription is multicasted and all emitted values sind connection will be replayed. The source subscription is lazy and can connect to the source via #refCount/ #connect/ #autoConnect.
Returns a ConnectableObservable that shares a single subscription to the underlying ObservableSource that will replay all of its items and notifications to any future Observer.
Applying #relay without any connect-method (#refCount/ #connect/ #autoConnect) will not emit any values on subscription
A Connectable ObservableSource resembles an ordinary ObservableSource, except that it does not begin emitting items when it is subscribed to, but only when its connect method is called.
replay(1)#autoConnect(-1) / #refCount(1) / #connect
Applying replay(1) will cache the last value and will emit the cached value on each subscription. The #autoConnect will connect open an connection immediately and stay open until a terminal event (onComplete, onError) happens. #refCount is smiular, but will disconnect from the source, when all subscriber disappear. The #connect opreator can be used, when you need to wait, when alle subscriptions have been done to the observable, in order not to miss values.
usage
#replay(1) -- most of the it should be used at the end of the observable.
sourcObs.
.filter()
.map()
.replay(bufferSize)
.refCount(connectWhenXSubsciberSubscribed)
caution
applying #replay without a buffer-limit or expiration date will lead to memory-leaks, when you observale is infinite
cache / cacheWithInitialCapacity
Operators are similar to #replay with autoConnect(1). The operators will cache every value and replay on each subsciption.
The operator subscribes only when the first downstream subscriber subscribes and maintains a single subscription towards this ObservableSource. In contrast, the operator family of replay() that return a ConnectableObservable require an explicit call to ConnectableObservable.connect().
Note: You sacrifice the ability to dispose the origin when you use the cache Observer so be careful not to use this Observer on ObservableSources that emit an infinite or very large number of items that will use up memory. A possible workaround is to apply takeUntil with a predicate or another source before (and perhaps after) the application of cache().
example
#Test
fun skfdsfkds() {
val create = PublishSubject.create<Int>()
val cacheWithInitialCapacity = create
.cacheWithInitialCapacity(1)
cacheWithInitialCapacity.subscribe()
create.onNext(1)
create.onNext(2)
create.onNext(3)
cacheWithInitialCapacity.test().assertValues(1, 2, 3)
cacheWithInitialCapacity.test().assertValues(1, 2, 3)
}
usage
Use cache operator, when you can not control the connect phase
This is useful when you want an ObservableSource to cache responses and you can't control the subscribe/dispose behavior of all the Observers.
caution
As with replay() the cache is unbounded and could lead to memory-leaks.
Note: The capacity hint is not an upper bound on cache size. For that, consider replay(int) in combination with ConnectableObservable.autoConnect() or similar.
further reading
https://blog.danlew.net/2018/09/25/connectable-observables-so-hot-right-now/
https://blog.danlew.net/2016/06/13/multicasting-in-rxjava/
If your event source (Observable) is an expensive operation, such as reading from a database, you shouldn't use Subject to observe the events, since that will repeat the expensive operation for each subscriber. Caching can also be risky with infinite streams due to "OutOfMemory" exceptions. A more appropriate solution may be ConnectableObservable, which only performs the source operation once, and broadcasts the updated value to all subscribers.
Here is a code sample. I didn't bother creating an infinite periodic stream or including error handling to keep the example simple. Let me know if it does what you need.
class RxJavaTest {
private final int maxValue = 50;
private final ConnectableObservable<Integer> source =
Observable.<Integer>create(
subscriber -> {
log("Starting Event Source");
subscriber.onNext(readFromDatabase());
subscriber.onNext(readFromDatabase());
subscriber.onNext(readFromDatabase());
subscriber.onComplete();
log("Event Source Terminated");
})
.subscribeOn(Schedulers.io())
.filter(value -> value < maxValue)
.publish();
void run() throws InterruptedException {
log("Starting Application");
log("Subscribing");
source.subscribe(value -> log("Subscriber 1: " + value));
source.subscribe(value -> log("Subscriber 2: " + value));
log("Connecting");
source.connect();
// Add sleep to give event source enough time to complete
log("Application Terminated");
sleep(4000);
}
private Integer readFromDatabase() throws InterruptedException {
// Emulate long database read time
log("Reading data from database...");
sleep(1000);
int randomValue = new Random().nextInt(2 * maxValue) + 1;
log(String.format("Read value: %d", randomValue));
return randomValue;
}
private static void log(Object message) {
System.out.println(
Thread.currentThread().getName() + " >> " + message
);
}
}
Here's the output:
main >> Starting Application
main >> Subscribing
main >> Connecting
main >> Application Terminated
RxCachedThreadScheduler-1 >> Starting Event Source
RxCachedThreadScheduler-1 >> Reading data from database...
RxCachedThreadScheduler-1 >> Read value: 88
RxCachedThreadScheduler-1 >> Reading data from database...
RxCachedThreadScheduler-1 >> Read value: 42
RxCachedThreadScheduler-1 >> Subscriber 1: 42
RxCachedThreadScheduler-1 >> Subscriber 2: 42
RxCachedThreadScheduler-1 >> Reading data from database...
RxCachedThreadScheduler-1 >> Read value: 37
RxCachedThreadScheduler-1 >> Subscriber 1: 37
RxCachedThreadScheduler-1 >> Subscriber 2: 37
RxCachedThreadScheduler-1 >> Event Source Terminated.
Note the following:
Events only start firing once connect() is called on the source, not when observers subscribe to the source.
Database calls are only made once per event update
Filtered values are not emitted to subscribers
All subscribers are executed in the same thread
Application terminates before the events are processed due to concurrency. Normally your app will run in an event loop, so your app will remain responsive during slow operations.

SubscribeOn does not change the thread pool for the whole chain

I want to trigger longer running operation via rest request and WebFlux. The result of a call should just return an info that operation has started. The long running operation I want to run on different scheduler (e.g. Schedulers.single()). To achieve that I used subscribeOn:
Mono<RecalculationRequested> recalculateAll() {
return provider.size()
.doOnNext(size -> log.info("Size: {}", size))
.doOnNext(size -> recalculate(size))
.map(RecalculationRequested::new);
}
private void recalculate(int toRecalculateSize) {
Mono.just(toRecalculateSize)
.flatMapMany(this::toPages)
.flatMap(page -> recalculate(page))
.reduce(new RecalculationResult(), RecalculationResult::increment)
.subscribeOn(Schedulers.single())
.subscribe(result -> log.info("Result of recalculation - success:{}, failed: {}",
result.getSuccess(), result.getFailed()));
}
private Mono<RecalculationResult> recalculate(RecalculationPage pageToRecalculate) {
return provider.findElementsToRecalculate(pageToRecalculate.getPageNumber(), pageToRecalculate.getPageSize())
.flatMap(this::recalculateSingle)
.reduce(new RecalculationResult(), RecalculationResult::increment);
}
private Mono<RecalculationResult> recalculateSingle(ElementToRecalculate elementToRecalculate) {
return recalculationTrigger.recalculate(elementToRecalculate)
.doOnNext(result -> {
log.info("Finished recalculation for element: {}", elementToRecalculate);
})
.doOnError(error -> {
log.error("Error during recalculation for element: {}", elementToRecalculate, error);
});
}
From the above I want to call:
private void recalculate(int toRecalculateSize)
in a different thread. However, it does not run on a single thread pool - it uses a different thread pool. I would expect subscribeOn change it for the whole chain. What should I change and why to execute it in a single thread pool?
Just to mention - method:
provider.findElementsToRecalculate(...)
uses WebClient to get elements.
One caveat of subscribeOn is it does what it says: it runs the act of "subscribing" on the provided Scheduler. Subscribing flows from bottom to top (the Subscriber subscribes to its parent Publisher), at runtime.
Usually you see in documentation and presentations that subscribeOn affects the whole chain. That is because most operators / sources will not themselves change threads, and by default will start sending onNext/onComplete/onError signals from the thread from which they were subscribed to.
But as soon as one operator switches threads in that top-to-bottom data path, the reach of subscribeOn stops there. Typical example is when there is a publishOn in the chain.
The source of data in this case is reactor-netty and netty, which operate on their own threads and thus act as if there was a publishOn at the source.
For WebFlux, I'd say favor using publishOn in the main chain of operators, or alternatively use subscribeOn inside of inner chains, like inside flatMap.
As per the documentation , all operators prefixed with doOn , are sometimes referred to as having a “side-effect”. They let you peek inside the sequence’s events without modifying them.
If you want to chain the 'recalculate' step after 'provider.size()' do it with flatMap.

Variable Multithread Access - Corruption

In a nutshell:
I have one counter variable that is accessed from many threads. Although I've implemented multi-thread read/write protections, the variable seems to still -in an inconsistent way- get written to simultaneously, leading to incorrect results from the counter.
Getting into the weeds:
I'm using a "for loop" that triggers roughly 100 URL requests in the background, each in its “DispatchQueue.global(qos: .userInitiated).async” queue.
These processes are async, once they finish they update a “counter” variable. This variable is supposed to be multi-thread protected, meaning it’s always accessed from one thread and it’s accessed syncronously. However, something is wrong, from time to time the variable will be accessed simultaneously by two threads leading to the counter not updating correctly. Here's an example, lets imagine we have 5 URLs to fetch:
We start with the Counter variable at 5.
1 URL Request Finishes -> Counter = 4
2 URL Request Finishes -> Counter = 3
3 URL Request Finishes -> Counter = 2
4 URL Request Finishes (and for some reason – I assume variable is accessed at the same time) -> Counter 2
5 URL Request Finishes -> Counter = 1
As you can see, this leads to the counter being 1, instead of 0, which then affects other parts of the code. This error happens inconsistently.
Here is the multi-thread protection I use for the counter variable:
Dedicated Global Queue
//Background queue to syncronize data access fileprivate let
globalBackgroundSyncronizeDataQueue = DispatchQueue(label:
"globalBackgroundSyncronizeSharedData")
Variable is always accessed via accessor:
var numberOfFeedsToFetch_Value: Int = 0
var numberOfFeedsToFetch: Int {
set (newValue) {
globalBackgroundSyncronizeDataQueue.sync() {
self.numberOfFeedsToFetch_Value = newValue
}
}
get {
return globalBackgroundSyncronizeDataQueue.sync {
numberOfFeedsToFetch_Value
}
}
}
I assume I may be missing something but I've used profiling and all seems to be good, also checked the documentation and I seem to be doing what they recommend. Really appreciate your help.
Thanks!!
Answer from Apple Forums:https://forums.developer.apple.com/message/322332#322332:
The individual accessors are thread safe, but an increment operation
isn't atomic given how you've written the code. That is, while one
thread is getting or setting the value, no other threads can also be
getting or setting the value. However, there's nothing preventing
thread A from reading the current value (say, 2), thread B reading the
same current value (2), each thread adding one to this value in their
private temporary, and then each thread writing their incremented
value (3 for both threads) to the property. So, two threads
incremented but the property did not go from 2 to 4; it only went from
2 to 3. You need to do the whole increment operation (get, increment
the private value, set) in an atomic way such that no other thread can
read or write the property while it's in progress.

Python-Multithreading Time Sensitive Task

from random import randrange
from time import sleep
#import thread
from threading import Thread
from Queue import Queue
'''The idea is that there is a Seeker method that would search a location
for task, I have no idea how many task there will be, could be 1 could be 100.
Each task needs to be put into a thread, does its thing and finishes. I have
stripped down a lot of what this is really suppose to do just to focus on the
correct queuing and threading aspect of the program. The locking was just
me experimenting with locking'''
class Runner(Thread):
current_queue_size = 0
def __init__(self, queue):
self.queue = queue
data = queue.get()
self.ID = data[0]
self.timer = data[1]
#self.lock = data[2]
Runner.current_queue_size += 1
Thread.__init__(self)
def run(self):
#self.lock.acquire()
print "running {ID}, will run for: {t} seconds.".format(ID = self.ID,
t = self.timer)
print "Queue size: {s}".format(s = Runner.current_queue_size)
sleep(self.timer)
Runner.current_queue_size -= 1
print "{ID} done, terminating, ran for {t}".format(ID = self.ID,
t = self.timer)
print "Queue size: {s}".format(s = Runner.current_queue_size)
#self.lock.release()
sleep(1)
self.queue.task_done()
def seeker():
'''Gathers data that would need to enter its own thread.
For now it just uses a count and random numbers to assign
both a task ID and a time for each task'''
queue = Queue()
queue_item = {}
count = 1
#lock = thread.allocate_lock()
while (count <= 40):
random_number = randrange(1,350)
queue_item[count] = random_number
print "{count} dict ID {key}: value {val}".format(count = count, key = random_number,
val = random_number)
count += 1
for n in queue_item:
#queue.put((n,queue_item[n],lock))
queue.put((n,queue_item[n]))
'''I assume it is OK to put a tulip in and pull it out later'''
worker = Runner(queue)
worker.setDaemon(True)
worker.start()
worker.join()
'''Which one of these is necessary and why? The queue object
joining or the thread object'''
#queue.join()
if __name__ == '__main__':
seeker()
I have put most of my questions in the code itself, but to go over the main points (Python2.7):
I want to make sure I am not creating some massive memory leak for myself later.
I have noticed that when I run it at a count of 40 in putty or VNC on
my linuxbox that I don't always get all of the output, but when
I use IDLE and Aptana on windows, I do.
Yes I understand that the point of Queue is to stagger out your
Threads so you are not flooding your system's memory, but the task at
hand are time sensitive so they need to be processed as soon as they
are detected regardless of how many or how little there are; I have
found that when I have Queue I can clearly dictate when a task has
finished as oppose to letting the garbage collector guess.
I still don't know why I am able to get away with using either the
.join() on the thread or queue object.
Tips, tricks, general help.
Thanks for reading.
If I understand you correctly you need a thread to monitor something to see if there are tasks that need to be done. If a task is found you want that to run in parallel with the seeker and other currently running tasks.
If this is the case then I think you might be going about this wrong. Take a look at how the GIL works in Python. I think what you might really want here is multiprocessing.
Take a look at this from the pydocs:
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.