publisher.collect(<#T##strategy: Publishers.TimeGroupingStrategy<Scheduler>##Publishers.TimeGroupingStrategy<Scheduler>#>)
I couldn't find any example anywhere and the documentation is bland... free Using Combine book has nothing interesting as well.
In Xcode 11.3, the completion doesn't work extremely well, but the format isn't too complex. There are two options (as of iOS 13.3) that are published in the TimeGroupingStrategy enumeration:
byTime
byTimeOrCount
When you specify either of the strategies, you will also need to specify a scheduler that it runs upon, which is one of the parameters to those enumeration cases.
For example, to collect by time, with a 1.0 second interval of collection, using a DispatchQueue, you might use:
let q = DispatchQueue(label: self.debugDescription)
let cancellable = publisher
.collect(.byTime(q, 1.0))
The byTime version will use an unbounded amount of memory during the interval specified in buffering values provided by upstream publishers.
The byTimeOrCount takes an additional count parameter which puts an upper bound on the number of items collected before the buffered collection is sent to subscribers.
For example, the same code with a 10 item max buffer size:
let q = DispatchQueue(label: self.debugDescription)
let cancellable = publisher
.collect(.byTimeOrCount(q, 1.0, 10))
You can see more specific examples of the code being used, and unit tests verifying how they're operating, in the Using Combine project:
testCollectByTime
testCollectByTimeOrCount
Related
I am looking to implement concurrency inside part of my app in order to speed up processing. The input array can be a large array, that I need to check multiple things related to it. This would be some sample code.
EDITED:
So this is helpful for looking at striding through the array, which was something else I was looking at doing, but I think the helpful answers are sliding away from the original question, due to the fact that I already have a DispatchQueue.concurrentPerform present in the code.
Within a for loop multiple times, I was looking to implement other for loops, due to having to relook at the same data multiple times. The inputArray is an array of structs, so in the outer loop, I am looking at one value in the struct, and then in the inner loops I am looking at a different value in the struct. In the change below I made the two inner for loops function calls to make the code a bit more clear. But in general, I would be looking to make the two funcA and funcB calls, and wait until they are both done before continuing in the main loop.
//assume the startValues and stop values will be within the bounds of the
//array and wont under/overflow
private func funcA(inputArray: [Int], startValue: Int, endValue: Int) -> Bool{
for index in startValue...endValue {
let dataValue = inputArray[index]
if dataValue == 1_000_000 {
return true
}
}
return false
}
private func funcB(inputArray: [Int], startValue: Int, endValue: Int) -> Bool{
for index in startValue...endValue {
let dataValue = inputArray[index]
if dataValue == 10 {
return true
}
}
return false
}
private func testFunc(inputArray: [Int]) {
let dataIterationArray = Array(Set(inputArray))
let syncQueue = DispatchQueue(label: "syncQueue")
DispatchQueue.concurrentPerform(iterations: dataIterationArray.count) { index in
//I want to do these two function calls starting roughly one after another,
//to work them in parallel, but i want to wait until both are complete before
//moving on. funcA is going to take much longer than funcB in this case,
//just because there are more values to check.
let funcAResult = funcA(inputArray: dataIterationArray, startValue: 10, endValue: 2_000_000)
let funcBResult = funcB(inputArray: dataIterationArray, startValue: 5, endValue: 9)
//Wait for both above to finish before continuing
if funcAResult && funcBResult {
print("Yup we are good!")
} else {
print("Nope")
}
//And then wait here until all of the loops are done before processing
}
}
In your revised question, you contemplated a concurrentPerform loop where each iteration called funcA and then funcB and suggested that you wanted them “to work them in parallel”.
Unfortunately. that is not how concurrentPerform works. It runs the separate iterations in parallel, but the code within the closure should be synchronous and run sequentially. If the closure introduces additional parallelism, that will adversely affect how the concurrentPerform can reason about how many worker threads it should use.
Before we consider some alternatives, let us see what will happen if funcA and funcB remain synchronous. In short, you will still enjoy parallel execution benefits.
Below, I logged this with “Points of Interest” intervals in Instruments, and you will see that funcA (in green) never runs concurrently with funcB (in purple) for the same iteration (i.e., for the same range of start and end indices). In this example, I am processing an array with 180 items, striding 10 items at a time, ending up with 18 iterations running on an iPhone 12 Pro Max with six cores:
But, as you can see, although funcB for a given range of indices will not start until funcA finishes for the same range of indices, it does not really matter, because we are still enjoying full parallelism on the device, taking advantage of all the CPU cores.
I contend that, given that we are enjoying parallelism, that there is little benefit to contemplate making funcA and funcB run concurrently with respect to each other, too. Just let the individual iterations run parallel to each other, but let A and B run sequentially, and call it a day.
If you really want to have funcA and funcB run parallel with each other, as well, you will need to consider a different pattern. The concurrentPerform simply is not designed for launching parallel tasks that, themselves, are asynchronous. You could consider:
Have concurrentPerform launch, using my example, 36 iterations, half of which do funcA and half of which do funcB.
Or you might consider using OperationQueue with a reasonable maxConcurrentOperationCount (but you do not enjoy the dynamic limitation of the degree concurrency to the device’s CPU cores).
Or you might use an async-await task group, which will limit itself to the cooperative thread pool.
But you will not want to have concurrentPerform have a closure that launches asynchronous tasks or introduces additional parallel execution.
And, as I discuss below, the example provided in the question is not a good candidate for parallel execution. Mere tests of equality are not computationally intensive enough to enjoy parallelism benefits. It will undoubtedly just be slower than the serial pattern.
My original answer, below, outlines the basic concurrentPerform considerations.
The basic idea is to “stride” through the values. So calculate how many “iterations” are needed and calculate the “start” and “end” index for each iteration:
private func testFunc(inputArray: [Int]) {
DispatchQueue.global().async {
let array = Array(Set(inputArray))
let syncQueue = DispatchQueue(label: "syncQueue")
// calculate how many iterations will be needed
let count = array.count
let stride = 10
let (quotient, remainder) = count.quotientAndRemainder(dividingBy: stride)
let iterations = remainder == 0 ? quotient : quotient + 1
// now iterate
DispatchQueue.concurrentPerform(iterations: iterations) { iteration in
// calculate the `start` and `end` indices
let start = stride * iteration
let end = min(start + stride, count)
// now loop through that range
for index in start ..< end {
let value = array[index]
print("iteration =", iteration, "index =", index, "value =", value)
}
}
// you won't get here until they're all done; obviously, if you
// want to now update your UI or model, you may want to dispatch
// back to the main queue, e.g.,
//
// DispatchQueue.main.async {
// ...
// }
}
}
Note, if something is so slow that it merits concurrentPerform, you probably want to dispatch the whole thing to a background queue, too. Hence the DispatchQueue.global().async {…} shown above. You would probably want to add a completion handler to this method, now that it runs asynchronously, but I will leave that to the reader.
Needless to say, there are quite a few additional considerations:
The stride should be large enough to ensure there is enough work on each iteration to offset the modest overhead introduced by multithreading. Some experimentation is often required to empirically determine the best striding value.
The work done in each thread must be significant (again, to justify the multithreading overhead). I.e., simply printing values is obviously not enough. (Worse, print statements compound the problem by introducing a hidden synchronization.) Even building a new array with some simple calculation will not be sufficient. This pattern really only works if you are doing something very computationally intensive.
You have a “sync” queue, which suggests that you understand that you need to synchronize the combination of the results of the various iterations. That is good. I will point out, though, that you will want to minimize the total number of synchronizations you do. E.g. let’s say you have 1000 values and you end up doing 10 iterations, each striding through 100 values. You generally want to have each iteration build a local result and do a single synchronization for each iteration. Using my example, you should strive to end up with only 10 total synchronizations, not 1000 of them, otherwise excessive synchronization can easily negate any performance gains.
Bottom line, making a routine execute in parallel is complicated and you can easily find that the process is actually slower than the serial rendition. Some processes simply don’t lend themselves to parallel execution. We obviously cannot comment further without understanding what your processes entail. Sometimes other technologies, such as Accelerate or Metal can achieve better results.
I will explain it here, since comment is too small, but will delete later if it doesn't answer the question.
Instead of looping over iterations: dataIterationArray.count, have number of iterations based on number of desired parallel streams of work, not based on array size. For example as you mentioned you want to have 3 streams of work, then you should have 3 iterations, each iteration processing independent part of work:
DispatchQueue.concurrentPerform(iterations: 3) { iteration in
switch iteration {
case 0:
for i in 1...10{
print ("i \(i)")
}
case 1:
for j in 11...20{
print ("j \(j)")
}
case 2:
for k in 21...30{
print ("k \(k)")
}
}
}
And the "And then wait here until all of the loops are done before processing" will happen automatically, this is what concurrentPerform guarantees.
I'd like some help understanding why my publishers aren't emitting elements through the combineLatest operator. I have a publisher that emits video frames, and another publisher that consumes these video frames and extracts faces from these frames. I'm now trying to combine the original video frames and the transformed output into one using combineLatest (I am using some custom publishers to extract video frames and transform the frames):
let videoPublisher = VideoPublisher //Custom Publisher that outputs CVImageBuffers
.share()
let faceDetectionPublisher = videoPublisher
.detectFaces() // Custom Publisher/subscriber that takes in video frames and outputs an array of VNFaceObservations
let featurePublisher = videoPublisher.combineLatest(faceDetectionPublisher)
.sink(receiveCompletion:{_ in
print("done")
}, receiveValue: { (video, faces) in
print("video", video)
print("faces", faces)
})
I'm not getting any activity out of combineLatest, however. After some debugging, I think the issue is that all the videoFrames from videoPublisher are published before any can successfully flow through faceDetectionPublisher. If I attach print statements to the end of the videoPublisher and faceDetectionPublisher, I can see output from the former but none from the latter. I've read up on combine and other techniques such as multicasting, but haven't figured out a working solution. I'd love any combine expertise or guidance on how to better understand the framework!
Your combineLatest won't emit anything until each of its sources emit at least one value. Since detectFaces() never emits, your chain is stalling. Something is wrong in your detectFaces() operator or maybe there are no faces to detects, in which case your logic is off.
If the latter case, then use prepend on the result of detectFaces() to seed the pipeline with some default value (maybe an empty array?)
I am building a Data object that looks something like this:
struct StructuredData {
var crc: UInt16
var someData: UInt32
var someMoreData: UInt64
// etc.
}
I'm running a CRC algorithm that will start at byte 2 and process length 12.
When the CRC is returned, it must exist at the beginning of the Data object. As I see it, my options are:
Generate a Data object that does not include the CRC, process it, and then build another Data object that does (so that the CRC value that I now have will be at the start of the Data object.
Generate the data object to include a zero'd out CRC to start with and then mutate the data at range [0..<2].
Obviously, 2 would be preferable, as it uses less memory and less processing, but this type of an optimization I'm not sure is necessary anymore. I'd still rather go with 2, except I do not know how to mutate the data at a given index range. Any help is greatly appreciated.
I do not recommend the way of mutating Data using this:
data.replaceSubrange(0..<2, with: UnsafeBufferPointer(start: &self.crc, count: 1))
Please try this:
data.replaceSubrange(0..<2, with: &self.crc, count: 2)
It is hard to explain why, but I'll try...
In Swift, inout parameters work in copy-in-copy-out semantics. When you write something like this:
aMethod(¶m)
Swift allocates some region of enough size to hold the content of param,
copies param into the region, (copy-in)
calls the method with passing the address of the region,
and when returned from the call, copies the content of the region back to param (copy-out).
In many cases, Swift optimizes (which may happen even in -Onone setting) the steps by just passing the actual address of param, but it is not clearly documented.
So, when an inout parameter is passed to the initializer of UnsafeBufferPointer, the address received by UnsafeBufferPointer may be pointing to a temporal region, which will be released immediately after the initializer is finished.
Thus, replaceSubrange(_:with:) may copy the bytes in the already released region into the Data.
I believe the first code would work in this case as crc is a property of a struct, but if there is a simple and safe alternative, you should better avoid the unsafe way.
ADDITION for the comment of Brandon Mantzey's own answer.
data.append(UnsafeBufferPointer(start: &self.crcOfRecordData, count: 1))
Using safe in the meaning above. This is not safe, for the same reason described above.
I would write it as:
data.append(Data(bytes: &self.crcOfRecordData, count: MemoryLayout<UInt16>.size))
(Assuming the type of crcOfRecordData as UInt16.)
If you do not prefer creating an extra Data instance, you can write it as:
withUnsafeBytes(of: &self.crcOfRecordData) {urbp in
data.append(urbp.baseAddress!.assumingMemoryBound(to: UInt8.self), count: MemoryLayout<UInt16>.size)
}
This is not referred in the comment, but in the meaning of above safe, the following line is not safe.
let uint32Data = Data(buffer: UnsafeBufferPointer(start: &self.someData, count: 1))
All the same reason.
I would write it as:
let uint32Data = Data(bytes: &self.someData, count: MemoryLayout<UInt32>.size)
Though, observable unexpected behavior may happen in some very limited condition and with very little probability.
Such behavior would happen only when the following two conditions are met:
Swift compiler generates non-Optimized copy-in-copy-out code
Between the very narrow period, since the temporal region is released till the append method (or Data.init) finishes copying the whole content, the region is modified for another use.
The condition #1 becomes true only limited cases in the current implementation of Swift.
The condition #2 happens very rarely only in the multi-threaded environment. (Though, Apple's framework uses many hidden threads as you can find in the debugger of Xcode.)
In fact, I have not seen any questions regarding the unsafe cases above, my safe may be sort of overkill.
But alternative safe codes are not so complex, are they?
In my opinion, you should better be accustomed to use all-cases-safe code.
I figured it out. I actually had a syntax error that was boggling me because I hadn't seen that before.
Here's the answer:
data.replaceSubrange(0..<2, with: UnsafeBufferPointer(start: &self.crc, count: 1))
My first question!
I am doing CPU-intensive image processing on a video feed, and I wanted to use OperationQueue. However, the results are absolutely horrible. Here's an example—let's say I have a CPU intensive operation:
var data = [Int].init(repeating: 0, count: 1_000_000)
func run() {
let startTime = DispatchTime.now().uptimeNanoseconds
for i in data.indices { data[i] = data[i] &+ 1 }
NSLog("\(DispatchTime.now().uptimeNanoseconds - startTime)")
}
It takes about 40ms on my laptop to execute. I time a hundred runs:
(1...100).forEach { i in run(i) }
They average about 42ms each, for about 4200ms total. I have 4 physical cores, so I try to run it on an OperationQueue:
var q = OperationQueue()
(1...100).forEach { i in
q.addOperation {
run(i)
}
}
q.waitUntilAllOperationsAreFinished()
Interesting things happen depending on q.maxConcurrentOperationCount:
concurrency single operation total
1 45ms 4500ms
2 100-250ms 8000ms
3 100-300ms 7200ms
4 250-450ms 9000ms
5 250-650ms 9800ms
6 600-800ms 11300ms
I use the default QoS of .background and can see that the thread priority is default (0.5). Looking at the CPU utilization with Instruments, I see a lot of wasted cycles (the first part is running it on main thread, the second is running with OperationQueue):
I wrote a simple thread queue in C and used that from Swift and it scales linearly with the cores, so I'm able to get my 4x speed increase. But what am I doing wrong with Swift?
Update: I think we have concluded that this is a legitimate bug in DispatchQueue. Then the question actually is what is the correct channel to ask about issues in DispatchQueue code?
You seem to measure the wall-clock time of each run execution. This does not seem to be the right metric. Parallelizing the problem does not signify that each run will execute faster... it just means that you can do several runs at once.
Anyhow, let me verify your results.
Your function run seems to take a parameter some of the time only. Let me define a similar function for clarity:
func increment(_ offset : Int) {
for i in data.indices { data[i] = data[i] &+ offset }
}
On my test machine, in release mode, this code takes 0.68 ns per entry or about 2.3 cycles (at 3.4 GHz) per addition. Disabling bound checking helps a bit (down to 0.5 ns per entry).
Anyhow. So next let us parallelize the problem as you seem to suggest:
var q = OperationQueue()
for i in 1...queues {
q.addOperation {
increment(i)
}
}
q.waitUntilAllOperationsAreFinished()
That does not seem particular safe but is it fast?
Well, it is faster... I hit 0.3 ns per entry.
Source code : https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/extra/swift/opqueue
.background Will run the threads with the lowest priority. If you are looking for fast execution, consider .userInitiated and make sure you are measuring the performance with compiler optimizations turned on.
Also consider using DispatchQueue instead of OperationQueue. It might have less overhead and better performance.
Update based on your comments: try this. It goes from 38s on my laptop to 14 or so.
Notable changes:
I made the queue explicitly concurrent
I run the thing in release mode
Replaced the inner loop calculation with random number, the original got optimized out
QoS set to higher level: QoS now works as expected and .background runs forever
var data = [Int].init(repeating: 0, count: 1_000_000)
func run() {
let startTime = DispatchTime.now().uptimeNanoseconds
for i in data.indices { data[i] = Int(arc4random_uniform(1000)) }
print("\((DispatchTime.now().uptimeNanoseconds - startTime)/1_000_000)")
}
let startTime = DispatchTime.now().uptimeNanoseconds
var g = DispatchGroup()
var q = DispatchQueue(label: "myQueue", qos: .userInitiated, attributes: [.concurrent])
(1...100).forEach { i in
q.async(group: g) {
run()
}
}
g.wait()
print("\((DispatchTime.now().uptimeNanoseconds - startTime)/1_000_000)")
Something is still wrong though - serial queue runs 3x faster even though it does not use all cores.
For the sake of future readers, two observations on multithreaded performance:
There is a modest overhead introduced by multithreading. You need to make sure that there is enough work on each thread to offset this overhead. As the old Concurrency Programming Guide says
You should make sure that your task code does a reasonable amount of work through each iteration. As with any block or function you dispatch to a queue, there is overhead to scheduling that code for execution. If each iteration of your loop performs only a small amount of work, the overhead of scheduling the code may outweigh the performance benefits you might achieve from dispatching it to a queue. If you find this is true during your testing, you can use striding to increase the amount of work performed during each loop iteration. With striding, you group together multiple iterations of your original loop into a single block and reduce the iteration count proportionately. For example, if you perform 100 iterations initially but decide to use a stride of 4, you now perform 4 loop iterations from each block and your iteration count is 25.
And goes on to say:
Although dispatch queues have very low overhead, there are still costs to scheduling each loop iteration on a thread. Therefore, you should make sure your loop code does enough work to warrant the costs. Exactly how much work you need to do is something you have to measure using the performance tools.
A simple way to increase the amount of work in each loop iteration is to use striding. With striding, you rewrite your block code to perform more than one iteration of the original loop.
You should be wary of using either operations or GCD dispatches to achieve multithreaded algorithms. This can lead to “thread explosion”. You should use DispatchQueue.concurrentPerform (previously known as dispatch_apply). This is a mechanism for performing loops in parallel, while ensuring that the degree of concurrency will not exceed the capabilities of the device.
I'm producing a sequence of 50 items each tree seconds. I then want to batch them at max 20 items, but also not waiting more than one second before I release the buffer.
That works great!
But since the interval never dies, Buffer keeps firing empty batch chunks...
How can I avoid that? Shure Where(buf => buf.Count > 0)should help - but that seems like a hack.
Observable
.Interval(TimeSpan.FromSeconds(3))
.Select(n => Observable.Repeat(n, 50))
.Merge()
.Buffer(TimeSpan.FromSeconds(1), 20)
.Subscribe(e => Console.WriteLine(e.Count));
Output:
0-0-0-20-20-10-0-20-20-10-0-0-20-20
The Where filter you propose is a sound approach, I'd go with that.
You could wrap the Buffer and Where into a single helper method named to make the intent clearer perhaps, but rest assured the Where clause is idiomatic Rx in this scenario.
Think of it this way; an empty Buffer is relaying information that no events occurred in the last second. While you can argue that this is implicit, it would require extra work to detect this if Buffer didn't emit an empty list. It just so happens it's not information you are interested in - so Where is an appropriate way to filter this information out.
A lazy timer solution
Following from your comment ("...the timer... be[ing] lazily initiated...") you can do this to create a lazy timer and omit the zero counts:
var source = Observable.Interval(TimeSpan.FromSeconds(3))
.Select(n => Observable.Repeat(n, 50))
.Merge();
var xs = source.Publish(pub =>
pub.Buffer(() => pub.Take(1).Delay(TimeSpan.FromSeconds(1))
.Merge(pub.Skip(19)).Take(1)));
xs.Subscribe(x => Console.WriteLine(x.Count));
Explanation
Publishing
This query requires subscribing to the source events multiple times. To avoid unexpected side-effects, we use Publish to give us pub which is a stream that multicasts the source creating just a single subscription to it. This replaces the older Publish().RefCount() technique that achieved the same end, effectively giving us a "hot" version of the source stream.
In this case, this is necessary to ensure the subsequent buffer closing streams produced after the first will start with the current events - if the source was cold they would start over each time. I wrote a bit about publishing here.
The main query
We use an overload of Buffer that accepts a factory function that is called for every buffer emitted to obtain an observable stream whose first event is a signal to terminate the current buffer.
In this case, we want to terminate the buffer when either the first event into the buffer has been there for a full second, or when 20 events have appeared from the source - whichever comes first.
To achieve this we Merge streams that describe each case - the Take(1).Delay(...) combo describes the first condition, and the Skip(19).Take(1) describes the second.
However, I would still test performance the easy way, because I still suspect this is overkill, but a lot depends on the precise details of the platform and scenario etc.
After using the accepted answer for quite a while I would now suggest a different implementation (inspired by James Skip / Take approach and this answer):
var source = Observable.Interval(TimeSpan.FromSeconds(3))
.Select(n => Observable.Repeat(n, 50))
.Merge();
var xs = source.BufferOmitEmpty(TimeSpan.FromSeconds(1), 20);
xs.Subscribe(x => Console.WriteLine(x.Count));
With an extension method BufferOmitEmpty like:
public static IObservable<IList<TSource>> BufferOmitEmpty<TSource>(this IObservable<TSource> observable, TimeSpan maxDelay, int maxBufferCount)
{
return observable
.GroupByUntil(x => 1, g => Observable.Timer(maxDelay).Merge(g.Skip(maxBufferCount - 1).Take(1).Select(x => 1L)))
.Select(x => x.ToArray())
.Switch();
}
It is 'lazy', because no groups are created as long as there are no elements on the source sequence, so there are no empty buffers. As in Toms answer there is an other nice advantage to the Buffer / Where implementation, that is the buffer is started when the first element arrives. So elements following each other within buffer time after a quiet period are processed in the same buffer.
Why not to use the Buffer method
Three problems occured when I was using the Buffer approach (they might be irrelevant for the scope of the question, so this is a warning to people who use stack overflow answers in different contexts like me):
Because of the Delay one thread is used per subscriber.
In scenarios with long running subscribers elements from the source sequence can be lost.
With multiple subscribers it sometimes creates buffers with count greater than maxBufferCount.
(I can supply sample code for 2. and 3. but I'm insecure whether to post it here or in a different question because I cannot fully explain why it behaves this way)
RxJs5 has hidden features buried into their source code. It turns out it's pretty easy to achieve with bufferTime
From the source code, the signature looks like this:
export function bufferTime<T>(this: Observable<T>, bufferTimeSpan: number, bufferCreationInterval: number, maxBufferSize: number, scheduler?: IScheduler): Observable<T[]>;
So your code would be like this:
observable.bufferTime(1000, null, 20)