I'm estimating last mile delivery costs in an large urban network using by-route distances. I have over 8000 customer agents and over 100 retail store agents plotted in a GIS map using lat/long coordinates. Each customer receives deliveries from its nearest store (by route). The goal is to get two distance measures in this network for each store:
d0_bar: the average distance from a store to all of its assigned customers
d1_bar: the average distance between all customers common to a single store
I've written a startup function with a simple foreach loop to assign each customer to a store based on by-route distance (customers have a parameter, "customer.pStore" of Store type). This function also adds, in turn, each customer to the store agent's collection of customers ("store.colCusts"; it's an array list with Customer type elements).
Next, I have a function that iterates through the store agent population and calculates the two average distance measures above (d0_bar & d1_bar) and writes the results to a txt file (see code below). The code works, fortunately. However, the problem is that with such a massive dataset, the process of iterating through all customers/stores and retrieving distances via the openstreetmap.org API takes forever. It's been initializing ("Please wait...") for about 12 hours. What can I do to make this code more efficient? Or, is there a better way in AnyLogic of getting these two distance measures for each store in my network?
Thanks in advance.
//for each store, record all customers assigned to it
for (Store store : stores)
{
distancesStore.print(store.storeCode + "," + store.colCusts.size() + "," + store.colCusts.size()*(store.colCusts.size()-1)/2 + ",");
//calculates average distance from store j to customer nodes that belong to store j
double sumFirstDistByStore = 0.0;
int h = 0;
while (h < store.colCusts.size())
{
sumFirstDistByStore += store.distanceByRoute(store.colCusts.get(h));
h++;
}
distancesStore.print((sumFirstDistByStore/store.colCusts.size())/1609.34 + ",");
//calculates average of distances between all customer nodes belonging to store j
double custDistSumPerStore = 0.0;
int loopLimit = store.colCusts.size();
int i = 0;
while (i < loopLimit - 1)
{
int j = 1;
while (j < loopLimit)
{
custDistSumPerStore += store.colCusts.get(i).distanceByRoute(store.colCusts.get(j));
j++;
}
i++;
}
distancesStore.print((custDistSumPerStore/(loopLimit*(loopLimit-1)/2))/1609.34);
distancesStore.println();
}
Firstly a few simple comments:
Have you tried timing a single distanceByRoute call? E.g. can you try running store.distanceByRoute(store.colCusts.get(0)); just to see how long a single call takes on your system. Routing is generally pretty slow, but it would be good to know what the speed limit is.
The first simple change is to use java parallelism. Instead of using this:
for (Store store : stores)
{ ...
use this:
stores.parallelStream().forEach(store -> {
...
});
this will process stores entries in parallel using standard Java streams API.
It also looks like the second loop - where avg distance between customers is calculated doesn't take account of mirroring. That is to say distance a->b is equal to b->a. Hence, for example, 4 customers will require 6 calculations: 1->2, 1->3, 1->4, 2->3, 2->4, 3->4. Whereas in case of 4 customers your second while loop will perform 9 calculations: i=0, j in {1,2,3}; i=1, j in {1,2,3}; i=2, j in {1,2,3}, which seems wrong unless I am misunderstanding your intention.
Generally, for long running operations it is a good idea to include some traceln to show progress with associated timing.
Please have a look at above and post results. With more information additional performance improvements may be possible.
My purpose to calculate success and fail message from source to destination per second and sum their results in daily bases.
I had two options to do that ;
stream events then group them time#source#destination
KeyValueBytesStoreSupplier streamStore = Stores.persistentKeyValueStore("store-name");
sourceStream.selectKey((k, v) -> v.getDataTime() + KEY_SEPERATOR + SRC + KEY_SEPERATOR + DEST ).groupByKey().aggregate(
DO SOME Aggregation,
Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes));
After trying this approach above we noticed that state store is getting increase because of number of unique keys are increasing and if i am correct, because of state topics are only "compact" they are never expires.
NumberOfUniqueKeys = 86.400 seconds in a day X SOURCE X DESTINATION
Then we thought that if we do not put a time field in a KEY block, we can reduce state store size. We tried windowing operation as second approach.
using windowing operation with persistentWindowStore, CustomTimeStampExtractor, WindowBy, Suppress
WindowBytesStoreSupplier streamStore = Stores.persistentWindowStore("store-name", Duration.ofHours(6), Duration.ofSeconds(1), false);
sourceStream.selectKey((k, v) -> SRC + KEY_SEPERATOR + DEST)
.groupByKey() .windowedBy(TimeWindows.of(Duration.ofSeconds(1)).grace(Duration.ofSeconds(5)))
.aggregate(
{
DO SOME Aggregation
}, Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded())).toStream();`
After trying that second approach, we reduced state store size but now we had problem with late arrive events. Then we added grace period with 5 seconds with suppress operation and in addition using grace period and suppress operation did not guarantee to handle all late arrived events, another side effect of suppress operation is a latency because it emits result of aggregation after window grace period.
BTW
using windowing operation caused a getting WARNING message like
"WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
I checked the reason from source code and I found from here
https://github.com/a0x8o/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/state/internals/WindowKeySchema.java
/**
* Safely construct a time window of the given size,
* taking care of bounding endMs to Long.MAX_VALUE if necessary
*/
static TimeWindow timeWindowForSize(final long startMs,
final long windowSize) {
long endMs = startMs + windowSize;
if (endMs < 0) {
LOG.warn("Warning: window end time was truncated to Long.MAX");
endMs = Long.MAX_VALUE;
}
return new TimeWindow(startMs, endMs);
}
BUT actually it does not make any sense to me that how endMs can be lower than 0...
Questions ?
What if we go through with approach 1, how can we reduce state store size ? In approach 1, It was guaranteed that all event will be processed and there will be no missing event because of latency.
What if we go through with approach 2, how should i tune my logic and catch late arrival data and reduce latency ?
Why do i get Warning message in approach 2 although all time fields are positive in my model ?
What can be other options that you can suggest other then these two approaches ?
I need some expert help :)
BR,
According to mail kafka mail group about warning message
WARNING message like "WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
It was written to me :
You can get this message "o.a.k.s.state.internals.WindowKeySchema :
Warning: window end time was truncated to Long.MAX"" when your
TimeWindowDeserializer is created without a windowSize. There are two
constructors for a TimeWindowDeserializer, are you using the one with
WindowSize?
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L46-L55
It calls WindowKeySchema with a Long.MAX_VALUE
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L84-L90
I trying to make unit test on DStream.
I put data in my stream with a mutable queue ssc.queueStream(red)
I set the ManualClock to 0
Start my streaming context.
Advance my ManualClock to batchDuration milis
When i'm doing a
stream.slice(Time(0), Time(clock.getTimeMillis())).map(_.collect().toList)
I got a result.
when I do
for (time <- 0L to stream.slideDuration.milliseconds + 10) {
println("time "+ time + " " +stream.compute(Time(time)).map(_.collect().toList))
}
None of them contain a result event the stream.compute(Time(clock.getTimeMillis()))
So what is the difference between this two functions without considerings the parameters differences?
Compute will return an RDD only if the provided time is a correct time in a sliding window i.e it's the zero time + a multiple of the slide duration.
Slice will align both the from and to times to slide duration and compute for each of them.
In slide you provide time interval and from time interval as long as it is valid we generate Seq[Time]
def to(that: Time, interval: Duration): Seq[Time] = {
(this.milliseconds) to (that.milliseconds) by (interval.milliseconds) map (new Time(_))
}
and then we "compute" for each instance of Seq[Time]
alignedFromTime.to(alignedToTime, slideDuration).flatMap { time =>
if (time >= zeroTime) getOrCompute(time) else None
}
as oppose to compute we only compute for the instance of Time that we pass the compute method...
Apologies if the question is poorly phrased, I'll do my best.
If I have a sequence of values with times as an Observable[(U,T)] where U is a value and T is a time-like type (or anything difference-able I suppose), how could I write an operator which is an auto-reset one-touch barrier, which is silent when abs(u_n - u_reset) < barrier, but spits out t_n - t_reset if the barrier is touched, at which point it also resets u_reset = u_n.
That is to say, the first value this operator receives becomes the baseline, and it emits nothing. Henceforth it monitors the values of the stream, and as soon as one of them is beyond the baseline value (above or below), it emits the elapsed time (measured by the timestamps of the events), and resets the baseline. These times then will be processed to form a high-frequency estimate of the volatility.
For reference, I am trying to write a volatility estimator outlined in http://www.amazon.com/Volatility-Trading-CD-ROM-Wiley/dp/0470181990 , where rather than measuring the standard deviation (deviations at regular homogeneous times), you repeatedly measure the time taken to breach a barrier for some fixed barrier amount.
Specifically, could this be written using existing operators? I'm a bit stuck on how the state would be reset, though maybe I need to make two nested operators, one which is one-shot and another which keeps creating that one-shot... I know it could be done by writing one by hand, but then I need to write my own publisher etc etc.
Thanks!
I don't fully understand the algorithm and your variables in the example, but you can use flatMap with some heap-state and return empty() or just() as needed:
int[] var1 = { 0 };
source.flatMap(v -> {
var1[0] += v;
if ((var1[0] & 1) == 0) {
return Observable.just(v);
}
return Observable.empty();
});
If you need a per-sequence state because of multiple consumers, you can defer the whole thing:
Observable.defer(() -> {
int[] var1 = { 0 };
return source.flatMap(v -> {
var1[0] += v;
if ((var1[0] & 1) == 0) {
return Observable.just(v);
}
return Observable.empty();
});
}).subscribe(...);
Is there a way to have a MATLAB timer pass different data on each subsequent call to the timer function? My goal is to cycle through intervals at a fixed rate, and the pause function inside a loop is not precise enough.
I have workng MATLAB code that uses a for loop to send data via serial ports, then wait a specified time before the next iteration of the loop. The serial communication varies in speed, so if I specify 300 seconds as the period, the loop actually executes every 340-360 seconds. Here is the existing code:
clear all;
testFile = input('What is the name of the test data file (with extension): ', 's');
measurementData = csvread(testFile);
intervalDuration = input('What is the measurement change period (seconds): ');
intervalNumber = size(measurementData,2);
% Set up the COM PORT communication
sensorComPort = [101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120];
controllerComPort = [121,122,123,124];
for j=1:intervalNumber
tic
fprintf('\nInterval # %2d\n',rem(j,24));
sensorMeasurementPS = [measurementData(1,j),measurementData(2,j),measurementData(3,j),measurementData(4,j),measurementData(5,j), measurementData(6,j),measurementData(7,j),measurementData(8,j),measurementData(9,j),measurementData(10,j), measurementData(11,j),measurementData(12,j),measurementData(13,j),measurementData(14,j),measurementData(15,j), measurementData(16,j),measurementData(17,j),measurementData(18,j),measurementData(19,j),measurementData(20,j)];
serialSensorObj = startsensorPSWithoutReset(sensorComPort, sensorMeasurementPS);
serialSensorObj = changeMeasurement(serialSensorObj, sensorMeasurementPS);
rc = stopsensorPS(serialSensorObj);
controllerMeasurementPS = [measurementData(21,j),measurementData(22,j),measurementData(23,j),measurementData(24,j)];
serialControllerObj = startControllerPSWithoutReset(controllerComPort, controllerMeasurementPS);
serialControllerObj = changeMeasurement(serialControllerObj, controllerMeasurementPS);
rc2 = stopControllerPS(serialControllerObj);
pause(intervalDuration);
t = toc;
fprintf('Elapsed time = %3.4f\n',t);
end
clear serialSensorObj;
clear serialControllerObj;
The serial functions are specified in other files and they are working as intended.
What I need to do is have the serial communication execute on a more precise 5-minute interval. (The actual timing of the commands inside the interval will still vary slightly, but the commands will kick off every 5 minutes over the course of 24 hours. The current version loses time and gets out of sync with another system that is reading the measurements I'm setting by serial port.)
My first thought is to use a MATLAB timer with the fixedRate execution mode, which queues the function at fixed intervals. However, it doesn't appear that I can send the timer function different data for each interval. I thought about having the timer function change a counter in the workspace, similar to j in my existing for loop, but I know that having functions interact with the workspace is not recommended.
Here's what I've come up with so far for the timer method:
function [nextJ] = changeMeasurement_fcn(obj,event,j,sensorComPort,controllerComPort)
tic
fprintf('\nInterval # %2d\n',rem(j,24));
sensorMeasurementPS = [measurementData(1,j),measurementData(2,j),measurementData(3,j),measurementData(4,j),measurementData(5,j), measurementData(6,j),measurementData(7,j),measurementData(8,j),measurementData(9,j),measurementData(10,j), measurementData(11,j),measurementData(12,j),measurementData(13,j),measurementData(14,j),measurementData(15,j), measurementData(16,j),measurementData(17,j),measurementData(18,j),measurementData(19,j),measurementData(20,j)];
serialSensorObj = startSensorPSWithoutReset(sensorComPort, sensorMeasurementPS);
serialSensorObj = changeMeasurement(serialSensorObj, sensorMeasurementPS);
rc = stopSensorPS(serialSensorObj);
controllerMeasurementPS = [measurementData(21,j),measurementData(22,j),measurementData(23,j),measurementData(24,j)];
serialControllerObj = startControllerPSWithoutReset(controllerComPort, controllerMeasurementPS);
serialControllerObj = changeMeasurement(serialControllerObj, controllerMeasurementPS);
rc2 = stopControllerPS(serialControllerObj);
t2 = toc;
fprintf('Elapsed time = %3.4f\n',t2);
and this is how I would call it from the main m file:
t = timer('TimerFcn',#changeMeasurement,'ExecutionMode','fixedRate','period',intervalDuration);
% then I need some code to accept the returned nextJ from the timer function
This feels a bit sloppy so I'm hoping there's a built-in way to have a timer cycle through a data set.
Another idea I had was to keep the for loop but change the pause function to use a value calculated based on how much time would add up to 5 minutes for the iteration.
To summarize my question:
a) Can I have a timer pass different data to the timer function on each iteration?
b) Is that a good way to go about cycling through the intervals in my data on a precise 5-minute interval?
Thanks!
I stumbled on this page: http://www.mathworks.com/company/newsletters/articles/tips-and-tricks-simplifying-measurement-and-timer-callbacks-with-nested-functions-new-online-support-features.html
and learned that timer callback functions can be nested inside other functions (but not regular scripts).
Using that information, I cut my scenario to the basics and came up with this code:
function timerTestMain_fcn
testFile = input('What is the name of the test data file (with extension): ', 's');
testData = csvread(testFile);
intervalDuration = input('What is the voltage change period (seconds): ');
intervalNumber = size(testData,2);
t = timer('ExecutionMode','fixedRate','period',intervalDuration,'TasksToExecute',intervalNumber);
t.TimerFcn = {#timerTest_fcn};
start(t);
wait(t);
delete(t);
function timerTest_fcn(obj,event)
tic
event_time = datestr(event.Data.time);
interval_id = t.TasksExecuted;
data_value = testData(1,interval_id);
txt1 = 'Interval ';
txt2 = num2str(interval_id);
txt3 = ' occurred at ';
txt4 = ' with data value of ';
txt5 = num2str(data_value);
msg = [txt1 txt2 txt3 event_time txt4 txt5];
disp(msg)
t2 = toc;
fprintf('Elapsed time = %3.4f\n',t2);
end
end
The test data file it requests must be a csv containing a row vector. For example, you could put the values 11,12,13,14,15 across the first row of the csv. The output message would then say 'Interval 1 occurred at [time] with data value of 11', 'Interval 2 occurred at [time] with data value of 12', etc.
The key is that by nesting the functions, the timer callback can reference both the test data and the timer attributes contained in the outer function. The TasksExecuted property of the timer serves as the counter.
Thanks to anyone who thought about my question. Comments welcome.