I want to iterate through all items of a csv file and for each item I want to distribute uniform the request so that all SearchProduct (SearchProduct1, SearchProduct2 and SearchProduct3) functions are called the same times.
val products= csv("products.csv").records
val start= exec(repeat(products.size, "n"){
feed(products.queue)
.uniformRandomSwitch(
exec(searchProduct1),
exec(searchProduct2),
exec(searchProduct3)
)
})
I expect that If I have 9 products, the function SearchProduct1 is called 3 times, the function SearchProduct2 is called 3 times and the function SearchProduct3 also is called 3 times.
But the statistics show me many times that the function SearchProduct3 was called 5 times and the SearchProduct2 and SearchProduct1 were called 2 times. Am I doing anithing wrong? Should I do the repeat inside the uniformRandomSwitch?
So I understand the uniformRandomSwitch that the probability of executing one of these three functions is the same. It could be possible that in 9 iterations, 8 times is executed the SearchProduct1 and 1 time SearchProduct2 (and the SearchProduct3 never). But with uniformRandomSwitch I am not forcing to execute the same times every function. Right?
I think what you want is the roundRobinSwitch directive. This will iterate through each chain, moving on to the next and then repeating at the beginning, as new requests come through.
With uniformRandomSwitch each chain has a 1/N chance of being called. Only over many iterations would the number of calls converge, given your example, to 3/3/3.
Related
The "Assembler" should stop working for 2 hours after 10 assemblies are done.
How can I achieve that?
There are so many ways to do this depending on what it means to stop working and what the implications are for the incoming parts.. but here's one option
create a resourcePool called Machine, this will be used along with the technicians:
on the "on exit" action of the assembler do this (I use 9 instead of 10 because the out.count() doesn't count until the agent is completely out, so when it counts 9, it means that you have produced 10)
if(self.out.count()==9){
machine.set_capacity(0);
create_MyDynamicEvent(2, HOUR);
}
In your dynamice event (that you have to create) you will add the following code:
machine.set_capacity(1);
A second option is to have a variable countAssembler count the number of items produced... then
on exit you write countAssembler++;
on enter delay you write the following:
if(countAssembler==10){
self.suspend(agent);
create_MyDynamicEvent(2, HOUR,agent);
}
on the dynamic event you write:
assembler.resume(agent);
Don't forget to add the parameter needed in the dynamic event:
Create a variable called countAssembler of type int. Increment this as agents pass through the assembler. Also create a variable called assemblerStopTime. You also record the assembler stop time with assemblerStopTime=time()
Place a selectOutputOut block before the and let them in if countAssembler value is less than 10. Otherwise send to a Wait block.
Now, to maintain the FIFO rule, in the first selectOutputOut condition, you need to check also if there is any agent in the wait block and if the current time - assemblerStopTime is greater than 2. If there is, you free it and send to the assembler with wait.free(0) function. And send the current agent to wait. You also need to reset the countAssembler to zero.
I have many data and I have experimented with partitions of cardinality [20k, 200k+].
I call it like that:
from pyspark.mllib.clustering import KMeans, KMeansModel
C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None)
C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None)
and I see that initRandom() calls takeSample() once.
Then the takeSample() implementation doesn't seem to call itself or something like that, so I would expect KMeans() to call takeSample() once. So why the monitor shows two takeSample()s per KMeans()?
Note: I execute more KMeans() and they all invoke two takeSample()s, regardless of the data being .cache()'d or not.
Moreover, the number of partitions doesn't affect the number takeSample() is called, it's constant to 2.
I am using Spark 1.6.2 (and I cannot upgrade) and my application is in Python, if that matters!
I brought this to the mailing list of the Spark devs, so I am updating:
Details of 1st takeSample():
Details of 2nd takeSample():
where one can see that the same code is executed.
As suggested by Shivaram Venkataraman in Spark's mailing list:
I think takeSample itself runs multiple jobs if the amount of samples
collected in the first pass is not enough. The comment and code path
at GitHub
should explain when this happens. Also you can confirm this by
checking if the logWarning shows up in your logs.
// If the first sample didn't turn out large enough, keep trying to take samples;
// this shouldn't happen often because we use a big multiplier for the initial size
var numIters = 0
while (samples.length < num) {
logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
numIters += 1
}
However, as one can see, the 2nd comment said it shouldn't happen often, and it does happen always to me, so if anyone has another idea, please let me know.
It was also suggested that this was a problem of the UI and takeSample() was actually called only once, but that was just hot air.
For some reason, my Akka streams always wait for a second message before "emitting"(?) the first.
Here is some example code that demonstrates my problem.
val rx = Source((1 to 100).toStream.map { t =>
Thread.sleep(1000)
println(s"doing $t")
t
})
rx.runForeach(println)
yields output:
doing 1
doing 2
1
doing 3
2
doing 4
3
doing 5
4
doing 6
5
...
What I want:
doing 1
1
doing 2
2
doing 3
3
doing 4
4
doing 5
5
doing 6
6
...
The way your code is setup now, you are completely transforming the Source before it's allowed to start emitting elements downstream. You can clearly see that behavior (as #slouc stated) by removing the toStream on the range of numbers that represents the source. If you do that, you will see the Source be completely transformed first before it starts responding to downstream demand. If you actually want to run a Source into a Sink and have a transformation step in the middle, then you can try and structure things like this:
val transform =
Flow[Int].map{ t =>
Thread.sleep(1000)
println(s"doing $t")
t
}
Source((1 to 100).toStream).
via(transform ).
to(Sink.foreach(println)).
run
If you make that change, then you will get the desired effect, which is that an element flowing downstream gets processed all the way through the flow before the next element starts to be processed.
You are using .toStream() which means that the whole collection is lazy. Without it, your output would be first a hundred "doing"s followed by numbers from 1 to 100. However, Stream evaluates only the first element, which gives the "doing 1" output, which is where it stops. Next element will be evaluated when needed.
Now, I couldn't find any details on this in the docs, but I presume that runForeach has an implementation that takes the next element before invoking the function on the current one. So before calling println on element n, it first examines element n+1 (e.g. checks if it exists), which results in "doing n+1" message. Then it performs your println function on current element which results in message "n" .
Do you really need to map() before you runForeach? I mean, do you need two travelsals through the data? I know I'm probably stating the obvious, but if you just process your data in one go like this:
val rx = Source((1 to 100).toStream)
rx.runForeach({ t =>
Thread.sleep(1000)
println(s"doing $t")
// do something with 't', which is now equal to what "doing" says
})
then you don't have a problem of what's evaluated when.
I have a quick question about the details of running a model in JAGS and BUGS.
Say I run a model with n.burnin=5000, n.iter=5000 and thin=2. Does this mean that the program will:
Run 5,000 iterations, and discard results; and then
Run another 10,000 iterations, only keeping every second result?
If I save these simulations as a CODA object, are all 10,000 saved, or only the thinned 5,000? I'm just trying to understand which set of iterations are used to make the ACF plot?
With JAGS, n.burnin=5000, n.iter=5000 and thin=2, means you keep nothing. You run 5000, discard the first 5000 of these 5000 and then only keep a half of the remaining values of the chain (keep 1 value and discard the next one ..).
Use for example n.burnin=2000, n.iter=7000, thin=50, n.chains=5 : so you have (7000-2000)/50 * 5 = 500 values.
Could you be more specific which software you're talking about? It looks like you're referring to the arguments of the function bugs() in the R2WinBUGS package (except that the argument is called n.thin not thin). Looking at help(bugs) it just says n.burnin is the "number of iterations to discard at the beginning". Which doesn't specifically answer your question, but looking at the source for bugs.script() in that package suggests to me that it would run 5000 iterations burn in, as you suspected. You could send a suggestion to the maintainers of that package to clarify their documentation.
In your example, bugs() would then run 0 further iterations after the burn-in. Here the documentation is clearer - n.iter is the total number of iterations including the burn-in.
For your second question, the CODA output from WinBUGS (and any software which calls WinBUGS or OpenBUGS) will only include the thinned sample.
I have four variables, each saved in 365 mat-files (size: 8 x 92 x 240). I try to load these into my function in a for-loop day=1:365, one variable file per day. However, the two first variables always take abnormally long time to load. My code for loading looks like this:
load([eraFolder sprintf('Y%dD%d-tempSD.mat',year,day)], 'tempSD'); % took 5420 s to load
load([eraFolder sprintf('Y%dD%d-tempDewSD.mat',year,day)], 'tempDewSD')
load([eraFolder sprintf('Y%dD%d-eEraSD.mat',year,day)], 'eEraSD'); % took 6 seconds to load
load([eraFolder sprintf('Y%dD%d-pEraSD.mat',year,day)], 'pEraSD');
Using Profiler, I could see that the first two variables took 5420 seconds to load in 365 calls, whereas the the last two variables took 6 and 4 seconds to load respectively over 365 calls. When I swap the order in which variables are loaded, e.g. eEraSD before tempSD, it is still the first two loads that take more time.
When using tic toc to track the time spent on loading, it appears that the time to load a the first or second variable exponentially increases with the number of calls (with the last calls taking 50 seconds to run) . For the third and fourth variable, the loading time stays around 0.02-0.04 seconds per file, more or less independent on how far in the for loop I have gone. See figures below.
When using importdata instead of 'load', the first line takes about 8000 seconds to load 365 times (with the loading exponentially increasing as shown for T in the second figure). The other lines then take about 10 seconds to load 365 times.
I can't understand why it looks this way and what I can do to decrease the loading time. Would greatly appreciate any idea of a possible solution for this.
I suppose your data sets are in the same directory(over network or local) and with same attributes e.g. access properties and so on.
Then the only option left is with the charateristics of the vairbales stored in those matfiles. Can you check how much those variables appear in size e.g. by loading a sample one. This will narrow down to solve your problem.
Hope that help.
FS
Thanks for your help. I finally found out what caused the problem. In a 'for' loop later in the script, I saved other data to a folder I called temp. After renaming that folder to something else (e.g. temporary), the data loading problem disappeared.
(Doesn't matter so much now that the practical problem is solved, but I can't really say I understand why there was this peculiar relationship between the later 'save' call and this 'importdata' or 'load' call.)
Please see new question about the temp folder