Spark::KMeans calls takeSample() twice? - scala

I have many data and I have experimented with partitions of cardinality [20k, 200k+].
I call it like that:
from pyspark.mllib.clustering import KMeans, KMeansModel
C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None)
C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None)
and I see that initRandom() calls takeSample() once.
Then the takeSample() implementation doesn't seem to call itself or something like that, so I would expect KMeans() to call takeSample() once. So why the monitor shows two takeSample()s per KMeans()?
Note: I execute more KMeans() and they all invoke two takeSample()s, regardless of the data being .cache()'d or not.
Moreover, the number of partitions doesn't affect the number takeSample() is called, it's constant to 2.
I am using Spark 1.6.2 (and I cannot upgrade) and my application is in Python, if that matters!
I brought this to the mailing list of the Spark devs, so I am updating:
Details of 1st takeSample():
Details of 2nd takeSample():
where one can see that the same code is executed.

As suggested by Shivaram Venkataraman in Spark's mailing list:
I think takeSample itself runs multiple jobs if the amount of samples
collected in the first pass is not enough. The comment and code path
at GitHub
should explain when this happens. Also you can confirm this by
checking if the logWarning shows up in your logs.
// If the first sample didn't turn out large enough, keep trying to take samples;
// this shouldn't happen often because we use a big multiplier for the initial size
var numIters = 0
while (samples.length < num) {
logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
numIters += 1
}
However, as one can see, the 2nd comment said it shouldn't happen often, and it does happen always to me, so if anyone has another idea, please let me know.
It was also suggested that this was a problem of the UI and takeSample() was actually called only once, but that was just hot air.

Related

How to debug further this dropped record in apache beam?

I am seeing intermittent dropped records(only for error messages though not for success ones). We have a test case that intermittenly fails/passes because of a lost record. We are using "org.apache.beam.sdk.testing.TestPipeline.java" in the test case. This is the relevant setup code where I have tracked the dropped record too ....
PCollectionTuple processed = records
.apply("Process RosterRecord", ParDo.of(new ProcessRosterRecordFn(factory))
.withOutputTags(TupleTags.OUTPUT_INTEGER, TupleTagList.of(TupleTags.FAILURE))
);
errors = errors.and(processed.get(TupleTags.FAILURE));
PCollection<OrderlyBeamDto<Integer>> validCounts = processed.get(TupleTags.OUTPUT_INTEGER);
PCollection<OrderlyBeamDto<Integer>> errorCounts = errors
.apply("Flatten Roster File Error Count", Flatten.pCollections())
.apply("Publish Errors", ParDo.of(new ErrorPublisherFn(factory)));
The relevant code in ProcessRosterRecordFn.java is this
if(dto.hasValidationErrors()) {
RosterIngestError error = new RosterIngestError(record.getRowNumber(), record.toTitleValue());
error.getValidationErrors().addAll(dto.getValidationErrors());
error.getOldValidationErrors().addAll(dto.getOldValidationErrors());
log.info("Tagging record row number="+record.getRowNumber());
c.output(TupleTags.FAILURE, new OrderlyBeamDto<>(error));
return;
}
I see this log for the lost record of Tagging record row for 2 rows that fail. After that however, inside the first line of ErrorPublisherFn.java, we log immediately after receiving each message. We only receive 1 of the 2 rows SOMETIMES. When we receive both, the test passes. The test is very flaky in this regard.
Apache Beam is really annoying in it's naming of threads(they are all the same name), so I added a logback thread hashcode to get more insight and I don't see any and the ErrorPublisherFn could publish #4 on any thread anyways.
Ok, so now the big question: How to insert more things to figure out why this is being dropped INTERMITTENTLY?
Do I have to debug apache beam itself? Can I insert other functions or make changes to figure out why this error is 'sometimes' lost on some test runs and not others?
EDIT: Thankfully, this set of tests are not testing errors upstream and this line "errors = errors.and(processed.get(TupleTags.FAILURE));" can be removed which forces me to remove ".apply("Flatten Roster File Error Count", Flatten.pCollections())" and in removing those 2 lines, the issue goes away for 10 test runs in a row(ie. can't completely say it is gone with this flaky stuff going on). Are we doing something wrong in the join and flattening? I checked the Error structure and rowNumber is a part of equals and hashCode so there should be no duplicates and I am not sure why it would be intermittently failure if there are duplicate objects either.
What more can be done to debug here and figure out why this join is not working in the TestPipeline?
How to get insight into the flatten and join so I can debug why we are losing an event and why it is only 'sometimes' we lose the event?
Is this a windowing issue? even though our job started with a file to read in and we want to process that file. We wanted a constant dataflow stream available as google kept running into limits but perhaps this was the wrong decision?

"While loop" not working in my Anylogic model

I have the model which I posted before on Stack. I am currently running the iterations through 5 Flow Chart blocks contain enter block and service block. when agent fill service block 5 in flow chart 5, the exit block should start to fill block one and so on. I have used While infinite loop to loop between the five flow chart blocks but it isn't working.
while(true)
{
for (Curing_Drying currProcess : collection) {
if (currProcess.allowedDay == (int)time(DAY)) {
currProcess.enter.take(agent);
}
}
if (queue10.size() <= Throughtput1){
break;
}
}
Image for further illustration 1
Image for further illustration 2
Wondering if someone can tell me what is wrong in the code.
Based on the description and the pictures provided, it isn't clear why the while loop is necessary. The On exit action is executed for each Agent arrival to the Exit block. It seems that the intention is to find the appropriate Curing_Drying block based on number of days since the model start time? If so, then just iterating through the collection is enough.
Also, it is generally a good practice to provide more meaningful names to collections. Using simply collection doesn't say anything about the contents and can get pretty confusing later on.

Simpy 3.0.4, setting resource priority

I am having trouble with resource priority in simpy. Consider the following code:
import simpy
env = simpy.Environment()
res = simpy.PriorityResource(env, capacity = 1)
def go(id):
with res.request(priority = id) as req:
yield req
print id,res
env.process(go(3))
env.process(go(2))
env.process(go(4))
env.process(go(5))
env.process(go(1))
env.run()
Lower number means higher priority, so I should get 1,2,3,4,5. But instead i am getting 3,1,2,4,5. So the first output is wrong, after that its sorted!
Thanks in advance for your help.
This is correct. When "3" requests the resource, it is empty so it gets the
slot. The remaining processes have to queue and will get the resource in the
order 1, 2, 4, 5.
If you use the PreemptiveResource instead (like request(priority=id,
preempt=True)), 3 will still get the resource first but will be preempted by
2. 2 will then get preempted by 1. 2 and 3 would then have to request the
resource again to gain access to it.
Even I had the same problem where I was supposed to make a factory FIFO. At that time I assigned a reaction time to a part and made it to follow the previous part. That is only if the previous part had got into service of resource, I made the next part request. It solved the problem objectively but seemed like it slowed down the simulation little and also gave a rexn time to the part. It was basically a revamp of the factory process. But I would love to see a feature when the part doesn't have to request again.
Can it be done in the present version?

Calling a method from ABL code not working

When I create a new quote from Epicor I would like to add an item from the parts form automatically.
I am trying to do this using the following ABL code which runs when 'GetNewQuoteHed' is called:
run Update.
run GetNewQuoteDtl.
run ChangePartNumMaster("Rod Tube").
ttQuoteDtl.OrderQty = 5.
run Update.
I am getting the error:
Index -1 is either negative or above rows count.
This error occurs for each line in my ABL code.
What am I doing wrong?
That's not the proper format for a 4GL error message (nor is it at all familiar) so I'd say it is an Epicor application message. Epicor support is probably your best bet. However... Just guessing but it sounds like you might need to somehow initialize the thing that you're updating.
Agree with #Tom, but i would also say try and isolate the error and see where the error is raised as soon as you find the point the error is actually raised it is normally much easier to figure out exactly what is going wrong and how to solve it.
Working between a 0 based and a 1 based system there can be issues with the 1st or last entry depending on which way you moving. As the index for 0 based systems starts at 0 and ends at n-1 where 1 based systems start at 1 and end at n.

How many iterations are saved by JAGS/BUGS when burnin and thinning are specified?

I have a quick question about the details of running a model in JAGS and BUGS.
Say I run a model with n.burnin=5000, n.iter=5000 and thin=2. Does this mean that the program will:
Run 5,000 iterations, and discard results; and then
Run another 10,000 iterations, only keeping every second result?
If I save these simulations as a CODA object, are all 10,000 saved, or only the thinned 5,000? I'm just trying to understand which set of iterations are used to make the ACF plot?
With JAGS, n.burnin=5000, n.iter=5000 and thin=2, means you keep nothing. You run 5000, discard the first 5000 of these 5000 and then only keep a half of the remaining values of the chain (keep 1 value and discard the next one ..).
Use for example n.burnin=2000, n.iter=7000, thin=50, n.chains=5 : so you have (7000-2000)/50 * 5 = 500 values.
Could you be more specific which software you're talking about? It looks like you're referring to the arguments of the function bugs() in the R2WinBUGS package (except that the argument is called n.thin not thin). Looking at help(bugs) it just says n.burnin is the "number of iterations to discard at the beginning". Which doesn't specifically answer your question, but looking at the source for bugs.script() in that package suggests to me that it would run 5000 iterations burn in, as you suspected. You could send a suggestion to the maintainers of that package to clarify their documentation.
In your example, bugs() would then run 0 further iterations after the burn-in. Here the documentation is clearer - n.iter is the total number of iterations including the burn-in.
For your second question, the CODA output from WinBUGS (and any software which calls WinBUGS or OpenBUGS) will only include the thinned sample.