GraphX: Wrong output without cache() - scala

I'm doing the following:
var count = 0
while(count > 0){
val messages = graph.vertices.flatMap{
// Create messages for other nodes
}
// Cache which is critical for the correct execution
count.cache()
count = messages.count()
val msgType1 = messages.filter()
val msgType2 = messages.filter()
println(count)
//Should be exactly messages.count()
println(msgType1.count() + msgType2.count())
println("---")
}
If I'm executing it exactly like this then the output is:
8
6 2
---
11
3 8
---
0
0 0
---
which add up exactly to the message count.
If I'm removing the count.cache() after the flatMap-operation, then the filtering of the messages is wrong after counting the messages. It looks like the counting clears the messages or something like that.
The output is then:
8
0 0
---
0
0 0
---
Why is that happening? Is it okay that my program only works if I'm using the cache operation at that point or should it also work without caching the messages?

My problem was, that if flatmap() was called once in one loop iteration, then the output was correct.
If it is called twice in one iteration (which could happen, if the messages must be recomputed) then the first output was correct and the following not, because my opertions inside the flatmap() can only be executed one time per node and node multiple times.
So if I call cache() the flatmap is executed only once. Without cache it is called for every count() operation, so the first was correct and the following two wrong.

Related

Scala java.lang.NoSuchElementException Error

I have a Scala for loop that goes like this:
val a = sc.textFile("path to file containing 8 elements")
for(i <- 0 to a.count.toInt)
{
println((a.take(i).last))
}
But it throws java.lang.NoSuchElementException error.
I am not able to understand what's wrong and how to resolve it?
There are two problems
1) The "to" operator for defining range (in 0 to a.count.toInt) is a problem here as it is inclusive range from 0 to 8. In a collection of 8 elements, it is trying to access element at index 8.
You can use 0 until a.count.toInt.
2) Second problem is the way "last" operator is called. When i=0, the expression a.take(i) is an empty collection and hence calling "last" on it results in NoSuchElementException.
Why would you iteratively take 1, 2, 3...8 elements from a collection just to take the last element everytime?
It is ok to do what you are doing with a collection of 8 elements but if you wanted to do this on a larger RDD, you should consider caching the RDD if you want to do something like this on a larger RDD.

Getting the number of function calls of fminsearch for each iteration

options = optimset('Display','iter','MaxIter',3,'OutputFcn',#outfun);
[x,fval,~,output] = fminsearch(#(param) esm6(param,identi),result(k,1:end-1),options);
This code will find the local Minimum of my esm6 function and due to the 'Display' Option it will Output strings like this
Iteration Func-count min f(x) Procedure
0 1 36.9193
1 5 35.9815 initial simplex
2 7 35.4924 contract inside
3 9 35.4924 contract inside
4 11 33.0085 expand
So in the command window, i get the function Count for each Iteration step. The structure output, which is created by fminsearch has only the total amount of func-count in it. Is there a way to receive all the Information, that is outputed in the command window also in the Output-structure?
EDIT:
I think i'm pretty Close to the solution. I wrote this outputfunction:
function stop = outfun(x,optimvalues,state);
stop = false;
if state == 'iter'
history = evalin('base','history');
history = [history; optimvalues.iteration optimvalues.funcCount];
assignin('base','history',history);
end
end
due to http://de.mathworks.com/help/matlab/math/output-functions.html this should work, but in fact, matlab tells me,
??? Reference to non-existent field 'funcCount'.
any idea, why this happens?

Unexpected spark caching behavior

I've got a spark program that essentially does this:
def foo(a: RDD[...], b: RDD[...]) = {
val c = a.map(...)
c.persist(StorageLevel.MEMORY_ONLY_SER)
var current = b
for (_ <- 1 to 10) {
val next = some_other_rdd_ops(c, current)
next.persist(StorageLevel.MEMORY_ONLY)
current.unpersist()
current = next
}
current.saveAsTextFile(...)
}
The strange behavior that I'm seeing is that spark stages corresponding to val c = a.map(...) are happening 10 times. I would have expected that to happen only once because of the immediate caching on the next line, but that's not the case. When I look in the "storage" tab of the running job, very few of the partitions of c are cached.
Also, 10 copies of that stage immediately show as "active". 10 copies of the stage corresponding to val next = some_other_rdd_ops(c, current) show up as pending, and they roughly alternate execution.
Am I misunderstanding how to get Spark to cache RDDs?
Edit: here is a gist containing a program to reproduce this: https://gist.github.com/jfkelley/f407c7750a086cdb059c. It expects as input the edge list of a graph (with edge weights). For example:
a b 1000.0
a c 1000.0
b c 1000.0
d e 1000.0
d f 1000.0
e f 1000.0
g h 1000.0
h i 1000.0
g i 1000.0
d g 400.0
Lines 31-42 of the gist correspond to the simplified version above. I get 10 stages corresponding to line 31 when I would only expect 1.
The problem here is that calling cache is lazy. Nothing will be cached until an action is triggered and the RDD is evaluated. All the call does is set a flag in the RDD to indicate that it should be cached when evaluated.
Unpersist however, takes effect immediately. It clears the flag indicating that the RDD should be cached and also begins a purge of data from the cache. Since you only have a single action at the end of your application, this means that by the time any of the RDDs are evaluated, Spark does not see that any of them should be persisted!
I agree that this is surprising behaviour. The way that some Spark libraries (including the PageRank implementation in GraphX) work around this is by explicitly materializing each RDD between the calls to cache and unpersist. For example, in your case you could do the following:
def foo(a: RDD[...], b: RDD[...]) = {
val c = a.map(...)
c.persist(StorageLevel.MEMORY_ONLY_SER)
var current = b
for (_ <- 1 to 10) {
val next = some_other_rdd_ops(c, current)
next.persist(StorageLevel.MEMORY_ONLY)
next.foreachPartition(x => {}) // materialize before unpersisting
current.unpersist()
current = next
}
current.saveAsTextFile(...)
}
Caching doesn't reduce stages, it just won't recompute the stage every time.
In the first iteration, in the stage's "Input Size" you can see that the data is coming from Hadoop, and that it reads shuffle input. In subsequent iterations, the data is coming from memory and no more shuffle input. Also, execution time is vastly reduced.
New map stages are created whenever shuffles have to be written, for example when there's a change in partitioning, in your case adding a key to the RDD.

Rx debouncing inputs

I need to debounce an input-stream.
At the first occurrence of state 1 I need to wait for 5 Seconds and verify if the laste state was also 1.
Only than I have a stable signal.
(time) 0-1-2-3-4-5-6-7-8-9
(state) 0-0-0-0-0-1-0-1-0-1
(result) -> 1
Here is an example of a non-stable signal.
(time) 0-1-2-3-4-5-6-7-8-9
(state) 0-0-0-0-0-1-0-1-0-0
(result) -> 0
I tried using a buffer, but a buffer has fixed starting point and I need to wait for 5 seconds starting with my first event.
Taking your requirements literally
At the first occurrence of state 1 I need to wait for 5 Seconds and
verify if the laste state was also 1. Only than I have a stable
signal.
I can come up with a few ways to solve this problem.
To clarify my assumptions, you just want to push the last value produced 5 seconds after the first occurrence of a 1. This will result in a single value sequence producing either a 0 or a 1 (ie. regardless of any further values produced past 5 seconds from the source sequence)
Here I recreate you sequence with some jiggery-pokery.
var source = Observable.Timer(TimeSpan.Zero,TimeSpan.FromSeconds(1))
.Take(10)
.Select(i=>{if(i==5 || i==7 || i==9){return 1;}else{return 0;}}); //Should produce 1;
//.Select(i=>{if(i==5 || i==7 ){return 1;}else{return 0;}}); //Should produce 0;
All of the options below look to share the sequence. To share a sequence safely in Rx we Publish() and connect it. I use automatic connecting via the RefCount() operator.
var sharedSource = source.Publish().RefCount();
1) In this solution we take the first value of 1, and then buffer the selected the values of the sequence in to buffer sizes of 5 seconds. We only take the first of these buffers. Once we get this buffer, we get the last value and push that. If the buffer is empty, I assume we push a one as the last value was the '1' that started the buffer from running.
sharedSource.Where(state=>state==1)
.Take(1)
.SelectMany(_=>sharedSource.Buffer(TimeSpan.FromSeconds(5)).Take(1))
.Select(buffer=>
{
if(buffer.Any())
{
return buffer.Last();
}
else{
return 1;
}
})
.Dump();
2) In this solution I take the approach to only start listening once we get a valid value (1) and then take all values until a timer triggers the termination. From here we take the last value produced.
var fromFirstValid = sharedSource.SkipWhile(state=>state==0);
fromFirstValid
.TakeUntil(
fromFirstValid.Take(1)
.SelectMany(_=>Observable.Timer(TimeSpan.FromSeconds(5))))
.TakeLast(1)
.Dump();
3) In this solution I use the window operator to create a single window that opens when the first value of '1' happens and then closes when 5 seconds elapses. Again we just take the last value
sharedSource.Window(
sharedSource.Where(state=>state==1),
_=>Observable.Timer(TimeSpan.FromSeconds(5)))
.SelectMany(window=>window.TakeLast(1))
.Take(1)
.Dump();
So lots of different ways to skin-a-cat.
It sounds (at a glance) like you want Throttle, not Buffer, although some more information on your use cases would help pin that down - at any rate, here's how you might Throttle your stream:
void Main()
{
var subject = new Subject<int>();
var source = subject.Publish().RefCount();
var query = source
// Start counting on a 1, wait 5 seconds, and take the last value
.Throttle(x => Observable.Timer(TimeSpan.FromSeconds(5)));
using(query.Subscribe(Console.WriteLine))
{
// This sequence should produce a one
subject.OnNext(1);
subject.OnNext(0);
subject.OnNext(1);
subject.OnNext(0);
subject.OnNext(1);
subject.OnNext(1);
Console.ReadLine();
// This sequence should produce a zero
subject.OnNext(0);
subject.OnNext(0);
subject.OnNext(0);
subject.OnNext(0);
subject.OnNext(1);
subject.OnNext(0);
Console.ReadLine();
}
}

Issues using retract in then condition of a rule

I am trying to write a rule to detect if a given event has occurred for 'n' number times in last 'm' duration of time.
I am using drools version 5.4.Final. I have also tried 5.5.Final with no effect.
I have found that there are a couple of Conditional Elements, as Drools call it, accumulate and collect. I have used collect in my sample rule below
rule "check-login-attack-rule-1"
dialect "java"
when
$logMessage: LogMessage()
$logMessages : ArrayList ( size >= 3 )
from collect(LogMessage(getAction().equals(Action.Login)
&& isProcessed() == false)
over window:time(10s))
then
LogManager.debug(Poc.class, "!!!!! Login Attack detected. Generating alert.!!!"+$logMessages.size());
LogManager.debug(Poc.class, "Current Log Message: "+$logMessage.getEventName()+":"+(new Date($logMessage.getTime())));
int size = $logMessages.size();
for(int i = 0 ; i < size; i++) {
Object msgObj = $logMessages.get(i);
LogMessage msg = (LogMessage) msgObj;
LogManager.debug(Poc.class, "LogMessage: "+msg.getEventName()+":"+(new Date(msg.getTime())));
msg.setProcessed(true);
update(msgObj); // Does not work. Rule execution does not proceed beyond this point.
// retract(msgObj) // Does not work. Rule execution does not proceed beyond this point.
}
// Completed processing the logs over a given window. Now removing the processed logs.
//retract($logMessages) // Does not work. Rule execution does not proceed beyond this point.
end
The code to inject logs is as below. The code injects logs at every 3 secs and fires rules.
final StatefulKnowledgeSession kSession = kBase.newStatefulKnowledgeSession();
long msgId = 0;
while(true) {
// Generate Log messages every 3 Secs.
// Every alternate log message will satisfy a rule condition
LogMessage log = null;
log = new LogMessage();
log.setEventName("msg:"+msgId);
log.setAction(LogMessage.Action.Login);
LogManager.debug(Poc.class, "PUSHING LOG: "+log.getEventName()+":"+log.getTime());
kSession.insert(log);
kSession.fireAllRules();
LogManager.debug(Poc.class, "PUSHED LOG: "+log.getEventName()+":"+(new Date(log.getTime())));
// Sleep for 3 secs
try {
sleep(3*1000L);
} catch (InterruptedException e) {
e.printStackTrace();
}
msgId++;
}
With this, what I could achieve is checking for existence of the above said LogMessage in last 10 secs. I could also find out the exact set of LogMessages which occurred in last 10 secs triggering the rule.
The problem is, once these messages are processed, they should not take part in next cycle of evaluation. This is something which I've not be able to achieve. I'll explain this with example.
Consider a timeline below, The timeline shows insertion of log messages and the state of alert generation which should happen.
Expected Result
Secs -- Log -- Alert
0 -- LogMessage1 -- No Alert
3 -- LogMessage2 -- No Alert
6 -- LogMessage3 -- Alert1 (LogMessage1, LogMessage2, LogMessage3)
9 -- LogMessage4 -- No Alert
12 -- LogMessage5 -- No Alert
15 -- LogMessage6 -- Alert2 (LogMessage4, LogMessage5, LogMessage6)
But whats happening with current code is
Actual Result
Secs -- Log -- Alert
0 -- LogMessage1 -- No Alert
3 -- LogMessage2 -- No Alert
6 -- LogMessage3 -- Alert1 (LogMessage1, LogMessage2, LogMessage3)
9 -- LogMessage4 -- Alert2 (LogMessage2, LogMessage3, LogMessage4)
12 -- LogMessage5 -- Alert3 (LogMessage3, LogMessage4, LogMessage5)
15 -- LogMessage6 -- Alert4 (LogMessage4, LogMessage5, LogMessage6)
Essentially, I am not able to discard the messages which are already processed and have taken part in an alert generation. I tried to use retract to remove the processed facts from its working memory. But when I added retract in the then part of the rule, the rules stopped firing at all. I have not been able to figure out why the rules stop firing after adding the retract.
Kindly let me know where am I going wrong.
You seem to be forgetting to set as processed the other 3 facts in the list. You would need a helper class as a global to do so because it should be done in a for loop. Otherwise, these groups of messages can trigger the rule as well:
1 no triggering
1,2 no triggerning
1,2,3 triggers
2,3,4 triggers because a new fact is added and 2 and 3 were in the list
3,4,5 triggers because a new fact is added and 3 and 4 were in the list
and so on
hope this helps