Loops not working in Scala for a Flink job - scala

I have a scala script that would hit Flink to process (filter and aggregate) some data. However, before I hit the Flink process, I need to create a list to pass into the filter. To simplify is problem, let's just say I need to take a string, split it via comma, then return the long value. So my code looks like
items.split(",")
.map(item => {
item.trim.toLong
}).toSet
The unit test has no issue. However, when I run the job on Kubernetes, it gets hung and after time out, it restarts. It seems like the .map(...) gets stuck (split ran when i split up split and map with a log). No error is throw though...
I tried foregoing .map(...) and used for, foreach, and while but all of them get stuck and restart. I even replaced item.trim.toLong with just 1L (simple enough?) - still gets stuck.
Any idea what's going on?
I'm using Flink version 1.13.5.

Related

spark iterative programming - exit condition without launching a job

When writing iterative programs, a common situation is that you need to define a condition at which the program will stop execution and return the result. This stop condition can be for example rdd.isEmpty. The problem is that this "condition test" is an action which triggers a job to be executed and therefore schedueling,serialisation and others costs for each iteration
def iterate(layer:RDD[Long])={
layer.cache()
if(layer.isEmpty) return null;
val nextlayer=process(layer)//contains hashjoins, joins, filters, cache
iterate(nextlayer)
}
The timeline will look like:
[isempty][------spacing----][isempty][------spacing----][isempty]
what is the best way for iterative programming in such situtation? we should not be forced to launch a job in each iteration.
is there a method to check for empty rdd without executing an action?
possible solution:
as you can see in the image below, the is empty is now executed every 5 iterations.each iteration is represented by a periodic triplets of blue rectangles. i did this by modifying the stop condition to the following:
if(layer.index%5==0 && layer.isEmpty) return null;
But as you can see in the figure below am still getting actions that get executed as "run at ThreadPoolExecutor.java". A research shows that those actions are happenning because i am doing "broadcast hash joins" of small DFs with larger ones
threadpoolexecutor reason
timeline
You can try using
layer.cache()
layer.isEmpty
This means that the check for empty is going to trigger an action, but the rdd will be cached, so when you pass it to the process method, the things that were done in isEmpty will be "skipped".

VSTS Test fails but vstest.console passes; the assert executes before the code for some reason?

Well the system we have has a bunch of dependencies, but I'll try to summarize what's going on without divulging too much details.
Test assembly in the form of a .dll is the one being executed. A lot of these tests call an API.
In the problematic method, there's 2 API calls that have an await on them: one to write a record to that external interface, and another to extract all records and then read the last one in that external interface, both via API. The test is simply to check if writing the last record was successful in an end-to-end context, that's why there's both a write and then a read.
If we execute the test in Visual Studio, everything works as expected. I also tested it manually via command lining vstest.console.exe, and the expected results always come out as well.
However, when it comes to VS Test task in VSTS, it fails for some reason. We've been trying to figure it out, and eventually we reached the point where we printed the list from the 'read' part. It turns out the last record we inserted isn't in the data we pulled, but if we check the external interface via a different method, we confirmed that the write process actually happened. What gives? Why is VSTest getting like an outdated set of records?
We also noticed two things:
1.) For the tests that passed, none of the Console.WriteLine outputs appear in the logs. Only on Failed test do they do so.
2.) Even if our Data.Should.Be call is at the very end of the TestMethod, the logs report the fail BEFORE it prints out the lines! And even then, the printing should happen after reading the list of records, and yet when the prints do happen we're still missing the record we just wrote.
Is there like a bottom-to-top thing we're missing here? It really seems to me like VSTS vstest is executing the assert before the actual code. The order of TestMethods happen the right order though (the 4th test written top-to-bottom in the code is executed 4th rather than 4th to last) and we need them to happen in the right order because some of the later tests depend on the former tests succeeding.
Anything we're missing here? I'd put a source code but there's a bunch of things I need to scrub first if so.
Turns out we were sorely misunderstanding what 'await' does. We're using .Wait() instead for the culprit and will also go back through the other tests to check for quality.

How long does a Scala Spark job take to process a million lines in a file?

I have a file called file1 in HDFS that contains paths of several files:
this/is/path1
this/is/path2
this/is/path3
.
.
.
this/is/path1000000
If I get all the lines from this file as a list by executing the following line in Scala,
val lines=Source.fromFile("/my/path/file1.txt").getLines.toList
and if I use a 'for' loop as follows, to process each line of file1 in a separate function that involves some mapping functionality for each line,
for(i<-lines){
val firstLines=sc.hadoopFile(i,classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
}
how long will this take to run, given that file1 contains roughly more than a million lines? This scala job has been running on my machine for more than an hour and I would like to know if it has gotten stuck anywhere or is going through an infinite loop, or something like that.
That is a bit of a loaded question. But it shouldn't take long in general. My guess is something has gone wrong. From personal experience, I would guess you don't have enough executors available.
Memory gets a lot of focus with spark, but the number of available executors has given me more fits than memory issues. Especially because you will see behavior like this where it won't error out. It will just stall indefinitely.
That said, that is just a guess with very little knowledge about the job and env. Time to debug on your part and see if you can't find the issue or come back with a more specific problem/question.

Spark stuck at removing broadcast variable (probably)

Spark 2.0.0-preview
We've got an app that uses a fairly big broadcast variable. We run this on a big EC2 instance, so deployment is in client-mode. Broadcasted variable is a massive Map[String, Array[String]].
At the end of saveAsTextFile, the output in the folder seems to be complete and correct (apart from .crc files still being there) BUT the spark-submit process is stuck on, seemingly, removing the broadcast variable. The stuck logs look like this: http://pastebin.com/wpTqvArY
My last run lasted for 12 hours after after doing saveAsTextFile - just sitting there. I did a jstack on driver process, most threads are parked: http://pastebin.com/E29JKVT7
Full story:
We used this code with Spark 1.5.0 and it worked, but then the data changed and something stopped fitting into Kryo's serialisation buffer. Increasing it didn't help, so I had to disable the KryoSerialiser. Tested it again - it hanged. Switched to 2.0.0-preview - seems like the same issue.
I'm not quite sure what's even going on given that there's almost no CPU activity and no output in the logs, yet the output is not finalised like it used to before.
Would appreciate any help, thanks.
I had a very similar issue.
I was updating from spark 1.6.1 to 2.0.1 and my steps were hanging after completion.
In the end, I managed to solve it by adding a sparkContext.stop() at the end of the task.
Not sure why this is needed it but it solved my issue.
Hope this helps.
ps: this post reminds me of this https://xkcd.com/979/

How to halt the invocation of the mapper or reducer

I am trying to run my hadoop map/reduce job inside eclipse (not a node and or cluster) to debug my map/reduce logic. I want to able to put a break point on the mapper and reducer and make eclipse to stop on these break points however this is not happening and the things mapper get stuck. I noticed that if I hit suspend and run a couple of times, it will eventually break on the mapper and reducer. I am very new to eclipse. What am doing wrong?
I am literally running the word count code at http://wiki.apache.org/hadoop/WordCount and have break points on lines 22, 35.
Maybe you have disabled break points? The break points will be displayed with a strike through icon if that is the case.
When not running locally it is possible that your break points will not be hit, because the tasks are run in new isolated JVMs. However that does not seem to be the case here, because suspend would not work either in that case.