Spark stuck at removing broadcast variable (probably) - scala

Spark 2.0.0-preview
We've got an app that uses a fairly big broadcast variable. We run this on a big EC2 instance, so deployment is in client-mode. Broadcasted variable is a massive Map[String, Array[String]].
At the end of saveAsTextFile, the output in the folder seems to be complete and correct (apart from .crc files still being there) BUT the spark-submit process is stuck on, seemingly, removing the broadcast variable. The stuck logs look like this: http://pastebin.com/wpTqvArY
My last run lasted for 12 hours after after doing saveAsTextFile - just sitting there. I did a jstack on driver process, most threads are parked: http://pastebin.com/E29JKVT7
Full story:
We used this code with Spark 1.5.0 and it worked, but then the data changed and something stopped fitting into Kryo's serialisation buffer. Increasing it didn't help, so I had to disable the KryoSerialiser. Tested it again - it hanged. Switched to 2.0.0-preview - seems like the same issue.
I'm not quite sure what's even going on given that there's almost no CPU activity and no output in the logs, yet the output is not finalised like it used to before.
Would appreciate any help, thanks.

I had a very similar issue.
I was updating from spark 1.6.1 to 2.0.1 and my steps were hanging after completion.
In the end, I managed to solve it by adding a sparkContext.stop() at the end of the task.
Not sure why this is needed it but it solved my issue.
Hope this helps.
ps: this post reminds me of this https://xkcd.com/979/

Related

Loops not working in Scala for a Flink job

I have a scala script that would hit Flink to process (filter and aggregate) some data. However, before I hit the Flink process, I need to create a list to pass into the filter. To simplify is problem, let's just say I need to take a string, split it via comma, then return the long value. So my code looks like
items.split(",")
.map(item => {
item.trim.toLong
}).toSet
The unit test has no issue. However, when I run the job on Kubernetes, it gets hung and after time out, it restarts. It seems like the .map(...) gets stuck (split ran when i split up split and map with a log). No error is throw though...
I tried foregoing .map(...) and used for, foreach, and while but all of them get stuck and restart. I even replaced item.trim.toLong with just 1L (simple enough?) - still gets stuck.
Any idea what's going on?
I'm using Flink version 1.13.5.

kubernetes+GKE / status is now: NodeHasDiskPressure

Read a bit through here (https://kubernetes.io/docs/admin/out-of-resource/) without ending up with a clear understanding; Trying here to gather more infos about what actually happens.
We run 2 n1-standard-2 instances, a 300Go disk is attached
More specifically, a "nodefs.inodesFree" problem seems specified. And this one is quite unclear. It seems to happen during builds (when the image is creating), should we understand that it takes too much space on disk ? What would be the most obvious reason ?
It feels like it is not tied to the CPU/memory requests/limits that can be specified on a node, but still as we've "overcommitted" the limits, can it have any impact regarding this issue ?
Thanks for sharing your experience on this one
Could you run df -i on the affected node please?

Odd failures with PS v2 remoting

I have a moderatly complex script made up of a PS1 file that does Import-Module on a number of PSM1 files, and includes a small amount of global variables that define state.
I have it working as a regular script, and I am now trying to implement it for Remoting, and running into some very odd issues. I am seeing a ton of .NET runtime errors with eventID of 0, and the script seems to work intermittently, with time between attempts seeming to affect results. I know that isn't a very good description of the problem, but I haven't had a chance to test more deeply, and I am just wondering if I am perhaps pushing PowerShell v2 further than it can really handle, trying to do remoting with a complex and large script like this? Or does this look more like something I have wrong in code and once I get that sorted I will get consistent script processing? I am relatively new to PowerShell and completely new to remoting.
The event data is
.NET Runtime version : 2.0.50727.5485 - Application ErrorApplication
has generated an exception that could not be handled. Process ID=0x52c
(1324), Thread ID=0x760 (1888). Click OK to terminate the application.
Click CANCEL to debug the application.
Which doesn't exactly provide anything meaningful. Also, rather oddly, if I clear the event log, it seems like I have a better chance of not having an issue. Once there are errors in the event log the chance of the script failing is much higher.
Any thoughts before I dig into troubleshooting are much appreciated. And suggestions on best practices when troubleshooting Remote scripts also much appreciated.
One thing about v2 remoting is that the shell memory limit is set pretty small - 150 MB. You might try bumping that up to say 1gb like so:
Set-Item WSMan:\localhost\shell\MaxMemoryPerShellMB 1024 -force

Eclipse heap space (out of memory error)

I am facing memory issue in eclipse. Initially I was getting this error: ‘Unhandled event loop exception java heap space’ and also sometimes ‘An out of memory error has occured’.
I somehow managed to increase my heap size upto -Xmx990m. But still its not working. When I try to increase heap size beyond that, I am getting error ‘Unable to create virtual machine’ while starting eclipse.
I tried to make other changes in eclipse.ini file. When I change XXMaxPermSize, it gives me ‘permGen memory error’. For few times, I got different other kind of errors like ‘Unhandled event loop exception GC overhead limit exceeded’ and 2-3 more different types. Please help me what can be done that would be great!
Jeshurun's somewhat flippant comment about buying more RAM is actually fairly accurate. Eclipse is a memory HOG! On my machine right now Eclipse is using 2.1GB; no joke. If you want to be able to use Eclipse really effectively, with all the great features, you really need lots of memory.
That being said, there are ways to use Eclipse with less memory. The biggest helper I've found is disabling ALL validators (check "Suspend all validators" under Window>Preferences>Validation; just disabling the individual ones doesn't help enough). Another common source of memory-suckage is plugins. If you're going to stay at your current memory limit, I strongly recommend that you:
Uninstall your current Eclipse
Download the core/standalone/just Java version of Eclipse (the one with least filesize/no plug-ins built-in)
Try using just that for awhile, and see how the performance is. If it's ok, try installing the plug-ins you like, one at a time. Never install multiple, and give each one a week or two of trial.
You'll likely find that some plug-ins dramatically increase memory usage; don't use those (or if you do, get more RAM).
Hope that helps.
I also faced the same problem.I resolved by doing the build by following steps as.
-->Right click on the project select RunAs ->Run configurations
Select your project as BaseDirectory.
In place of goals give eclipse:eclipse install
-->In the second tab give -Xmx1024m as VM arguments.
I faced similar situation. My program had to run simulation for 10000 trials.
I tried -Xmx1024m : still it crashed.
Then I realized given my program had too much to put up on console; my console-display memory may be going OOB.
Simple solution=> right-click console > preferences > Check Limit console output > Enter Buffer size(characters)[Default: 80000].
I had unchecked it for analyzing single run, but when the final run had 10000 trials, it started to crash passed 500 trials.
Today was the day: I thought three times, that how programming in Java helps me skip the whole job memory deallocation and cursed C for the same. And here I am, spent last 2 1/2 hours to find how to force GC, how to dellocate variable( By the way, none was required).
Have a good day!

perl File::Tail syncronization

im having this situation:
Im parsing some log files with perl daemon. This daemon writes data to mysql db.
Log file can:
be rotated ('solved by filesize and some logic')
doesnt exist ('ignore_nonexistant' parameter in Tail)
Daemon:
Can be killed
Can became dead by some reazon.
Im using File::Tail to tail tha file. For file rotation mechanism of date of creation or filesize can help. and what mechanism should i use to start tail from some position in file? (asume that there is a lot of such daemons, no write access to filesystem).
I've think about position variable in DB, but this wont help me.
Maybe some mechanism to pass position parameter to parrent process?
I just dont want to reinvent bicycle.
File::Tail already detects rotation and continues reading from the new file.
To deal with the daemon dying and restarting, can you query the database for the last record written when the daemon restarts, and just skip logfile lines until you get to a later line?
Try http://search.cpan.org/dist/Log-Unrotate/.
You'll have to implement your own Log::Unrotate::Cursor class if you wish to store position files in DB instead of local filesystem, but that should be trivial.
We wrote and used Log::Unrotate for 5 years in production and it tries really hard to never skip any data. (It tries so hard that it throws exception if your cursor becomes invalid, for example if log got rotated several times while reader didn't work for some reason. You may want to enable autofix_cursor option to change this behavior).
Also take a look at http://search.cpan.org/dist/File-LogReader/. I never used it but it's supposed to solve the same task.