How to reduce the verbosity of Spark's runtime output?

How to reduce the verbosity of Spark's runtime output? - scala

How to reduce the amount of trace info the Spark runtime produces?
The default is too verbose,
How to turn off it, and turn on it when I need.
Thanks
Verbose mode
scala> val la = sc.parallelize(List(12,4,5,3,4,4,6,781))
scala> la.collect
15/01/28 09:57:24 INFO SparkContext: Starting job: collect at <console>:15
15/01/28 09:57:24 INFO DAGScheduler: Got job 3 (collect at <console>:15) with 1 output
...
15/01/28 09:57:24 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
15/01/28 09:57:24 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 626 bytes result sent to driver
15/01/28 09:57:24 INFO DAGScheduler: Stage 3 (collect at <console>:15) finished in 0.002 s
15/01/28 09:57:24 INFO DAGScheduler: Job 3 finished: collect at <console>:15, took 0.020061 s
res5: Array[Int] = Array(12, 4, 5, 3, 4, 4, 6, 781)
Silent mode(expected)
scala> val la = sc.parallelize(List(12,4,5,3,4,4,6,781))
scala> la.collect
res5: Array[Int] = Array(12, 4, 5, 3, 4, 4, 6, 781)

Spark 1.4.1
sc.setLogLevel("WARN")
From comments in source code:
Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
Spark 2.x - 2.3.1
sparkSession.sparkContext().setLogLevel("WARN")
Spark 2.3.2
sparkSession.sparkContext.setLogLevel("WARN")

quoting from 'Learning Spark' book.
You may find the logging statements that get printed in the shell
distracting. You can control the verbosity of the logging. To do this,
you can create a file in the conf directory called log4j.properties.
The Spark developers already include a template for this file called
log4j.properties.template. To make the logging less verbose, make a
copy of conf/log4j.properties.template called conf/log4j.properties
and find the following line:
log4j.rootCategory=INFO, console
Then
lower the log level so that we only show WARN message and above by
changing it to the following:
log4j.rootCategory=WARN, console
When
you re-open the shell, you should see less output.

Logging configuration at the Spark app level
With this approach no need of code change in cluster for a spark application.
Let's create a new file log4j.properties from log4j.properties.template.
Then change verbosity with log4j.rootCategory property.
Say, we need to check ERRORs of given jar then, log4j.rootCategory=ERROR, console
Spark submit command would be
spark-submit \
... #Other spark props goes here
--files prop/file/location \
--conf 'spark.executor.extraJavaOptions=-Dlog4j.configuration=prop/file/location' \
--conf 'spark.driver.extraJavaOptions=-Dlog4j.configuration=prop/file/location' \
jar/location \
[application arguments]
Now you would see only the logs which are ERROR categorised.
Plain Log4j way wo Spark(but needs code change)
Set Logging OFF for packages org and akka
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)

If you are invoking a command from a shell, there is a lot you can do without changing any configurations. That is by design.
Below are a couple of Unix examples using pipes, but you could do similar filters in other environments.
To completely silence the log (at your own risk)
Pipe stderr to /dev/null, i.e.:
run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999 2> /dev/null
To ignore INFO messages
run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999 | awk '{if ($3 != "INFO") print $0}'

Related

SparkSession.read.csv from S3 gives java.lang.OutOfMemoryError: Java heap space (Command exiting with ret '137')

I have a spark job that I stripped down completely to:
spark.read.option("delimiter", delimiter)
.schema(Encoders.product[MyData].schema)
.csv("s3://bucket/data/*/*.gz")
.as[MyData]
to isolate the error and it's still giving me a java.lang.OutOfMemoryError when running on AWS EMR on YARN. The total file size is approximately 4.7 GB gzipped (each partition file is approximately 1 to 20 kB); total number of rows = 373 063 082.
MyData (obfuscated) schema:
case class MyData(field1: Long, field2: String, field3: Int, field4: Float, field5: Float, field6: Option[Int] = None, field7: Option[Int])
The strange thing is that the job completely works in all of the following cases:
On a much larger dataset (70 GB gzipped); each individual partition file is approximately the same size as the partition files in the smaller dataset.
On each half of the files individually; i.e. I ran one job on s3://bucket/data/2017*/*.gz, and another on s3://bucket/data/2018*/*.gz, and both succeed.
On my local machine using master("local[*]"). The only difference is on the cluster it uses YARN (tested with MASTER: 1 x m4.2xlarge, CORE: 25 x m4.2xlarge, TASK: 25 x m4.2xlarge, and with smaller configurations): they all failed.
In the stderr logs I get:
[Stage 0:===============================================> (9536 + 411) / 10000]
[Stage 0:=================================================>(9825 + 175) / 10000]
[Stage 0:==================================================>(9964 + 36) / 10000]
[Stage 0:===================================================>(9992 + 8) / 10000]
[Stage 0:===================================================>(9997 + 3) / 10000]
[Stage 0:===================================================>(9998 + 2) / 10000]
[Stage 0:===================================================>(9999 + 1) / 10000]
Command exiting with ret '137'
And then the Spark-UI freezes around 9999/10000.
I also ran on s3://bucket/data/201[7-8]*/*.gz to see if it was a regex issue that was capturing more files than it was supposed to. It ended up giving the same errors.
Finally, I also checked Ganglia to try to figure out what was going on, and didn't really see anything that caught my eye.
My cluster deployment command (information starred out):
aws emr create-cluster --name $NAME --release-label emr-5.12.0 \
--log-uri s3://bucket/logs/ \
--instance-fleets InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m4.xlarge}'] \
InstanceFleetType=CORE,TargetSpotCapacity=25,InstanceTypeConfigs=['{InstanceType=m4.xlarge,BidPrice=0.2,WeightedCapacity=1}'],LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=120,TimeoutAction=SWITCH_TO_ON_DEMAND}'} \
InstanceFleetType=TASK,TargetSpotCapacity=25,InstanceTypeConfigs=['{InstanceType=m4.xlarge,BidPrice=0.2,WeightedCapacity=1}'],LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=120,TimeoutAction=SWITCH_TO_ON_DEMAND}'} \
--ec2-attributes KeyName="*****",SubnetId=subnet-d******* --use-default-roles \
--applications Name=Spark Name=Ganglia \
--steps Type=CUSTOM_JAR,Name=CopyAppFromS3,ActionOnFailure=CONTINUE,Jar="command-runner.jar",Args=[aws,s3,cp,s3://bucket/assembly-0.1.0.jar,/home/hadoop] \
Type=Spark,Name=MyApp,ActionOnFailure=CONTINUE,Args=[/home/hadoop/assembly-0.1.0.jar] --configurations file://$CONFIG_FILE --auto-terminate
I'd like to understand why spark can't read the smaller dataset when it can read one that's 15x larger (same cluster configuration), and why it runs on my local machine but not on AWS, and finally why it runs on both halves separately but not together. What kind of data could cause this? What can I do to solve this problem or avoid it in the future?
EDIT: Local machine is a MacBook Pro Retina 15-inch 2015 with 2.8 GHz Intel Core i7, 16 GB RAM, and 1 TB SSD.
EDIT2: I also got this in stderr once:
18/05/28 16:15:49 ERROR SparkContext: Exception getting thread dump from executor 1
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.SparkContext.getExecutorThreadDump(SparkContext.scala:607)
at org.apache.spark.ui.exec.ExecutorThreadDumpPage.render(ExecutorThreadDumpPage.scala:40)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
at org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171)
at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:524)
at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)

sbt file doesn't recognize the spark input

I try to execute Scala code in Spark. The example of the code and build.sbt file is possible to find here.
I have one difference to this example. I use already the version 2.0.0 of Spark (I have already download this version local and defined path in .bashrc file). Now, I have modified also my build.sbt file and set the version to 2.0.0
After that I have the error message.
Case 1:
I just executed the code of SparMeApp like is given in the link. I got the error message, that I have to define setMaster function.
16/09/05 19:37:01 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
Case 2:
I define setMaster function with different arguments. I have got next error messages:
Input: setMaster("spark://<username>:7077) or setMaster("local[2]")
Error:
[error] (run-main-0) java.lang.ArrayIndexOutOfBoundsException: 0
java.lang.ArrayIndexOutOfBoundsException: 0
(this error means that my string is empty)
In other cases just error: 16/09/05 19:44:29 WARN
StandaloneAppClient$ClientEndpoint: Failed to connect to master <...>
org.apache.spark.SparkException: Exception thrown in awaitResult
Additional I have only a little experience in Scala and in sbt. So probably my sbt is configutred false.... Can somebody, please, tell me the right way?

This is how your minimal build.sbt will look :
name := "SparkMe Project"
version := "1.0"
scalaVersion := "2.11.7"
organization := "pl.japila"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
And here is your SparkMeApp object :
object SparkMeApp{
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("SparkMe Application")
.setMaster("local[*]")
val sc = new SparkContext(conf)
val fileName = args(0)
val lines = sc.textFile(fileName).cache
val c = lines.count
println(s"There are $c lines in $fileName")
}
}
execute it like :
$ sbt "run [your file path]"

#Abhi, thank you very much for your answer. In general it works. Anyway I have some error message after the correct execution of the code.
I have created some test txt file with 4 lines
test file
test file
test file
test file
In the SparkMeApp I have changed the code line to:
val fileName = "/home/usr/test.txt"
After I execute the line run SparkMeApp.scala I got next output:
16/09/06 09:15:34 INFO DAGScheduler: Job 0 finished: count at SparkMeApp.scala:11, took 0.348171 s
There are 4 lines in /home/usr/test.txt
16/09/06 09:15:34 ERROR ContextCleaner: Error in cleaning thread
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:175)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1229)
at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:172)
at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:67)
16/09/06 09:15:34 ERROR Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:67)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1229)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64)
16/09/06 09:15:34 INFO SparkUI: Stopped Spark web UI at http://<myip>:4040
[success] Total time: 7 s, completed Sep 6, 2016 9:15:34 AM
I can see the correct output of my code (second line), but after I have got the interrupt error. How I can fix it? Anyway, I do hope the code is worked currently.

Spark Pipe example

I'm new to Spark and trying to figure out how the pipe method works. I have the following code in Scala
sc.textFile(hdfsLocation).pipe("preprocess.py").saveAsTextFile(hdfsPreprocessedLocation)
The values hdfsLocation and hdfsPreprocessedLocation are fine. As proof, the following code works from the command line
hadoop fs -cat hdfsLocation/* | ./preprocess.py | head
When I run the above Spark code I get the following errors
14/11/25 09:41:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "preprocess.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
... 12 more
In order to solve this for Hadoop streaming I would just use the --files attribute, so I tried the same thing for Spark. I start Spark with the following command
bin/spark-shell --files ./preprocess.py
but that gave the same error.
I couldn't find a good example of using Spark with an external process via pipe, so I'm not sure if I'm doing this correctly. Any help would be greatly appreciated.
Thanks

I'm not sure if this is the correct answer, so I won't finalize this, but it appears that the file paths are different when running spark in local and cluster mode. When running spark without --master the paths to the pipe command are relative to the local machine. When running spark with --master the paths to the pipe command are ./
UPDATE:
This actually isn't correct. I was using SparkFiles.get() to get the file name. It turns out that when calling .pipe() on an RDD the command string is evaluated on the driver and then passed to the worker. Because of this SparkFiles.get() is not the appropriate way to get the file name. The file name should be ./ because SparkContext.addFile() should put that file on ./ relative to to where each worker is run from. But I'm so sour on .pipe now that I've take .pipe out of my code in total in favor of .mapPartitions in combination of a PipeUtils object that I wrote here. This is actually more efficient because I only have to incur the script startup costs once per partition instead of once per example.

ScalaTest in Intellij does not print out console messages

I am running Spark tests that use ScalaTest. They are very chatty on the command line using the following command (as an aside the -Dtest= is apparently ignored - all core tests are being run..):
mvn -Pyarn -Phive test -pl core -Dtest=org.apache.spark.MapOutputTrackerSuite
There are thousands of lines of output, here is a taste:
7:03:30.251 INFO org.apache.spark.scheduler.TaskSetManager: Finished TID 4417 in 23 ms on localhost (progress: 4/4)
17:03:30.252 INFO org.apache.spark.scheduler.TaskSchedulerImpl: Removed TaskSet 38.0, whose tasks have all completed, from pool
17:03:30.252 INFO org.apache.spark.scheduler.DAGScheduler: Completed ResultTask(38, 3)
17:03:30.252 INFO org.apache.spark.scheduler.DAGScheduler: Stage 38 (apply at Transformer.scala:22) finished in 0.050 s
17:03:30.288 INFO org.apache.spark.ui.SparkUI: Stopped Spark web UI at http://localhost:4041
17:03:30.289 INFO org.apache.spark.scheduler.DAGScheduler: Stopping DAGScheduler
However in IJ only tests Pass/Fail are printed out. So how to view the same chatty INFO level output as on command line?

The log4j.properties was not on the classpath. The way I fixed this:
(a) create a log4j.properties inside the test/resources folder
(b) the Following log4j.properties file worked for me:
# Set everything to be logged to the file bagel/target/unit-tests.log
log4j.rootCategory=DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Ignore messages below warning level from Jetty, because it's a bit verbose
# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

How to use Hadoop streaming input parameter for matlab shell script

Actually i want to execute my matlab code in hadoop streaming. My doubt is how to use hadoop streaming input parameter value for my matlab input. For example ,
This is my matlab file imreadtest.m (simple coding)
rgbImage = imread('/usr/new.jpg');
imwrite(rgbImage,'/usr/OT/testedimage1.jpg');
my shell script is
#!/bin/sh
matlabbg imreadtest.m -nodisplay
Normally this works well in my ubuntu. (Not in hadoop). I have stored these two files in my HDFS using hue. now my matlab script looks like (imrtest.m)
rgbImage = imread(STDIN);
imwrite(rgbImage,STDOUT);
My shell script is (imrtest.sh).
#!/bin/sh
matlabbg imrtest.m -nodisplay
I have tried to execute this in hadoop streaming
hadoop#xxx:/usr/local/master/hadoop$ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -mapper /usr/OT/imrtest.sh -file /usr/OT/imrtest.sh -input /usr/OT/testedimage.jpg -output /usr/OT/opt
But i got error like this
packageJobJar: [/usr/OT/imrtest.sh, /usr/local/master/temp/hadoop- unjar4018041785380098978/] [] /tmp/streamjob7077345699332124679.jar tmpDir=null
14/03/06 15:51:41 WARN snappy.LoadSnappy: Snappy native library is available
14/03/06 15:51:41 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/03/06 15:51:41 INFO snappy.LoadSnappy: Snappy native library loaded
14/03/06 15:51:41 INFO mapred.FileInputFormat: Total input paths to process : 1
14/03/06 15:51:42 INFO streaming.StreamJob: getLocalDirs(): [/usr/local/master/temp/mapred/local]
14/03/06 15:51:42 INFO streaming.StreamJob: Running job: job_201403061205_0015
14/03/06 15:51:42 INFO streaming.StreamJob: To kill this job, run:
14/03/06 15:51:42 INFO streaming.StreamJob: /usr/local/master/hadoop/bin/hadoop job -Dmapred.job.tracker=slave3:8021 -kill job_201403061205_0015
14/03/06 15:51:42 INFO streaming.StreamJob: Tracking URL: http://slave3:50030/jobdetails.jsp?jobid=job_201403061205_0015
14/03/06 15:51:43 INFO streaming.StreamJob: map 0% reduce 0%
14/03/06 15:52:15 INFO streaming.StreamJob: map 100% reduce 100%
14/03/06 15:52:15 INFO streaming.StreamJob: To kill this job, run:
14/03/06 15:52:15 INFO streaming.StreamJob: /usr/local/master/hadoop/bin/hadoop job -Dmapred.job.tracker=slave3:8021 -kill job_201403061205_0015
14/03/06 15:52:15 INFO streaming.StreamJob: Tracking URL: http://slave3:50030/jobdetails.jsp?jobid=job_201403061205_0015
14/03/06 15:52:15 ERROR streaming.StreamJob: Job not successful. Error: NA
14/03/06 15:52:15 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
jobtracker error log for this job is
HOST=null
USER=hadoop
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Thu Mar 06 15:51:51 IST 2014
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:297)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.streaming.Pipe
java.io.IOException: log:null
.
.
.
Please suggest me how to get input from hadoop streaming input for my matlab script input, Similarly output.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to reduce the verbosity of Spark's runtime output? - scala

Spark 1.4.1 sc.setLogLevel("WARN") From comments in source code: Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN Spark 2.x - 2.3.1 sparkSession.sparkContext().setLogLevel("WARN") Spark 2.3.2 sparkSession.sparkContext.setLogLevel("WARN")

Related

SparkSession.read.csv from S3 gives java.lang.OutOfMemoryError: Java heap space (Command exiting with ret '137')

sbt file doesn't recognize the spark input

Spark Pipe example

ScalaTest in Intellij does not print out console messages

How to use Hadoop streaming input parameter for matlab shell script

Categories

Resources