We have tried to use Documents4j for converting docx to pdf.
I have tried it with LocalConverter, it runs perfectly as expected. But when i tried to run with RemoteConverter, facing errors "com.documents4j.throwables.ConversionInputException: The input file seems to be corrupt". Same file works in LocalConverter in same machine
To run the RemoteConverter:
java -jar **\Downloads\documents4j-server-standalone-1.1.3-shaded.jar http://127.0.0.1:9998 -log **\Downloads\Documents4jlog.txt -level DEBUG
java -jar **\Downloads\documents4j-client-standalone-1.1.3-shaded.jar http://127.0.0.1:9998 -log **\Downloads\Documents4jlogClient.txt
ERROR:
com.documents4j.throwables.ConversionInputException: The sent input is invalid
Below is the server log:
2020-06-05 18:11:04,939 INFO [pool-3-thread-2] c.d.c.msoffice.MicrosoftWordBridge - Requested conversion from C:\Users\DIVYAL~2\AppData\Local\Temp\1591360694906-0\5b54b28b-b20d-4f41-9647-18b89f154c28\temp3 (application/msword) to C:\Users\DIVYAL~2\AppData\Local\Temp\1591360694906-0\5b54b28b-b20d-4f41-9647-18b89f154c28\temp4 (application/pdf)
2020-06-05 18:11:04,939 DEBUG [pool-3-thread-2] org.zeroturnaround.exec.ProcessExecutor - Executing [cmd, /S, /C, ""C:\Users\DIVYAL~2\AppData\Local\Temp\1591360694906-0\word_convert1288062732.vbs" "C:\Users\DIVYAL~2\AppData\Local\Temp\1591360694906-0\5b54b28b-b20d-4f41-9647-18b89f154c28\temp3" "C:\Users\DIVYAL~2\AppData\Local\Temp\1591360694906-0\5b54b28b-b20d-4f41-9647-18b89f154c28\temp4" "17""] in C:\Users\DIVYAL~2\AppData\Local\Temp\1591360694906-0.
2020-06-05 18:11:04,952 DEBUG [pool-3-thread-2] org.zeroturnaround.exec.ProcessExecutor - Started java.lang.ProcessImpl#8f30115
2020-06-05 18:11:05,189 DEBUG [WaitForProcess-java.lang.ProcessImpl#8f30115] org.zeroturnaround.exec.WaitForProcess - java.lang.ProcessImpl#8f30115 stopped with exit code -2
2020-06-05 18:11:05,196 INFO [pool-3-thread-2] c.d.w.e.AsynchronousConversionResponse - Sending exceptional response for org.glassfish.jersey.server.ServerRuntime$AsyncResponder#250891f5
com.documents4j.throwables.ConversionInputException: The input file seems to be corrupt
at com.documents4j.util.Reaction$ConversionInputExceptionBuilder.make(Reaction.java:159)
at com.documents4j.util.Reaction$ExceptionalReaction.apply(Reaction.java:75)
at com.documents4j.conversion.ExternalConverterScriptResult.resolve(ExternalConverterScriptResult.java:70)
at com.documents4j.conversion.ProcessFutureWrapper.evaluateExitValue(ProcessFutureWrapper.java:48)
at com.documents4j.conversion.ProcessFutureWrapper.get(ProcessFutureWrapper.java:36)
at com.documents4j.conversion.ProcessFutureWrapper.get(ProcessFutureWrapper.java:11)
at com.documents4j.job.AbstractFutureWrappingPriorityFuture.run(AbstractFutureWrappingPriorityFuture.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Kindly let me know, if I'm missing anything...
Thanks in advance
I get an "output location validation failed" exception in my pig script on EMR.
It fails when saving data back S3.
I use this simple script to narrow the problem:
REGISTER /home/hadoop/lib/mongo-java-driver-2.13.0.jar
REGISTER /home/hadoop/lib/mongo-hadoop-core-1.3.2.jar
REGISTER /home/hadoop/lib/mongo-hadoop-pig-1.3.2.jar
example = LOAD 's3://xxx/example-full.bson'
USING com.mongodb.hadoop.pig.BSONLoader();
STORE example INTO 's3n://xxx/out/example.bson' USING com.mongodb.hadoop.pig.BSONStorage();
This is the Stacktrace Produced:
================================================================================
Pig Stack Trace
---------------
ERROR 6000:
<line 8, column 0> Output Location Validation Failed for: 's3://xxx/out/example.bson More info to follow:
Output directory not set.
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias example
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1637)
at org.apache.pig.PigServer.registerQuery(PigServer.java:577)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1091)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:543)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 6000:
<line 8, column 0> Output Location Validation Failed for: 's3://xxx/out/example.bson More info to follow:
Output directory not set.
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:95)
at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:317)
at org.apache.pig.PigServer.compilePp(PigServer.java:1382)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1307)
at org.apache.pig.PigServer.execute(PigServer.java:1299)
at org.apache.pig.PigServer.access$400(PigServer.java:124)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1632)
... 13 more
Caused by: org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:138)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:80)
... 26 more
To setup the MongoConnector I used this Bootstrap script:
#!/bin/sh
wget -P /home/hadoop/lib http://central.maven.org/maven2/org/mongodb/mongo-java-driver/2.13.0/mongo-java-driver-2.13.0.jar
wget -P /home/hadoop/lib https://github.com/mongodb/mongo-hadoop/releases/download/r1.3.2/mongo-hadoop-core-1.3.2.jar
wget -P /home/hadoop/lib https://github.com/mongodb/mongo-hadoop/releases/download/r1.3.2/mongo-hadoop-pig-1.3.2.jar
wget -P /home/hadoop/lib https://github.com/mongodb/mongo-hadoop/releases/download/r1.3.2/mongo-hadoop-hive-1.3.2.jar
cp /home/hadoop/lib/mongo* /home/hadoop/hive/lib
cp /home/hadoop/lib/mongo* /home/hadoop/pig/lib
The error suggests that the output directory does not exist.
Of course the solution would be to create the output directory.
For a quick check it is also possible to make the output directory equal to the input directory. If the directory actually does exist, it may be a rights issue.
How to reduce the amount of trace info the Spark runtime produces?
The default is too verbose,
How to turn off it, and turn on it when I need.
Thanks
Verbose mode
scala> val la = sc.parallelize(List(12,4,5,3,4,4,6,781))
scala> la.collect
15/01/28 09:57:24 INFO SparkContext: Starting job: collect at <console>:15
15/01/28 09:57:24 INFO DAGScheduler: Got job 3 (collect at <console>:15) with 1 output
...
15/01/28 09:57:24 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
15/01/28 09:57:24 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 626 bytes result sent to driver
15/01/28 09:57:24 INFO DAGScheduler: Stage 3 (collect at <console>:15) finished in 0.002 s
15/01/28 09:57:24 INFO DAGScheduler: Job 3 finished: collect at <console>:15, took 0.020061 s
res5: Array[Int] = Array(12, 4, 5, 3, 4, 4, 6, 781)
Silent mode(expected)
scala> val la = sc.parallelize(List(12,4,5,3,4,4,6,781))
scala> la.collect
res5: Array[Int] = Array(12, 4, 5, 3, 4, 4, 6, 781)
Spark 1.4.1
sc.setLogLevel("WARN")
From comments in source code:
Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
Spark 2.x - 2.3.1
sparkSession.sparkContext().setLogLevel("WARN")
Spark 2.3.2
sparkSession.sparkContext.setLogLevel("WARN")
quoting from 'Learning Spark' book.
You may find the logging statements that get printed in the shell
distracting. You can control the verbosity of the logging. To do this,
you can create a file in the conf directory called log4j.properties.
The Spark developers already include a template for this file called
log4j.properties.template. To make the logging less verbose, make a
copy of conf/log4j.properties.template called conf/log4j.properties
and find the following line:
log4j.rootCategory=INFO, console
Then
lower the log level so that we only show WARN message and above by
changing it to the following:
log4j.rootCategory=WARN, console
When
you re-open the shell, you should see less output.
Logging configuration at the Spark app level
With this approach no need of code change in cluster for a spark application.
Let's create a new file log4j.properties from log4j.properties.template.
Then change verbosity with log4j.rootCategory property.
Say, we need to check ERRORs of given jar then, log4j.rootCategory=ERROR, console
Spark submit command would be
spark-submit \
... #Other spark props goes here
--files prop/file/location \
--conf 'spark.executor.extraJavaOptions=-Dlog4j.configuration=prop/file/location' \
--conf 'spark.driver.extraJavaOptions=-Dlog4j.configuration=prop/file/location' \
jar/location \
[application arguments]
Now you would see only the logs which are ERROR categorised.
Plain Log4j way wo Spark(but needs code change)
Set Logging OFF for packages org and akka
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
If you are invoking a command from a shell, there is a lot you can do without changing any configurations. That is by design.
Below are a couple of Unix examples using pipes, but you could do similar filters in other environments.
To completely silence the log (at your own risk)
Pipe stderr to /dev/null, i.e.:
run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999 2> /dev/null
To ignore INFO messages
run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999 | awk '{if ($3 != "INFO") print $0}'
I'm new to Spark and trying to figure out how the pipe method works. I have the following code in Scala
sc.textFile(hdfsLocation).pipe("preprocess.py").saveAsTextFile(hdfsPreprocessedLocation)
The values hdfsLocation and hdfsPreprocessedLocation are fine. As proof, the following code works from the command line
hadoop fs -cat hdfsLocation/* | ./preprocess.py | head
When I run the above Spark code I get the following errors
14/11/25 09:41:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "preprocess.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
... 12 more
In order to solve this for Hadoop streaming I would just use the --files attribute, so I tried the same thing for Spark. I start Spark with the following command
bin/spark-shell --files ./preprocess.py
but that gave the same error.
I couldn't find a good example of using Spark with an external process via pipe, so I'm not sure if I'm doing this correctly. Any help would be greatly appreciated.
Thanks
I'm not sure if this is the correct answer, so I won't finalize this, but it appears that the file paths are different when running spark in local and cluster mode. When running spark without --master the paths to the pipe command are relative to the local machine. When running spark with --master the paths to the pipe command are ./
UPDATE:
This actually isn't correct. I was using SparkFiles.get() to get the file name. It turns out that when calling .pipe() on an RDD the command string is evaluated on the driver and then passed to the worker. Because of this SparkFiles.get() is not the appropriate way to get the file name. The file name should be ./ because SparkContext.addFile() should put that file on ./ relative to to where each worker is run from. But I'm so sour on .pipe now that I've take .pipe out of my code in total in favor of .mapPartitions in combination of a PipeUtils object that I wrote here. This is actually more efficient because I only have to incur the script startup costs once per partition instead of once per example.
I am running Spark tests that use ScalaTest. They are very chatty on the command line using the following command (as an aside the -Dtest= is apparently ignored - all core tests are being run..):
mvn -Pyarn -Phive test -pl core -Dtest=org.apache.spark.MapOutputTrackerSuite
There are thousands of lines of output, here is a taste:
7:03:30.251 INFO org.apache.spark.scheduler.TaskSetManager: Finished TID 4417 in 23 ms on localhost (progress: 4/4)
17:03:30.252 INFO org.apache.spark.scheduler.TaskSchedulerImpl: Removed TaskSet 38.0, whose tasks have all completed, from pool
17:03:30.252 INFO org.apache.spark.scheduler.DAGScheduler: Completed ResultTask(38, 3)
17:03:30.252 INFO org.apache.spark.scheduler.DAGScheduler: Stage 38 (apply at Transformer.scala:22) finished in 0.050 s
17:03:30.288 INFO org.apache.spark.ui.SparkUI: Stopped Spark web UI at http://localhost:4041
17:03:30.289 INFO org.apache.spark.scheduler.DAGScheduler: Stopping DAGScheduler
However in IJ only tests Pass/Fail are printed out. So how to view the same chatty INFO level output as on command line?
The log4j.properties was not on the classpath. The way I fixed this:
(a) create a log4j.properties inside the test/resources folder
(b) the Following log4j.properties file worked for me:
# Set everything to be logged to the file bagel/target/unit-tests.log
log4j.rootCategory=DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Ignore messages below warning level from Jetty, because it's a bit verbose
# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO