Executing hdfs commands from inside scala script - scala

I'm trying to execute a HDFS specific command from inside the scala script being executed by Spark in cluster mode. Below the command:
val cmd = Seq("hdfs","dfs","-copyToLocal","/tmp/file.dat","/path/to/local")
val result = cmd.!!
The job fails at this stage with the error as below:
java.io.FileNotFoundException: /var/run/cloudera-scm-agent/process/2087791-yarn-NODEMANAGER/log4j.properties (Permission denied)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
at org.apache.log4j.Logger.getLogger(Logger.java:104)
at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:262)
at org.apache.commons.logging.impl.Log4JLogger.<init>(Log4JLogger.java:108)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
However, when I run the same command separately in Spark shell, it executes just fine and the file is copied as well.
scala> val cmd = Seq("hdfs","dfs","-copyToLocal","/tmp/file_landing_area/file.dat","/tmp/local_file_area")
cmd: Seq[String] = List(hdfs, dfs, -copyToLocal, /tmp/file_landing_area/file.dat, /tmp/local_file_area)
scala> val result = cmd.!!
result: String = ""
I don't understand the permission denied error. Although it displays as a FileNotFoundException. Totally confusing.
Any ideas?

As per error, it is checking hdfs data into var folder which I suspect configuration issue or it is not pointing to correct one.
Using seq and executing HDFS command is not good practise. It is useful only for spark shell. Using same approach in code not advisable. Instead of this try to use below Scala File system API to move data From or To HDFS. Please check below sample code just for reference that might help you.
import org.apache.hadoop.fs
import org.apache.hadoop.fs._
val conf = new Configuration()
val fs = path.getFileSystem(conf)
val hdfspath = new Path("hdfs:///user/nikhil/test.csv")
val localpath = new Path("file:///home/cloudera/test/")
fs.copyToLocalFile(hdfspath,localpath)
Please use below link for more reference regarding Scala File system API.
https://hadoop.apache.org/docs/r2.9.0/api/org/apache/hadoop/fs/FileSystem.html#copyFromLocalFile(boolean,%20boolean,%20org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.Path)

Related

Spark Pipe function throws No such file or directory

I am running the spark pipe function on EMR master server in REPL just to test out the pipe functionality. I am using the following examples
https://stackoverflow.com/a/32978183/8876462
http://blog.madhukaraphatak.com/pipe-in-spark/
http://hadoop-makeitsimple.blogspot.com/2016/05/pipe-in-spark.html
This is my code ::
import org.apache.spark._
val distScript = "/home/hadoop/PipeEx.sh"
val distScriptName = "PipeEx.sh"
sc.addFile(distScript)
val ipData =
sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val opData = ipData.pipe(SparkFiles.get(distScriptName))
opData.foreach(println)
I have tried different things like making the file executable, placed in file in /usr/lib/spark/bin as suggested in another post. I changed the distScript to say
"file:///home/hadoop/PipeEx.sh"
I always get no such file or directory in tmp/spark*/userFiles* location. I have tried to access and run the shell program from the tmp location and it runs fine.
My shell script is the same as http://blog.madhukaraphatak.com/pipe-in-spark/
Here is the first part of the log::
[Stage 9:> (0 + 2)
/ 2]18/03/19 19:58:22 WARN TaskSetManager: Lost task 1.0 in stage 9.0 (TID
72, ip-172-31-42-11.ec2.internal, executor 9): java.io.IOException: Cannot
run program "/mnt/tmp/spark-bdd582ec-a5ac-4bb1-874e-832cd5427b18/userFiles-
497f6051-6f49-4268-b9c5-a28c2ad5edc6/PipeEx.sh": error=2, No such file or
directory
Does any one have any idea? I am using Spark 2.2.1 and scala 2.11.8
Thanks
I was able to solve this , once I removed the
SparkFiles.get(distScriptName)
command.
So my final code looks like this
val distScript = "/home/hadoop/PipeEx.sh"
val distScriptName = "./PipeEx.sh"
sc.addFile(distScript)
val ipData = sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val opData = ipData.pipe(distScriptName)
opData.collect().foreach(println)
I am not very sure why removing the SparkFiles.get() solved the problem

Error while submitting a spark scala job using spark-submit

I have written a simple app in scala using Eclipse -> New Scala Project.
I am using Scala 2.10.6 and Spark 2.0.2. The app is compiling without error and I also exported the jar file.
I am using the following command to execute the JAR
spark-submit TowerTest.jar --class com.IFTL.EDI.LocateTower MobLocationData Output1
The scala code snippet is as follows
package com.IFTL.EDI
import scala.math.pow
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object LocateTower {
def main(args: Array[String]){
//create Spark context with Spark configuration
val sc = new SparkContext(new SparkConf().setAppName("TowerCount"))
//helper to add locations function used as a helper in finding tower
centroid
def addLocations(p1: (Double,Double), p2: (Double,Double)) ={
(p1._1 + p2._1,p1._2 + p2._2)
}
}
This is not the full code. when I run this, I get the following error.
[cloudera#quickstart ~]$ spark-submit --class com.IFTL.EDI.LocateTower
TowerTest.jar MobLocationData LocationOut1
java.lang.ClassNotFoundException: com.IFTL.EDI.LocateTower
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.util.Utils$.classForName(Utils.scala:176)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$
runMain(SparkSubmit.scala:689)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I am new to spark and scala, so not sure what I'm missing.
Try with this order, if you want to pass some parameters you can put them after the jar file now, make sure you specify path to the jar or run it from jar location
spark-submit --class com.IFTL.EDI.LocateTower /Users/myJarFilePath/TowerTest.jar
Try like that first, ones you have it working you can add command line arguments

spark - hadoop argument

I am running both hadoop and spark and I want to use files from hdfs as an argument on spark-submit, so I made a folder in hdfs with the files
eg. /user/hduser/test/input
and I want to run spark-submit like this:
$SPARK_HOME/bin/spark-submit --master spark://admin:7077 ./target/scala-2.10/test_2.10-1.0.jar hdfs://user/hduser/test/input
but I cant make it work, what's the right way to do it?
the error I am getting is :
WARN FileInputDStream: Error finding new files
java.lang.NullPointerException
Check if you are able to access HDFS from Spark code, If yes then you need to add following line of code in your Scala import.
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.spark.SparkFiles
then in your code add following lines
var hadoopConf = new org.apache.hadoop.conf.Configuration()
var fileSystem = FileSystem.get(hadoopConf)
var path = new Path(args(0))
actually the problem was the path. I had to use hdfs://localhost:9000/user/hduser/...

Not able to load file from HDFS in spark Dataframe

I have a CSV file stored in local windows HDFS (hdfs://localhost:54310), under path /tmp/home/.
I would like to load this file from HDFS to spark Dataframe. So I tried this
val spark = SparkSession.builder.master(masterName).appName(appName).getOrCreate()
and then
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
import sparkSession.implicits._
spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path)
.show()
But fails at runtime with below exception Stack trace:
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/test/sampleApp/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
C:/test/sampleApp/ is the path where my sample project lies. But I have specified the HDFS path.
Additionally, this works perfectly fine with plain rdd
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
val sc = SparkContext.getOrCreate()
val rdd = sc.textFile(path)
println(rdd.first()) //prints first row of CSV file
I found and tried this as well but no luck :(
I am missing something? Why spark is looking at my local file system & not the HDFS?
I am using spark 2.0 on hadoop-hdfs 2.7.2 with scala 2.11.
EDIT: Just one additional info I tried to downgrade to spark 1.6.2. I was able to make it work. So I think this is a bug in spark 2.0
Just to close the loop.This seems to be issue in spark 2.0 and a ticket has been raised.
https://issues.apache.org/jira/browse/SPARK-15899

Finding Scala libraries location from within Scala program

I'm trying to make one Scala program spawn another Scala program. I managed to obtain java executable from System.getProperty("java.home"), I've obtained some path from System.getProperty("java.class.path") (sbt-launcher.jar location), and with ClassLoader I've got project/target/scala-2.11/classes directory.
However, I am still unable to run it. JVM complain that it is unable to find Scala library's classes:
Exception in thread "main" java.lang.NoClassDefFoundError: scala/concurrent/ExecutionContext
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: scala.concurrent.ExecutionContext
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
I am looking for a way to add those files to classpath, but I want it to be portable. I do not look for solutions like hardcoding scala location on local computer nor do I want to use other environment variables and parameters than the one already existing. I also don't want to rely on SBT or Activators presence in user's environment.
Since the parent JVM process can use them their location has to be stored somewhere and I'll be thankful for help with finding out that location.
To successfully spawn one Scala App from another I had to fix several issues with my code:
1. correct main class:
object ChildApp extends App {
println("success")
}
to make sure that ChildApp is runnable by Java it has to be an object. Scala has no concept of static but object methods would (and main will) be compiled into static method.
2. correct class name:
While ChildApp.getClass.getName returns ChildApp$, it refers to an object (so that we could pass otherwise static-method-only class around). Java expects $ in command line - in other works I had to remove tailing $ before passing it into the process builder.
3. complete class path
I haven't found all used JARs within System.getPropertiy("java.class.path"):
val pcp = System getPropertiy "java.class.path" split File.pathSeparator // sbt-launcher.jar only
I haven't found them in SystemClassLoader either:
val scp = ClassLoader.getSystemClassLoader.asInstanceOf[URLClassLoader].getURLs.map(_.toString) // same as above
I did found compiled files from my project using Class' resources:
// format like jar:file:/(your-project/compiled.jar)!some/package/ChildApp.class
lazy val jarClassPathPattern = "jar:(file:)?([^!]+)!.+".r
// format like file:/(your-project/compiled/some/package/ChildApp).class
lazy val fileClassPathPattern = "file:(.+).class".r
val jcp = jarClassPathPattern.findFirstMatchIn(pathToClass) map { matcher =>
val jarDir = Paths get (matcher group 2) getParent()
s"${jarDir}/*"
} toSet
val fcp = fileClassPathPattern.findFirstMatchIn(pathToClass) map { matcher =>
val suffix = "/" + clazz.getName
val fullPath = matcher group 1
fullPath substring (0, fullPath.length - suffix.length)
} toList
Finally I found where all those dependencies where stored:
// use App class' ClassLoader instead of system one
val lcp = ChildApp.getClass.getClassLoader.asInstanceOf[URLClassLoader].getURLs.map(_.toString)
4. bonus - JVM params and java location
val jvmArgs = ManagementFactory.getRuntimeMXBean.getInputArguments.toList
lazy val javaHome = System getProperty "java.home"
lazy val java = Seq(
Paths.get(javaHome, "bin", "java"),
Paths.get(javaHome, "bin", "java.exe")
) filter (Files exists _) head
Then you have everything you need for ProcessBuilder / Process:
val executable = java.toString
val arguments = jvmArgs ++ List("-cp", classPath, mainName) ++ mainClassArguments
PS. I checked several times - those additional JARs aren't passed using neither CLASSPATH environment variable nor with -cp parameter (sbt-launcher.jar's MANIFEST file did't had anything as well). So anyone knowing how they are passed and why my solution actually works, please explain.