i keep getting this error
Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: hdfs:/filename.txt
I have set up a standalone spark cluster and i am trying to run this code on my master node.
conf = new SparkConf()
.setAppName("Recommendation Engine1")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "4g")
val sc = new SparkContext(conf)
val rawUserArtistData = sc.textFile("hdfs:/user_artist_data.txt").sample(false,0.05)
on my terminal i run,
spark-submit --class com.latentview.spark.Reco --master spark://MASTERNODE U IP:PORT --deploy-mode client
/home/cloudera/workspace/new/Sparksample/target/Sparksample-0.0.1-SNAPSHOT-jar-with-dependencies.jar
These are the various things that i tried,
I replaced the hdfs:/filename.txt with fs.defaultFS path which was present on my core-site.xml file
Replaced the hdfs:/filename.txt with hdfs:// (if at all it makes any difference)
Replaced the hdfs:/ with file:// and later with file:/// to access my local drive for the files
None of this seems to work is there anything else that could be going wrong.
if i do hadoop fs -ls
this is where my files are.
Generally the path is:
hdfs://name-nodeIP:8020/path/to/file
In your case it must be,
hdfs://localhost:8020/user_artist_data.txt
or
hdfs://machinname:8020/user_artist_data.txt
the org.apache.hadoop.mapred.InvalidInputException error mean that spark can not create RDD because the folder "hdfs:/user_artist_data.txt" has no file on it.
try to connect hdfs://localhost:8020/user_artist_data.txt and see if there are any files.
Related
Scala version: 2.11.12
Spark version: 2.4.0
emr-5.23.0
Get the following when running the below command to create an Amazon EMR cluster
spark-submit --class etl.SparkDataProcessor --master yarn --deploy-mode cluster --conf spark.yarn.appMasterEnv.ETL_NAME=foo --conf spark.yarn.appMasterEnv.ETL_SPARK_MASTER=yarn --conf spark.yarn.appMasterEnv.ETL_AWS_ACCESS_KEY_ID=123 --conf spark.yarn.appMasterEnv.ETL_AWS_SECRET_ACCESS_KEY=abc MY-Tool.jar
Exception
ERROR ApplicationMaster: Uncaught exception:
java.lang.IllegalStateException: User did not initialize spark context!
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:485)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:773)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:772)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:797)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
How I create my spark session (where sparkMaster = yarn)
lazy val spark: SparkSession = {
val logger: Logger = Logger.getLogger("etl");
val sparkAppName = EnvConfig.ETL_NAME
val sparkMaster = EnvConfig.ETL_SPARK_MASTER
val sparkInstance = SparkSession
.builder()
.appName(sparkAppName)
.master(sparkMaster)
.getOrCreate()
val hadoopConf = sparkInstance.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3a.access.key", EnvConfig.ETL_AWS_ACCESS_KEY_ID)
hadoopConf.set("fs.s3a.secret.key", EnvConfig.ETL_AWS_SECRET_ACCESS_KEY)
logger.info("Created My SparkSession")
logger.info(s"Spark Application Name: $sparkAppName")
logger.info(s"Spark Master: $sparkMaster")
sparkInstance
}
UPDATE:
I determined that due to the application logic, in certain cases, we did not initialize the spark session. Because of this, it seems that when the cluster terminates, it also tries to do something with the session (perhaps close it) and is thus failing. Now that I have figured out this issue, the application runs but never actually completes. Currently, it seems to be hanging in a particular part involving spark when running in cluster mode:
val data: DataFrame = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(s"s3://$csvPath/$fileKey")
.toDF()
20/03/16 18:38:35 INFO Client: Application report for application_1584324418613_0031 (state: RUNNING)
AFAIK EnvConfig.ETL_AWS_ACCESS_KEY_ID and ETL_AWS_SECRET_ACCESS_KEY are not getting populated due to which sparksession cant be instanciated with null or empty values . try to print and debug the values.
also reading the properties from --conf spark.xxx
should be like this example. I hope you are following this...
spark.sparkContext.getConf.getOption("spark. ETL_AWS_ACCESS_KEY_ID")
once you check that, this example way should work...
/**
* Hadoop-AWS Configuration
*/
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.host", proxyHost)
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.port", proxyPort)
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.server-side-encryption-algorithm", "AES256")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3n.server-side-encryption-algorithm", "AES256")
sparkSession.sparkContext.hadoopConfiguration.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem
another thing is, use
--master yarn or --master local[*] you can use instead of
-conf spark.yarn.appMasterEnv.ETL_SPARK_MASTER=yarn
UPDATE :
--conf spark.driver.port=20002 may solve this issue. where 20002 is orbitary port.. seems like its waiting for the particular port for some time and its retrying for some time and its failing with the exception you got.
I got this idea by walking through the Sparks application master code from here
and comment This a bit hacky, but we need to wait until the spark.driver.port property has been set by the Thread executing the user class.
you can try this and let me know.
Further reading : Apache Spark : How to change the port the Spark driver listens to
In my case (after resolving the application issues), I needed to include core AND task node types when deploying in cluster mode.
I am learning spark + scala with intelliJ , started with below small piece of code
import org.apache.spark.{SparkConf, SparkContext}
object ActionsTransformations {
def main(args: Array[String]): Unit = {
//Create a SparkContext to initialize Spark
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("Word Count")
val sc = new SparkContext(conf)
val numbersList = sc.parallelize(1.to(10000).toList)
println(numbersList)
}
}
when trying to run , getting below exception
Exception in thread "main" java.net.BindException: Can't assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Process finished with exit code 1
can any one suggest what to do .
Add SPARK_LOCAL_IP in load-spark-env.sh file located at spark/bin directory
export SPARK_LOCAL_IP="127.0.0.1"
Sometimes the problem is related to a connected VPN or something like that! Just disconnect your VPN or any other tool that may affect your networking, and then try again.
Seems like you've used some old version of spark. In your case try to add this line:
conf.set("spark.driver.bindAddress", "127.0.0.1")
If you will use spark 2.0+ folowing should do the trick:
val spark: SparkSession = SparkSession.builder()
.appName("Word Count")
.master("local[*]")
.config("spark.driver.bindAddress", "127.0.0.1")
.getOrCreate()
The following should do the trick:
sudo hostname -s 127.0.0.1
This worked for me for same error with pySpark:
from pyspark import SparkContext, SparkConf
conf_spark = SparkConf().set("spark.driver.host", "127.0.0.1")
sc = SparkContext(conf=conf_spark)
I think setMaster and setAppName will return a new SparkConf object and the line conf.setMaster("local") will not effect on the conf variable. So you should try:
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Word Count")
It seems like the ports which spark is trying to bind are already in use. Did this issue start happening after you ran spark successfully a few times? You may want to check if those previously-run-spark-processes are still alive, and are holding onto some ports (a simple jps / ps -ef should tell you that). If yes, kill those processes and try again.
Add your hostname with your internal ip to /etc/hosts
More explanation
Get your hostname with this command:
hostname
# OR
cat /proc/sys/kernel/hostname
Get your internal ip with this command:
ip a
Change values and add it to /etc/hosts
${INTERNAL_IP} ${HOSTNAME}
Example:
192.168.1.5 bashiri_pc
Or (Previous line is better!)
127.0.0.1 bashiri_pc
I am creating a self-contained Scala program that uses Spark for parallelization in some parts. In my specific situation, the Spark cluster is available through mesos.
I create spark context like this:
val conf = new SparkConf().setMaster("mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark").setAppName("foo")
val sc = new SparkContext(conf)
I found out from searching around that you have to specify MESOS_NATIVE_JAVA_LIBRARY env var to point to the libmesos library, so when running my Scala program I do this:
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib sbt run
But, this results in a SparkException:
ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Could not parse Master URL: 'mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark'
At the same time, using spark-submit seems to work fine after exporting the MESOS_NATIVE_JAVA_LIBRARY env var.
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib spark-submit --class <MAIN CLASS> ./target/scala-2.10/<APP_JAR>.jar
Why?
How can I make the standalone program run like spark-submit?
Add spark-mesos jar to your classpath.
Note that the following local setting does work:
sc = new SparkContext("local[8]", testName)
But setting the remote master programmatically does not work:
sc = new SparkContext(master, testName)
or (same end result)
val sconf = new SparkConf()
.setAppName(testName)
.setMaster(master)
sc = new SparkContext(sconf)
In both of the latter cases the result is:
[16:25:33,427][INFO ][AppClient$ClientActor] Connecting to master akka.tcp://sparkMaster#mellyrn:7077/user/Master...
[16:25:33,439][WARN ][ReliableDeliverySupervisor] Association with remote system [akka.tcp://sparkMaster#mellyrn:7077]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
The following command line approach for setting the spark master consistently works (verified on multiple projects):
$SPARK_HOME/bin/spark-submit --master spark://mellyrn.local:7077
--class $1 $curdir/sparkclass.jar )
Clearly there is some additional configuration happening related to the command line spark-submit. Anyone want to posit what that might be?
In UNIX shell script below:
SP_MAST_URL=$CASSANDRA_HOME/dse client-tool spark master-address
echo $SP_MAST_URL
This will print the master from your Spark cluster environment. You may try this command utility provided by Spark and pass it on to SPARK SUBMIT command.
Note: CASSANDRA_HOME is the path where Apache cassandra installation is done. It could be any UNIX FILE path depending upon each environment.
I have used Spark before in yarn-cluster mode and it's been good so far.
However, I wanted to run it "local" mode, so I created a simple scala app, added spark as dependency via maven and then tried to run the app like a normal application.
However, I get the above exception in the very first line where I try to create a SparkConf object.
I don't understand, why I need hadoop to run a standalone spark app. Could someone point out what's going on here.
My two line app:
val sparkConf = new SparkConf().setMaster("local").setAppName("MLPipeline.AutomatedBinner")//.set("spark.default.parallelism", "300").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.kryoserializer.buffer.mb", "256").set("spark.akka.frameSize", "256").set("spark.akka.timeout", "1000") //.set("spark.akka.threads", "300")//.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") //.set("spark.akka.timeout", "1000")
val sc = new SparkContext(sparkConf)