Spark reading file from local gives InvalidInputException - scala

Using Spark 2.2.0 installed by Homebrew on OSX High Sierra. I got into spark-shell and tried to read a local file like so:
val lines = sc.textFile("file:///Users/username/Documents/Notes/sampleFile")
val llist = lines.collect()
This gives me:
org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist: file:/Users/bsj625/Documents/Notes/sampleFile
I've tried a bunch of variations, file:/ and file://. I also tried running spark-shell in local mode like so:
spark-shell --master local
But I'm still getting the same error. Are there any environment variables I need to set? Any help appreciated.

Related

Add Jar file to Jupyter notebook - : java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver

I got a pyspark script which was run by using this bash script:
Now I am running the pyspark script on jupyter notebook. I added the teradata jar like this:
But when I tried to use"spark.read.jdbc" later to run a query to retrieve teradata data, I got this error:
May I know how to solve this issue?
Try this.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /jar/path/ pyspark-shell'

zeppelin-0.7.3 Interpreter pyspark not found

I get the below error when I use pyspark via Zeppelin.
The python & spark interpreters work and all environment variables are set correctly.
print os.environ['PYTHONPATH']
/x01/spark_u/spark/python:/x01/spark_u/spark/python/lib/py4j-0.10.4-src.zip:/x01/spark_u/spark/python:/x01/spark_u/spark/python/lib/py4j-0.10.4-src.zip:/x01/spark_u/spark/python/lib/py4j-0.10.4-src.zip:/x01/spark_u/spark/python/lib/pyspark.zip:/x01/spark_u/spark/python:/x01/spark_u/spark/python/pyspark:/x01/spark_u/zeppelin/interpreter/python/py4j-0.9.2/src:/x01/spark_u/zeppelin/interpreter/lib/python
zepplin-env.sh is set with the below vars
export PYSPARK_PYTHON=/usr/local/bin/python2
export PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:${PYTHONPATH}
export SPARK_YARN_USER_ENV="PYTHONPATH=${PYTHONPATH}"
See the below log file
INFO [2017-11-01 12:30:42,972] ({pool-2-thread-4}
RemoteInterpreter.java[init]:221) - Create remote interpreter
org.apache.zeppelin.spark.PySparkInterpreter
org.apache.zeppelin.interpreter.InterpreterException:
paragraph_1509038605940_-1717438251's Interpreter pyspark not
found
Thank you in advance
I found a workaround for the above issue.The interpreter not found issue does not happen when I create note inside a directory.The issue only happens when I use notes at toplevel.Addionally I foud out that this issue does not happen in 0.7.2 version
Ex :
enter image description here

Test Spark with Tachyon

I have installed Tachyon and Spark according to instructions:
http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html
However, as a newbie I have no idea how to put file "X" into Tachyon File System as they said:
$ ./spark-shell
$ val s = sc.textFile("tachyon-ft://stanbyHost:19998/X")
$ s.count()
$ s.saveAsTextFile("tachyon-ft://activeHost:19998/Y")
What I did was to point to an existing file (that I find through the management UI):
scala> val s = sc.textFile("tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH")
s: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
When I run count, I got this below error:
scala> s.count()
java.lang.NullPointerException: connectionString cannot be null
I assume my path was wrong. So two questions:
How to copy a file into Tachyon?
What is the proper path for its FS?
Sorry, very very newbie !!
UPDATE 1
I am not sure if tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH is correct path. I cannot get it either via the browser or wget
This is what I saw in the file system browser
I found out the issue. I didn't do this
sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
After I went through this exercise http://ampcamp.berkeley.edu/5/exercises/tachyon.html#run-spark-on-tachyon, I found out the proper path is this:
val file = sc.textFile("tachyon://localhost:19998/LICENSE")
So my setup was fine afterall. The documentation here http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html was causing me a lot of confusion.

Spark Shell unable to read file at valid path

I am trying to read a file in Spark Shell that comes with CentOS distribution of Cloudera on my local machine. Following are the commands I have entered in Spark Shell.
spark-shell
val fileData = sc.textFile("hdfs://user/home/cloudera/cm_api.py");
fileData.count
I also tried this statment for reading file:
val fileData = sc.textFile("user/home/cloudera/cm_api.py");
However I am getting
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera/cm_api.py
I haven't changed any settings or configurations. What am I doing wrong?
You are missing the leading slash in your url, so the path is relative. To make it absolute, use
val fileData = sc.textFile("hdfs:///user/home/cloudera/cm_api.py")
or
val fileData = sc.textFile("/user/home/cloudera/cm_api.py")
I think you need to put the file in hdfs first: hadoop fs -put, then check the file: hadoop fs -ls, then go spark-shell , val fileData = sc.textFile("cm_api.py")
In "hdfs://user/home/cloudera/cm_api.py", you are missing the hostname of the URI. You should have pass something like "hdfs://<host>:<port>/user/home/cloudera/cm_api.py", where <host> is Hadoop NameNode host and the <port> is, well, port number of Hadoop NameNode, which is 50070 by default.
The error message says hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera/cm_api.py does not exist. The path looks suspicious! The file you mean is probably at hdfs://quickstart.cloudera:8020/user/cloudera/cm_api.py.
If it is, you can access it by using that full path. Or, if the default file system is configured as hdfs://quickstart.cloudera:8020/user/cloudera/, you can use simply cm_api.py.
You may be confused between HDFS file paths and local file paths. By specifying
hdfs://quickstart.cloudera:8020/user/home/cloudera/cm_api.py
you are saying two things:
1) there is a computer by the name "quickstart.cloudera' reachable via the network (try ping to ensure that is the case), and it is running HDFS.
2) the HDFS file system contains a file at /user/home/cloudera/cm_api.py (try 'hdfs dfs -ls /user/home/cloudera/' to verify this
If you are trying to access a file on the local file system you have to use a different URI:
file:///user/home/cloudera/cm_api.py

How to start a Spark Shell using pyspark in Windows?

I am a beginner in Spark and trying to follow instructions from here on how to initialize Spark shell from Python using cmd: http://spark.apache.org/docs/latest/quick-start.html
But when I run in cmd the following:
C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4\>c:\Python27\python bin\pyspark
then I receive the following error message:
File "bin\pyspark", line 21
export SPARK_HOME="$(cd ="$(cd "`dirname "$0"`"/..; pwd)"
SyntaxError: invalid syntax
What am I doing wrong here?
P.S. When in cmd I try just C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4>bin\pyspark
then I receive ""python" is not recognized as internal or external command, operable program or batch file".
You need to have Python available in the system path, you can add it with setx:
setx path "%path%;C:\Python27"
I'm a fairly new Spark user (as of today, really). I am using spark 1.6.0 on Windows 10 and 7 machines. The following worked for me:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'C:/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Using the code above, I was able to launch Spark in an IPython notebook and my Enthought Canopy Python IDE. Before, this, I was only able to launch pyspark through a cmd prompt. The code above will only work if you have your Environment Variables set correctly for Python and Spark (pyspark).
I run these set of path settings whenever I start pyspark in ipython:
import os
import sys
# Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"') for R
### MANNN restart spart using ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
os.environ['SPARK_HOME']="G:/Spark/spark-1.5.1-bin-hadoop2.6"
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/bin")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
##sc.stop() # IF you wish to stop the context
sc = SparkContext("local", "Simple App")
With the reference and help of the user "maxymoo" I was able to find a way to set a PERMANENT path is Windows 7 as well. The instructions are here:
http://geekswithblogs.net/renso/archive/2009/10/21/how-to-set-the-windows-path-in-windows-7.aspx
Simply set path in System -> Environment Variables -> Path
R Path in my system C:\Program Files\R\R-3.2.3\bin
Python Path in my system c:\python27
Spark Path in my system c:\spark-2
The path must be separated by ";" and there must be no space between paths